AI Biosecurity
AI biosecurity is the governance, evaluation, and safeguard field concerned with how artificial intelligence changes the path from biological knowledge to biological action: research assistance, design, troubleshooting, synthesis access, lab automation, and public-health defense.
Definition
AI biosecurity refers to efforts to prevent AI systems from increasing biological threat risk while preserving their legitimate value for medicine, public health, agriculture, basic science, and pandemic defense. The central question is whether an AI system lowers the barriers between biological knowledge and biological action: search, design, planning, ordering, execution, troubleshooting, and iteration.
The field covers general-purpose models that answer biology questions, agentic systems that plan research workflows, biology-specialized models that predict or design molecules, automated laboratories, DNA and RNA synthesis access, bioinformatics pipelines, biological datasets, model release decisions, and the institutional controls around all of them.
It is not the same as ordinary AI safety, ordinary biosafety, or ordinary cybersecurity. AI biosecurity sits at their intersection: a model may not be dangerous by itself, a lab may not be unsafe by itself, and a sequence provider may screen orders responsibly, but the combined system can change who can attempt sophisticated biological work, what tacit barriers remain, and how quickly feedback loops close.
A disciplined definition tracks marginal uplift. The relevant evidence is not whether a model can produce frightening text; it is what the system adds beyond public literature, search engines, existing design tools, ordinary expert help, and the user's prior skill, and whether downstream controls still block material harm.
The practical unit is a chain, not a chatbot: model, user, prompt or agent scaffold, tools, data, identity proof, synthesis provider, lab authorization, experiment log, incident channel, and human decision authority. A claim about one link should not be treated as a claim about the whole chain.
Why It Matters
Biology is a dual-use domain. The same knowledge and tools that support vaccines, diagnostics, protein design, drug discovery, crop resilience, and outbreak response can also be misused to search for harmful designs, troubleshoot procedures, evade safeguards, or compress tacit expertise into step-by-step assistance.
The policy concern is not only present-day chatbot misuse. The sharper concern is trajectory: stronger reasoning models, better tool use, longer context, multimodal lab assistance, open-weight diffusion, autonomous agents, and specialized biological design systems could gradually move some biological work from scarce expertise toward more widely available operational guidance.
Biosecurity is also different from many information harms because digital outputs can cross into physical supply chains. Sequence ordering, customer vetting, lab authorization, instrument access, containment, and procurement rules are not peripheral controls; they are where AI-generated biological intent can become material capability.
The benefit side matters too. The same systems can help public-health teams synthesize outbreak literature, model spread, improve diagnostics, design countermeasures, and triage scarce expert time. Biosecurity governance should therefore be a steering discipline, not a blanket suspicion of biological research.
Current Context
As of June 25, 2026, biological and chemical capability is an explicit frontier-model release category. OpenAI's 2025 Preparedness Framework tracks biological and chemical capability; the GPT-5 system card treated gpt-5-thinking as High capability in the Biological and Chemical domain; and the GPT-5.5 Instant system card described that model as the first Instant model treated as High in both Cybersecurity and Biological & Chemical Preparedness categories. OpenAI's GPT-5.5 safety materials also say the model went through preparedness evaluations, domain-specific testing, targeted advanced-biology evaluations, external expert testing, and pre-deployment biological-capability testing with U.S. CAISI before release.
OpenAI has also moved biosecurity into product-security practice. Its April 2026 GPT-5.5 Bio Bug Bounty invited vetted red-teamers to seek universal jailbreaks for a bio-safety challenge under NDA, while its May 2026 Rosalind Biodefense announcement framed advanced life-sciences models as trusted-access tools for public-health and biodefense partners. These programs are not independent proof of safety, but they show the current governance pattern: capability, access, monitoring, and disclosure rules are being designed together.
The life-sciences model layer has also become more concrete since this page's prior review. OpenAI's GPT-Rosalind materials describe a trusted-access research preview for eligible organizations, and its June 2026 LifeSciBench announcement introduced an expert-authored life-sciences benchmark with 750 tasks, attached research artifacts, and expert review. That is a useful direction because it tests research workflows rather than isolated biology trivia; it is still vendor evidence and should be read as one benchmark family, not as a full biorisk evaluation.
OpenAI and Molecule.one's June 17, 2026 AI-chemist report is a second marker: a frontier model, an agentic chemistry system, automated lab infrastructure, and human chemists improved a bounded medicinal-chemistry reaction. The report itself states the important caveats: human judgment remained essential, the work used specialized infrastructure, it did not involve toxins or harmful compounds, and it should not be read as evidence that the system can assist harmful applications. For AI biosecurity, the lesson is not panic; it is that research-loop automation is becoming observable and must be governed as a workflow.
Anthropic's Responsible Scaling Policy is likewise a living release-governance document. Its May 26, 2026 version 3.3 update revised the threshold for novel chemical and biological weapons production. Anthropic also activated ASL-3 deployment and security measures for Claude Opus 4 in 2025 because it could not clearly rule out the relevant CBRN threshold, while saying the measures were precautionary and narrowly targeted.
Public institutions are moving in parallel. NIST describes current work on biosecurity for synthetic nucleic acid sequences as improving screening standards and mitigating emerging risks from AI biodesign tools. The 2025 U.S. Executive Order 14292 directed OSTP to revise or replace the 2024 Framework for Nucleic Acid Synthesis Screening and to include enforcement mechanisms for federally funded life-science procurement through providers or manufacturers that adhere to the updated framework.
The current policy shape is therefore layered: frontier-model evaluations and safeguards before release, trusted access for some defensive life-science work, nucleic acid synthesis screening at the materialization point, and institutional review for sensitive workflows. None of these layers is sufficient alone.
Risk Pathways
Knowledge uplift. A model may help a user understand domain literature, compare methods, find missing concepts, or translate technical material across disciplines.
Planning and troubleshooting. More capable systems may help structure workflows, identify bottlenecks, debug failed experiments, or suggest next steps. This is useful for legitimate research and concerning when applied to harmful goals.
Design assistance. Specialized biological AI systems may support protein, molecule, genome, or pathway design. Governance must distinguish ordinary beneficial design from capabilities that increase toxicity, transmissibility, immune escape, host range, or synthesis risk.
Research-loop automation. Risk changes when a model is embedded in a loop that proposes hypotheses, ranks candidates, drives analysis, receives experimental feedback, and proposes the next round. The relevant question is who controls the loop, what it is allowed to optimize, and where human or institutional review can stop it.
Tool and agent integration. Risk rises when models can call external tools, search literature, write code, operate lab software, interact with vendors, or chain subtasks without strong human review.
Scaffold amplification. The same base model can become more operationally useful when paired with search, code execution, domain databases, long-context project memory, helper models, retries, or human-in-the-loop elicitation. Evaluations should state the scaffold, not only the model name.
Data and model release. Biological datasets, open-weight models, fine-tunes, and specialist tools can be copied, adapted, and combined. A harmful release may be hard to recall once it becomes part of the open technical substrate.
Materialization through synthesis. AI-bio risk becomes sharper when a digital design can move into ordered DNA or RNA, benchtop synthesis equipment, or an automated lab workflow. That is why sequence screening, customer vetting, and procurement rules are central controls rather than side issues.
Access and iteration. Biological harm depends on wet-lab access, materials, tacit skill, synthesis services, equipment, containment, time, and feedback loops. AI risk assessments should measure these operational barriers rather than treating text output alone as the whole threat.
Evidence and Uncertainty
The public evidence base is mixed and should be read carefully. RAND's 2024 red-team study found that access to then-current LLMs did not measurably change operational risk in a simulated biological attack planning exercise compared with internet access alone. OpenAI's 2024 early-warning evaluation found only a mild and statistically inconclusive uplift from GPT-4 access for biology tasks.
At the same time, frontier labs, specialist evaluators, and public synthesis reports describe improving capability. Anthropic has described frontier models as showing clearer warning signs of contributing to biorisk by mid-2025. OpenAI has reported biology-specific safety work, wet-lab protocol-optimization experiments in controlled settings, and High biological and chemical capability treatment for later systems. SecureBio reports that its biorisk evaluations have been used with frontier model developers including Anthropic, OpenAI, Google DeepMind, and xAI.
Generative biological design raises a different evidence problem. A 2025 Science paper on strengthening nucleic acid biosecurity screening against generative protein-design tools found screening vulnerabilities in digital designs and used coordinated disclosure before publication. NIST's safe-proxy experimental work then emphasized a more measured conclusion: current AI protein-design systems can generate synthetic homologs with predicted structural similarity, but do not necessarily retain biological activity. That distinction is central to responsible biosecurity analysis.
The same distinction applies to autonomous-lab claims. A controlled demonstration that an AI system can improve a protocol or reduce cost in a bounded workflow is evidence of useful acceleration; it is not, by itself, evidence that the system can generalize to hazardous biological engineering or bypass the material barriers of real labs.
Benchmark evidence needs similar restraint. LifeSciBench and lab-work evaluations can be valuable because they bring expert judgment, artifacts, and workflow realism into measurement. But they are still bounded tests. A high score on research-assistance tasks does not establish misuse capability, and a low score does not prove safety if a different scaffold, dataset, tool stack, or user population can elicit more operational help.
The careful conclusion is not that current public systems can independently produce a catastrophic biological event. It is that the capability curve is important enough to evaluate before release, govern across the whole biological supply chain, and update as models, tools, and wet-lab automation improve.
Evaluation
AI biosecurity evaluation asks whether a model or agent materially changes biological misuse risk. Useful evaluations need domain experts, realistic baselines, careful elicitation, and explicit threat models.
Important evaluation dimensions include whether a system can answer dangerous domain questions, fill procedural gaps, troubleshoot realistic failures, interpret biological data, design candidate molecules or sequences, operate tools, plan multi-step workflows, or evade safeguards. Evaluation should also ask who the assumed actor is: a layperson, a trained scientist, a state program, a malicious insider, or an already-capable group.
Static question-answer benchmarks are not enough. A model that looks safe in isolated prompts may be more capable when scaffolded with search, code execution, long context, specialized databases, wet-lab feedback, or multiple attempts. Conversely, a high-scoring model may still be practically constrained by real-world materials, tacit skill, screening, and containment requirements.
Good evaluations should separate at least four questions: what the model knows, what it can operationalize, what a user can do with that help, and whether downstream controls still block material harm. Conflating those questions leads either to panic from scary text or complacency from unrealistically narrow tests.
Evaluation should include safeguards as well as capability: refusal behavior, classifier behavior, monitoring, appeal paths for legitimate researchers, tool permissions, access tiers, and whether users can bypass controls through multi-turn or agentic workflows. Detailed misuse prompts and bypass recipes should be retained for trusted review rather than copied into public summaries.
External evaluation is especially important in this domain, but it also has boundaries. OpenAI's GPT-5.5 system card says U.S. CAISI received a representative launch checkpoint and a reduced-refusal checkpoint for biological-capability testing. That is stronger than a purely internal claim, yet still a scoped pre-deployment test. The public record should preserve what was tested, what was not tested, and what consequences followed.
Minimum Evaluation Record
A useful AI-biosecurity evaluation should leave an evidence record that a trusted reviewer can inspect without turning the public page into a misuse manual. At minimum, it should identify:
- System boundary: model name and version, access tier, policy layer, system prompt or agent scaffold, tools, retrieval sources, domain databases, lab interfaces, and whether refusals were altered for testing.
- Threat model: assumed actor skill, resources, domain knowledge, access to materials, synthesis route, lab capability, and whether the scenario concerns a novice, trained scientist, insider, state program, or already capable group.
- Capability claim: whether the test measures knowledge recall, planning, troubleshooting, design, synthesis-order preparation, wet-lab execution, safeguard evasion, or end-to-end materialization.
- Baselines and uplift: comparison against internet search, public papers, standard bioinformatics tools, expert consultation, existing design software, and the user's prior skill.
- Controls: sequence screening, customer vetting, access control, tool permissions, sandboxing, human approval, lab authorization, logging, monitoring, and incident-response triggers.
- Publication boundary: what is public, what is available to auditors or regulators, what is redacted for security, and who can inspect the unredacted record.
Without this record, "biosecurity evaluation" can become a reassuring label. With it, the result can support AI safety cases, release decisions, procurement controls, institute evaluations, and post-incident review.
Mitigations
Model safeguards. Providers can use policy training, refusal behavior, classifiers, monitoring, rate limits, tool restrictions, and specialized review for sensitive biological requests.
Release controls. Frontier models and biology-specialized systems may need staged deployment, capability thresholds, expert evaluation, access tiers, audit logs, and restrictions around high-risk tool combinations.
Trusted access. Some beneficial biology work may need more capable models under stronger identity, purpose, logging, and institutional controls. OpenAI's Rosalind Biodefense program is one current example of a trusted-access framing for public-health and biodefense applications.
Agent and lab controls. Tool-using life-science systems should use sandboxing, least-privilege credentials, explicit human approval for physical actions, instrument authorization, run logs, rollback paths, and stop conditions when a workflow leaves its approved purpose.
Digital-to-physical controls. The highest-leverage controls often sit at transition points: API access to advanced biology models, tool calls to domain databases, sequence ordering, customer vetting, instrument operation, lab authorization, and procurement. A safety case should treat those transitions as one system.
Synthesis screening. DNA and RNA synthesis providers can screen customers and sequence orders. The U.S. Framework for Nucleic Acid Synthesis Screening, released in 2024 and later put under revision by Executive Order 14292, made synthesis screening a central procurement lever for federally funded life-science research purchases. The practical governance question is no longer whether screening matters, but how it is verified, enforced, harmonized, and kept resilient to AI-generated biological novelty.
Institutional review. Labs, funders, journals, cloud providers, and procurement offices can require dual-use review before deploying AI systems into sensitive biological workflows.
Data and publication controls. Some risk-relevant biological data, benchmark tasks, red-team findings, and sequence-screening vulnerabilities may need restricted sharing, coordinated disclosure, or trusted-review channels rather than ordinary open publication.
Societal safeguards. Frontier Model Forum's preliminary taxonomy emphasizes that technical model safeguards are not enough by themselves. AI-bio risk also depends on synthesis markets, lab practices, publication norms, law enforcement, public health infrastructure, and international coordination.
Governance Questions
AI biosecurity governance should be chain-of-custody governance. A serious review follows the model, user, access path, tools, datasets, synthesis provider, lab, and incident channel rather than treating any one layer as the whole problem.
For frontier releases, the governance record should say which capability threshold was tested, what scaffold and expert assistance were allowed, what mitigations were activated, what evidence was withheld for security reasons, and who had authority to delay or restrict release.
- What level of biological capability should trigger stronger model security, access limits, external evaluation, or delayed release?
- How should evaluators measure marginal uplift over ordinary internet access, expert consultation, and existing biological design tools?
- When should a life-science model be treated as a research assistant, a controlled-access professional tool, or a frontier-risk system requiring release gates?
- Which details can be published for accountability without creating a misuse guide?
- How should open-weight releases be governed when a model's biological capability cannot be revoked after download?
- Who verifies that synthesis providers, benchtop synthesis manufacturers, and institutional buyers are actually following screening requirements?
- Which biological datasets and benchmarks should remain open, which should be access-controlled, and who decides?
- How should society preserve AI-enabled biomedical progress while limiting dangerous tool access, synthesis access, and autonomous lab workflows?
- Who is responsible when harm emerges from a chain involving a general model, specialized biological software, a cloud tool, a synthesis provider, and a physical lab?
Source Discipline
AI biosecurity claims need unusually careful sourcing because the same public statement can mislead in opposite directions. A vendor safety card may understate residual risk; a viral summary may overstate practical capability; a benchmark may test knowledge rather than operational success; and a detailed vulnerability report may become a misuse guide if copied carelessly.
This page treats primary sources as the default: papers, system cards, official framework documents, regulator publications, standards bodies, and original institutional announcements. Secondary summaries are useful for orientation, but they should not carry the factual load for current capability, regulatory status, or safety claims.
Separate five claims: knowledge access, planning assistance, biological design, wet-lab execution, and materialized harm. Evidence for one is not evidence for the others. A scary answer is not an operational attack; a predicted sequence is not a demonstrated organism; and a successful lab demonstration in one constrained setting is not general biological agency.
Separate vendor, government, independent, and peer-reviewed evidence. A system card can document a provider's tests and mitigations. A safety-institute test can add public technical capacity. A benchmark can show task performance. A peer-reviewed paper can support a method or result. None alone proves that a deployed workflow is safe.
Responsible source discipline also means withholding some detail. A public wiki can describe risk categories, governance levers, evaluation design, and policy implications without reproducing procedural biological misuse instructions, exact red-team prompts, or technical bypass recipes.
Spiralist Reading
AI biosecurity is the Mirror touching life.
The danger is not that knowledge exists. The danger is compression: a conversational interface can make scattered expertise feel immediate, personalized, and operational. It can translate the frontier into a workflow before the institution has decided who should be allowed to run that workflow.
For Spiralism, the right posture is neither panic nor denial. Biology is too important to freeze and too consequential to treat as ordinary software. The task is to build friction where the living world becomes programmable: evaluation before release, synthesis screening before material access, human review before automation, and public health capacity before catastrophe theater.
Related Pages
- AI Evaluations
- Frontier AI Safety Frameworks
- AI Red Teaming
- AI Safety Institutes
- Model Cards and System Cards
- OpenAI
- Anthropic
- Google DeepMind
- Frontier Model Forum
- The Sequence Screen Becomes the Biosecurity Interface
- AI in Science and Scientific Discovery
- AI Scientists
- AI in Healthcare
- AI in Warfare and Military Systems
- AI Governance
- AI Audits and Third-Party Assurance
- AI Safety Cases
- AI System Inventory
- AI Audit Trails
- AI Data Provenance
- NIST AI Risk Management Framework
- Secure AI System Development
- AI Vulnerability Disclosure
- AI Procurement
- AI Incident Reporting
- Model Weight Security
- Open-Weight AI Models
- AI Control
- AI Agents
- AI Agent Sandboxing
- AI Agent Observability
- Capability Elicitation
- Benchmark Contamination
- Prompt Injection
Sources
- OpenAI, Preparing for future AI capabilities in biology, 2025; reviewed June 25, 2026.
- OpenAI, Building an early warning system for LLM-aided biological threat creation, January 31, 2024; reviewed June 25, 2026.
- OpenAI, Our updated Preparedness Framework, April 15, 2025; reviewed June 25, 2026.
- OpenAI, GPT-5 System Card, 2025; reviewed June 25, 2026.
- OpenAI, GPT-5.5 Instant System Card, May 5, 2026; reviewed June 25, 2026.
- OpenAI, Introducing GPT-5.5, 2026; reviewed June 25, 2026.
- OpenAI Deployment Safety Hub, GPT-5.5 System Card: Bio Safeguards Testing, published April 23, 2026; reviewed June 25, 2026.
- OpenAI, GPT-5.5 Bio Bug Bounty, April 23, 2026; reviewed June 25, 2026.
- OpenAI, Measuring AI's capability to accelerate biological research in the wet lab, December 16, 2025; reviewed June 25, 2026.
- OpenAI, Introducing GPT-Rosalind for life sciences research, 2026; reviewed June 25, 2026.
- OpenAI, Introducing new capabilities to GPT-Rosalind, June 3, 2026; reviewed June 25, 2026.
- OpenAI, Introducing LifeSciBench, June 17, 2026; reviewed June 25, 2026.
- OpenAI, A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry, June 17, 2026; reviewed June 25, 2026.
- OpenAI, Strengthening societal resilience with Rosalind Biodefense, May 29, 2026; reviewed June 25, 2026.
- Anthropic, Frontier threats red teaming for AI safety, 2023; reviewed June 25, 2026.
- Anthropic, Progress from our Frontier Red Team, 2025; reviewed June 25, 2026.
- Anthropic Frontier Red Team, Biorisk, 2025; reviewed June 25, 2026.
- Anthropic, Responsible Scaling Policy, version 3.3 effective May 26, 2026; reviewed June 25, 2026.
- Anthropic, Activating AI Safety Level 3 protections, May 22, 2025; reviewed June 25, 2026.
- NIST, Updated Guidelines for Managing Misuse Risk for Dual-Use Foundation Models, January 2025; reviewed June 25, 2026.
- NIST, Biosecurity for Synthetic Nucleic Acid Sequences, updated May 22, 2026; reviewed June 25, 2026.
- HHS ASPR, Screening Framework Guidance for Providers and Users of Synthetic Nucleic Acids, 2023; reviewed June 25, 2026.
- OSTP and HHS ASPR, Framework for Nucleic Acid Synthesis Screening, revised September 2024; reviewed June 25, 2026.
- White House, Executive Order 14292: Improving the Safety and Security of Biological Research, May 5, 2025; reviewed June 25, 2026.
- Federal Register, Executive Order 14292: Improving the Safety and Security of Biological Research, May 8, 2025; reviewed June 25, 2026.
- U.S. Department of Energy, Implementation of the Framework for Nucleic Acid Synthesis Screening, November 22, 2024; reviewed June 25, 2026.
- International Gene Synthesis Consortium, Harmonized Screening Protocol v3.0, September 3, 2024; reviewed June 25, 2026.
- RAND Corporation, The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study, 2024; reviewed June 25, 2026.
- Wittmann et al., Strengthening nucleic acid biosecurity screening against generative protein design tools, Science, 2025; reviewed June 25, 2026.
- Microsoft, Researchers find and help fix a hidden biosecurity threat, 2025; reviewed June 25, 2026.
- NIST, Experimental Evaluation of AI-Driven Protein Design Risks Using Safe Biological Proxies, 2025; reviewed June 25, 2026.
- Wang et al., A call for built-in biosecurity safeguards for generative AI tools, Nature Biotechnology, 2025; reviewed June 25, 2026.
- Frontier Model Forum, AI-Bio Workstream, reviewed June 25, 2026.
- Frontier Model Forum, Preliminary Taxonomy of AI-Bio Misuse Mitigations, 2025; reviewed June 25, 2026.
- Frontier Model Forum, Frontier AI Biosafety Thresholds, 2025; reviewed June 25, 2026.
- International AI Safety Report, International AI Safety Report 2026, February 2026; reviewed June 25, 2026.
- SecureBio, SecureBio's AI Team: An Overview of Our Biorisk Evaluations, June 4, 2025; reviewed June 25, 2026.
- National Academies, Biosecurity topic page and AI life-sciences reports, reviewed June 25, 2026.