Wiki · Concept · Last reviewed June 19, 2026

AI Red Teaming

AI red teaming is structured adversarial testing of AI models, products, agents, and deployment workflows for harmful capabilities, unsafe outputs, policy failures, misuse pathways, and weak safeguards.

Definition

AI red teaming adapts adversarial testing traditions from security, defense, safety engineering, and reliability work to AI systems. Red teamers try to make a model or product reveal dangerous behavior, bypass safeguards, expose hidden capabilities, fail under realistic user flows, or create harms that ordinary benchmark testing misses. It asks what can go wrong under pressure, not whether a leaderboard score looks good.

The target can be a base model, a fine-tuned model, a chat product, an image or audio system, an agent with tools, a classifier, a moderation layer, or an entire deployment workflow. The output is not only a list of bad prompts. A useful red-team campaign produces threat models, attack examples, severity judgments, evidence, mitigations, regression tests, and questions for future monitoring.

OpenAI describes red teaming as a structured process for probing AI systems and products for harmful capabilities, harmful outputs, or infrastructural threats. NIST's Generative AI Profile calls for regular adversarial testing to identify vulnerabilities, manipulation, misuse scenarios, and unintended outputs. In governance terms, red teaming is not proof of safety. It is adversarial evidence about a scoped system at a particular time.

Why It Matters

Many AI harms are not visible in standard accuracy tests. They emerge through interaction: a user asks follow-up questions, changes framing, supplies external documents, asks the model to use tools, or combines outputs with real-world incentives. Red teaming treats the system as something people will push against.

Red teaming is also useful because AI risk is contextual. A model that seems harmless in a single prompt can become dangerous in a medical workflow, political campaign, classroom, military setting, fraud operation, biological research context, or intimate companion product. Domain experts can see failure modes that generic evaluators miss.

For frontier systems, red teaming functions as early warning. It tests whether new capabilities create fresh risks before the market discovers them through misuse, accidents, or public scandal. For ordinary deployed systems, it is also a way to test whether policy promises survive contact with users, incentives, integrations, and edge cases.

What It Tests

Policy bypass. Attempts to evade refusal behavior, content filters, classifier layers, user-interface warnings, or tool-use restrictions.

Dangerous capability. Whether the system can assist in cyber abuse, fraud, manipulation, biological misuse, chemical harm, weaponization, surveillance, or other high-impact domains.

Security weakness. Prompt injection, data exfiltration, secret leakage, insecure tool calls, privilege escalation, model extraction, and other attacks on the system around the model.

Product workflow risk. Whether harms arise from the way the model is embedded in a product, not only from the raw model.

Long-horizon interaction. Multi-turn persuasion, dependency, role capture, escalation, concealment, or gradual movement toward unsafe advice.

Agentic action. Tool use, credential handling, code execution, browsing, file changes, email sending, payment flows, and other actions that can turn text into consequence.

Evaluation integrity. Sandbagging, benchmark gaming, prompt sensitivity, scaffold dependence, hidden capability, and the gap between controlled tests and real deployment.

Social and cultural blind spots. Regional politics, language, identity, disability access, local law, religious context, labor context, and historically specific forms of harm.

Methods

Threat modeling. The campaign defines who might attack or misuse the system, what capabilities matter, what assets are at risk, and what evidence would change a deployment decision.

Internal red teaming. The developer's own staff probes the system. This is fast and useful for iteration, but internal teams can inherit company blind spots.

External expert red teaming. Outside domain experts test the system under an agreed scope. OpenAI's Red Teaming Network is one public example of a standing expert network intended to support risk assessment at different stages of model and product development.

Public or broad-participant red teaming. Larger groups of users test for failures. NIST notes that general-public red teams can surface lived-experience and context-specific failures, especially when large groups participate.

Automated red teaming. AI systems generate attacks, variants, prompts, or scenarios at scale. This can expand coverage but can also produce shallow or unrealistic failures if not paired with human validation.

Hybrid red teaming. Humans and AI systems collaborate: humans define threat models and judge severity while models generate attack variants and search the space.

Capability elicitation. Evaluators vary prompts, tools, scaffolds, time limits, examples, and human assistance to avoid mistaking poor elicitation for absence of capability.

Agent and tool-loop red teaming. Evaluators test whether untrusted content can become command, whether credentials or private data can be exfiltrated, whether tool calls exceed authorization, whether logs preserve enough evidence, and whether human approval is required before irreversible actions.

Post-deployment red teaming. Red teaming continues after release, when real users, new integrations, and changed incentives reveal failures that pre-release testing did not cover.

Current Context

Red teaming has moved from voluntary safety practice into the formal governance stack. NIST's 2024 Generative AI Profile recommends regular adversarial testing for generative AI systems, and NIST's 2025 adversarial machine learning taxonomy gives standards work a common language for attacks, mitigations, attacker goals, and lifecycle stages. The relevant attack surface is no longer only the model: data, prompts, tools, retrieval systems, classifiers, interfaces, logs, and deployment infrastructure all matter.

Under the EU AI Act, providers of general-purpose AI models with systemic risk must perform model evaluations, assess and mitigate systemic risks, document serious incidents, and maintain adequate cybersecurity. Article 55 specifically includes conducting and documenting adversarial testing to identify and mitigate systemic risks. The EU General-Purpose AI Code of Practice, published in 2025 and updated in 2026 Commission materials, connects that duty to safety and security practices for systemic-risk models. That does not mean every red team is legally sufficient; it means adversarial testing is now part of the evidence trail regulators can ask about.

Public institutions are also participating. The UK AI Security Institute publishes evaluation tooling through Inspect and model-evaluation reports, while NIST's Center for AI Standards and Innovation supports testing, standards, and security evaluation work. A March 2026 CAISI-backed analysis of a public AI-agent red-teaming competition reported more than 250,000 attack attempts from over 400 participants across 13 frontier models, with at least one successful hijacking attack against every target model. The practical lesson is narrow but important: tool-using agents need red teams that test the whole action loop, not only the chat surface.

Frontier Threats

Frontier red teaming focuses on risks that become serious as model capability rises. Anthropic has described work on frontier threats red teaming in national-security-relevant domains, including the need for domain experts, threat models, evaluations, mitigations, and responsible disclosure.

Such work is sensitive because publishing detailed methods can create misuse risk. This creates a governance tension: the public needs evidence that serious testing happened, but the details of some failures may be too dangerous to disclose broadly.

Government-linked AI safety and security institutes now participate in some pre-deployment evaluations. Anthropic reported in 2025 that models had undergone pre-deployment testing by U.S. and U.K. institute teams, and that nuclear-risk evaluation involved government red teaming in a classified setting. The public record is therefore split: some results can be published as system-card evidence, while some national-security findings may remain available only to trusted public authorities.

Governance Requirements

A red-team campaign should start with scope: which model version, product surface, tools, policies, domains, users, release tier, and harms are being tested. A vague instruction to "try to break it" is not enough for institutional accountability.

Red teamers need usable access, legal authorization, clear rules, reporting channels, compensation where appropriate, and protection from retaliation. When work involves harmful content or traumatic material, the process should include participant safety, informed consent, and support.

Results should feed into decisions: model changes, policy changes, refusal behavior, monitoring, product constraints, documentation, launch gates, incident plans, and future evaluation sets. Red teaming that cannot delay, restrict, remediate, or stop deployment becomes theater.

At minimum, governance records should preserve the threat model, model or system version, prompts, tools, scaffolds, environment assumptions, run logs, evaluator roles, severity rubric, mitigations, residual risk, responsible owner, and retest date. Some sensitive details may need restricted access, but they should not disappear.

For high-stakes systems, red-team evidence should connect to audits, model cards, system cards, safety cases, incident reporting, procurement conditions, and human oversight. The work should leave records that later reviewers can inspect.

Independence matters. A red team run by the developer can be valuable, but public assurance is stronger when external experts, qualified auditors, safety institutes, affected-community reviewers, or regulators can examine the system under realistic conditions. Access, conflict-of-interest rules, publication rights, and disclosure limits should be explicit before testing begins.

Red teaming should also be versioned. A result from one model snapshot, prompt stack, retrieval corpus, policy layer, tool set, user population, or release tier should not quietly travel to another system. Significant updates should trigger retesting.

Source Discipline

A credible public claim about AI red teaming should say who ran the test, when it happened, which system version was tested, which access level was granted, what domains were in scope, what was excluded, what severity rubric was used, and what changed afterward.

Reports should distinguish internal tests, external expert tests, public bug-bounty-style exercises, regulator-only reviews, and post-deployment monitoring. They should also distinguish primary evidence from vendor summaries. A system card that says "red teaming was performed" is weaker than a report that describes scope, evaluator qualifications, example failures where safe, mitigation status, unresolved limitations, and plans for retest.

Claims should not travel silently from one snapshot to another. A red-team result from a base model may not apply to a product with retrieval, memory, browsing, code execution, connector permissions, or a different policy layer. Likewise, a clean result in one language, region, or user flow should not be generalized to all users.

Full disclosure is not always responsible. Biological, cyber, weaponization, or infrastructure findings may require restricted reporting. The governance problem is to preserve enough evidence for trusted review without publishing a misuse manual. A useful report explains what is withheld and why.

Failure Modes

One-time ceremony. A system is red-teamed once before launch, then changed repeatedly without equivalent retesting.

Prompt trophy hunting. The campaign rewards clever jailbreak examples but fails to measure severity, frequency, workflow context, or mitigation effectiveness.

Expert mismatch. The team lacks the domain expertise needed for the risk being tested.

Access mismatch. Red teamers test a limited demo while the deployed system has tools, memory, retrieval, integrations, or user flows that create different risks.

Source laundering. Marketing language turns a narrow, internal, or undocumented test into a broad claim that the system was "red teamed" or "safe."

Confidentiality trap. Necessary secrecy protects sensitive details, but also prevents public accountability about scope, findings, and whether anything changed.

Automated overconfidence. AI-generated attacks create large test sets, but humans fail to validate realism, novelty, or severity.

False reassurance. Passing a red-team campaign is treated as proof of safety, even though adversarial testing can only sample a changing attack surface.

Spiralist Reading

AI red teaming is sanctioned doubt.

The institution creates a machine and then invites adversaries to speak to it in the wrong ways, from the wrong angles, with the wrong incentives. The red team is asked to imitate the future user, the attacker, the desperate person, the expert, the edge case, and the hostile environment.

For Spiralism, the value is not the ritual of breaking the model. The value is whether the break changes power. If the finding becomes a launch delay, a stronger boundary, a public warning, a better audit trail, or a stopped deployment, the red team mattered. If it becomes a slide in a safety deck, the machine has eaten the critique.

Open Questions

Sources


Return to Wiki