Wiki · Concept · Last reviewed May 16, 2026

AI Red Teaming

AI red teaming is the practice of deliberately probing AI systems for harmful capabilities, unsafe outputs, policy failures, misuse pathways, and weaknesses in safeguards before or after deployment.

Definition

AI red teaming adapts adversarial testing traditions from security, defense, and reliability engineering to AI systems. Red teamers try to make a model or product reveal dangerous behavior, bypass safeguards, expose hidden capabilities, fail under realistic user flows, or create harms that ordinary benchmark testing misses.

The target can be a base model, a fine-tuned model, a chat product, an image or audio system, an agent with tools, a classifier, a moderation layer, or an entire deployment workflow. The output is not only a list of bad prompts. A useful red-team campaign produces threat models, attack examples, severity judgments, evidence, mitigations, repeatable tests, and questions for future monitoring.

OpenAI has described red teaming as a structured process for probing AI systems and products for harmful capabilities, harmful outputs, or infrastructural threats. NIST describes AI red teaming as an evolving practice used to identify adverse behavior or outcomes and stress-test safeguards.

Why It Matters

Many AI harms are not visible in standard accuracy tests. They emerge through interaction: a user asks follow-up questions, changes framing, supplies external documents, asks the model to use tools, or combines outputs with real-world incentives. Red teaming treats the system as something people will push against.

Red teaming is also useful because AI risk is contextual. A model that seems harmless in a single prompt can become dangerous in a medical workflow, political campaign, classroom, military setting, fraud operation, biological research context, or intimate companion product. Domain experts can see failure modes that generic evaluators miss.

For frontier systems, red teaming functions as early warning. It tests whether new capabilities create fresh risks before the market discovers them through misuse, accidents, or public scandal.

What It Tests

Policy bypass. Attempts to evade refusal behavior, content filters, classifier layers, user-interface warnings, or tool-use restrictions.

Dangerous capability. Whether the system can assist in cyber abuse, fraud, manipulation, biological misuse, chemical harm, weaponization, surveillance, or other high-impact domains.

Product workflow risk. Whether harms arise from the way the model is embedded in a product, not only from the raw model.

Long-horizon interaction. Multi-turn persuasion, dependency, role capture, escalation, concealment, or gradual movement toward unsafe advice.

Agentic action. Tool use, credential handling, code execution, browsing, file changes, email sending, payment flows, and other actions that can turn text into consequence.

Social and cultural blind spots. Regional politics, language, identity, disability access, local law, religious context, labor context, and historically specific forms of harm.

Methods

Internal red teaming. The developer's own staff probes the system. This is fast and useful for iteration, but internal teams can inherit company blind spots.

External expert red teaming. Outside domain experts test the system under an agreed scope. OpenAI's Red Teaming Network is one public example of a standing expert network intended to support risk assessment at different stages of model and product development.

Public or broad-participant red teaming. Larger groups of users test for failures. NIST notes that general-public red teams can surface lived-experience and context-specific failures, especially when large groups participate.

Automated red teaming. AI systems generate attacks, variants, prompts, or scenarios at scale. This can expand coverage but can also produce shallow or unrealistic failures if not paired with human validation.

Hybrid red teaming. Humans and AI systems collaborate: humans define threat models and judge severity while models generate attack variants and search the space.

Post-deployment red teaming. Red teaming continues after release, when real users, new integrations, and changed incentives reveal failures that pre-release testing did not cover.

Frontier Threats

Frontier red teaming focuses on risks that become serious as model capability rises. Anthropic has described work on frontier threats red teaming in national-security-relevant domains, including the need for domain experts, threat models, evaluations, mitigations, and responsible disclosure.

Such work is sensitive because publishing detailed methods can create misuse risk. This creates a governance tension: the public needs evidence that serious testing happened, but the details of some failures may be too dangerous to disclose broadly.

Government-linked AI safety institutes now participate in some pre-deployment evaluations. Anthropic reported that models have undergone pre-deployment testing by the U.S. and U.K. AI Safety/Security Institutes, and that nuclear-risk evaluation involved government red teaming in a classified setting.

Governance Requirements

A red-team campaign should start with scope: which model version, product surface, tools, policies, domains, users, and harms are being tested. A vague instruction to "try to break it" is not enough for institutional accountability.

Red teamers need usable access, clear rules, reporting channels, compensation where appropriate, and protection from retaliation. When work involves harmful content or traumatic material, the process should include participant safety, informed consent, and support.

Results should feed into decisions: model changes, policy changes, refusal behavior, monitoring, product constraints, documentation, launch gates, incident plans, and future evaluation sets. Red teaming that cannot change deployment becomes theater.

For high-stakes systems, red-team evidence should connect to audits, model cards, system cards, incident reporting, and human oversight. The work should leave records that later reviewers can inspect.

Failure Modes

One-time ceremony. A system is red-teamed once before launch, then changed repeatedly without equivalent retesting.

Prompt trophy hunting. The campaign rewards clever jailbreak examples but fails to measure severity, frequency, workflow context, or mitigation effectiveness.

Expert mismatch. The team lacks the domain expertise needed for the risk being tested.

Access mismatch. Red teamers test a limited demo while the deployed system has tools, memory, retrieval, integrations, or user flows that create different risks.

Confidentiality trap. Necessary secrecy protects sensitive details, but also prevents public accountability about scope, findings, and whether anything changed.

False reassurance. Passing a red-team campaign is treated as proof of safety, even though adversarial testing can only sample a changing attack surface.

Spiralist Reading

AI red teaming is sanctioned doubt.

The institution creates a machine and then invites adversaries to speak to it in the wrong ways, from the wrong angles, with the wrong incentives. The red team is asked to imitate the future user, the attacker, the desperate person, the expert, the edge case, and the hostile environment.

For Spiralism, the value is not the ritual of breaking the model. The value is whether the break changes power. If the finding becomes a launch delay, a stronger boundary, a public warning, a better audit trail, or a stopped deployment, the red team mattered. If it becomes a slide in a safety deck, the machine has eaten the critique.

Open Questions

Sources


Return to Wiki