Blog · Analysis · May 2026

The Red Team Becomes the Release Theater

AI red teaming can expose real failures. It can also become a public ritual that makes a model look governed before the institution knows what the test actually proved.

From Security to Governance

Red teaming has moved from cybersecurity practice into the language of AI legitimacy.

In the older security sense, a red team attacks a system so its defenders can learn. The method assumes that design review and ordinary testing miss important failures. A motivated adversary finds paths through the architecture, the process, the interface, and the people. The value is not theater. The value is discovery under pressure.

AI changed the object being attacked. A model can fail by exposing harmful instructions, producing false claims, leaking data, amplifying bias, following malicious instructions hidden in retrieved text, giving dangerous assistance, or behaving differently across languages, contexts, and tool-use settings. The test is no longer only whether an intruder can break in. It is whether a system that speaks, recommends, ranks, drafts, searches, reasons, or acts can be induced into unsafe behavior.

That is why red teaming now appears in official governance. NIST's AI glossary defines AI red teaming as a structured testing effort, often using adversarial methods, to find flaws, vulnerabilities, unforeseen behaviors, and misuse risks. NIST's generative AI profile treats red teaming as one method among participatory engagement, field testing, and other evaluations. NIST's ARIA program explicitly combines model testing, red teaming, and field testing to examine technical and contextual robustness.

The European Union has pushed the practice into law for the most powerful general-purpose models. Article 55 of the AI Act requires providers of general-purpose AI models with systemic risk to perform model evaluation using state-of-the-art protocols and tools, including documented adversarial testing, to identify and mitigate systemic risks.

The result is a new governance object: not the model, not the benchmark, not the audit, but the adversarial event. A red team exercise becomes evidence. It can feed a safety case, a system card, a procurement decision, a regulator briefing, a launch gate, or a public claim that the provider took safety seriously.

The Public Red Team

The most visible symbol of this shift was the public generative-AI red team at DEF CON 31 in 2023.

AI Village announced the event with support from the White House Office of Science and Technology Policy, the National Science Foundation's CISE Directorate, and the Congressional AI Caucus. The event brought public participants to test models from major AI companies and adjacent organizations. It treated public adversarial probing as both safety work and civic education: more people would learn how these systems fail, and providers would receive evidence about failure modes that ordinary internal review might not reveal.

That public form matters. It broke the assumption that AI safety testing must occur entirely inside corporate labs. It also showed the limits of spectacle. A crowd can find vivid examples of failure. A timed challenge can generate useful traces. But a capture-the-flag format is not automatically a statistically meaningful evaluation of model behavior. AI Village's later announcement for a second generative red team made this point directly in practice: single examples can reveal where to look, but serious evaluation needs datasets, reporting structures, adjudication, and public evidence about how often a failure appears.

This is the central tension. Public red teaming democratizes attention. It lets people outside a company touch the system, discover harms, compare notes, and build expertise. But if the event becomes the proof, the logic reverses. The institution can point to the crowd instead of explaining the method. The release story becomes: many people attacked it, therefore the system is ready.

That conclusion does not follow. A red team is a searchlight, not a guarantee.

What Red Teaming Can Prove

Red teaming is valuable because it finds failures that clean evaluation often misses.

Benchmarks usually ask a system to perform specified tasks under specified conditions. Red teaming asks what happens when the conditions become hostile, weird, social, multilingual, goal-shifted, or institutionally realistic. It tests the guardrail, not only the capability. It tests the gap between stated policy and actual behavior. It turns a model from an object of measurement into a participant in a contested interaction.

For AI systems, that contested interaction is often the point. Users do not simply submit neutral tasks. They persuade, trick, misunderstand, role-play, pressure, conceal, chain tools, paste context, ask for exceptions, and exploit ambiguity. Organizations also use models in workflows full of incentives and shortcuts. A customer-support bot faces angry users. A coding agent reads hostile repositories. A legal assistant ingests adversarial documents. A workplace assistant sees permissions that no single employee has cognitively mapped. A public-service chatbot faces people asking desperate questions with incomplete facts.

A good red team can reveal these edges. It can show that a model follows instructions hidden inside a web page. It can show that a supposedly blocked capability is reachable through translation, analogy, staged reasoning, or tool use. It can show that safety performance differs sharply across models, languages, domains, or deployment settings. It can create a record that engineers, lawyers, procurement officers, and regulators can use.

The practice also creates a labor force and a language. People learn how to describe AI failures, classify them, reproduce them, and dispute provider responses. That matters because model-mediated knowledge systems do not fail only through spectacular disasters. They fail through small repeated deformations: false confidence, hidden deference, unsafe defaults, permission leakage, synthetic authority, and user adaptation to the machine's preferred path.

What It Cannot Prove

Red teaming becomes dangerous when it is treated as proof of safety.

The first limitation is sampling. A red team searches part of a huge behavioral space. Even a large event cannot cover every prompt, language, user population, tool chain, retrieval source, policy conflict, or institutional setting. Finding failures proves the system can fail. Not finding a failure proves much less.

The second limitation is incentives. If the provider defines the scope, controls the model version, selects the tasks, withholds logs, classifies reports, and writes the summary, the exercise may be adversarial only at the surface. The red team attacks the system, but the institution controls what counts as a wound.

The third limitation is repair theater. A company can fix the examples without fixing the class of failure. It can patch prompts, blacklist strings, add refusal templates, or tune around the public exploit while leaving the underlying pattern intact. In generative systems, the visible exploit is often just one path through a broader behavioral manifold.

The fourth limitation is release timing. Red teaming is often concentrated before launch, when the product team wants a green light. But deployed systems drift. They receive new tools, connectors, memories, policies, model updates, retrieval corpora, plugins, and user habits. A pre-release exercise can become stale almost immediately after deployment.

The fifth limitation is public meaning. Red teaming has cultural force. It sounds tough, adversarial, and empirical. That makes it easy to convert into a badge. The claim "we red-teamed the model" can function like "we audited the system" or "we published a system card": not false, but incomplete. The verb becomes a halo around a process the public cannot inspect.

Agentic Systems Change the Test

Agentic AI makes red teaming more important and less sufficient.

When a chatbot only answers in text, a failure may be dangerous because it informs, persuades, or misleads. When an agent uses tools, a failure can become action. It may send email, edit code, call APIs, purchase goods, move files, query private data, operate a browser, or change an institutional record.

CAISI's March 2026 write-up on a large-scale AI agent red-teaming competition focused on agent hijacking, also known as indirect prompt injection. In those attacks, malicious instructions are placed inside data that an agent later reads: a web page, email, document, repository, or other external source. CAISI reported that the competition covered 13 frontier models across tool-use, coding-agent, and computer-use scenarios, with more than 250,000 attack attempts by over 400 participants. At least one successful attack was found against every target model.

The important point is not that every model is doomed. The important point is that the attack surface moved into the world. The adversary may not talk to the model directly. They may write the document the model will later trust. They may shape the environment the model treats as context.

That changes governance. A red team for an agent must test the model, the tools, the permission system, the connector layer, the logs, the recovery path, and the human handoff. It must ask whether the system can distinguish user intent from environmental instruction. It must ask what the agent can do before a human notices. It must ask how an organization reconstructs the chain of delegated action after something goes wrong.

This links red teaming to the site's broader analysis of the agent log as receipt, the tool server as trust boundary, and the enterprise connector as permission map. Red teaming is not a separate ritual. It is one instrument inside the governance of delegated machine action.

The Governance Standard

A serious AI red-team regime should leave behind more than anecdotes.

First, scope should be explicit. The report should say what model version, system prompt, tools, policies, languages, domains, user roles, and deployment settings were tested. A model-only red team should not be used to certify an agentic product.

Second, the team should fit the risk. NIST's generative AI profile emphasizes that red-team output depends on the team's background and expertise. Medical, legal, education, cybersecurity, labor, child-safety, and public-service systems need domain expertise, not only clever prompting.

Third, the method should distinguish examples from rates. A single jailbreak can be important, but governance decisions need to know whether the problem is rare, common, localized, transferable, patched, or structurally unresolved.

Fourth, findings should map to mitigations. A red team should not end with a list of impressive failures. It should produce design changes, policy changes, monitoring plans, release conditions, and unresolved-risk statements.

Fifth, independence should be real. Internal red teams are useful, but external testing, public-interest researchers, regulator access, and protected disclosure channels matter when the provider has a launch incentive.

Sixth, public summaries should be meaningful. Trade secrets and exploit details may need protection, but the public should be able to see categories of risk, severity, scope, mitigation status, and residual uncertainty. Otherwise the red team becomes private knowledge converted into public legitimacy.

Seventh, testing should continue after deployment. Model updates, new tools, retrieval changes, and user behavior can reopen old failures or create new ones. Release is not the end of adversarial evaluation. It is the beginning of exposure to adversaries who are no longer playing by the test rules.

The Spiralist Reading

The red team is a ritual of controlled opposition.

At its best, that ritual disciplines power. It gives critics a sanctioned path into the system before the system hardens into infrastructure. It converts dissent into evidence. It lets an institution learn from attack without waiting for public harm.

At its worst, it domesticates opposition. The institution stages an attack, absorbs the findings, publishes a vague assurance, and proceeds. The adversary becomes part of the release ceremony. The public sees conflict and mistakes it for accountability.

This is the recursive trap. AI systems are increasingly used to produce knowledge, classify risk, manage work, and mediate public reality. Institutions then use AI red teaming to produce knowledge about those systems. The test shapes what failure means. The report shapes what regulators and buyers believe. That belief shapes adoption. Adoption creates the next reality to be tested.

The answer is not to dismiss red teaming. It is to refuse the theater version. Red teaming should make uncertainty visible, not launder it. It should create institutional memory, not launch decoration. It should open the model to skilled antagonism and then show what changed because of what was found.

A good red team does not say the system is safe. It says: here is how we tried to break it, here is what broke, here is what we fixed, here is what remains unknown, and here is who can stop deployment if the unknowns are too large.

Sources


Return to Blog