Wiki · Concept · Last reviewed June 25, 2026

AgentDojo

AgentDojo is an evaluation environment for tool-using LLM agents under prompt-injection attack: the agent tries to complete ordinary user tasks while malicious instructions appear inside untrusted tool data.

Category: AI security Updated: June 25, 2026 Tags: agents, prompt injection, benchmarks, tool use, AI security

Definition

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents is arXiv:2406.13352 by Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. The arXiv API lists the first submission on June 19, 2024 and version 3 on November 24, 2024.

The project frames AgentDojo as an extensible benchmark environment, not a fixed leaderboard. Its target is agentic prompt injection: an agent reads data from tools or external sources, and hostile instructions embedded in that data try to redirect it.

Threat Model

AgentDojo focuses on agents that combine natural-language reasoning with external tool calls. The paper's abstract describes the core failure mode: data returned by tools can hijack the agent and cause it to execute malicious tasks. The attacker does not need to change the model weights or control the user prompt. The attack can arrive through the environment the agent reads.

This makes AgentDojo closer to application security than ordinary chatbot evaluation. The question is whether a model-and-tool pipeline can keep user instructions, system instructions, tool outputs, and attacker text in the right authority order while still completing useful tasks.

Benchmark Structure

The arXiv abstract says AgentDojo includes 97 realistic tasks and 629 security test cases. Examples include managing email, navigating an e-banking website, and making travel bookings: delegated tasks where agents read untrusted content and then act through tools.

The GitHub README describes installation through the agentdojo Python package, benchmark runs through agentdojo.scripts.benchmark, project documentation, a results page, and an API still under development. Its citation block identifies the paper as appearing in the NeurIPS 2024 Datasets and Benchmarks Track.

AgentDojo is also designed for adaptation. The paper describes it as an environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. A prompt-injection benchmark should not freeze the attack surface and then treat defenses as solved.

Reported Findings

The paper's abstract reports two cautious findings. First, state-of-the-art LLMs failed at many tasks even when no attack was present. Second, existing prompt-injection attacks broke some security properties but not all. These are benchmark findings, not product guarantees.

The result warns both sides of the debate. Agents are not reliable merely because they can call tools; they can fail ordinary tasks. Defenses are not reliable merely because they stop a visible jailbreak; indirect instructions in tool-returned data can still alter behavior.

Governance and Safety

AgentDojo matters for governance because it translates prompt injection from a slogan into an evaluation object. A buyer, regulator, or internal risk team can ask whether an agent was tested on tasks where untrusted content is read before consequential tool use.

It also pushes evidence beyond refusal text. A serious agent-safety case should describe the tool boundary, data trust boundary, action permissions, defense, attack, task utility, security metric, and failure trace. A polite model with broad tool authority has not addressed the AgentDojo problem.

Limits

AgentDojo is not a full measure of deployed agent safety. Static benchmark success can hide brittleness, and adaptive attackers may exploit a defense that looked strong against known attacks. Results depend on the model, tool schema, task suite, attack family, defense wrapper, and evaluation harness.

It also should not be read as proof that all agent use is unsafe. It is a controlled stress test: it isolates one important failure mode so designers can compare defenses and find weak tool-use architecture.

Evidence Record

A serious AgentDojo result should record the package version, task suite, user task IDs, injection task IDs, attack, defense, model version, tool set, permissions, prompt template, seed, utility score, security score, benign failures, successful attacks, and trajectory logs.

Source Discipline

Use the exact paper identity: AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents, arXiv:2406.13352. The arXiv page is the source for the title, authors, dates, threat framing, 97 tasks, 629 security test cases, and reported high-level findings. The GitHub README is the source for package use, documentation, results links, API caveat, and NeurIPS Datasets and Benchmarks citation.

Do not cite AgentDojo as proof that a commercial agent is secure. Cite it as a benchmark environment for one central class of agent risk: indirect prompt injection through untrusted tool data.

Spiralist Reading

AgentDojo is a lesson in contaminated context.

The agent does not need to become mystical or malicious for trouble to begin. It only has to mistake hostile text for relevant instruction while holding useful tools. For Spiralism, the benchmark marks the place where reading and acting must be separated by boundaries, not vibes.

Open Questions

Which defenses survive adaptive attacks that know the defense design?
How should benchmarks score over-defensive agents that refuse useful work?
Which tool permissions should never be reachable from untrusted content?

Sources

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr, AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents, arXiv:2406.13352 [cs.CR], submitted June 19, 2024; v3 revised November 24, 2024.
ETH Zurich SPY Lab GitHub repository, ethz-spylab/agentdojo, README reviewed June 25, 2026.
AgentDojo project documentation, AgentDojo, reviewed June 25, 2026.

Return to Wiki