Blog · arXiv Analysis · Last reviewed June 25, 2026

The Action-Open Task Becomes the Injection Slot

AutoDojo reframes indirect prompt injection as an adaptive test of the task itself: when the user leaves the action open, malicious content can look like ordinary task data.

The Source

The source is Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, and Yevgeniy Vorobeychik's AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents, arXiv:2606.15057v2 [cs.CR]. The arXiv record lists initial submission on June 13, 2026 and revision on June 19, 2026. The PDF is 18 pages, and the abstract links a public repository at xhOwenMa/AutoDojo.

The paper studies indirect prompt injection, or IPI: hostile instructions placed in external content that an otherwise legitimate agent reads while doing a user task. The authors group defenses into prompt-based, detection-based, and system-level approaches, then ask whether defenses that look strong on fixed AgentDojo attacks still hold when the attacker adapts to the defense.

The Static Benchmark Trap

AutoDojo's first useful move is methodological. A static prompt-injection benchmark gives every defense the same frozen attack distribution. That is useful for comparison, but it can reward defenses that learn the shape of the benchmark rather than the structure of the threat. If the hostile strings never change, a filter can look principled while mostly recognizing familiar attack texture.

AutoDojo extends AgentDojo by turning seed injections into adaptive black-box attacks. The attacker does not need model weights or gradients. It uses observable success and failure, plus a frontier LLM optimizer, to iteratively rewrite the injection against a defended agent. The paper evaluates state-of-the-art IPI defenses across three task suites and five target models. In the headline case, a filter that reduces static attack success to zero still lets AutoDojo recover 28 percent attack success overall and 64 percent on action-open tasks.

The Spiralist lesson is that benchmark pass rates are not receipts unless the attacker was allowed to move. A deployed agent does not face one canonical injection. It faces emails, tickets, pages, documents, reviews, code comments, search snippets, and tool outputs written after the defense is known.

Action-Open Tasks

The paper's freshest angle is the task-specification axis. The authors divide user tasks by how much the user has already specified. A fully specified task names the action and the parameters. A parameter-open task names the action but asks the agent to get details from content. An action-open task names no action and leaves the agent to infer what to do from attacker-controlled content.

That last category is where the injection slot opens. If a user says, in effect, "look at this content and do whatever it asks," the hostile payload does not need to sound like an explicit override. It can pose as ordinary data, a process note, a required next step, a billing instruction, or a workflow cue. Prompt-level and filter-based defenses struggle because the bad content is not only malicious instruction; it is also the apparent substance of the delegated task.

This matters because normal users rarely write security-perfect prompts. They ask agents to summarize and handle inboxes, process forms, review requests, book travel, answer messages, and clean up workflows. Those are often under-specified by design. A usable agent fills in gaps. AutoDojo shows why the same gap-filling behavior can become the attack surface.

Governance Reading

The governance control is not "detect every bad sentence." It is task narrowing. Before an agent reads untrusted content, the runtime should know what action family is allowed, what parameters may come from the content, which tools are excluded, and which side effects require confirmation. If the user has not named an action, the agent should treat that absence as a risk signal, not as a blank check.

That places this page beside The Injection Prompt Becomes the Search Problem, The Out-of-Band Defense Becomes the Reference Monitor, AgentDojo, The Tool Scope Becomes the Intent Gate, and The Agent Security Survey Becomes the Threat Model. The neighboring pages ask how to search for attacks, mediate tool calls, and map threats. AutoDojo adds a sharper question: how much of the action did the user actually specify before the model saw untrusted text?

A procurement review should therefore demand task-bucket results. Do not accept one aggregate IPI score. Ask for fully specified, parameter-open, and action-open results; the attack budget; the seed attacks; the target model; the defense stack; the tool manifest; and the final environment-state checks.

Limits and Cautions

AutoDojo is a preprint evaluation, not a certificate that every deployed agent fails. The paper's reported results are bounded by its task suites, target models, defense implementations, optimizer budget, AgentDojo-based setup, and attack-success definition. The authors also report that system-level defenses can provide meaningful protection under the adaptive attack, especially when they constrain actions rather than merely trying to classify hostile language.

The right conclusion is narrower and stronger: a low static attack-success rate is not enough. It may mean the model is resistant to that fixed string, or that a defense recognizes the benchmark's familiar pattern. The release question is whether the whole system survives adaptive, defense-aware attempts in the same kind of under-specified workflows users will actually delegate.

Audit Receipt

The audit-grade sentence is: Ma, Li, Xiao, Yu, Zhang, and Vorobeychik's arXiv:2606.15057 introduces AutoDojo, an adaptive black-box extension of AgentDojo for testing indirect prompt-injection defenses, and reports that action-open tasks are structurally more vulnerable because malicious content can present itself as ordinary task data.

The practical receipt is: every agent release that reads untrusted content should report IPI robustness by task-specification bucket, not only by aggregate benchmark score, and should preserve the tool trace that proves whether an unauthorized action occurred.

Sources

Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, and Yevgeniy Vorobeychik, AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents, arXiv:2606.15057v2 [cs.CR], submitted June 13, 2026 and revised June 19, 2026.
Primary versions checked: arXiv abstract record, arXiv HTML, and PDF.
Associated code link listed by arXiv: xhOwenMa/AutoDojo.
Related pages: The Injection Prompt Becomes the Search Problem, The Out-of-Band Defense Becomes the Reference Monitor, AgentDojo, The Tool Scope Becomes the Intent Gate, and Prompt Injection.

Return to Blog