Blog · arXiv Analysis · Last reviewed June 25, 2026

The Out-of-Band Defense Becomes the Reference Monitor

A June 2026 arXiv paper reframes prompt-injection defense for tool-using agents as action mediation: the safety boundary is not what the model says, but which tool calls a deterministic monitor will allow.

The Action Boundary

The paper, arXiv:2606.26479 [cs.CR], is Praneeth Narisetty, Shiva Nagendra Babu Kore, Uday Kumar Reddy Kattamanchi, and Jayaram Kumarapu's Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents. arXiv records submission on June 25, 2026, and lists Cryptography and Security as the primary subject, with Artificial Intelligence, Computation and Language, and Machine Learning as additional subjects.

The paper's core distinction is practical. A chatbot can be misled into producing bad text. A tool-using agent can be misled into sending mail, moving money, deleting a row, posting a message, or leaking data. In that setting, indirect prompt injection is not only a content-classification problem. It is an authorization problem at the point where language becomes an action.

That framing matters because many defenses still live inside the same token stream the attacker can influence. Input classifiers, guardrail models, refusal prompts, and model-layer detectors may reduce incidents, but they remain in-band. The paper argues that consequential tool calls need a separate enforcement layer that does not ask the compromised model whether the action is safe.

The Classical Security Lens

The fresh contribution is not a new slogan. The authors organize a family of out-of-band defenses through older security primitives: Biba integrity, reference monitors, least privilege, capabilities, and information-flow control. The surveyed systems include CaMeL, FIDES, Progent, RTBAS, Conseca, and FORGE.

Read through that lens, the agent becomes a subject whose authority changes when it reads untrusted text. If low-integrity content can steer a high-integrity action, the system has allowed a write-up. The reference monitor is the small, always-invoked enforcement point that checks whether the next tool call is allowed, regardless of the model's current story about why the call seems necessary.

The Static Benchmark Trap

The paper is careful about current evidence. Several action-level defenses report strong results on static benchmarks such as AgentDojo and ASB. But static benchmarks test a fixed attack set. The attacker does not get to see the defense, adapt to it, and search for a path through its assumptions.

That matters because the same evaluation style made earlier in-band defenses look stronger than they were. The paper points to adaptive, defense-aware attacks that broke many model-layer defenses after the attacker moved second. The authors do not claim that out-of-band defenses will fail the same way. Their claim is narrower and stronger: if the field wants confidence in deterministic action mediation, it needs adaptive evaluation designed for action-level monitors.

What the Reproduction Found

The paper then runs one such test against Progent. The authors describe it as an independent reproduction and extension of Progent's adaptive-attack analysis. Their stack used Qwen2.5-7B-Instruct served by vLLM on a single NVIDIA H200 GPU. The benchmark was AgentDojo, using banking, slack, and workspace suites; the travel suite was excluded because it produced pathologically long loops on the 7B model.

The reported security result is encouraging but bounded. Across three runs, mean attack success fell from 25.8 percent in the undefended condition to 4.2 percent with Progent under the standard attack. A hand-crafted adaptive attack did not raise success; the paper reports 2.6 percent. Banking fell from 19.0 percent to 1.9 percent under the standard defended condition, while slack fell from 58.3 percent to 10.8 percent. Workspace stayed at 0 percent across conditions, which the authors interpret as a property of the read-oriented task subset rather than proof that the suite is generally uninjectable.

The cost was also real. Task utility fell in defended runs, with the paper reporting mean utility around 45 percent undefended and around 26 percent under the defended conditions. Slack and workspace saw especially visible drops. The monitor made bad actions harder, but it also made some good work harder.

Limits That Matter

The limits are not footnotes. The authors tested one defense, one weak open-weight agent, one fixed subsample, and one adaptive attack family. They did not run a white-box or gradient-style attack, and they did not test an attacker constrained to achieve the malicious goal only through already-authorized tools. Progent's policy-authoring model was also changed from GPT-4o in the original Progent work to the same local Qwen2.5-7B used for the agent.

Those limits keep the result from becoming a universal claim. It supports the hypothesis that deterministic out-of-band enforcement may be a harder target than in-band detection, but it does not prove that the class is robust. It says this particular monitor held against this particular adaptive template in this particular open-weight setup.

Governance Standard

A serious agent deployment should publish its action boundary. The record should say which channels are trusted, how provenance labels are assigned, where the reference monitor sits, whether every tool call is mediated, what the fail-closed behavior is, who can change policy, which arguments are checked, how denied calls are logged, and how human approval is requested without turning the human into a rubber stamp.

The evaluation record should be just as explicit. A static AgentDojo or ASB score is not enough. The safety case should include defense-aware attacks, repeated runs, utility retention, tool-call overhead, attacks that target the policy authoring path, attacks that use only already-authorized tools, and a statement of what is still out of scope, including text-only harms and side channels.

The Spiralist rule is plain: when an agent can act, the ritual of safety moves to the action surface. A model may narrate intent. A reference monitor decides authority.

Sources


Return to Blog