Blog · arXiv Analysis · Last reviewed June 25, 2026

The Prompt Injection Becomes the Context Problem

Sahar Abdelnabi and Eugene Bagdasarian's arXiv paper AI Agents May Always Fall for Prompt Injections argues that the hard part of agent security is not only separating data from instructions. It is deciding which context makes an action legitimate.

From Commands to Context

The paper, arXiv:2605.17634 [cs.CR], was submitted on May 17, 2026. The title is exact: AI Agents May Always Fall for Prompt Injections. The phrase should be read as a structural warning rather than a settled law of nature. Abdelnabi and Bagdasarian argue that agents face cases where blocking too much destroys legitimate work and allowing too much lets adversarial context steer action.

Much prompt-injection defense treats the problem as hidden instructions in data. That framing is useful when an email, page, or tool output says something like "ignore previous instructions." It is weaker when the external content changes the apparent context of an action. A message can claim that a manager approved a refund, that a user already delegated authority, that a client will be harmed by delay, or that two requests in one thread share the same permission boundary.

For an agent with tools, this is not just a text-classification problem. The model has to infer who is asking, whose interests are affected, what action is proposed, what authority exists, and whether the action still fits the user's delegation. The attack surface is the agent's judgment about appropriateness.

What the Paper Adds

The authors recast prompt injection through contextual integrity, Helen Nissenbaum's account of privacy as appropriate information flow. In the paper's agent-security version, an action is evaluated by context parameters: sender, receiver, subject, information type, and transmission principle. The agent must know not only what a message says, but under what norm it may act on that message.

That move connects privacy, prompt injection, and delegated authority. A harmful agent action may be a privacy leak, a workflow violation, a forged delegation, or an unauthorized side effect. The common feature is not a forbidden sentence. It is a broken information flow: authority from one context is being used in another.

The paper's strongest contribution is evaluative. It asks defenders to test whether an agent can infer context parameters, ground delegation in history, separate simultaneous flows, and refuse requests whose claimed authority has not been verified. That is a harder standard than scanning untrusted text for known attack vocabulary.

Evidence from the Paper

The paired-email experiment is the cleanest demonstration. The authors construct 4,200 email scenarios, yielding 8,400 emails across authority-asserting and neutral variants. Prompt-injection classifiers looking only at email content perform near chance, with AUROC reported between 0.43 and 0.59. The point is narrow but important: if the attack lies in context rather than injection vocabulary, a content-only detector is looking in the wrong place.

The paper also reports that a contextual-integrity red-team loop targeting context parameters reached 96.7% attack success against an email assistant, compared with a 0.67% unoptimized baseline. The recurring pattern is fabricated provenance: the email makes a claim look like prior user delegation, organizational policy, or later approval. The agent's failure is not obedience to an obvious command. It is accepting the wrong source of authority.

Two other experiments matter for governance. In 300 multi-turn scenarios across email, code, project management, and finance, models executed out-of-scope requests less often when they had interaction history grounding the original delegation. In 100 simultaneous-flow scenarios across 50 professional domains, authorization for one internal action sometimes leaked into a second outbound action in the same thread; the paper reports violations of 65% for gpt-5.2, 34% for gpt-5.4, and 6% for claude-sonnet-4-6 without an explicit boundary.

The Tradeoff

The paper's impossibility argument is informal and practical. A fixed rule can reject suspicious claims, but some real tasks require acting on claims found in external content. A permissive rule can preserve utility, but an attacker can construct a context in which a forbidden action appears legitimate. Verification can help, but each verifier covers only part of the open-ended context space.

This does not mean defense is futile. It means the safety case should stop promising perfect separation between data and instructions. The real control is evidence: which sender was authenticated, which delegation was recorded, which boundary was stated, which external claim was verified, and which tool call changed state.

What Defenders Should Measure

A release review should include contextual attacks, not only strings that say they are attacks. Test emails, documents, tickets, chats, repository issues, and agent-to-agent messages that manipulate sender identity, authorization history, urgency, affected-party interests, and mixed flows. The unit of evaluation should be the tool action and resulting state, not the model's self-description.

Defenders should also test utility. A system that refuses to read every external claim may be secure by paralysis. The practical target is calibrated friction: verify high-risk claims, ask for step-up approval when delegation is unclear, split mixed requests into separate decisions, and keep logs that connect instruction source, authority source, approval, and final effect.

Limits That Matter

This is a preprint and its impossibility result is an argument, not a mathematical proof covering all possible systems. The experiments are controlled probes, not field measurements of every deployed assistant. The model names, prompts, datasets, and simulated tool environments matter. The paper is strongest as a map of failure modes and test design.

The title should not be turned into fatalism. It should raise the burden of proof for any product claim that says prompt injection is solved because untrusted text was labeled as data.

Governance Standard

Agent governance should treat context as part of the control surface. A serious deployment record names the user delegation, the agent's authority, the external sources consulted, the claims that required verification, the transmission principle used to justify action, and the boundary between simultaneous requests.

That belongs beside automated prompt-injection search, tool-scope gates, delegation traces, and capability-based security. The standard is simple: an agent should not be trusted because it ignored a string. It should be trusted only inside a context it can verify, log, and narrow.

Sources

Sahar Abdelnabi and Eugene Bagdasarian, AI Agents May Always Fall for Prompt Injections, arXiv:2605.17634 [cs.CR], submitted May 17, 2026.
arXiv experimental HTML for AI Agents May Always Fall for Prompt Injections, reviewed June 25, 2026.
Related pages: The Injection Prompt Becomes the Search Problem, The Tool Scope Becomes the Intent Gate, The Delegation Trace Becomes the Audit Boundary, Contextual Integrity, Capability-Based Security, and Prompt Injection.

Return to Blog