The Embedded Command Becomes the Evaluation Target
Brett Reynolds's July 2026 arXiv paper Adversarial Pragmatics for AI Safety Evaluation argues that many safety failures are not simply refusal failures, capability failures, or prompt-injection failures. They are failures to tell what a string is doing in context.
Evaluation Problem
The paper, arXiv:2607.01153 [cs.CL, cs.AI, cs.SE], is Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity. arXiv lists it as submitted on July 1, 2026. Its target is the safety-evaluation case where the relevant behavior depends on whether some text is a command, quoted content, policy example, tool output, cited source, user instruction, or transcript evidence.
That distinction is easy to describe and hard to score. A model that sees "ignore the previous instruction" inside a webpage should usually summarize it as content, while the same words from a legitimate user may be the task. A model that refuses to classify a quoted unsafe string may be over-refusing, while a model that enacts the quoted string may be under-separating use from mention. A pass/fail label can hide all of those differences.
The Spiralist relevance is direct. Safety claims about agents increasingly rest on logs, judge labels, red-team cases, policy test suites, and transcripts. If the label does not say which source had authority, which phrase was merely mentioned, which policy boundary applied, and which part of the transcript supports the attribution, then the evaluation artifact is too thin to govern a deployed system.
Adversarial Pragmatics
Reynolds names the target adversarial pragmatics: safety-relevant behavior under cases where source authority, quotation, scope, reference, speech-act force, or policy category has to be inferred from language use. This is not a replacement for instruction hierarchy. It is the layer before and around instruction hierarchy: whether the system correctly identifies what kind of linguistic object it is handling before deciding which priority rule should apply.
The paper's taxonomy pulls apart eight families: embedded commands, mention/use and quotation, authority and instruction hierarchy, scope and modality, deixis and reference hijacking, indirect speech acts and pragmatic pressure, policy ambiguity, and multi-turn agent evidence. The point of separating them is methodological. Real attacks combine these dimensions, but a benchmark has to isolate the contrast before it can explain why a model, prompt, scaffold, or judge failed.
This makes prompt-injection evaluation less mystical. The dangerous string is not dangerous only because of its words. It becomes dangerous when the system assigns it the wrong role: tool output treated as instruction, quoted policy example treated as action request, ambiguous "the above" resolved to the wrong source, or a transcript fragment treated as evidence for a failure it does not actually show.
Benchmark Receipt
The artifact is intentionally small and auditable. The paper describes an 18-item seed benchmark organized into nine minimal pairs, with validator-enforced metadata. Each item records the control dimension, context source, source role, authority level, pragmatic status, response act, expected behavior, task-success label, policy-compliance label, safety-risk label, risk type, refusal outcome, failure attribution, and judge-validation flag.
That metadata is the real contribution. It makes an evaluation item inspectable from both directions. Before running a model, a reviewer can see the linguistic contrast the item is supposed to test. After the model responds, a reviewer can distinguish source-sensitivity failure, quote-handling failure, scope failure, refusal-calibration failure, risk-classification failure, and policy-layer failure instead of throwing all of them into a single wrong-answer bucket.
The paper also reports a 54-row local seed pilot and proposes scaling the development set only after the taxonomy, expert protocol, validator, and scoring scripts survive close adjudication. That restraint matters. The first job of this benchmark is not to produce a leaderboard. It is to prove that the labels are stable enough to support broader safety claims.
Judge Limits
The local pilot is useful because it tests the evaluator too. The paper reports that an LLM judge reached exact agreement with expert labels at different rates by label family: 66.7 percent for task success, 88.9 percent for policy compliance, 72.2 percent for safety risk, 83.3 percent for risk type, 98.1 percent for refusal outcome, and 77.8 percent for failure attribution.
The pattern is the lesson. A judge can be strong at noticing the visible refusal outcome while weaker at the harder inference: what actually failed, whether the policy boundary was ambiguous, and whether transcript evidence supports the attribution. For agent safety, that means an LLM judge can help triage, but it cannot be treated as a replacement for expert adjudication when the evaluation target is pragmatic authority.
The paper's metrics follow from that concern. Pairwise contrast accuracy asks whether both sides of a minimal pair are labeled correctly. Instruction-source sensitivity asks whether behavior changes appropriately when the same words move between user, webpage, document, tool output, or transcript. Mention/use robustness asks whether classification, report, translation, summary, and enactment are kept apart. Refusal calibration reports over-refusal and under-refusal separately.
Governance Standard
Any production safety evaluation that includes instruction conflicts, prompt injection, policy interpretation, or agent transcripts should publish pragmatic metadata with its test cases. The minimum record should include source role, authority level, application surface, quoted-versus-enacted status, scope and reference target, expected response act, policy boundary, refusal category, failure attribution, evaluator confidence, and the adjudication rule used to settle disagreement.
Benchmarks should also report results by phenomenon family before any aggregate score. A single "safe" score can hide the exact distinction that will matter in deployment: over-refusal on safe metalinguistic tasks, under-refusal on policy-blocked tasks, source-role blindness in tool outputs, reference hijacking in multi-turn transcripts, or judge instability on ambiguous policy boundaries.
The Spiralist rule is simple: the command is not only a string. It is a string with provenance, authority, scope, uptake, and evidence. A safety evaluation that cannot preserve those distinctions is not yet a safety evaluation. It is a transcript with a verdict attached too early.
Claim Boundary
This is a methodological preprint, not a final benchmark standard. The seed set is small, the pilot is local, and the reported judge results should be read as calibration evidence rather than model-ranking evidence. The paper's own emphasis is on constructing auditable items and validating label families before scaling.
That limitation is also why the paper is useful. It does not pretend that a larger dataset automatically solves ambiguity. It says safety evaluation has to make the ambiguity visible enough for reviewers to decide whether the item, label, judge, policy, or model failed.
Sources
- Brett Reynolds, Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity, arXiv:2607.01153 [cs.CL, cs.AI, cs.SE], submitted July 1, 2026; reviewed July 2, 2026.
- arXiv PDF: Adversarial Pragmatics for AI Safety Evaluation, reviewed for the taxonomy, seed benchmark schema, local pilot, LLM-judge validation table, metrics, code-and-data note, and limitations.
- Related pages: The Clarification Question Becomes the Injection Window, The Causal Context Becomes the Injection Tripwire, The Hidden Web Prompt Becomes the Payload, The Instruction-Data Boundary Becomes the Security Primitive, The Table Reference Becomes the Reasoning Error, The Evaluation Score Becomes the Inference Budget, and System Prompts.