The Underspecified Target Becomes the Production Change
UnderSpecBench tests a quiet failure mode in coding agents: the benign DevOps instruction that leaves the target unclear, then gets executed anyway.
The Paper
The paper is Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions, arXiv:2607.02294 [cs.SE]. The arXiv abstract page lists Zimo Ji, Zekai Zhang, Congying Xu, Zongjie Li, Yudong Gao, Shuai Wang, and Shing-Chi Cheung as authors and records submission on July 2, 2026. The PDF title page additionally includes Yujia Tian in the author line; this page treats that as a source discrepancy and cites both the abstract page and PDF.
The paper studies a practical safety question for coding agents that can run shell commands, modify repositories, and call operational APIs. A task can be benign and still dangerous if it leaves out the target, environment, or scope. In that setting, success is not just doing something plausible. Success is doing the intended thing, to the intended object, within the intended boundary.
The Boundary
Action-boundary failures are not the same as jailbreaks. The user is not asking for harm, and the model is not necessarily refusing a policy. The problem is ordinary operational ambiguity: a request points toward an action, but the exact object or blast radius is underspecified. A human operator might ask which resource is meant. An action-biased agent may infer a target from nearby context and execute.
That is why the paper is useful next to the site's pages on agent sandboxes, tool scopes, and governance receipts. It does not ask whether the agent can finish a task. It asks whether the agent can notice when finishing is not yet authorized because the action boundary has not been named.
The Benchmark
The authors introduce UnderSpecBench, a DevOps benchmark with 69 task families grounded in documented incidents, CVEs, or tool behavior. The tasks span four parent DevOps capability domains and nine operational control surfaces, including source configuration, build and verification control, release and artifact supply chain, and runtime and reliability operations.
Each task family is turned into 32 prompt variants by crossing three axes: intent clarity, target certainty, and blast radius. That produces 2,208 prompt variants while holding the environment and ground-truth safe action fixed. The evaluation uses deterministic, side-effect-based state checkers to classify Safe Success, Wrong Target, and OverScope outcomes. Runs where the agent does not act are separately classified as clarification, refusal, deferment, or empty output.
Findings
The study evaluates five agent-by-model configurations across OpenCode, Claude Code, and Codex, all in full autonomous execution mode. Across configurations, the paper reports that 55.8 to 67.8 percent of acted runs violate at least one action boundary. The paper's sharpest phrasing is empirical rather than theatrical: underspecification often makes agents guess.
Target certainty is the central variable. Among acted runs, Safe Success falls from 67.9 percent at the clearest target level to 8.6 percent at the most underspecified target level. Wrong Target rises from 9.6 percent to 75.1 percent, and OverScope rises from 31.4 percent to 87.0 percent. By contrast, blast-radius cues barely move action propensity: the pooled action rate is 65.5 percent in the lower-radius condition and 64.0 percent in the higher-radius condition.
The task surface also matters. On bounded-object surfaces, overreach stays more contained. On shared runtime control planes, the paper reports OverScope rates of 59.8 percent for deployment and traffic control and 77.2 percent for infrastructure, capacity, and observability. The governance lesson is that autonomy should not be granted at the same level for every operational surface.
Action Receipt
A receipt for autonomous DevOps work should record the requested action, exact target identifier, environment, blast-radius classification, allowed scope, confirmation rule, harness, model, tool permissions, postcondition checker, state before action, state after action, and whether the agent asked, deferred, refused, or executed. Without those fields, an organization may know that an agent completed a task but not whether it crossed the boundary the user assumed.
The paper also points toward a design rule: asking must be a first-class action. The same Codex-5.1-mini model asked in 31.8 percent of runs under the first-party Codex harness but only 10.5 percent under OpenCode, where some stops became silent dry-runs. That suggests the scaffold can decide whether hesitation becomes useful clarification or invisible non-completion.
Limits
The authors frame the benchmark as a stress test, not an incident-rate forecast. The tasks are containerized abstractions, production environments may add richer state and human gates, each state checker encodes one intended safe action, and the five configurations are a snapshot of fast-moving models and harnesses. The benchmark also leaves other forms of ambiguity, such as temporal, environmental, and policy ambiguity, for future task families.
Even within that boundary, the result is important. Completion-centric evaluation can make autonomy look safer than it is. A coding agent that finishes a DevOps task on the wrong target has not succeeded; it has converted ambiguity into an infrastructure change.
Sources
- Zimo Ji, Zekai Zhang, Congying Xu, Zongjie Li, Yudong Gao, Shuai Wang, and Shing-Chi Cheung, Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions, arXiv:2607.02294 [cs.SE].
- arXiv HTML for Coding Agents Are Guessing, checked for abstract, benchmark design, metrics, empirical results, limitations, and author metadata.
- arXiv PDF for Coding Agents Are Guessing, checked against the title page, experimental setup, result tables, discussion, limitations, and the additional PDF title-page author line.