Blog · arXiv Analysis · Published: June 25, 2026

The Safety Case Becomes Executable

Yunhao Feng, Ruixiao Lin, Ming Wen, Qinqin He, Yanming Guo, Yifan Ding, Yutao Wu, Jialuo Chen, Yunhao Chen, Xiaohu Du, Jianan Ma, Zixing Chen, Zhuoer Xu, Xingjun Ma, and Xinhao Deng's Vera paper asks agent safety tests to prove outcomes in the sandbox record.

The Paper

The paper is Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification, arXiv:2607.01793 [cs.AI]. The arXiv record lists v1 on July 2 2026 and lists the authors as Yunhao Feng, Ruixiao Lin, Ming Wen, Qinqin He, Yanming Guo, Yifan Ding, Yutao Wu, Jialuo Chen, Yunhao Chen, Xiaohu Du, Jianan Ma, Zixing Chen, Zhuoer Xu, Xingjun Ma, and Xinhao Deng. The paper presents Vera, an automated safety-testing framework for tool-using LLM agents.

The useful move is not just scale. It is the shift from asking whether an agent said something unsafe to asking whether an unsafe outcome was realized in a controlled environment. A refusal, a promise, a plan, and a final answer are all weak evidence if the system can also call tools, edit records, send messages, modify repositories, search the web, or route through middleware.

State Evidence

Vera treats the agent as a computer-use system whose behavior has to be judged by observable traces. The paper says existing safety evaluation can conflate unsafe requests, attempted actions, or textual statements with realized violations. Vera instead records the interaction, tool calls, gateway events, and final environment state, then lets a verifier inspect the record.

This matters for governance because agent incidents often hide in the gap between intent and effect. An agent can claim it refused after a side effect has already happened. It can also claim it complied without changing anything. Evidence-grounded verification ranks environment state first, then tool-call records, then response text when the text itself is the relevant output. That hierarchy is a practical antidote to self-report theater.

Executable Case

The core object is the executable safety case. Vera's pipeline first uses literature-driven exploration to structure risks, attack methods, and execution environments into taxonomies. It then composes those dimensions into candidate safety goals, compiles retained goals into initial environment states and deterministic verifier scripts, and filters out cases that depend on internal reasoning traces rather than observable effects.

The paper reports that the risk exploration stage processed about 800 arXiv and OpenReview papers. The resulting taxonomy contained 124 leaf-level risk categories, 77 leaf-level attack methods, and 30 leaf-level environment categories. After compatibility filtering and deduplication, 39,078 candidate safety goals were reduced to 1,600 executable base scenarios for Vera-Bench. Each base scenario is instantiated in benign, single-channel, and multi-channel settings.

Sandbox Record

Vera evaluates four production agent frameworks: OpenClaw, Hermes, Codex, and Claude Code. Each run starts in an isolated Docker Compose environment with the target agent, an MCP middleware gateway, and self-hosted services for email, code hosting, payment and banking, messaging, and web search. The paper says the environment exposes 72 MCP tool functions and records tool calls, arguments, original service responses, transformed observations, and persistent state changes.

The released data item is not just a label. It contains an attack plan, MCP logs, a normalized multi-turn trajectory, and a verify.py predicate. In other words, the test is meant to be replayable: the case says what was being tested, the logs say what happened, and the verifier says why the outcome did or did not count.

Results

The headline number is severe but should be read with the test design in view. Across the four frameworks, Vera reports average execution success rates of 90.6 percent in the single-channel setting and 93.9 percent in the multi-channel setting, compared with 70.5 percent in benign task completion. Table IV reports overall rates of 88.6 percent for Claude Code, 86.6 percent for Hermes, 84.1 percent for Codex, and 70.3 percent for OpenClaw.

The paper interprets the gap between benign completion and adversarial outcomes as evidence that adaptive multi-turn testing changes what is measured. The same capabilities that make agents useful in plausible workflows can also make them responsive to safety-case pressure. The multi-channel condition adds a tool-observation surface through the gateway; the average increase over single-channel is smaller than the jump from benign to single-channel, but it exposes differences between agent frameworks that single-channel tests would miss.

Vera also uses the benchmark for downstream guard-model training. In preliminary experiments, a Qwen3Guard-based model fine-tuned on Vera-derived data reached 0.930 accuracy, 0.903 recall, and 0.941 F1 on the Vera downstream task. On the separate R-Judge benchmark, the fine-tuned model reached 61.7 percent accuracy and 77.9 percent recall. These are benchmark results, not deployment guarantees.

Safety Receipt

An executable agent-safety receipt should include the risk taxonomy version, attack-method taxonomy version, environment taxonomy version, scenario ID, benign and adversarial variants, target agent framework, model route, sandbox image, service fixtures, tool schema, gateway policy, interaction budget, transcript, MCP logs, final environment diff, verifier code, verifier result, infrastructure failures, and replay instructions.

That receipt changes the institutional question. A safety claim should not be "the agent refused in our prompt test." It should be "here is the executable case, here is the initialized state, here are the tool records, here is the final state, and here is the deterministic predicate that judged the outcome." Safety testing becomes less like a demo and more like an evidence package.

Claim Boundary

The Vera paper does not show that every real deployment of the named frameworks fails in the same way. It evaluates configured agents in controlled sandbox services under defined benign, single-channel, and multi-channel conditions. Its threat model does not modify the target model, system instructions, agent implementation, or internal tool code. Its cases are inference-time tests, not training-phase compromise tests. Its guard-model results are preliminary.

Within that boundary, the paper is valuable because it moves agent safety away from a purely linguistic ritual. For a system that can act through tools, the audit object is not the answer. It is the case, the run, the state change, and the verifier.

Sources


Return to Blog