Blog · arXiv Analysis · Last reviewed June 24, 2026

The Decomposed Task Becomes the Safety Bypass

The June 2026 arXiv paper Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DECOMPBENCH, by Vikhyath Kothamasu, Virginia Smith, and Chhavi Yadav, studies a hard case for tool-using agent safety: a harmful job split into harmless-looking subtasks. Safety has to read the work as a whole.

The Harmful Goal Is Split Apart

Kothamasu, Smith, and Yadav's paper, arXiv:2606.13994 [cs.CR], was submitted on June 12, 2026. The arXiv HTML lists Vikhyath Kothamasu and Virginia Smith at Carnegie Mellon University, and Chhavi Yadav at Carnegie Mellon University and the Simons Institute, UC Berkeley. The paper introduces DeCompBench, a benchmark for decomposition attacks against LLM-based agents that can use tools.

A decomposition attack is not simply a long bad prompt. It is a harmful objective distributed across smaller requests that look ordinary when inspected one by one. The paper's core warning is that a safety system can refuse the visible whole and still comply after the intent has been broken into benign-looking pieces.

This angle is distinct from the site's pages on command denylists, unsafe shortcuts, cold-start safety, and intent-governed tool authorization. This paper asks whether safety can survive when dangerous meaning is distributed across the job.

DeCompBench as a Composite Test

DeCompBench is built around a decomposition-by-design method. The authors describe four criteria for valid tasks: the original task must be harmful, it must naturally require multiple dependent steps, each subtask must appear benign in isolation, and the monolithic task must be hard enough that a weak unaligned model cannot simply complete it without decomposition.

The dataset construction begins with 335 role-tagged capabilities and 101 manually curated seed tasks across eight attack categories. The final dataset contains 250 harmful tasks grounded in tool-rich settings such as version control platforms, databases, messaging systems, and cloud storage services. The paper names categories including data theft and exfiltration, financial crime, social engineering, and audit and evidence tampering. Its Hugging Face dataset page identifies DeCompBench under a CC-BY-4.0 license and requires users to accept access conditions before viewing files.

The result is useful because it does not treat refusal as a full safety proof. It compares monolithic presentation with decomposed presentation, and it preserves the difference between asking an agent to say something unsafe and asking it to operate tools until a harmful state is reached.

Why Local Checks Miss the Job

The arXiv abstract reports that state-of-the-art agents showed high refusal rates for monolithic harmful tasks but much lower refusal rates for decomposed variants. In the results section, the authors report sharp refusal-rate drops for the tested agents after decomposition: Qwen3-Coder from 21 percent to 0 percent, Claude Haiku from 90 percent to 2.5 percent, and GPT-5-mini to 6 percent after refusing most monolithic tasks. The important governance point is not the leaderboard. It is the shape of the failure.

A local guardrail sees one turn, one command, one API call, one file operation, or one tool result. A decomposed job uses that locality against the system. It places safety-relevant meaning in the relation among steps rather than in the apparent content of each step. The harm becomes visible only when the partial outputs are composed.

This is why agent safety needs a cumulative ledger. Each delegated job should carry a durable intent record, intermediate artifacts, source and destination labels, tool receipts, and a running interpretation of what the sequence is making possible.

Capability and Refusal Are Entangled

The paper's results also avoid a measurement trap. A failed decomposed attack is not always a safety success. The HTML table of contents names a specific finding: failures to complete decomposed attacks can arise from execution errors rather than safety refusals. In Appendix B, the authors compare GPT-5-mini on malicious OpenAgentSafety tasks and DeCompBench monolithic tasks, reporting 58.6 percent full compliance on the OpenAgentSafety subset, but 0.0 percent full success and 90.7 percent refusal on DeCompBench monolithic tasks.

For governance, that distinction is essential. If an agent avoids harm because it cannot operate the environment, that is a capability ceiling, not a safety mechanism. The ceiling may rise with better tools, clearer scaffolds, more robust retry loops, or lower-friction credentials.

The sober reading is that decomposition attacks sit between behavior policy and systems engineering. A model may be trained to refuse the obvious request, but the deployed agent is also a workflow executor. It has memory, tools, credentials, retry logic, and logs. The safety object is the whole harness, not only the sentence that began the task.

Governance Standard

Any production agent with meaningful tool authority should be evaluated against decomposition-aware misuse, not only against direct forbidden prompts. The evaluation should include whole-job safety cases, cross-turn and cross-tool aggregation, budgeted attack attempts, benign-subtask isolation tests, artifact-flow review, and separate accounting for refusal, technical failure, and completed harmful outcome.

The runtime should treat partial work as evidence. If a subtask creates a file, queries a database, exports a list, rewrites a message, prepares a patch, or sends a request, the receipt should preserve why it happened, which higher-level job it served, and what later steps consumed it.

The rule is simple: if an agent can execute a delegated job in pieces, the pieces cannot be governed one at a time. The safety boundary is the composed task.

Sources


Return to Blog