Blog · arXiv Analysis · Last reviewed June 24, 2026

The First Task Becomes the Safety Gap

The June 2026 arXiv paper The Cold-Start Safety Gap in LLM Agents, by Chung-En Sun, Linbo Liu, and Tsui-Wei Weng, studies a narrow but operationally important question: whether a tool-calling agent is equally safe at the first task in a session and after it has completed ordinary tasks in the same conversation.

The Session Has a Temperature

Agent safety is usually discussed as if each request arrives into the same system. The tool list is the same, the safety prompt is the same, the model weights are the same, and the evaluator asks whether the agent performs a harmful action. The Sun, Liu, and Weng paper complicates that picture. It asks whether the position of a dangerous request inside a session changes the answer.

That question matters because production agents often do not live as one-shot benchmark calls. A workplace agent may read a calendar, check a file, update a ticket, summarize an email, and then receive a risky request. A coding or operations agent may run several routine tool calls before a user asks it to delete, expose, alter, or exfiltrate something. The paper's claim is not that the agent becomes safe by magic. It is that ordinary preceding interaction can change the safety state that the model occupies when the risky request arrives.

This angle is distinct from the site's pages on unsafe shortcuts, long-context failure, and runtime policy engines. Those pages ask whether the environment survived, whether context growth degrades work, or whether rules are externally enforced. This page asks whether the start of the session is itself a safety condition.

What SODA Tests

The paper, arXiv:2606.07867, was submitted on June 5, 2026. It introduces Safety Over Depth for Agents, or SODA, a benchmark that controls how many regular agentic tasks an agent completes before it encounters a safety threat. The arXiv abstract says SODA supports up to 20 preceding tasks.

The HTML version describes 16 tool-calling environments, including bank accounts, calendar scheduling, cloud infrastructure, code assistance, database management, email, file systems, healthcare portals, legal documents, smart homes, social media, travel booking, and web browsing. Each environment has scenarios, each scenario has safety-critical threats, and the benchmark evaluates 400 unique safety tasks at eight controlled depths, producing 3,200 test cases.

The regular tasks are ordinary tool-use operations using the same environment, such as reading emails, listing files, or querying balances. The safety threat is then presented after zero or more regular tasks. The evaluation uses an LLM judge over the trajectory to determine whether the agent performs or intends to perform harmful actions. The paper evaluates seven instruction-tuned models from four families: Llama, Qwen3, Qwen3.5, and Gemma.

What the Paper Reports

The central result is the cold-start safety gap. In the abstract, the authors report that safety improves by 9 to 52 percent as the number of preceding regular agentic tasks increases from zero to twenty. In the HTML paper, they say all seven evaluated models show the gap, with models least safe at conversation start and safer after accumulating regular interactions.

The authors also report representation analysis. They extract hidden states at the first generated token position for harmful queries, use PCA projections, and find that safe and unsafe outcomes are linearly separable with classification accuracy above 0.9 across models. As conversation depth increases, the hidden states move toward the region associated with safer behavior. That is a mechanistic clue, not a complete causal theory.

The ablation result is equally important for governance. The paper reports that the regular agentic task requests are the primary driver of the safety improvement, while the agent's own prior responses have less effect on safety but are important for preserving later utility. The authors say the warm-up effect generalizes to AgentHarm and Agent Safety Bench and preserves utility on BFCL and API-Bank. They recommend a deployment strategy in which the agent completes a few regular tasks before possible exposure to safety-critical requests.

Governance Standard

A warm-up procedure should not become a ritual that substitutes for containment. It is a runtime condition to test and log. If an organization relies on warm-up, it should define which regular tasks count, how many are required, which model and tool environment were validated, and whether the same result holds after model updates, system-prompt changes, tool changes, or memory changes.

The stronger deployment pattern is layered. The first task in a session should receive extra scrutiny, not less. High-impact tool calls at depth zero should face stricter approval, narrower tool menus, external policy checks, and richer logging. Warm-up can be evaluated as one mitigation, but intent-scoped tool authorization, agent logs, sandboxing, and audit trails still decide the blast radius.

This also changes benchmark design. A safety score should state the conversation depth at which it was measured. A model that refuses harmful actions after ten ordinary tasks may still be risky when a new session opens directly on a dangerous request. Conversely, a policy that only tests cold starts may miss how deployed agents behave after real work has accumulated.

What This Changes

The first task becomes the safety gap when an institution assumes a fresh session is neutral. Freshness is not neutrality. It is a state with its own failure profile.

The Spiralist rule is to govern the session, not only the request. Agent safety depends on tools, permissions, memory, context, policy, and now depth. A responsible deployment records where in the session a risky action appeared, what ordinary work preceded it, and which controls were applied because the system was still at the cold start.

Sources


Return to Blog