Blog · arXiv Analysis · Last reviewed July 2, 2026

The Static Tool Benchmark Becomes the Open-World Trap

Song-Lin Lv, Weiming Wu, Rui Zhu, Zi-Jian Cheng, and Lan-Zhe Guo's July 2026 arXiv paper studies a practical agent failure: a tool-use model can master a closed benchmark and still fail when real tools, returns, instructions, or domains drift.

For this essay, an open-world tool receipt is the evidence record that binds a tool-use agent score to the actual shift tested: query wording, tool names, tool descriptions, observation formats, error signals, redirected values, dependency graph, domain, refusal condition, training regime, and whether the trajectory adapted or hallucinated success.

The Claim

The paper, arXiv:2607.01084 [cs.AI], was submitted on July 1, 2026 and is marked by arXiv as accepted by ICML 2026. Its title is Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use.

The central claim is that static tool-use training can create an illusion of mastery. An agent can converge toward near-perfect success when train and test environments share the same query shapes, tool schemas, observation patterns, and domain, yet degrade when any of those assumptions move.

That matters because production tool environments move constantly. APIs are renamed, return formats change, errors become partial rather than clean, values are deprecated, tools are removed, and user requests arrive with ambiguity that never appeared in the demonstration set.

The Paper Frame

The authors formalize OpenAgent as tool-use under distributional shifts in the agent-environment loop. The object of evaluation is not one prompt or one answer. It is a trajectory in which the model reads a user query, chooses a tool action, receives an observation, updates its history, and acts again.

This makes tool-use shift different from ordinary input perturbation. If an early tool name, observation, or condition changes, the error can propagate through later calls. A local misread becomes a wrong tool chain.

The experimental platform is a controlled sandbox built around geographical point-of-interest queries and calculation tasks. The paper reports 6,050 training samples and 880 evaluation samples, with strict separation so test variation patterns do not appear in training.

The backbone is Qwen2.5-7B-Instruct. The comparison uses full-parameter supervised fine-tuning and reinforcement learning through GRPO with sparse answer and format rewards. The authors track correctness, invalid tool use, efficiency, active exploration, and refusal behavior.

Open-World Shift

OpenAgent decomposes drift into four kinds of shift. Query shift changes the surface form or implied intent of the user's request. Action-space shift changes tool names, tool documentation, available tools, or the dependency structure among tools. Observation shift changes return formats, null values, error states, or corrective feedback. Domain shift changes the whole task surface while preserving an abstract reasoning pattern.

The useful distinction is between informative anomalies and terminal failures. A tool redirection or deprecated value may require adaptation. A fatal error with no alternative may require refusal. Static benchmarks often reward answer completion so consistently that the agent never learns this boundary.

Four-Tier Test

The benchmark organizes these shifts into four tiers. Tier 1, perception, tests whether the agent can parse messy user intent, adapt to changed call schemas, and ground tool choices in functional descriptions rather than memorized names.

Tier 2, interaction, tests whether the agent treats observations as state updates. It includes return-format changes, explicit error returns, null returns, value redirection, and tool redirection.

Tier 3, reasoning, tests whether the agent can update rules and execution graphs. It asks whether the model follows changed calculation semantics, recognizes shortcuts, or handles logic inversion when the documented dependency order reverses.

Tier 4, internalization, tests whether the agent has learned a task principle rather than a script. It includes active refusal for missing conditions or fatal errors and domain transfer from the original POI setting to a medical-registration style setting.

Failure Modes

The most important SFT failure is symbolic anchoring. When names and descriptions disagree, the SFT model tends to follow familiar surface tokens. The paper describes this as brittle pattern matching: the agent has learned the route through the tool names, not the current tool semantics.

The most important interaction failure is trajectory inertia. Under explicit feedback, an SFT agent may treat the observation as if it confirmed the original plan, ignore the redirection, and continue with a hallucinated successful return. In the paper's language, it resembles an open-loop execution policy.

RL helps, but it does not solve the problem. The RL agent often handles explicit feedback better and grounds names more semantically, but both SFT and RL degrade under global dependency inversion. When the causal order of tools changes, both can hallucinate the training-set topology despite current documentation.

The safety-critical failure is boundary blindness. In fatal-error cases, SFT may ignore the error and answer anyway. RL may notice the error and still fabricate an answer rather than refuse. That is worse than ordinary overconfidence: the reward structure has taught the agent that completion is the goal even when the environment says completion is not currently possible.

PAFT

The proposed intervention is Perturbation-Augmented Fine-Tuning, or PAFT. Its premise is straightforward: if clean demonstrations never contain error states, ambiguous observations, symbolic variation, or unsolvable cases, the model has no reason to learn recovery or refusal.

PAFT adds trajectory-level perturbations rather than isolated input rewrites. Environmental Feedback Perturbation injects anomalies such as redirection or deprecation into a trajectory and supervises corrective actions. Solvability Boundary Perturbation creates fatal-error examples with explicit refusal. Symbolic Representation Perturbation changes tool names and documentation so the model must rely on functional meaning.

On selected challenge tasks, the paper reports that SFT-200 moves from accuracy deltas of -67.7, -48.2, and -39.9 with a 0.3 refusal rate to PAFT scores of +28.6, +26.5, +22.7, and 99.3 refusal rate. At SFT-800, PAFT still leaves modest negative deltas on the three accuracy tasks, but raises fatal-error refusal from 0.2 to 99.6.

The ablation result is useful governance evidence. Environmental feedback perturbation mainly helps interaction adaptation, solvability boundary perturbation mainly restores refusal, and symbolic representation perturbation mainly helps perception and reasoning. Robustness is not one knob.

Governance Reading

The governance lesson is that an agent benchmark should not report only static task success. It should say what changed between training and test: the user query, the tool schema, the tool semantics, the observation channel, the dependency graph, the solvability condition, or the domain.

This is also a procurement problem. A vendor can show a tool-use demo where the model calls the right APIs in a stable sandbox. The operational question is whether the same model adapts when the API returns a new error code, a tool is renamed, a value is deprecated, or a task is impossible.

For deployed agents, refusal is not an annoyance. It is a core control. A tool-use system that cannot say "the current environment does not support this action" will eventually turn missing state, broken tools, or stale assumptions into fabricated completion.

Open-World Receipts

An open-world tool receipt should include the training tool set, test tool set, schema changes, renamed tools, changed descriptions, added distractors, removed tools, observation formats, error and null-return behavior, redirection messages, dependency-graph changes, domain-transfer mapping, and the exact solvability rule.

It should also include trajectory evidence: chosen tool, rejected alternatives, returned observation, whether the model repeated an invalid call, whether it parsed corrective feedback, whether it hallucinated a successful observation, whether it refused, and whether refusal was correct.

For agent governance, this receipt is more useful than a single success rate. It separates wrong-tool errors, wrong-order errors, feedback-ignoring errors, forced-completion errors, and domain-transfer failures. Each one asks for a different control.

Limits

The evidence is strongest as a controlled diagnostic, not as a complete map of production tool-use risk. The sandbox isolates variables by design, and the real-API validation replaces one distance tool with an Amap driving-distance API rather than recreating the full mess of enterprise systems.

The model comparison is also scoped. The paper uses Qwen2.5-7B-Instruct as the backbone and compares SFT and GRPO-style RL under its setup. The result should guide evaluation design, not become a universal ranking of post-training methods.

PAFT is an intervention for the diagnosed failure classes, not a deployment guarantee. A production agent still needs deterministic tool permissions, schema validation, live compatibility checks, fallback paths, human escalation, observability, and change management around the tool layer.

Source Discipline

This page treats the arXiv abstract, arXiv HTML, project page, and GitHub repository as the source set. It uses the paper's reported dataset sizes, training setup, PAFT table values, qualitative failure taxonomy, and real-API validation as author-reported evidence.

The page does not claim that PAFT solves open-world agency, that RL is generally safe, or that a controlled POI sandbox is sufficient for deployment. The claim is narrower: static success is not enough evidence for tool-use agents unless shift behavior and refusal behavior are tested.