Blog · arXiv Analysis · Last reviewed June 25, 2026

The Tool Call Becomes the Judgment Trap

The June 2026 arXiv paper When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More, by Zhongyuan Wang and Pratyusha Vemuri, tests whether a tool-using agent weighs a learned tool as evidence or simply carries its answer forward.

Tool Access Is Not Judgment

The paper, arXiv:2606.14476 [cs.AI], was submitted on June 12, 2026. Its exact title is When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More. The authors ask a deliberately narrow question: when an LLM agent is given a frozen graph neural network as an explicit tool, does it treat that tool as one signal among several, or does it mostly obey it?

That question matters because tool calling is often presented as a way to make agents more grounded. A calculator, search index, database, code interpreter, classifier, or graph model is supposed to give the agent contact with the world outside its prompt. But a tool can also become an authority shortcut: the agent stops evaluating and transports the tool's answer through a natural-language wrapper.

What the Paper Tests

Wang and Vemuri train a graph convolutional network on node classification, freeze it, and expose it to a ReAct-style LLM agent as a callable tool. The tool can return a predicted label with confidence, a reconstruction anomaly score, and link probabilities to neighbors. The main dataset is ogbn-arxiv, a text-attributed citation graph with 169,000 nodes and 40 classes; the replication uses WikiCS, a graph of Wikipedia computer-science articles.

The comparison has four arms: an agent with the GNN tool, an agent with a minimal neighbor-label navigation tool, the frozen GNN alone, and an agent with no graph tool. The main backbone sweep uses Qwen2.5-Instruct from 0.5B to 7B parameters. The key metrics are agreement with the raw GNN and an oracle gap: how much better a per-node chooser would do if it could pick the best available arm for each case.

The Evidence

The 7B agent with the GNN tool agrees with the raw GNN 97.6% to 99.2% of the time across local-homophily regimes on ogbn-arxiv. The paper calls this a parrot effect: the agent's final prediction is nearly the same as the tool's prediction. The result is sharper because the tool exposes more than one signal. It can provide confidence, anomaly score, and link probabilities, with a budget for multiple calls, yet in 83% of 7B queries the agent makes exactly one call.

The capability sweep is the uncomfortable part. The 0.5B model barely uses the tool. Among models that do use it, agreement with the GNN rises from about 0.60 at 1.5B to about 0.98 at 7B. In this experiment, more capable backbones do not become more skeptical callers. They become more complete deferrers.

The deference also has a measurable cost. The paper reports that a per-node oracle over available actions beats the parrot by 0.09 to 0.18 at 3B and 0.12 to 0.22 at 7B. At high homophily, the neighbor-label tool reaches 0.81 accuracy while the GNN arm is at 0.71, but the agent still follows the GNN. A simple selective-invocation gate lifts high-homophily accuracy from 0.71 to 0.83, yet hurts other regimes and gives no net global gain. On WikiCS, the parrot effect reproduces with 0.96 to 1.00 agreement and positive oracle gaps.

Why Deference Matters

The finding cuts against a common interface story. A user sees an agent, a tool list, and an explanation-shaped answer. The surface suggests deliberation. The measured behavior can be thinner: the tool produced an answer, and the agent carried it forward.

That does not mean tools are bad. The danger is confusing tool access with judgment. If the agent almost never challenges a learned predictor, then the combined system inherits the predictor's errors while making them harder to see. The language layer can turn a tool output into a confident account, and the user may attribute that account to a broader reasoning process that did not happen.

This is especially relevant for screening, moderation, fraud detection, scientific triage, legal routing, hiring filters, and other domains where a learned component is embedded inside an agentic workflow. The agent may be a courier, not a second opinion.

Governance Standard

Any agent-plus-learned-tool system should publish a tool-deference card before it is treated as an evaluated workflow. The card should report raw-tool accuracy, agent-with-tool accuracy, tool-free agent accuracy, agreement with the raw tool, accuracy when the agent disagrees, tool-call counts, budget limits, prompt or scaffold language, tool version, model family, and override conditions.

The trace should separate four events: tool observation, agent interpretation, final answer, and post-hoc explanation. If the agent sees a confidence score or anomaly score but almost never uses it, that is a behavioral fact. If a stronger backbone defers more, that should be reported as a scaling result rather than hidden inside one aggregate score.

The audit rule is simple: do not say the agent validated the tool unless the evaluation shows useful disagreement. Measure the cases where the tool is wrong and another available arm is right. Measure whether a gate works across seeds and regimes, not only on one convenient split.

Scope Boundary

The paper does not prove that all LLM agents blindly obey all tools. Its strongest quantitative claims are scoped to ogbn-arxiv and WikiCS node classification, a frozen GCN tool, and the tested backbones. The authors also report that near-total parroting is strongest for Qwen in their setup, while Mistral and OLMo controls defer only partially. The lesson is that blind deference is a real failure mode to measure before tool-using agents are trusted as reviewers of their own tools.

Sources

Zhongyuan Wang and Pratyusha Vemuri, When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More, arXiv:2606.14476 [cs.AI], submitted June 12, 2026.
Zhongyuan Wang and Pratyusha Vemuri, When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More, arXiv PDF, reviewed June 25, 2026.
Related pages: The Tool Scope Becomes the Intent Gate, The Tool Server Becomes the Trust Boundary, The Agent Trace Becomes the Process Map, The LLM Judge Becomes the Annotation Budget, and AI Agents.

Return to Blog