Blog · arXiv Analysis · Last reviewed June 25, 2026

The Tool Call Becomes the Wrong Target

A June 2026 arXiv paper separates tool correctness from entity correctness and shows why an agent can pick the right API while acting on the wrong person, document, ticket, account, or thread.

The Missing Half of Tool Use

Agent safety often asks whether the model chose the right tool. Did it call email instead of calendar? Did it produce valid JSON? Did it complete the workflow? Those are useful checks, but they miss a quieter failure: the agent can choose the right operation and bind it to the wrong real-world target.

The Spiralist angle is that the tool call becomes the wrong target. A valid API invocation is not safe if the recipient, ticket, document, account, event, or thread is wrong. In offices, schools, clinics, support desks, repositories, and payment systems, the harm often attaches less to the action verb than to the object of the action.

This is not a mystical failure. It is an addressability failure. Natural language points loosely; databases require exact identifiers. Between those two layers sits a binding decision that should be treated as a governance boundary, not as incidental prompt plumbing.

The Paper Frame

The source is Rahul Suresh Babu and Shashank Indukuri's Entity Binding Failures in Tool-Augmented Agents, arXiv:2606.30531v1 [cs.AI]. The arXiv record lists submission on June 29, 2026, with the primary subject Artificial Intelligence.

The paper defines entity binding failures as cases where an agent selects the correct tool but applies it to the wrong external entity. Its examples include contacts with the same name, overlapping launch documents, overlapping email threads, similar customer records, and ambiguous calendar events.

The paper's contribution is useful because it separates two measures that are often collapsed. Tool correctness asks whether the selected operation is right. Entity correctness asks whether the operation is attached to the intended target. A system can pass the first test and still fail the second.

The Testbed

Babu and Indukuri build a controlled diagnostic testbed with 60 tasks across email, calendar, document, customer-record, and issue-tracking workflows. The ambiguity taxonomy includes name collisions, document-version ambiguity, temporal ambiguity, account collisions, near-duplicate records, cross-system references, and true ambiguity.

The evaluation covers 1,800 model-method-task runs: 60 tasks, five model backends, and six tool-use methods. The model backends are Amazon Nova 2 Lite, Amazon Nova Premier, Claude Opus, Claude Sonnet, and Llama 3.3 70B Instruct. The six methods are direct execution, semantic filtering, CMTF only, entity retrieval, confidence-gated binding, and entity-aware CMTF with provenance.

What Failed

The striking result is that all methods achieved a 0.0 percent wrong-tool rate in the aggregate evaluation. The agents were not picking the wrong API. Yet action-oriented baselines still produced wrong-entity actions in roughly a quarter of runs: direct execution and entity retrieval each reached 26.0 percent, CMTF-only reached 25.7 percent, and semantic filtering reached 24.0 percent.

Entity-aware methods changed the failure mode. Confidence-gated binding and entity-aware CMTF with provenance both reached 0.0 percent wrong-entity action rate and 0.0 risk-weighted wrong-entity exposure in this diagnostic setting. They did so by deferring or asking for clarification when the target entity was not sufficiently grounded.

The tradeoff is visible. Action-oriented methods reached 74.0 to 75.0 percent task success, while confidence-gated binding reached 31.7 percent task success and entity-aware CMTF with provenance reached 26.0 percent. The paper argues that this should be read as a safety-completion tradeoff: direct completion sometimes means guessing.

The failures were not evenly distributed. Temporal calendar tasks and true-ambiguity tasks were worst. Direct and entity-retrieval methods produced wrong-entity actions in 100.0 percent of temporal tasks and 100.0 percent of true-ambiguity tasks. In true ambiguity, the entity-aware methods detected ambiguity in 100.0 percent of runs and achieved 100.0 percent safe success by clarifying rather than executing.

Governance Reading

For governance, the lesson is plain: a tool-use audit cannot stop at tool choice. A valid call to send a message, delete a document, update a customer record, or move a meeting needs a target receipt. Which entity was selected? Which candidates were rejected? Which evidence made the binding unique? Was there enough information to act, or should the system have asked?

This connects to the site's earlier pages on tool scope, agent identity, and agent logs. Least privilege can limit available tools, but it does not prove that the remaining tool is bound to the right object. Identity can prove who the agent is acting as, but it does not prove which Alex, launch plan, customer account, or issue ID the agent meant.

Clarification should therefore count as a valid safety outcome, especially for high-impact actions. If multiple old launch plans exist, asking which one to delete is not failure. It is correct refusal to turn ambiguity into irreversible action.

What to Log

An entity-aware action record should include the user instruction, the available tool schema, the candidate entity set, relevant entity attributes, the selected entity identifier, rejected alternatives, confidence or margin evidence, clarification state, final action, and post-action receipt. For multi-step agents, each later action should preserve the earlier binding trail rather than inheriting a vague natural-language reference.

Limits

The paper is deliberately diagnostic. Its 60-task suite is not a deployment prevalence estimate, and the environments are controlled rather than messy production systems with stale records, permissions conflicts, partial metadata, or live organizational change. The experiments also focus on single-step tool execution, while many deployed agents plan across several steps.

The authors caution that model-level rankings should not be treated as stable, because prompting, output format, decoding choices, and provider updates can change behavior. The durable point is narrower and stronger: tool correctness alone is incomplete evidence.

Audit Receipt

The audit-grade sentence is: Babu and Indukuri's arXiv:2606.30531 defines entity binding failures as correct-tool wrong-target errors, evaluates 60 diagnostic tasks across five model backends and six tool-use methods, reports 0.0 percent wrong-tool errors across methods, finds 24.0 to 26.0 percent wrong-entity actions for action-oriented baselines, and reports 0.0 percent wrong-entity action rate for confidence-gated binding and entity-aware CMTF with provenance in the diagnostic setting.

The practical receipt is: do not call an agent action safe because the tool call was valid. Require the entity binding, ambiguity decision, provenance, clarification path, and action receipt to travel with the tool call.

Sources


Return to Blog