Blog · arXiv Analysis · Last reviewed June 25, 2026

The Tool Call Becomes the Privacy Boundary

A June 2026 arXiv paper benchmarks whether tool-using LLM agents send private facts only where the task requires them. The practical lesson is blunt: final-answer privacy is too late when the tool arguments have already leaked.

Privacy Moves Into the Arguments

A tool-using agent can keep its final answer clean while still sending too much private context to intermediate systems. The leakage may happen in a payment note, a ticket description, a team handoff, a notification body, or a work log. By the time the user sees the answer, the backend record already exists.

That shifts privacy governance from output moderation to information-flow accounting. A private fact is not simply allowed or forbidden. It may be necessary for identity verification, unnecessary for billing, dangerous in a handoff summary, and irrelevant to a notification. The boundary is the current tool, purpose, recipient, and field.

The Paper Frame

The paper is ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents, arXiv:2606.28061 [cs.CR], submitted June 26, 2026. The authors are Shijing Hu, Liang Liu, Zhu Meng, and Zhicheng Zhao. arXiv lists Cryptography and Security as the primary subject and Artificial Intelligence as a secondary subject.

The paper's target is not whether an agent can call functions correctly. It asks whether successful multi-tool execution preserves purpose-bound privacy: each tool should receive only the private atoms necessary for its stated role in the current task.

How the Benchmark Works

ToolPrivacyBench contains 2,150 cases. The Need-to-Know split contributes 1,150 internally constructed privacy-sensitive business workflows. The public-derived split contributes 1,000 cases adapted from existing multi-tool and function-calling benchmarks. The domains include healthcare, insurance, lending, recruiting, employee onboarding, reimbursement, tax filing, logistics, IT helpdesk, and software security.

Each case carries a policy knowledge base. That knowledge base records the user task, tools, purposes, sinks, private atoms, authorization labels, free-text slots, and expected workflow behavior. It is then projected into a field-tool authorization matrix: which private atom may be sent to which tool or sink under this task's purpose.

The important design choice is that the policy knowledge base is not handed to the baseline agent as a defense. It is used after execution. Agents run through an OpenClaw-based stack into mock business backends, and those backends record the actual tool arguments and persisted records. The evaluator checks the trajectory and backend logs, not merely the final text.

What It Measures

The benchmark separates utility from privacy. TaskSuccess combines task outcome, workflow coverage, and required authorized fact delivery. Privacy metrics then ask whether forbidden field-tool opportunities became observed disclosures. The paper reports Field Opportunity-Normalized Over-Disclosure Rate, Severity-Weighted Leakage Rate, ToolFOR, FreeTextFOR, MidFOR, Multi-Tool Privacy Over-Disclosure Index, and Safety-Maintained Task Completion.

This matters because a low-leakage agent could simply fail to act, while a high-utility agent could complete every step by copying unnecessary private facts into broad collaboration channels. The benchmark tries to keep both sides visible: did the agent get the work done, and did it route private atoms only where they were needed?

What the Results Show

The authors evaluate nine agents: GPT-5.5, Claude Opus 4.7, DeepSeek V4 Flash, Kimi K2.5, GLM 5.1, Qwen3.6-plus, Gemini 3.5 Flash, Doubao Seed 2.0 Lite, and MiniMax M2.7. On the public-derived split, TaskSuccess ranges from 76.30 to 94.72 and MT-POI from 15.81 to 22.56. On the Need-to-Know split, TaskSuccess rises to 92.23 through 97.70, but MT-POI also remains between 19.19 and 28.04.

The result is the governance point: stronger tool execution does not prove privacy compliance. On the synthetic private split, Qwen3.6-plus, Kimi K2.5, and DeepSeek V4 Flash all reach TaskSuccess above 97, yet their MT-POI values remain above 27. Gemini 3.5 Flash has lower TaskSuccess at 92.45 but the lowest MT-POI at 19.19.

The leak locations are also concrete. Tickets have the highest aggregated FOR at 51.43, followed by handoffs at 34.79. Free-text fields are a major channel: the paper reports FTSlotRate above 80 percent for message, description, and work_notes fields. In the authors' case studies, clinical facts, tax details, API keys, repository URLs, and infrastructure details can be restated after their local purpose has already passed.

Governance Reading

The Spiralist reading is that the audit boundary follows the agent's effects. A serious agent deployment needs tool-call logging, sink classification, free-text inspection, purpose-bound authorization, and post-execution review. "The final answer did not expose private information" is not a privacy claim unless the intermediate trajectory is also clean.

The page belongs beside agent observability, tool permissioning, and data-governance work because it gives privacy a process shape. The unit of compliance is no longer a static data field. It is a private atom moving through a workflow, losing and gaining authorization as the recipient and purpose change.

Limits

This is a controlled benchmark, not a measurement of production incident rates. The paper uses fabricated or test values, mock backends, and policy annotations based on stated tool purposes. It does not reproduce every organizational role, retention rule, access-control regime, model update, or human review process. The useful claim is narrower and stronger: if you do not inspect executed tool trajectories and backend logs, you can miss privacy failures that final-answer review will never see.

Audit Receipt

The audit-grade sentence is: Hu, Liu, Meng, and Zhao introduce ToolPrivacyBench, a 2,150-case benchmark for tool-using LLM agents that compares executed tool arguments and mock-backend audit logs against purpose-bound policy knowledge bases.

The practical receipt is: an agent privacy review must inspect the whole tool trajectory, because successful workflow completion can coexist with private facts traveling into unauthorized tickets, notes, messages, and handoffs.

Sources

Shijing Hu, Liang Liu, Zhu Meng, and Zhicheng Zhao, ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents, arXiv:2606.28061 [cs.CR], submitted June 26, 2026.
Primary versions checked: experimental HTML, PDF, and arXiv DOI.
Related pages: The Agent Data Leak Becomes the Safety Case, The Data Agent Becomes the Privacy Surface, The Semantic Transaction Becomes the Commit Boundary, AI Agent Observability, and Tool Use and Function Calling.

Return to Blog