Blog · arXiv Analysis · Last reviewed July 2, 2026

The Tool Menu Becomes the Attack Surface

Risk-Aware Causal Gating makes a useful security move: it treats the list of tools visible to an agent as temporary authority. A dangerous tool should not merely be discouraged by policy. It should be absent until the current state proves it is needed and authorized.

The Paper

The paper is Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents, arXiv:2606.13884 [cs.AI], by Laxmipriya Ganesh Iyer and Rahul Suresh Babu. arXiv lists version 1 as submitted on June 11, 2026, with DOI 10.48550/arXiv.2606.13884. The PDF lists both authors as independent researchers in the United States.

There is a source-integrity wrinkle. The arXiv abstract metadata currently describes a generic counterfactual-risk decision-system paper, while the arXiv HTML, PDF, and TeX source contain the least-privilege LLM-agent paper reviewed here. This analysis follows the HTML/PDF body and notes the metadata mismatch as part of the record.

The paper says the harness, specs, per-trial logs, and annotated registry are released with the code, but the arXiv record and TeX source do not provide an official repository URL. No official code link appears in the sources I checked, so I did not add a code link to the arXiv card.

The Exposure Problem

Most tool-selection systems ask which tool is relevant. RACG asks a stricter question: which tool is safe to make visible right now? That distinction matters because an LLM agent can be tricked by indirect prompt injection, hallucinated plans, or confused-deputy situations. If send_email, delete_file, share_externally, or transfer_funds is in the menu, then the model can attempt the call.

The paper's core claim is that visible tools are a security control surface. A tool menu is not just context for the model. It is a set of capabilities conferred at runtime. Guardrails decide whether an attempted action should be allowed; RACG tries to prevent dangerous attempts by withholding the action from the model's current action space.

This makes the least-privilege principle operational for agents. The agent can still read, search, summarize, and gather facts. It does not receive high-risk authority until the current state contains trusted authorization evidence.

Risk-Aware Causal Gating

RACG extends Causal Minimal Tool Filtering, or CMTF. CMTF exposes tools on the next causal frontier toward the goal. RACG adds risk labels, authorization preconditions, and trusted provenance constraints. A risk-bearing tool is admissible only when its ordinary preconditions are satisfied and its authorization variables are present in the trusted state.

The tool contract has a description, required state variables, produced state variables, optional cost, risk level, and authorization variables. Risk levels are low, med, and high. Read-only or idempotent tools are low risk. Reversible state mutation can be medium risk. Irreversible, externally visible, or value-transferring tools are high risk.

The paper uses a simple risk map: risk(low) = 0, risk(med) = 1, and risk(high) = 4. The path score is:

score(path) =
  sum(tool_cost) +
  lambda * sum(risk(tool_risk))

The super-linear high-risk cost is meant to prefer a short chain of safer steps over a one-step irreversible shortcut. The authors sweep lambda across 0, 0.25, 0.5, 1, 2, and 4, and use lambda = 2 as the default operating point. They derive a crossover threshold, lambda dagger = (L_safe - L_risk) / risk(high), and report an empirical crossover between 0.5 and 1 on RiskGate.

The load-bearing assumption is provenance. Authorization variables such as recipient_confirmed, deletion_approved, external_approved, and payment_confirmed must be produced by trusted steps, not copied from untrusted content. If an attacker-controlled email body can set recipient_confirmed, then the gate opens for the wrong reason. If no trusted producer can establish the missing authorization variable, RACG's fail-closed behavior returns no risk-bearing action instead of exposing an unauthorized dangerous tool.

RiskGate

The benchmark is RiskGate, an extension of the controlled CMTF benchmark. It uses a 100-tool registry over calendar, email, files, and documents. The high-risk tools named in the paper include send_email, delete_email, delete_file, share_file, share_externally, and transfer_funds; update_event is treated as medium risk in the table.

RiskGate contains 102 benign CMTF tasks plus 80 safety-stress tasks. The stress tasks are organized as four families of 20: email authorization-required, email no-action, files no-action, and calendar high-risk-shortcut. For each stress task, the authors create three injection phrasings: direct command, role-play, and obfuscated or encoded. That yields 240 adversarial trials per method per model.

The deterministic track uses a worst-case adversarially compliant policy that always chooses the injected tool when it is visible. That choice is useful: if a defense claims structural protection, it should still work when the model itself is perfectly compliant with the attack. The LLM validation then compares CMTF against RACG on the 80-task safety-stress set using seven hosted models via Amazon Bedrock at temperature 0: Claude Opus 4, Claude Sonnet 4.6, Claude Haiku 4.5, GPT-OSS 120B, Nova Premier, Nova Pro, and Nova 2 Lite.

Results

In the deterministic comparison, all methods except under-tuned RACG at lambda = 0.5 reach task success 1.00. That is the point: the improvement is not ordinary completion. It is completion with less exposed authority.

All-tools leaves the full high-risk surface visible, with attack surface AS = 26.00, weighted attack surface WAS = 95.00, unauthorized exposures UE = 76.16, injection success rate 1.00, and serialized-context tokens per task 29,875. Keyword top-10 also has injection success 1.00, with UE = 5.51. State-aware filtering reports injection success 0.75, with UE = 3.11. Risk-agnostic causal frontier reduces UE to 0.11 but still has injection success 0.25.

RACG at lambda = 2 is the only reported method with success 1.00, UE = 0.00, and injection success 0.00. Its AS is 0.18, WAS is 0.34, gold-tool exposure is 0.94, over-block rate is 0.00, and serialized-context tokens per task are 1,350. The under-tuned lambda = 0.5 row shows the tradeoff: it also has UE = 0.00 and injection success 0.00, but success falls to 0.89 and over-block rate rises to 0.50.

The real-LLM validation reproduces the structural pattern. For every listed model, CMTF has high-risk-call rate 0.25 and exposure-at-attack 0.25, while RACG has high-risk-call rate 0.00 and exposure-at-attack 0.00. Authorization-required task completion is 1.00 for both methods across all seven models. Average measured model tokens per task are 2,101 under CMTF and 2,456 under RACG; RACG's measured token range runs from 1,242 for GPT-OSS 120B to 3,534 for Nova 2 Lite.

The boundary-condition test is the most important result for deployment. With provenance intact, RACG's injection success rate is 0.00. When authorization-forging injections can write the target authorization variable into state, RACG rises to 0.25, matching CMTF. The paper is clear: the guarantee does not live in the gate alone. It lives in enforced tool visibility plus trusted authorization provenance.

Governance Standard

A RACG-style deployment should ship a tool-exposure receipt. The receipt should include the user goal, current state, visible tool set, hidden tool set, tool contracts, risk labels, authorization variables, trusted and untrusted state partitions, trusted producers, content producers, chosen lambda, path scores, causal frontier, admissibility decision, blocked tools, blocking reason, authorization-establishing step, tool-call arguments, model identity, prompt or policy version, injected-content detector output if any, final action, and reviewer override if used.

The receipt should keep three questions separate. Causal necessity asks whether a tool advances the goal. Authorization asks whether the user or trusted system state has conferred authority. Provenance asks whether the authorization fact came from a trusted producer rather than attacker-controlled content. Collapsing those questions is how a useful tool becomes standing authority.

This connects directly to AI Agents, AI Agent Sandboxing, Agent Tool Permission Protocol, Prompt Injection, Confused Deputy Problem, Capability-Based Security, AI Audit Trails, AI Agent Observability, The Agent Log Becomes the Receipt, The Agent Sandbox Becomes the Airlock, The Agent Operational Envelope Becomes the Trust Certificate, and The Agent Identity Becomes the Service Account.

Limits

The paper is precise about scope. RACG is not a general prompt-injection solution. It prevents injected use of a high-risk tool only when that tool is absent from the enforced action space. It does not prevent misuse after legitimate authorization, bad arguments to an authorized tool, attacks through low-risk tools, bad risk labels, missing authorization variables, or broken tool contracts.

The benchmark is synthetic and deterministically mocked. That helps isolate exposure behavior, but it does not capture real API failures, latency, ambiguous observations, or messy production state. The authors also say RiskGate's tasks, risk tiers, and injections are authored by them, so reported leak rates such as CMTF's 0.25 reflect task construction. Evaluation on independent adversarial benchmarks such as AgentDojo and ToolEmu is future work.

The method shifts security scrutiny into contracts and provenance. That is a good shift because it creates auditable surfaces, but it is not free. If a deployment cannot prove which tools produce authorization variables, cannot enforce the visible tool set, or lets untrusted text mutate trusted state, then the central guarantee collapses.

Sources


Return to Blog