Blog · arXiv Analysis · Last reviewed June 25, 2026

The Computer-Use Agent Becomes the Contextual Integrity Test

Anmol Goel and Iryna Gurevych's June 2026 arXiv paper Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? turns a quiet privacy failure into an executable test: a helpful agent can complete the task while moving information into the wrong context.

From Task Success to Disclosure

The paper, arXiv:2606.23189 [cs.AI, cs.CL], was submitted on June 22, 2026. It studies computer-use agents that work across applications such as email, calendars, notes, to-do lists, messages, and rendered user interfaces. The issue is not that the agent cannot see enough. It is that the agent may see too much and share the wrong subset.

That makes the paper different from a prompt-injection or task-completion benchmark. The request can be legitimate and the agent cooperative. The privacy failure appears when information from one setting moves into another where it is inappropriate for the recipient, purpose, or transmission norm.

This is distinct from the site's pages on sensitive-screen handover, unsafe shortcuts, and data-agent privacy surfaces: it tests whether disclosed facts belong in the recipient context.

What AgentCIBench Tests

The paper introduces AgentCIBench, an evaluation harness for contextual-integrity failures in computer-use agents. Each scenario gives the agent a personal multi-app workspace, a task, a recipient, information that must be shared for utility, and information that must not be shared. The scorer checks whether the agent emits a message, calendar entry, note, or reply that includes the necessary content without leaking the forbidden content.

The contextual-integrity lens comes from Helen Nissenbaum's privacy theory: privacy is not only secrecy or control, but appropriate information flow under context-specific norms. In agent terms, a calendar item, shopping note, HR thread, or medical reminder may be visible to the agent and still inappropriate to send to a colleague, vendor, manager, classmate, or family member.

This is stricter than "did the agent have access?" Access is not permission to republish. A personal assistant may need broad read access, but the disclosure decision must be narrower than the observation window.

The Three Failures

AgentCIBench targets three failure modes. Visual co-location tests whether the agent shares prohibited items because they sit near the task target in the rendered UI. Task-ambiguity overshare tests whether an underspecified request leads the agent to dump dense personal state instead of selecting only what the task warrants. Recipient misalignment tests whether the agent changes the shared subset depending on who will receive it.

Those modes are mundane, which is why they matter. No attacker needs to hide a malicious instruction. The failure can come from ordinary helpfulness: the agent reads across apps, over-includes visible detail, and treats a colleague, manager, friend, or external contact as if they were the same audience.

The governance lesson is that computer-use agents need recipient-aware disclosure boundaries. "Complete the task" is not enough; the system must know which facts are necessary and which should remain in their original context.

The Numbers and the Caveat

The primary arXiv abstract reports an evaluation of 15 frontier agents, with 11 of 15 leaking on more than half of scenarios and an average leakage rate of 67.9%. The experimental HTML has an internal inconsistency: its abstract says 12 of 15, while its conclusion says 11. Both versions support the safer statement that most evaluated agents crossed the halfway leakage mark and that the reported average leakage was 67.9%.

The paper also reports that leakage persists when agents act end-to-end in the rendered environment, not only in a state-grounded final-output probe. UI action can add new failure channels: navigation, partial drafting, recipient selection, and tool-state confusion.

The mitigation section is useful but limited. The authors report that three prompt-level interventions reduce engagement-conditioned leakage by 33 to 36 percentage points while raising utility in their setup. That is evidence of steerability, not proof that prompting alone is enough.

Why Approval Is Not Enough

A final approval dialog can still be too late or too vague. If a user sees a polished message and clicks send, they may not know which details came from which app or which are inappropriate for the recipient. Contextual integrity requires an audit of information flow, not only a button before transmission.

A stronger interface would show the recipient, source contexts, selected facts, withheld facts, and why each selected fact is necessary. It needs a disclosure receipt, not private chain-of-thought: source, recipient, content type, purpose, and transmission rule.

The Spiralist rule is simple: a computer-use agent does not preserve privacy by seeing everything and asking once at the end. It preserves privacy by proving that each disclosed item belongs in the context it is entering.

Limits That Matter

The paper is a v1 arXiv preprint and uses synthetic OpenApps workspaces, not real personal accounts. The authors explicitly caution that absolute leakage rates should not be read as estimates of real-world user harm. The scenario pool is intentionally stress-tested, the end-to-end study covers a 50-scenario subset and two agents, and the defense sweep covers three models and three prompt interventions.

Those limits do not weaken the benchmark's core point. They define its role. AgentCIBench is not a census of deployed harm. It is a pre-deployment and regression test for whether a computer-use agent can separate task-relevant sharing from contextually inappropriate disclosure.

Governance Standard

A computer-use agent safety case should include contextual-disclosure tests beside task completion, prompt-injection resistance, handover gates, and state-preservation checks. The test set should name source context, recipient, purpose, must-share items, must-not-share items, output artifact, and scoring rule.

Release records should preserve the model version, app workspace, task prompt, recipient, visible state, final artifact, selected facts, withheld facts, leakage score, utility score, refusal rate, and any mitigation prompt or policy rule. High-stakes deployments should rerun the suite after app-layout changes, memory-policy changes, connector changes, or model upgrades.

The hard part is deciding which information can move from one legitimate context into another. That is where computer-use agents need governance before they need more autonomy.

Sources

Anmol Goel and Iryna Gurevych, Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?, arXiv:2606.23189 [cs.AI, cs.CL], submitted June 22, 2026.
arXiv experimental HTML for Capable but Careless, reviewed June 25, 2026.
Helen Nissenbaum, Privacy as Contextual Integrity, Washington Law Review, 79(1), 2004.
Related pages: The Sensitive Screen Becomes the Handover Gate, The Unsafe Shortcut Becomes the Safety Benchmark, The Data Agent Becomes the Privacy Surface, The Group Chat Assistant Becomes the Privacy Boundary, AI Browsers and Computer Use, Contextual Integrity, and Data Minimization.

Return to Blog