Blog · arXiv Analysis · Last reviewed June 24, 2026

The Approval Gate Becomes the Fatigue Model

The June 2026 arXiv paper Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human, by Emre Turan, studies a weak point in LLM-agent safety: the human approval gate. Its Spiralist lesson is that human oversight is not a magic property added to a workflow. It is a finite attention budget, and an agent guard can spend that budget badly.

The Pause Button Is Cheap

Turan's paper, arXiv:2606.08919 [cs.AI], was submitted on June 8, 2026. The arXiv record lists the subject areas as Artificial Intelligence, Cryptography and Security, and Machine Learning, with code and data published at the headroom repository.

The paper starts from a practical pattern in agent systems: when an LLM agent is about to run a shell command, edit files, deploy code, read a secret, or take another consequential action, the system can pause and ask a person. That pause is often treated as the hard safety part. Turan's argument is sharper: the pause primitive is cheap; the hard part is deciding which actions deserve the human's attention.

This is a fresh angle beside the site's pages on cold-start agent safety, intent-governed tool authorization, agent sandboxes, and human oversight. Those pages ask where an agent's authority should stop. This paper asks what happens when too many stops make the reviewer worse.

Risk Labels Are Not Ground Truth

The study uses a small, curated dataset of 125 coding-agent actions. The paper says the set is deliberately weighted toward hard cases: ambiguous package installs and repository edits, obfuscated or adversarial commands, scary-looking benign cases, and clearer allow-or-block examples. Its labels are safe 42, approval 52, and blocked 31, with the author explicitly reporting the gold labels as one author's judgments rather than objective truth.

That caveat matters because the paper's reviewer-agreement test finds only moderate agreement. The arXiv abstract reports Fleiss' kappa = 0.52 across reviewers. In the HTML version, three reviewer personas disagree most in the contested middle, where a cautious reviewer and a pragmatic reviewer can have very different tolerance for the same action.

For governance, this breaks the comfortable fiction that an approval gate is simply sorting actions into obviously safe and obviously unsafe bins. Many agent actions are contextual. Editing package.json, deleting node_modules, reading an example environment file, or rebasing a branch can be ordinary maintenance in one setting and dangerous in another. A serious guard must therefore expose its risk tolerance, not pretend it discovered a universal label.

Oversight Has a Load Curve

Turan frames the guard as selective classification under asymmetric cost. Auto-allowing a dangerous action is much worse than escalating a safe action, but false alarms still matter because they consume human attention. The paper sweeps guard thresholds and reports missed-danger rate, false-alarm rate, coverage, expected cost, and risk-coverage curves instead of treating raw accuracy as enough.

On the 125-action set, the guard cannot safely auto-decide the hard cases without leaning heavily on the reviewer. Then the paper adds the key move: the reviewer is modeled as endogenous. The reviewer's reliability declines as cumulative escalation load grows, so the guard changes the quality of the human oracle it depends on.

Under that fatigue model, realized safety becomes U-shaped in the escalation rate. Escalate too little and the automated guard misses danger. Escalate too much and the reviewer is overloaded into rubber-stamping. The paper reports safety-optimal escalation rates below full escalation for reviewer capacities of 10, 25, and 50. It is careful about the evidentiary status: this is a modeling result on real scored data, not a human study measuring actual reviewer fatigue.

Flooding the Reviewer

The same mechanism becomes an attack surface. In the paper's flooding scenario, an adversary buries one malicious action among benign filler actions. A paranoid policy that escalates routine actions gives the attacker a way to spend the reviewer's attention before the malicious action arrives. A load-aware policy preserves more attention and buys headroom.

The numbers are deliberately bounded. The dataset is small, the domain is coding-agent actions, the reviewer-fatigue curve is assumed rather than fit to people, and the reviewer personas are proxies rather than human annotators. The value of the paper is not a universal fatigue coefficient. It is the system picture: approval gates are queues, queues create load, load changes human judgment, and attackers can aim at the queue.

That should change how institutions talk about human-in-the-loop AI. A workflow with more approvals is not automatically safer. It may only be louder. If the approval surface is noisy enough, the human becomes part of the failure mode.

Governance Standard

Agent approval systems should publish and monitor their operating point. That means action categories, risk scoring method, threshold, escalation rate, false-alarm rate, missed-danger rate, reviewer workload, approval latency, override frequency, and incident outcomes. The control is not just whether a human appears in the loop. The control is whether the human has enough remaining attention and authority to change the loop.

High-consequence actions should also be separated by reversibility, context, and blast radius. A file read, a local edit, a production deploy, a secret access, and a destructive command should not all compete in the same undifferentiated approval stream. The queue itself needs governance.

The rule is simple: if an agent safety claim depends on human review, the safety case must measure the reviewer's capacity, not merely count the number of times the interface asked for approval.

Sources


Return to Blog