Blog · arXiv Analysis · Last reviewed June 25, 2026

The LLM Label Becomes the Review Tax

Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Julian Frattini, and Philipp Leitner's June 2026 arXiv paper on LLM-labelled code review shows that provenance labels are not neutral stickers. They change attention, create verification work, and turn the prompt into a missing review artifact.

The Label Is Not a Warning Label

The paper, arXiv:2606.26505v1, was submitted on June 25, 2026. arXiv lists the exact title as Same Scrutiny, More Time: Eye Tracking Insights into Reviewing LLM-Labelled Code, by Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Julian Frattini, and Philipp Leitner, with categories cs.SE and cs.HC. The arXiv comment says it was accepted at the 41st IEEE/ACM International Conference on Automated Software Engineering, ASE 2026.

The governance question is narrower than whether developers should use code generators. It asks what happens when a reviewer is told that a code segment came from an LLM. Many organizations want provenance labels because hiding machine assistance is risky. But the label is not a magic warning. It changes the reviewer's attention and creates a new kind of work: deciding how to inspect code whose origin carries a special suspicion.

That makes the label part of the review interface. A provenance mark can support accountability, but it can also become a vague anxiety trigger if the organization has not defined what extra review means. The label says "look here." It does not say what evidence would make the code acceptable.

What the Experiment Measured

The authors conducted a Wizard-of-Oz experiment with 32 participants. Participants reviewed four Python files presented as pull requests during one-hour sessions while an eye tracker recorded gaze behavior. The files averaged 77 plus or minus 10 lines. Across the tasks, selected areas of interest appeared either with an LLM-generated label or without one, and label assignments were rotated so the same code could be compared across participants under labelled and unlabelled conditions.

The label was designed to feel realistic. It included a comment header indicating that the code was generated by an LLM, along with metadata such as generation time and the prompt used to produce the code. The authors then used Bayesian data analysis on fixation durations and saccade lengths, and combined that with qualitative analysis of gaze paths and exit interviews.

This design matters because it separates perceived provenance from code quality. The study is not simply comparing human-written code to machine-written code. It asks how reviewers behave when the same review task contains an origin signal that changes what the code means institutionally.

Same Scrutiny, More Time

The headline finding is precise. The label did not produce a practical change in review thoroughness as measured by saccade length. Participants still scanned labelled and unlabelled code with broadly comparable thoroughness. But they spent more time fixating on code labelled as LLM-generated. The paper reports a credible effect for the LLM label on fixation duration, while file order and carryover effects were not credible in the same analysis.

The interviews sharpen that result. Twenty participants, or 62.5 percent, reported no change in their evaluation criteria between labelled and unlabelled code. Twelve participants did adjust criteria, often looking differently at logical correctness or code quality depending on their assumptions about LLM mistakes. Fourteen participants used the prompt during review.

The Spiralist lesson is that human oversight is not just a role assignment. It is a cognitive workload. A reviewer can sincerely believe they are applying the same standard while their eyes reveal a different cost profile. The LLM label becomes the review tax: extra attention without necessarily a clearer rule for acceptance.

The Prompt Becomes an Artifact

The prompt was not merely background information. Some participants treated it like a requirement: they compared the generated code against what the prompt asked for. Others treated it more like documentation, using it to understand the intent behind a code segment. In both cases, the prompt became part of the review material.

This is one of the paper's strongest governance contributions. If a team labels code as LLM-generated but withholds the prompt, the reviewer sees risk without seeing the instruction that shaped the output. If the prompt is embedded as a code comment, it can clutter the file and create maintenance problems. The authors argue instead for prompt-to-code traceability as metadata that can be accessed on demand, especially for multi-turn generation where one prompt is not the whole history.

Policy Boundary

A serious AI coding policy should therefore define what LLM-labelled code requires from reviewers. Does the reviewer check prompt alignment, security properties, tests, edge cases, licensing, performance, maintainability, or all of them? Who owns that check when the original author used an assistant? What must be preserved when the generated code is modified? When does a label expire because the human has rewritten enough?

This connects to Codex workflow reorganization, machine contributor maintainer tax, the contributor ladder and agent queue, human oversight of AI systems, and AI audit trails. A code provenance label is only useful when it is tied to authority, evidence, and action.

Limits

The paper is a controlled study, not a census of all software review. The authors note external-validity risks: participants' stance toward AI may vary, eye-tracking equipment may affect behavior, and the lab task cannot fully reproduce industrial review settings. The labels were also intentionally constructed for the experiment, so real company tooling may produce different effects.

Those limits do not weaken the practical point. If provenance labels change attention under controlled conditions, teams should not assume labels are neutral in production. The right response is not to hide AI use. It is to test how labels interact with review tools, policy, team norms, and maintainer workload.

Review Receipt

A review receipt for LLM-assisted code should name the generated segment, model or tool if known, generation context, prompt or conversation trace, author modifications, tests run, reviewer role, required verification criteria, reviewer comments, unresolved concerns, and final merge authority. The prompt should travel as review metadata, not as folklore in a chat window.

The label becomes the review tax when it asks for suspicion without supplying a method. The better artifact is a label plus a trace: this code was assisted, this is what was asked, this is what changed, this is what was checked, and this is who accepted the remaining risk.

Sources

Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Julian Frattini, and Philipp Leitner, Same Scrutiny, More Time: Eye Tracking Insights into Reviewing LLM-Labelled Code, arXiv:2606.26505 [cs.SE, cs.HC], submitted June 25, 2026.
Primary arXiv versions checked: metadata API record, PDF, and experimental HTML, reviewed for title, authors, submission date, categories, ASE 2026 note, Wizard-of-Oz design, 32-participant study, pull-request review task, eye-tracking measures, Bayesian analysis, prompt-use findings, policy implications, and validity threats.
Related pages: The Codex Workflow Becomes the Reorganization, The Machine Contributor Becomes the Maintainer Tax, The Contributor Ladder Becomes the Agent Queue, Human Oversight of AI Systems, and AI Audit Trails.

Return to Blog