Blog · arXiv Analysis · Last reviewed June 25, 2026

The Code Line Becomes the Authorship Receipt

A June 2026 arXiv paper turns mixed human and AI-written Python files into a line-level authorship benchmark. The practical lesson is that software provenance can no longer stop at the file, pull request, or badge.

The File Is No Longer the Unit

AI-assisted programming turns authorship into a patchwork. A developer may write the interface, accept a generated helper, edit three returned lines, keep a model's error-handling branch, and delete the rest. By the time the file lands in review, "human-authored" and "AI-authored" are no longer stable file-level categories.

That matters for labor, security, and accountability. Teams want to know which code needs extra review, what policies were followed, whether generated code entered regulated systems, and where a defect originated. A badge saying that a pull request used an AI assistant is too coarse. The governance unit is becoming the line, chunk, test status, generator, and review trail.

The Paper Frame

The paper is HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection, arXiv:2606.12620 [cs.SE], submitted June 10, 2026. The authors are Luke Patterson, Li Wang, and Adam Faulkner of Capital One. arXiv lists the subjects as Software Engineering and Artificial Intelligence, and the paper records acceptance to LREC 2026.

The authors argue that many earlier AI-code datasets do not match the new engineering reality. Some use academic or puzzle-style code; others assume that an entire snippet is either human-written or AI-generated. HybridCodeAuthorship instead asks what happens when a real source file contains interleaved human and generated lines.

How the Benchmark Is Built

The starting point is CodeSearchNet, a GitHub and Microsoft Research corpus whose Python subset contains more than 450,000 files from over 13,000 repositories. The benchmark uses 4,814 Python files that were still available on GitHub, then runs a two-part pipeline: first code testing, then code interleaving.

The code-testing phase tries to locate tests, build an environment, install dependencies, and run unit tests across Python 2.7, 3.6, and 3.12. Each original file receives a validity tier such as unit-test passed, AST parsable, or unparsable. The interleaving phase masks portions of code, asks an LLM to regenerate the missing pieces, and then labels the resulting file line by line through a Python diff.

The generation setup used Llama3.3-70B, Llama-4-Scout, and GPT-OSS-120b. For generated variants, the target replacement percentage was sampled from 10 percent through 100 percent in 10-point steps. A random 10 percent of files were intentionally left unchanged and labeled as human, giving detectors a control case rather than only hybrid examples.

What the Dataset Contains

The authors report 4,196 source files that completed the pipeline for at least one LLM, yielding 10,488 records. The full dataset contains 2,827,938 lines of code, including 488,896 AI-generated lines. The paper also distinguishes trivial from nontrivial lines, with 1,943,728 lines counted as nontrivial.

Testing status travels with the data. Of the human-authored files, 4,103, or 39 percent, passed unit tests. Of the AI-interleaved records, 3,000, or 29 percent, passed unit tests. That does not make unit tests a complete correctness proof, but it prevents the benchmark from treating every generated splice as equally plausible code.

Detector Results

The paper adapts two AI-generated code detectors, DroidDetect and AIGCode Detector, to line-level and chunk-level authorship detection. AIGCode Detector performs better across the reported variants, but the best figures are modest: the paper reports top F1 scores of 0.56 at line level and 0.48 at chunk level. Chunk-level detection is harder, and trivial segments behave differently from nontrivial code.

The result should cool the fantasy of easy provenance. A detector can help triage, but it is not a blame machine. If line-level authorship is needed for compliance, safety review, labor accounting, or incident response, detection must be paired with generation logs, prompts, tests, repository context, and human review records.

Governance Reading

The strongest Spiralist reading is procedural: code provenance becomes credible only when authorship labels are receipts, not vibes. A repository manager should be able to answer which model produced which lines, which file version they entered, which tests passed, which reviewer accepted them, and which later edits changed their status.

This also protects workers. A vague "AI-written" label can inflate suspicion around a whole contribution. A vague "human-written" label can hide generated material in regulated systems. Line-level receipts make it harder to turn AI assistance into either a productivity myth or a disciplinary shortcut.

Limits

The paper is a benchmark contribution, not evidence that authorship detection is solved. The present release focuses on Python. CodeSearchNet was selected partly to avoid AI-written source, but that also means the underlying code is at least six years old and may omit newer libraries and development styles. The authors also note pipeline limits: old repositories can be hard to test, long files can exceed context windows, and LLMs do not always follow code-marking instructions.

There is also a simulation gap. HybridCodeAuthorship creates interleavings through a controlled masking-and-generation pipeline. Real-world IDE assistance includes autocompletion, chat-driven rewrites, copy-paste from external tools, review suggestions, and human edits after generation. The benchmark is valuable because it names the unit of audit, not because it exhausts the practice.

Audit Receipt

The audit-grade sentence is: Patterson, Wang, and Faulkner introduce HybridCodeAuthorship, a Python benchmark of 10,488 records from 4,196 CodeSearchNet-derived files, with line-level human and AI labels produced through controlled code interleaving and evaluated against adapted code-authorship detectors.

The practical receipt is: if an organization relies on AI coding tools, its governance should attach authorship, model, prompt, test, and review evidence at a finer grain than the pull request.

Sources

Luke Patterson, Li Wang, and Adam Faulkner, HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection, arXiv:2606.12620 [cs.SE], submitted June 10, 2026.
Primary versions checked: experimental HTML, PDF, arXiv DOI, and LREC proceedings DOI.
Benchmark repository checked from the paper: CapitalOne-Research/c1-hybrid-code-authorship.
Related pages: The LLM Label Becomes the Review Tax, The Static Structure Becomes the Agent Anchor, The Verifier Horizon Becomes the Agent Reward, The Safety Fine-Tune Becomes the Evasion Surface, and AI Agents.

Return to Blog