The Sorry Count Becomes the Library Review
A Lean project can reach zero sorries and still not be a good library contribution. This case study makes the missing gate visible: after the proof checks, someone has to review the definitions, theorem surfaces, namespaces, APIs, and proof style as reusable software.
The Paper
The paper is Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization, arXiv:2606.13925 [cs.AI, math.AG], by Vasily Ilin and Brian Nugent. arXiv lists version 1 as submitted on June 11, 2026, with DOI 10.48550/arXiv.2606.13925.
The arXiv HTML links the public Lean source at Vilin97/Clawristotle, grothendieck-vanishing, and a public log dataset at uw-math-ai/grothendieck-vanishing-logs. The paper says the dataset contains the two expert reviews, structured process logs, per-turn token usage, refactor and compression loop histories, per-commit LOC and sorry counts, the agent tool-use timeline, Aristotle proving jobs, human prompts, and loop prompts.
The Case Study
The theorem is Grothendieck's vanishing theorem, from Hartshorne, Chapter III, Theorem 2.7. Informally, if X is a Noetherian topological space and n is above the topological Krull dimension of X, then the n-th sheaf cohomology of any sheaf of abelian groups on X vanishes.
The authors wrote the Lean statement by hand, supplied Claude Code with a PDF excerpt of Hartshorne's proof, and instructed the agent to follow that proof. The paper emphasizes that the statement used only existing mathlib definitions, which prevented the agent from inventing custom definitions that could make the theorem trivially true.
The displayed Lean statement is:
theorem GrothendieckVanishing
(X : TopCat) [NoetherianSpace X]
(n : Nat) (h : n > topologicalKrullDim X)
(F : Sheaf AddCommGrpCat X) :
Subsingleton (Sheaf.H F n)
This is not a contest problem reducible to one tactic script. The proof route touches presheaves, sheaves, stalks, exact sequences, filtered colimits, Krull dimension, closed immersions, extension by zero, flasque sheaves, and the derived-category definition of cohomology. That makes it a useful stress test for library-building judgment.
Timeline
The project has two states. State A is the first verified version, inspected by a Lean and mathlib expert. State B is the version after automated response to that review plus later mathlib-style cleanup, inspected by the same expert.
The formalization phase ran March 27 to April 4, 2026, over nine active days. Claude Code followed the supplied Hartshorne proof plan, with Aristotle used for bounded lemmas, and produced state A, the first sorry-free version. The expert review ran April 8 to April 15. The review response ran April 17 to May 1, with the main refactor loop April 19 to April 27. Compression ran April 27 to April 28, followed by final polish for names, docstrings, lint, and stale comments.
The timeline matters because the project did not end when the sorry count reached zero. The paper reports that the first sorry-free version arrived by April 4, but LOC continued moving afterward as the project was rewritten in response to review. The review judged the code after kernel success, so it found design failures no proof-completion metric could see.
What Review Found
State A got the mathematical proof structure right. The authors explicitly note that producing this sophisticated proof with no human guidance other than the theorem statement and a standard textbook proof was remarkable. The failure was not "the model cannot prove." The failure was "the model proved in a way that did not yet make good library code."
Table 2 separates local, checkable fixes from global API judgment. File structure was fixed: names and docstrings became readable. Theorem statements improved when the reviewer made specific requests, such as replacing overly specific hypotheses about short exact sequences from injective presentations with simpler hypotheses about short exact sequences. Proof style became better but uneven.
Definitions were the weakest category and did not noticeably improve after review. The paper reports that the agent made 62 of its own definitions and exactly one was done the correct way: TopCat.closedIncl. The remaining 61 were judged poor and not very useful for future formalizations. Some wrapped standard constructions under application-specific names; others introduced public surface area for one-off proof conveniences.
The API story is more subtle. The first review asked the agent to build an API for Sheaf.H instead of unfolding the definition downstream. State B did create a dedicated cohomology API file and downstream files became cleaner. But the API itself was noisy and bloated: the paper says the file reached almost 800 lines of code with 24 lemmas in the documentation header, many highly specific to whatever proof needed them next.
The paper's central lesson is that agents adapt well to crisp completion predicates: rename this file, remove this old name, generalize this theorem, delete this wrapper, expose this isomorphism. They struggle when the target requires counterfactual design judgment: which definition would future users want, which theorem surface is reusable, and which small API will support proofs that have not yet been written.
Operational Trace
The project includes unusually concrete process telemetry. Figure 3 reports 31,529 Claude turns across 270 sessions. At posted Opus API rates, the captured Claude usage corresponds to roughly $13K, dominated by cache reads at about $10K, though the authors used the $200 subscription. Figure 4 reports 19,393 tool calls, dominated by shell and file operations plus Lean LSP queries.
The heartbeat incident is a useful engineering lesson. From March 28 to April 1, mathlib's synthInstance budget for HasDerivedCategory kept colliding with proofs involving Ext or Sheaf.H. The agent raised set_option maxHeartbeats, oscillating between 200K and 12.8M. Removing those overrides regressed the sorry count from 3 to 24. Recovery came from caching instances with inferInstanceAs and splitting large proofs into named sublemmas. The durable rule reported by the paper is never raise maxHeartbeats above 200000.
The prompts also had to close verbal escape hatches. The paper says the slash-command prompts explicitly forbade saying the task was blocked and said a no-op cycle was never acceptable. That is an uncomfortable but useful observation: an autonomous loop can optimize for declaring a review item done or blocked unless the gate forces decomposition and evidence.
Aristotle helped as a bounded theorem-proving tool, not as an architecture designer. Successful calls closed bounded lemmas, and disproven calls helped falsify bad candidate lemmas. But the hard decisions about which definitions should exist and which statements should be public were never theorem-prover queries.
Governance Standard
A reusable formalization should ship a library-review receipt. The receipt should include the theorem statement, informal source, proof outline, human-supplied assumptions, Lean version, mathlib version, initial sorry count, final sorry count, build command, heartbeat policy, public definitions introduced, theorem statements introduced, namespaces, file graph, API files, reviewer identity or role, first expert review, review checklist, agent response status, second expert review, removed definitions, generalized theorem statements, remaining known bad API areas, tool-call timeline, token usage, external prover jobs, code repository, log dataset, and revision identifiers.
The receipt should distinguish proof completion from contribution quality. Proof completion asks whether Lean accepts the artifact. Library quality asks whether later formalizers can reuse the definitions and APIs without paying transport costs, unfolding internals, or navigating one-off lemmas. A green kernel check answers the first question. Expert review answers the second.
This connects directly to AI Coding Agents, Mechanistic Interpretability, AI Evaluations, AI Audits and Assurance, AI Audit Trails, Reasoning Models, The Kernel Acceptance Becomes the Quality Mirage, The Proof Trace Becomes the Trust Boundary, The Validity Certificate Becomes the Policy Proof, The Reasoning Tree Becomes the Commit Log, The Coding Agent Becomes the Maintainer, The Agentic Code Becomes the Governance Substrate, The Benchmark Becomes the Curriculum, and The Feature Geometry Becomes the Stress Test.
Limits
The paper is deliberately a case study, not a population estimate. One Grothendieck-vanishing formalization in algebraic geometry cannot quantify all failure rates for all autoformalization systems. Its strength is qualitative detail: same project, same expert, before-and-after review, process logs, and released artifacts.
The theorem statement and proof outline were supplied by humans. That makes the study cleaner for evaluating proof construction and library quality, but it does not test full autonomous theorem selection, informal-to-formal statement design, or source theorem discovery.
The review standard is also human and contextual. That is the point, but it means the gate cannot be reduced to a single public leaderboard number. The strongest claim is practical: if a system claims to build reusable formal libraries, evaluate what survives expert review after the sorries are closed.
Sources
- Vasily Ilin and Brian Nugent, Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization, arXiv:2606.13925 [cs.AI, math.AG], submitted June 11, 2026.
- arXiv HTML: Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization, reviewed for abstract, theorem statement, timeline, artifact links, review themes, analysis, conclusion, and license.
- arXiv PDF: Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization, reviewed for table values, operational telemetry, heartbeat details, token and tool-call counts, code/data availability, and limitations.
- Lean source: Vilin97/Clawristotle, grothendieck-vanishing, reviewed for repository status and project identity.
- Dataset: uw-math-ai/grothendieck-vanishing-logs, reviewed for dataset status and artifact identity.
- Related pages: AI Coding Agents, Mechanistic Interpretability, AI Evaluations, AI Audits and Assurance, AI Audit Trails, Reasoning Models, The Kernel Acceptance Becomes the Quality Mirage, The Proof Trace Becomes the Trust Boundary, The Validity Certificate Becomes the Policy Proof, The Reasoning Tree Becomes the Commit Log, The Coding Agent Becomes the Maintainer, The Agentic Code Becomes the Governance Substrate, The Benchmark Becomes the Curriculum, and The Feature Geometry Becomes the Stress Test.