Blog · arXiv Analysis · Last reviewed July 2, 2026

The Kernel Acceptance Becomes the Quality Mirage

OpenMath shows why "Lean builds" is not the same claim as "the mathematics was faithfully formalized." The kernel can accept a theorem whose statement quietly dropped a clause, added an easier hypothesis, or restricted the parameters until the source theorem is no longer the theorem being proved.

The Paper

The paper is Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance, arXiv:2606.14000 [cs.AI], by Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, and Vasily Ilin. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14000.

The object of formalization is J. C. Butcher's Numerical Methods for Ordinary Differential Equations. That choice matters because numerical analysis is largely absent from Mathlib compared with more library-rich areas of mathematics. The paper therefore tests an agent's ability to build Lean 4 theory rather than simply rediscover a nearby existing theorem.

The arXiv record does not list an official code link in its metadata. The UW Math AI Lab GitHub organization lists the public uw-math-ai/OpenMath repository as code for semi-autonomous formalization of undergraduate numerical analysis in Lean 4 and Mathlib.

Why Kernel Acceptance Is Too Weak

Kernel acceptance is a necessary check. It says the Lean term type-checks, the imports resolve, the declarations are well formed, and the project builds. It does not say the Lean statement preserved the source theorem. A formal theorem can be valid, mechanically checked, and still answer the wrong mathematical question.

The paper's core critique is that whole-textbook autoformalization has often treated compilation, zero open sorrys, or low axiom counts as the success metric. That hides errors at the source-to-formal boundary: incomplete multi-part definitions, implicit constraints translated incorrectly, hypotheses added to make a theorem easier, or universal statements narrowed to a few special cases.

That is the quality mirage. A build log can look clean because the theorem prover is only asked to prove the statement it was given. If the statement has already drifted away from the book, the kernel has no independent memory of the book.

The Pipeline

The OpenMath pipeline is an autonomous loop driven by a Python orchestrator. It repeatedly invokes four LLM roles: Planner, Worker, Evaluator, and Consultant. Planner and Worker use Claude Opus 4.6, while Evaluator uses Claude Sonnet. The work is coordinated through GitHub Actions and targets whole Lean projects rather than isolated theorem snippets.

The agents are stateless across cycles. Persistent state lives in repository files: strategy.md, task_results/cycle_NNN.md, issues/name.md, history.jsonl, cycle, heartbeat.json, and run.lock. The system commits only when Lean builds, and the workflow treats repository state as the shared memory substrate.

The resulting artifact is not just a theorem-proving benchmark score. It is a record of multi-step software work: source selection, Lean file edits, proof attempts, stuck states, external theorem-prover calls, build gates, and evaluator judgments. That shape makes it much closer to a coding-agent governance problem than to a single-shot translation task.

The Three-Dimensional Audit

The paper evaluates formalization quality along three dimensions: semantic correctness, Mathlib reuse, and cross-file reuse. Semantic correctness asks whether the Lean statement faithfully captures the informal source. Mathlib reuse asks whether the project uses existing library results instead of re-proving available theory. Cross-file reuse asks whether later files use earlier project results rather than duplicating them.

The semantic audit combines direct and round-trip LLM-as-judge methods. In the direct pass, the judge compares the natural-language statement and surrounding context with the Lean declaration. In the round-trip pass, a blind back-translator converts the Lean statement back into natural language, and a separate judge compares that back-translation to the source. The paper uses DeepSeek V4 Pro for the direct judge and back-translation, and gpt-oss-120b for round-trip judgment.

The audit has different pairing constraints across systems. OpenMath uses curated source-to-Lean status data from the Butcher formalization. RepoProver uses blueprint environments marked with Lean snippets. M2F relies mostly on generated docstrings for 1,834 of 1,864 entities, with only 30 true textbook-fidelity samples from Rockafellar section 1.1. That makes the M2F semantic audit closer to a self-consistency audit than a full source-fidelity audit.

Results

For Butcher, the project identifies 175 statements. It reports 84 fully proof-complete Lean formalizations, or 48.0 percent. Another 12 statements, or 6.9 percent, have a Lean statement plus at least one sorry. Combined coverage is 54.9 percent. Four of five chapters exceed 74 percent combined coverage, while Chapter 3 on Runge-Kutta methods is the bottleneck at 34.8 percent.

Raw proof-completeness metrics look strong. OpenMath reports 0 sorries and 0 axioms. RepoProver has 5 sorries and 0 axioms, with the sorries described as intentional exercises. M2F has 12 sorries and 4 axioms. Those counts are useful, but the semantic audit shows why they cannot stand alone.

The direct semantic audit finds substantial divergence: OpenMath has 42 percent different or divergent judgments, M2F has 25 percent, and RepoProver has 18 percent. The direct and round-trip methods agree at roughly 72 to 78 percent depending on system. A human spot-check by two Lean-expert authors agrees with the judge on 18 of 20 OpenMath samples, including all 10 OK cases and 8 of 10 DIV cases.

The recurring failure patterns are concrete. Definition 312A in the textbook includes elementary weights, internal weights, and derivative weights across multiple equations, while the Lean version defines only derivativeWeight. Theorem 514A adds an extra h_norm hypothesis. Theorem 520B adds h_inv and hY_stage. Theorems 530B and 530C add explicit-method and Lipschitz hypotheses. Theorem 550A proves only dimensions 1 through 7 for a claim whose textbook version ranges over arbitrary n by n doubly companion matrices; the paper notes the universal claim fails for n at least 8.

The Mathlib reuse audit also tempers the story. RepoProver has 6,960 proofs with 4 percent overlap with Mathlib, M2F has 4,636 proofs with 5 percent overlap, and OpenMath has 4,264 proofs with 2 percent overlap. Low overlap may reflect genuinely new territory, but it can also reveal missed opportunities to reuse established library statements.

Cross-file reuse is similarly partial. The paper validates 77 of 136 OpenMath candidate dependency edges as real, or 57 percent, and 2,603 of 7,090 RepoProver edges, or 37 percent. Among LLM-validated dependencies, the reflected reuse rate is 48 percent for OpenMath and 45 percent for RepoProver. Citation-derived edges survive more often than keyword-derived edges, which is exactly the kind of distinction a governance receipt should preserve.

Operational Cost

The cost profile is large enough to matter. Appendix D reports 59,340 lines of Lean 4 across 57 files, with 309 definitions, 931 theorem or lemma declarations, 15 structures, 2 classes, and approximately 1,257 named declarations. The project used Lean's default 200,000-heartbeat budget without set_option maxHeartbeats overrides.

The runtime budget was about 215 to 270 active hours out of a 500-hour wall-clock span, a 45 to 55 percent duty cycle. Claude Code consumed about 5.73 B tokens, including 5.4 B cached input reads and 33 M output tokens. Across the same window, Claude issued 19,968 tool calls, with Bash, Read, Edit, TodoWrite, Grep, Write, and MCP calls forming the operational surface.

Aristotle was used as a free batch theorem-proving backend. The paper reports 89 jobs, 72 returned jobs, and 17 cancelled jobs. Of the 89 jobs, 27 submissions were incorporated, 45 returned unused, and 17 were cancelled. The committed Aristotle contribution closed 47 sorrys across 72 lemmas and 1,417 lines of proof code, with a large shifted-Legendre development in section 342.

Governance Standard

An autoformalization project should ship a formalization-quality receipt. The receipt should include the source statement, source context, source location, informal dependencies, Lean declaration, proof status, sorry count, axiom count, Lean version, Mathlib version, build command, build log, imported files, reused Mathlib lemmas, cross-file dependencies, direct judge model, round-trip judge model, back-translator model, prompt rubric, judge verdict, human review sample, divergence class, Skolemization exception notes, runtime, token counts, tool calls, code repository, license, and revision identifier.

That receipt should keep three claims separate. Kernel acceptance says the formal object is accepted by Lean. Semantic correctness says the formal object still means the source claim. Project quality says the formal object fits the surrounding library and reuses available work rather than growing a disconnected parallel theory. Treating those as one metric lets quality debt hide behind a green build.

This connects directly to AI Coding Agents, AI Evaluations, AI Audits and Assurance, AI Audit Trails, Reasoning Models, AIME Math Benchmarks, The Proof Trace Becomes the Trust Boundary, The Validity Certificate Becomes the Policy Proof, The Evaluation Bench Becomes the Test Rig, The Grading Cascade Becomes the Evaluation Artifact, The Logic Benchmark Becomes the Control Panel, The Reasoning Tree Becomes the Commit Log, The Agentic Code Becomes the Governance Substrate, The Compatibility Rescue Becomes the Source-Only Audit, The Performance Benchmark Score Becomes the Coding Agent, The Coding Agent Becomes the Maintainer, and The Benchmark Becomes the Curriculum.

Limits

The audit is better than build-only evaluation, but it is not a final oracle. The paper names an LLM-judge failure mode: the judge can misread Lean parameter bindings or Skolemized existentials as added hypotheses, overcounting divergence. It can also miss mathematically subtle equivalences that a domain expert would recognize.

The cross-file reuse audit is a lower bound. Lean dependencies can travel through implicit mechanisms such as simplification sets and definitional unfolding, which are not always captured by explicit dependency signals. A missing reflected edge does not automatically prove the project failed to reuse prior work.

The M2F comparison has a source-fidelity caveat because most of its natural-language side is generated from the formal artifact itself. Some entities were also dropped because of parsing failures or API errors, and those drops were not random; complex statements can cluster in the failure set. The result is an important audit method, not a guarantee that every divergence label is correct.

Sources


Return to Blog