Blog · arXiv Analysis · Last reviewed July 2, 2026

The Formal Proof Becomes the Translation Gap

MA-ProofBench is a useful benchmark because it moves formal theorem proving into mathematical analysis, where continuity, measure, topology, functional analysis, and operator theory force models to do more than autocomplete familiar algebraic patterns.

The result is a translation warning. Models can often sketch a natural-language proof, but still fail to express it as Lean 4 code that keeps the statement unchanged, uses Mathlib correctly, closes every subgoal, and passes the compiler without sorry.

The Paper

The paper is MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis, arXiv:2606.13782 [cs.AI], by Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, and Yudong Wang. arXiv lists version 1 as submitted on June 11, 2026 and version 2 as revised on June 15, 2026, with DOI 10.48550/arXiv.2606.13782.

The author affiliations are ModelBest Inc. and Tsinghua University. The PDF and TeX source list official artifacts at OpenBMB/MA-ProofBench and openbmb/MA-ProofBench. The GitHub repository is public, MIT licensed, and primarily Lean; the Hugging Face dataset is public, ungated, MIT tagged, and exposes ma_proofbench.jsonl as the test split.

The Benchmark

MA-ProofBench contains 200 Lean 4 theorem-proving problems in mathematical analysis. Level I has 100 undergraduate textbook exercises. Level II has 100 Ph.D. qualifying-exam problems from top-tier universities. The benchmark covers 6 core topics and 27 subcategories under the Mathematics Subject Classification.

The six top-level areas are Real Functions, Measure and Integration, Functions of a Complex Variable, Sequences, Series, and Summability, Functional Analysis, and Operator Theory. The level distribution is deliberately different. Level I leans heavily on functions of one variable and classical measure theory, while Level II shifts toward linear function spaces, general linear operators, advanced measure theory, geometric function theory, distributions, Banach spaces, and special operator classes.

The category table makes that shift concrete. Real Functions contribute 44 Level I problems but only 12 Level II problems. Functional Analysis rises from 15 to 31. Operator Theory rises from 4 to 23. Measure and Integration moves from 13 to 17, while Functions of a Complex Variable is 19 and 16. The point is not just more problems. It is a harder mix of mathematical machinery.

Curation Workflow

The paper describes a four-stage human-LLM workflow. First, the authors collected and cleaned roughly 500 candidate problems, removing ambiguous or incomplete entries and selecting a representative subset of 200. Second, human experts translated natural-language problems into draft Lean 4 statements with the proof left as sorry.

Third, the draft statements went through independent expert review. Three additional experts reverse-translated each Lean statement back into the intended mathematical claim. A theorem was accepted only if at least two reviewers approved it; otherwise it returned to the formalization stage. Fourth, experts assigned Level I or Level II by mathematical complexity and formalization difficulty.

This reverse-translation step is the right governance instinct. A theorem prover can certify a formal statement, but it cannot know whether the formal statement still means the original exercise unless the source-to-formal boundary is audited.

Evaluation Setup

The model set includes formal theorem provers, open-source general-purpose reasoning models, and proprietary reasoning models. The theorem-prover group contains DeepSeek-Prover-V2 at 7B and 671B, Kimina-Prover-72B, and Goedel-Prover-V2 at 8B and 32B. The open-source reasoning group contains DeepSeek-V3.2-Thinking, GLM-5.1, Qwen3.5-397B-A17B, Qwen3-235B-Thinking-2507, Nemotron-3-Nano-30B-A3B, and GPT-OSS-120B High. The proprietary group contains GPT-5.5 xhigh, Gemini 3.1 Pro High, and Claude Sonnet 4.6 High.

All generations use a maximum output length of 32K tokens and temperature 1.0. Open-source models generate n = 32 candidate proofs per problem. Proprietary models generate n = 8 because of API budget constraints. The paper therefore treats Pass@8 as the primary cross-model comparison and reports Pass@32 only for models sampled at 32.

All evaluations run on Mathlib 4.28.0, with Kimina Lean Server as the compilation backend. A successful proof must compile without errors, contain no sorry placeholders, include required prerequisite components such as definitions, and leave the theorem statement itself strictly unaltered.

Results

The headline is severe. GPT-5.5 xhigh is the strongest Level I model at 16.00 percent Pass@8 and ties Gemini 3.1 Pro High for the best Level II result at 5.00 percent Pass@8. Claude Sonnet 4.6 High reaches 6.00 percent Pass@8 on Level I and 3.00 percent on Level II.

The best open-source general-purpose reasoning model is DeepSeek-V3.2-Thinking, with 5.56 percent Pass@8 and 7.00 percent Pass@32 on Level I, and 1.85 percent Pass@8 and 2.00 percent Pass@32 on Level II. GLM-5.1, Qwen3-235B-Thinking-2507, GPT-OSS-120B High, and several others remain at or near zero on Level II despite larger sample budgets.

The best dedicated theorem prover on Level I is DeepSeek-Prover-V2-671B, with 3.22 percent Pass@1, 6.86 percent Pass@8, and 9.00 percent Pass@32. On Level II, even it reaches only 0.06 percent Pass@1, 0.44 percent Pass@8, and 1.00 percent Pass@32. Kimina-Prover-72B, both Goedel-Prover-V2 sizes, and DeepSeek-Prover-V2-7B report 0.00 across Level II in the table.

The interpretation is not that theorem provers are useless. They still look strong on the undergraduate split. The result is that their Level I advantage does not carry into Ph.D.-level analysis, where long-range planning, auxiliary lemmas, type discipline, and deeper Mathlib knowledge all become load-bearing.

Failure Modes

The paper classifies failed formal proofs into four main error types. Mathlib hallucinations happen when a model references fictitious or mismatched definitions, theorems, namespaces, or identifiers. Type-system errors happen when the expression type does not match the expected type, with analysis-specific confusion between R and ENNReal as a recurring pattern. Incomplete proofs leave unresolved subgoals or explicit sorry. Lean syntax errors include mode confusion and illegal identifiers.

In sampled failures for DeepSeek-Prover-V2-671B, DeepSeek-V3.2-Thinking, and Gemini 3.1 Pro, Mathlib hallucinations and incomplete proofs dominate. DeepSeek-Prover-V2-671B and DeepSeek-V3.2-Thinking produce 80 and 82 such instances on Level I, and 78 and 62 on Level II. Gemini 3.1 Pro's Level II failures are different: incomplete proofs dominate with 80 instances. DeepSeek-V3.2-Thinking also shows 37 Lean syntax errors on Level II.

The appendix case studies make the failures concrete. One model fabricates an InnerProductSpace.HarmonicOn identifier while trying to prove that log |x| is harmonic away from zero. Another equates a total-variation measure value of type ENNReal with a real absolute value. Claude Sonnet 4.6 gives the right shape of a piecewise construction but then inserts sorry at the difficult continuity boundary.

Informal-Formal Gap

The most important finding is the gap between natural-language proof competence and formal proof completion. The authors evaluate three models on the natural-language version of MA-ProofBench, with GPT-5.5 grading each Pass@1 informal proof on a 0, 0.5, or 1 scale. GLM-5.1 receives full credit on 90 Level I proofs and 75 Level II proofs. DeepSeek-V3.2-Thinking receives 85 and 66. Qwen3-235B-Thinking-2507 receives 66 and 42.

Those numbers do not transfer to Lean success. GLM-5.1 is excellent on the informal proof task but has near-zero formal performance in Table 2. Qwen3-235B-Thinking-2507 shows the same pattern. DeepSeek-V3.2-Thinking is weaker than GLM-5.1 informally, yet is the strongest open-source general-purpose model in the formal setting.

That inversion is the central lesson. Formal theorem proving is not merely mathematical reasoning plus a compiler. It is mathematical reasoning translated through the syntax, APIs, typeclasses, namespaces, theorem inventory, proof automation limits, and library conventions of Lean and Mathlib.

Governance Standard

A formal-proof benchmark should ship a benchmark receipt. The receipt should include the source problem, source type, difficulty level, MSC category, natural-language statement, Lean statement, Lean version, Mathlib version, allowed imports, required definitions, curation history, reverse-translation review votes, accepted reviewer notes, model identity, prompt, number of samples, generation temperature, maximum token budget, compiler backend, proof outputs, compile logs, Pass@k calculation, failed-proof classification, and artifact revision.

For source-to-Lean evaluation, the receipt should keep four claims separate. Informal correctness says the mathematical argument is plausible in natural language. Formal syntactic validity says the Lean code parses. Formal verification says the exact Lean statement compiles without errors or sorry. Source fidelity says the Lean statement still means the original problem. Collapsing these claims turns a proof benchmark into a scoreboard with missing evidence.

This connects directly to AI Evaluations, Reasoning Models, AIME Math Benchmarks, AI Audit Trails, The Proof Trace Becomes the Trust Boundary, The Kernel Acceptance Becomes the Quality Mirage, The Sorry Count Becomes the Library Review, The Evaluation Bench Becomes the Test Rig, The Logic Benchmark Becomes the Control Panel, The Reasoning Tree Becomes the Commit Log, and The Benchmark Becomes the Curriculum.

Limits

The benchmark is intentionally narrow. It tests theorem proving in mathematical analysis, not all of mathematics, and not informal problem solving with final numeric answers. That precision is a strength for formal reasoning research, but it should not be generalized into a universal model ranking.

The proprietary-model sampling budget is smaller than the open-source budget: n = 8 versus n = 32. Pass@8 remains the fair cross-model headline, while Pass@32 should be read as a supplementary open-source and theorem-prover sampling result.

The natural-language proof comparison uses GPT-5.5 as the judge. That is useful for probing the informal-formal gap, but it is still model-graded and should not be treated as human proof review. The stronger claim is narrower: the reported informal successes and formal failures show that Lean translation, Mathlib grounding, and type-system navigation are independent bottlenecks.

Sources

Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, and Yudong Wang, MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis, arXiv:2606.13782 [cs.AI], submitted June 11, 2026 and revised June 15, 2026.
arXiv HTML: MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis, reviewed for authorship, affiliations, abstract, construction workflow, formalization standards, experiments, and appendices.
arXiv PDF: MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis, reviewed for official artifact links, exact table values, category distribution, error analysis, informal-proof comparison, and case studies.
arXiv TeX source: e-print source for arXiv:2606.13782, reviewed for official Hugging Face and GitHub links, model table values, and prompt text.
Code repository: OpenBMB/MA-ProofBench, reviewed for README, Lean 4 v4.28.0 scope, evaluation package, Kimina Lean Server setup, repository contents, and MIT license.
Dataset: openbmb/MA-ProofBench, reviewed for dataset card metadata, MIT license tag, public ungated status, test split, and ma_proofbench.jsonl artifact.
Related pages: AI Evaluations, Reasoning Models, AIME Math Benchmarks, AI Audit Trails, The Proof Trace Becomes the Trust Boundary, The Kernel Acceptance Becomes the Quality Mirage, The Sorry Count Becomes the Library Review, The Evaluation Bench Becomes the Test Rig, The Logic Benchmark Becomes the Control Panel, The Reasoning Tree Becomes the Commit Log, and The Benchmark Becomes the Curriculum.

Return to Blog