Blog · arXiv Analysis · Last reviewed July 2, 2026

The Diagram Becomes the Verification Trace

VeriGeo treats a synthetic geometry problem as more than a prompt completion. The question, diagram, constraints, proof steps, repair attempts, and rejection decision become one executable record.

The Paper

The paper is VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification, arXiv:2606.14176 [cs.AI], by Xiaoxian Duan, Zequn Liu, and Yingce Xia. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14176. The arXiv HTML lists Xiaoxian Duan with the Institute of Automation, Chinese Academy of Sciences, Beijing, China, and Zhongguancun Academy, Beijing, China; Zequn Liu and Yingce Xia are listed with Zhongguancun Academy, with correspondence to Zequn Liu and Yingce Xia.

The paper's object is synthetic geometry data for AI-assisted education and multimodal mathematical reasoning. Its claim is not merely that an LLM can write plausible questions. Its claim is that generation should be grounded in executable reasoning traces so the statement, diagram, geometric constraints, and solution can be checked together.

Why Geometry Data Breaks

Geometry problems are fragile because the answer depends on cross-modal agreement. The natural-language problem must match the diagram. The diagram must satisfy the implied constraints. The proof must use valid relationships, not invented facts. If any layer drifts, a training example can look normal while teaching a model a false geometry world.

The paper frames three requirements: controllability, verifiability, and diversity. Teachers and dataset builders may want target concepts, a chosen difficulty, or specific diagram requirements. The generator must obey those controls while still producing varied problems and valid solutions.

Existing strategies split the tradeoff. Seed-based rewriting is flexible, but it can hallucinate constraints or create cross-modal inconsistency. Diagram-first construction improves validity, but it is less suited to arbitrary user constraints. VeriGeo tries to keep the control surface while moving validity checks into the generation loop.

Executable Trace

VeriGeo uses two LLM roles. An Author agent receives user constraints, such as target concepts and difficulty, and generates the geometry problem and diagram. A Solver agent then produces a proof-aligned solution. The important part is that both agents work through a shared action sequence.

That shared action sequence is the bridge between prose, picture, constraint system, and proof. The paper's grammar includes construction and verification moves such as AddPoint, MovePoint, AddAuxLine, AddCircle, AddEdge, VerifyPoint, and VerifyFunction. Relationships such as collinear, parallel, perpendicular, equal length, equal angle, midpoint, and circle incidence are no longer just English claims. They become objects the verifier can inspect.

This is the paper's key governance move. The diagram is not a decorative input attached to a question. It is part of the verification trace. A generated example can be accepted, repaired, or rejected because the system has something more precise than surface plausibility to inspect.

Verification Gates

The first gate is numerical verification. It executes the action sequence and checks local geometric consistency under numerical tolerances. If a construction says two lines are perpendicular, three points are collinear, or a point lies on a circle, the local geometry has to satisfy that claim.

The second gate is analytical verification. It compiles the constraints into algebraic systems and checks realizability. This is the difference between a diagram that happens to be drawn plausibly and a constraint set that can actually coexist.

The third gate is LLM-assisted logical verification. This stage audits global consistency among the problem text, diagram, action sequence, and solution. It looks for contradictory assumptions, unsupported inferences, and missing cases. The paper is honest that this last gate uses an LLM, which means it is a judge channel rather than a theorem prover, but it is still embedded inside a broader checking pipeline.

When a generation fails, VeriGeo uses verification-guided reflection to repair recoverable failures and reject unrecoverable ones. That distinction matters. A synthetic-data system should not silently pass examples just because the prompt eventually produced fluent text. It needs a rejection rule as much as a repair loop.

Results

Across five LLM backbones, the average direct-pass rate is only 29.02 percent. That number is the paper's warning label: raw generations frequently fail the checks. The repair stage recovers a substantial fraction for stronger backbones, including 36.00 percent repaired for Gemini-3.1-Pro, 30.67 percent for Qwen3.5-Plus, and 20.22 percent for Claude-Opus-4.6.

The diversity claim is also measurable. In a 100-sample analysis, VeriGeo covers 354 distinct geometry concepts. The paper reports target-difficulty matching of 100.0 percent for the Harder condition and 80.0 percent for the Equivalent condition in its seed-conditioned difficulty evaluation.

The downstream result uses 8.7k verified examples generated by VeriGeo to supervised fine-tune Qwen2.5-VL-7B-Instruct. The reported scores are 59.40 percent on PGPS9K, 82.74 percent on GeoQA, and 75.96 percent on MathVista-GPS. The paper presents this as the best reported GeoQA result among end-to-end multimodal LLM-based geometry solvers, and as strong PGPS9K and MathVista-GPS performance relative to prior geometry data generation methods used for supervised fine-tuning.

The comparison is precise enough to keep in view. The baseline Qwen2.5-VL-7B-Instruct scores 42.40 percent on PGPS9K, 72.07 percent on GeoQA, and 53.37 percent on MathVista-GPS. GeoGen-SFT-7B, trained on 224K examples, reports 54.30, 78.00, and 74.00. VeriGeo's 7B model reports 59.40, 82.74, and 75.96 using 8.7k verified examples. The claimed advantage is data quality, not just data volume.

Governance Standard

A geometry-data generator should ship a geometry-data receipt. The receipt should include the prompt constraints, target concepts, requested difficulty, diagram requirements, model name, decoding settings, seed, action grammar version, action trace, diagram artifact, problem text, solver trace, numerical verifier version, tolerance settings, analytical solver, logical judge prompt, logical judge model, repair iterations, rejection reason, accepted or rejected status, time, tokens, cost, training split, downstream benchmark split, license, and provenance.

The receipt should keep three claims separate. Consistency means the statement, diagram, trace, constraints, and solution survived the checks. Pedagogical value means the problem is worth teaching. Benchmark value means training on it improves held-out performance without contaminating evaluation. VeriGeo gives a stronger consistency story than ordinary prompt synthesis, but consistency is not the whole educational or evaluation claim.

This connects directly to AI Evaluations, AIME and Math Benchmarks, Reasoning Models, Synthetic Data and Model Collapse, Training Data, The Benchmark Becomes the Curriculum, The Evaluation Bench Becomes the Test Rig, The Grading Cascade Becomes the Evaluation Artifact, The Validity Certificate Becomes the Policy Proof, The Difficulty Estimate Becomes the Reasoning Trace, The Verifier Horizon Becomes the Agent Reward, The Proof Trace Becomes the Trust Boundary, The Logic Benchmark Becomes the Control Panel, and The Reaction Rule Becomes the Verification Loop. Synthetic examples become governable only when their construction evidence travels with them.

Limits

The paper's own limitations are important. The current implementation focuses on geometry representable by the executable action grammar and by the supported geometric operators, constraint types, and algebraic verification routines. That leaves future work for richer diagrams, more advanced theorem-level constructions, 3D geometry, and curriculum variants.

Efficiency depends heavily on the backbone and the verification-reflection budget. The appendix reports very different token, time, cost, repair, and rejection profiles across Gemini-3.1-Pro, Qwen3.5-Plus, and Claude-Opus-4.6. A generator that is valid but too expensive still needs routing, early rejection, dynamic reflection budgets, or lightweight verifier pre-filtering.

The logical gate and some evaluation judgments still rely on LLMs. That is not disqualifying, but it should be logged as a judge channel rather than confused with formal proof. The downstream results focus on supervised fine-tuning with verified examples; process-level supervision, reinforcement learning, verifier-guided selection, and larger-scale training remain future directions.

At review time, I found the arXiv record, HTML, and PDF, but no official code repository linked from the arXiv record. That makes the implementation and benchmark claims paper-reported rather than locally reproduced.

Sources


Return to Blog