The Logic Benchmark Becomes the Control Panel
Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, and Kailong Wang's June 2026 arXiv paper asks a practical evaluation question: can deductive-reasoning benchmarks be generated with controlled logical complexity while still preserving natural-language diversity?
For this essay, a reasoning-benchmark receipt is the record that binds a benchmark score to the formal generator, depth, width, label, distractor count, topic, verifier, model settings, cost, and failure mode.
The Claim
The paper, arXiv:2606.20227 [cs.AI; cs.SE], was submitted on June 18, 2026. arXiv lists the title as QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation.
The authors introduce QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with controlled depth, width, labels, distractors, and semantic topics. The formal structures are translated into natural language, then checked by round-trip verification with an external automated theorem prover.
The useful claim is that a benchmark should be a control panel, not just a bag of questions. If a model fails, the evaluator should be able to say whether the failure came from deeper chains, wider branching, false or unknown labels, distractors, semantic topic effects, or translation drift.
The Paper Frame
The paper starts from a real evaluation problem. Large reasoning models are improving quickly, while deductive-reasoning benchmarks often lack fine-grained control over logical complexity or rely on templates that keep logic consistent at the cost of semantic variety.
Existing resources such as RuleTaker, ProofWriter, RobustLR, PrOntoQA-OOD, FOLIO, and ProverQA each cover part of the design space. The paper's framing is that modern evaluation needs controllability, scalability, semantic diversity, and consistency at the same time.
QMFOL narrows the formal target to monadic first-order logic. That restriction is a design trade-off: unary predicates make the generated logic easier to control and verify, but they do not cover the full relational richness of first-order reasoning.
Generator Pipeline
The pipeline has two main modules. Logic Construction creates formal MFOL tasks from conjunction and disjunction patterns. It builds basic rules, derives fact-conclusion pairs with labels, and optionally adds distractor rules that share predicates but do not change the correct label.
The FOL2NL module gives those formal tasks natural-language clothing. An LLM assigns a topic-specific domain, maps predicates to domain semantics, and converts formulas into indexed natural-language sentences. The paper uses topics such as food, animal, university, and mathematics.
The critical step is verification. The generated natural language is translated back into FOL, then checked with an external automated theorem prover. The paper identifies Vampire as the theorem prover used for verification. If the reconstructed label does not match the ground-truth label, the task is regenerated or discarded.
QMFOLBench
QMFOLBench contains 2,880 MFOL reasoning tasks. It spans 960 configurations: four depths, four widths, three labels, five distractor levels, and four topic domains, with three random seeds per configuration.
Depth and width use values 5, 10, 15, and 20. Distractor levels use 0, 5, 10, 15, and 20. The label space is True, False, and Unknown. The four topics are Food, Animal, University, and Mathematics.
For each task, the target model receives premises and a candidate conclusion, then must answer True, False, or Unknown. The benchmark reports accuracy, label-wise F1, Macro-F1, time overhead, and token overhead.
Results
The evaluation covers six large reasoning models and two non-reasoning-mode LLM settings: Qwen3-32B, Qwen3.5-27B, DeepSeek-V3.2, DeepSeek-V3.2-Thinking, GPT-5.4-None, GPT-5.4-High, Gemini-3.1-Pro, and Claude-Sonnet-4-6.
Gemini-3.1-Pro leads with 99.03 percent Macro-F1, and GPT-5.4-High follows at 97.40. Qwen3.5-27B reaches 92.19, and DeepSeek-V3.2-Thinking reaches 90.77. The weaker entries are not merely smaller; some are label-biased. GPT-5.4-None has a Macro-F1 of 55.08, with False-F1 at only 21.11.
The overhead result is just as important. DeepSeek-V3.2-Thinking and Qwen3.5-27B average 233.01 and 217.00 seconds per task. Qwen3.5-27B and Gemini-3.1-Pro use 14,990 and 12,468 tokens per task. The paper describes GPT-5.4-High as the best performance-efficiency trade-off, at 57.46 seconds and 2,732 tokens on average.
Distractors
Distractors expose a different failure path than raw depth or width. In Table 5, Qwen3-32B falls from 76.74 percent accuracy with no distractors to 50.35 percent with 20 distractors. Qwen3.5-27B falls from 97.22 to 87.85. DeepSeek-V3.2-Thinking is more robust, moving from 92.36 to 89.76.
The paper runs a context-length ablation to check whether the distractor drop is just longer prompts. For Qwen3-32B, D20 tasks contain about 603 more tokens than D0 tasks. When the authors append 600 tokens of placeholder "Unimportant Content" to D0 tasks, Qwen3-32B scores 75.69 percent, close to its 76.74 percent D0 result. The reported degradation is therefore attributed mainly to distractor rules, not context length.
This is useful benchmark design. It separates "the model cannot read a longer prompt" from "the model cannot ignore irrelevant but logically related rules."
Semantic Dependence
The topic experiment is the most governance-relevant part. The benchmark can hold the formal logic fixed while changing the semantic wrapper. Most models perform better on the University subset. Gemini-3.1-Pro and GPT-5.4-High show minimal variation, while the DeepSeek series drops sharply on Mathematics.
The paper reports that DeepSeek-V3.2-Thinking produces incorrect answers on 16.81 percent of Mathematics tasks despite solving tasks with identical logical structures in the other three topics; DeepSeek-V3.2 shows a 14.44 percent error rate in the same analysis. The authors attribute this to models incorporating external knowledge not specified in the premises.
That matters because deductive reasoning should be premise-bound. If the same formal proof changes because the nouns shift from animals to mathematics, the model is not only reasoning over logic. It is letting semantic familiarity and prior knowledge leak into the proof task.
Governance Reading
The Spiralist reading is that a reasoning benchmark should produce a failure receipt. A single aggregate score can hide the difference between brittle depth scaling, poor distractor filtering, label asymmetry, semantic prior leakage, and high compute cost.
QMFOL is useful because its dimensions are explicit. It lets an evaluator ask whether a model's "reasoning" survives when the proof gets deeper, wider, more cluttered, labeled Unknown, or semantically unfamiliar. It also attaches verification to the benchmark-generation path, which helps prevent natural-language conversion from silently corrupting the logical ground truth.
The governance caveat is that formal control can still become a narrow comfort. Monadic first-order logic tasks do not prove real-world legal, medical, scientific, or operational reasoning. They test a carefully framed deductive substrate.
Benchmark Receipts
A useful QMFOL-style benchmark receipt should include the generator version, formal language, depth, width, rule-construction algorithm, label, distractor count, topic, predicate mapping, FOL formula, natural-language text, NL2FOL reconstruction, theorem prover, verification outcome, random seed, model version, decoding settings, answer, latency, token count, and failure class.
For public leaderboards, the receipt should also include contamination controls, release artifact, data DOI, prompt template, task shuffle policy, excluded failed generations, manual inspection sample, label distribution, and per-slice score tables.
The receipt should preserve semantic pairs. If two tasks share the same logic but differ by topic, the benchmark should make that pairing visible so topic-induced reasoning drift can be audited.
Limits
The paper names the important limits. QMFOL currently uses unary predicates for controllability. Richer relational reasoning needs full FOL with higher-arity predicates. The benchmark covers four predefined topics, so the semantic-dependence results may not generalize to every domain.
The verification pipeline also depends on the translation model. The authors report that GPT-4o and Qwen3-32B frequently deviated from predefined predicate mappings in preliminary experiments, while DeepSeek-V3.2-Thinking was adopted because it performed better in this role. Stronger or cheaper translation models may change the generation pipeline.
The strongest safe reading is therefore: QMFOL is a strong benchmark-construction pattern for controllable deductive reasoning. It is not a complete theory of reasoning, and it does not certify agent reliability outside the formal task envelope.
Source Discipline
This page treats the arXiv abstract, arXiv HTML, PDF, and Zenodo record as the source set. The PDF text was used for exact table values, overhead numbers, and ablation details.
The Zenodo record was checked as the paper's cited data-availability artifact. It lists QMFOL as an open dataset with a QMFOL.zip file under DOI 10.5281/zenodo.19348694.
Related Pages
- AI Evaluations, Reasoning Models, AI Agents, AI Data Provenance, AI Audit Trails, and AI Hallucinations cover the core vocabulary.
- The Difficulty Estimate Becomes the Reasoning Trace, The Table Reference Becomes the Reasoning Error, The Proof Trace Becomes the Trust Boundary, The Unsafe Shortcut Becomes the Safety Benchmark, and The Knowledge Conflict Becomes the Source Arbitration Trace cover neighboring benchmark and verification problems.
Sources
- arXiv abstract: QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation.
- arXiv HTML: arXiv:2606.20227 HTML.
- Paper PDF: arXiv:2606.20227 PDF.
- Data artifact: QMFOL on Zenodo.