Blog · arXiv Analysis · Last reviewed June 25, 2026

The Nuclear Benchmark Becomes the Competence Receipt

Henry Shaowu Yuchi and coauthors' June 2026 arXiv paper treats nuclear-engineering QA as a competence receipt: not one score for "knows science," but separate evidence for factual recall, numeric reasoning, and conceptual understanding.

Not a Nuclear Safety Certification

The paper, arXiv:2606.27047 [cs.CL, cs.AI], was submitted on June 25, 2026. arXiv lists the exact title as NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models, by Henry Shaowu Yuchi, Michal Kucer, Benjamin H. Sims, Selma Peterson, and Emily Taylor.

The page is about benchmark governance, not nuclear operation, design, regulation, or safety advice. The paper asks how to test large language models on nuclear-engineering knowledge. Its useful warning is that a model can look competent on familiar factual questions while still failing on calculation, concepts, or evaluation artifacts.

The Benchmark Shape

NuclearQAv2 contains 1,239 question-answer pairs. The authors divide them into three task types: 750 boolean questions, 206 numeric questions, and 283 verbal questions. The taxonomy is the point. Boolean questions test factual grounding with yes/no or true/false answers. Numeric questions test multi-step deduction, calculation, and quantitative reasoning. Verbal questions test conceptual understanding through short answers.

That makes the benchmark more useful than one blended science score. A procurement team, lab, university, or safety office should not hear "the model passed the nuclear benchmark" and infer one portable capability. It should ask which part passed: recognition, calculation, or explanation. Those are different forms of competence.

How the Dataset Was Built

The paper describes a hybrid construction pipeline. The dataset uses a nuclear-engineering textbook with domain-expert assistance, existing benchmark material, and LLM-assisted generation from domain-specific technical corpora. The authors use Nougat to parse PDF material into structured text, excluding image information for simplicity.

For part of the boolean questions, domain experts supply QA pairs within the textbook's coverage. Numeric questions draw on exercise-style quantitative questions. Boolean and verbal questions are also generated from parsed technical text using Meta Llama 3 70B Instruct, under structured prompts intended to avoid ambiguity, trivial document-structure questions, and context-dependent answers. A manual filtering and curation step removes incomplete, unclear, or multiply-answerable QA pairs.

How Answers Were Scored

Scoring is task-specific. Boolean answers use exact-match accuracy after requiring a binary response. Numeric answers use a tolerance-based metric; the paper sets epsilon to 0.15, allowing a response within 15 percent of the reference value. Verbal answers are evaluated through an LLM-based semantic assessment intended to handle paraphrases and synonyms.

This is where the receipt must include the evaluator, not just the evaluated model. A verbal answer judged by another model is still a model-mediated score. The paper is explicit that verbal evaluation can depend on evaluator choice and prompt formulation, and that future work should compare automated judgments against expert human assessment.

What the Scores Showed

The authors evaluate nine models: OpenAI GPT-5.2, OpenAI GPT-5.4, Meta Llama 3 70B Instruct, Amazon Nova Pro, Mistral 7B Instruct, OpenAI GPT-4o, OpenAI gpt-oss-120B, NVIDIA Nemotron-3 Super 120B A12B FP8, and Meta Llama 3 8B Instruct. The evaluation is run three times for each model to estimate mean accuracy and standard error.

The task split matters. OpenAI GPT-5.2 is reported as strongest on boolean tasks with 0.8022 accuracy and strongest on verbal tasks with 0.8233. OpenAI gpt-oss-120B is strongest on numeric tasks with 0.7913 and has the highest aggregate score at 0.7923. Several models perform much worse on numeric tasks than on boolean tasks; Mistral 7B Instruct is the sharpest example, with 0.3831 on boolean tasks and 0.0307 on numeric tasks. The governance reading is simple: factual recall is not quantitative competence.

Governance Reading

For AI evaluations, NuclearQAv2 is useful because it resists the single-score shortcut. It treats a technical domain as a layered practice: remembering facts, calculating values, and explaining concepts. That belongs beside AI in science and the site's broader warning about source-bound factuality.

The governance problem is benchmark laundering. A model vendor, lab group, or institution can cite a domain benchmark as evidence of readiness while omitting the task mix, generation method, evaluator, tolerance rule, missing modalities, and failure modes. A benchmark becomes a competence receipt only when those fields travel with the score.

Limits

The paper's own limits keep the claim bounded. The quality of generated QA pairs depends on prompt design and the generation model. The authors say the automatic generation relies on an older language model, which may limit generated-question quality. Verbal scoring may be sensitive to judge choice and prompt formulation. The benchmark currently handles text only, while many scientific and engineering tasks rely on diagrams, figures, schematics, plots, or other multimodal material.

Those limits do not make the benchmark useless. They make it governable. A benchmark that names its scaffolding is easier to improve than a benchmark that presents one leaderboard as settled truth.

Competence Receipt

A domain-science benchmark receipt should record: domain boundary, source corpus, expert role, generated-question model, manual filtering rule, task counts, question taxonomy, answer format, numeric tolerance, judge model, prompt template, run count, aggregate score, per-task score, unavailable modalities, and known failure cases. The audit-grade sentence is not "the model knows nuclear engineering." It is: under this text-only benchmark, with these task types and scoring rules, this model showed these separable capabilities and limits.

Sources

Henry Shaowu Yuchi, Michal Kucer, Benjamin H. Sims, Selma Peterson, and Emily Taylor, NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models, arXiv:2606.27047 [cs.CL, cs.AI], submitted June 25, 2026.
arXiv PDF and HTML versions: PDF and experimental HTML, reviewed for title, authorship, date, subject categories, benchmark size, question-type counts, construction pipeline, model list, task-wise scores, aggregate scores, scoring methods, and stated limits.
Related pages: AI Evaluations, AI in Science, The Health LLM Becomes the Black-Box Evaluation, The Source ID Becomes the Factuality Test, The Lab Simulator Becomes the Instrument Gate, and The Benchmark Becomes the Curriculum.

Return to Blog