Blog · arXiv Analysis · Last reviewed July 2, 2026

The Open Artifact Becomes the Reproducibility Receipt

Coakley, Snelleman, Hoos, and Gundersen show a real institutional shift in AI research: code, datasets, splits, hardware, software dependencies, pseudocode, and setup descriptions are more often present than they were a decade ago. The catch is the one the paper keeps explicit: a reproducibility receipt is not the same as a reproduced result.

The Paper

The paper is The Shift Toward Open and Reproducible AI Research, arXiv:2606.16974 [cs.AI], by Kevin L. Coakley, Thijs Snelleman, Holger Hoos, and Odd Erik Gundersen. arXiv lists the first version as submitted on June 15, 2026 and the current version as v3, last revised on June 26, 2026.

The study analyzes 56,800 papers from AAAI, ICLR, ICML, IJCAI, and NeurIPS over 2014-2024. The corpus itself is part of the story: annual publication counts across those venues grew from 1,206 papers in 2014 to 12,026 in 2024. Of the full corpus, 52,328 papers, or 92 percent, were classified as empirical and used for the reproducibility-documentation analysis.

The paper matters because AI evaluation often treats a published result as a stable fact. This work asks whether the surrounding evidence needed to re-run or challenge the result has become more available.

Seven Receipts

The study tracks seven reproducibility variables: open source code, use of open datasets, dataset splits, pseudocode, hardware specification, software dependencies, and experiment setup. These are not decorative details. They are the route by which an outside team can understand what was done and where failure may enter.

The paper is careful not to overstate the measure. Documenting all seven variables does not guarantee reproducibility. Different empirical methods need different evidence. But each missing variable creates friction: algorithmic ambiguity without code or pseudocode, data ambiguity without datasets or splits, environmental ambiguity without hardware and dependencies, and search ambiguity without setup details.

That is why this page calls them receipts. They do not prove the claim by themselves, but they make the claim inspectable.

The Open Science Trend

The headline trend is strong. Papers documenting at least five of the seven variables increased from 8 percent in 2014 to 43 percent in 2024. Papers documenting none of the variables fell from 3 percent to 0.3 percent, while papers documenting all seven rose from 0.1 percent to 1.7 percent.

Open code availability rose from 13 percent in 2014 to 69 percent in 2024. Open dataset usage rose from 68 percent to 91 percent. Papers documenting both code and data increased nearly sixfold, from 11 percent to 64 percent, while papers documenting neither dropped from 29 percent to 4 percent.

The paper then estimates reproducibility from documentation practices and prior empirical reproduction rates, not from direct re-runs. On that basis, estimated reproducibility rises from 28 percent in 2014 to 64 percent in 2024. The authors also estimate that the number of reproducible empirical papers rose nearly 24-fold, from about 305 in 2014 to about 7,265 in 2024.

Checklists Were Not the Whole Cause

The paper tests whether conference reproducibility checklists accelerated documentation improvements. NeurIPS introduced the first such checklist among the five venues in 2019, followed by AAAI and IJCAI in 2021, ICML in 2023, and an optional reproducibility statement guideline at ICLR in 2022.

The result is more interesting than a compliance story. Across 35 variable-conference combinations, only 15 had a higher post-checklist slope, and the paper reports no statistical evidence that checklist adoption increased the rate of improvement in documentation. Hardware specification is the notable exception, with post-checklist increases across conferences, especially driven by NeurIPS.

That makes the governance lesson sharper. Checklists can set a floor, but the broader shift appears to have started before formal procedural mandates. Culture moved, then paperwork followed.

The LLM Meta-Science Layer

The study uses LLM-assisted classification rather than manual reproduction. The authors optimized prompts against a 400-paper manually evaluated dataset, selected seven core variables for large-scale analysis, and then manually labeled 160 randomly sampled papers from the 56,800-paper corpus as an evaluation set. The final large-scale evaluation used Gemini 2.5 Flash after comparing 17 models from several providers and open-weight options.

The method is practical and bounded. The authors asked the model for structured JSON answers and supporting quotes. They excluded variables that could not be extracted reliably enough. They also state an important limitation: the analyzed papers were almost certainly present in the training data of evaluated LLMs, so in-context reasoning versus recall remains an open question.

That should be read as methodological honesty, not a fatal flaw. The paper is not claiming that an LLM can reproduce 56,800 studies. It is claiming that an LLM-assisted workflow can classify whether reproducibility-relevant documentation appears in papers at scale, with validation and limits disclosed.

Governance Standard

A credible AI result should carry a reproducibility receipt. At minimum, the receipt should identify code availability, dataset availability, dataset splits, pseudocode or algorithm description, hardware, software dependencies with versions, experiment setup, random seeds where relevant, evaluation protocol, artifact license, and whether the artifact was still accessible at review time.

For high-impact claims, the receipt should also separate three levels: documented, executable, and reproduced. A paper can document code without the code being runnable. Code can run without recreating the result. A reproduced result can still be narrow if it depends on hidden data, private infrastructure, or unverifiable preprocessing.

Procurement, policy, and safety reviews should not treat "paper published" as enough. They should ask which artifacts survive, whether the exact model and data versions are identified, whether the evaluation can be replayed, and whether failures would be visible. That connects this paper to AI Evaluations, AI Audit Trails, The Evaluation Archive Becomes the Frontier Claim, and The Benchmark Becomes the Curriculum.

Limits

The paper does not directly reproduce 56,800 papers. It infers reproducibility from documentation practices and prior empirical rates. It also revises some historical criteria for LLM extraction: for example, open code is counted when a paper mentions a code link, not when the current link is verified as still accessible.

That distinction matters. Open science is moving in the right direction, but the operational question is still artifact survival. A broken repository, missing dependency, private dataset, changed benchmark, unavailable GPU, or undocumented preprocessing step can defeat a paper that looked complete in text.

The Spiralist reading is therefore disciplined optimism. AI research has become more open by the paper's measured documentation indicators. The next step is to make those artifacts durable enough that future reviewers can actually run, challenge, and learn from them.

Sources


Return to Blog