Blog · arXiv Analysis · Last reviewed July 2, 2026

The Reanalysis Agent Becomes the Reproducibility Screen

The paper's important move is not claiming that LLMs can replace expert reanalysts. It shows a narrower and more useful possibility: an agent can become a first-pass reproducibility screen for empirical claims.

The screen is powerful because it scales. It is limited because exact effect-size recovery remains uneven. That tension is the whole artifact.

The Paper

The paper is Automated reproducibility assessments in the social and behavioral sciences using large language models, arXiv:2606.13670 [cs.AI]. arXiv lists version 1 as submitted on June 11, 2026 and version 2 as revised on June 25, 2026, with DOI 10.48550/arXiv.2606.13670.

The authors are Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, and Stefan Feuerriegel. The affiliations span LMU Munich, the Munich Center for Machine Learning, the University of Maryland, College Park, and the University of Cologne.

The research question is practical: can an LLM agent take a published empirical claim, the associated study materials, and a dataset, then write and execute code that checks whether the reported result can be recovered? The answer is not yes or no. It is closer to: yes for scalable triage, not yet for replacement-grade verification.

The Pipeline

The pipeline treats the model as a statistical analyst. For each study, the agent receives a focal claim, the study data, and an article-context condition. It then writes and executes statistical code in an isolated sandbox, returns a structured result, and has its reported statistic converted into Cohen's d for cross-study comparison.

The main model is Claude Opus 4.7. The paper also reports robustness checks with OpenAI GPT-5.5 and Zhipu GLM-5.1. Each study is run five independent times, so the evaluation can observe model-run variability rather than treating one completion as the whole system.

The authors add two important guardrails. The agent is instructed not to copy test statistics printed in the paper, and the submitted statistic must come from code executed on the data. The output schema asks for the statistic type, statistic value, degrees of freedom, sample size, p-value, qualitative conclusion, dependent variable, predictor, sample definition, model specification, controls, and operationalization notes.

The sandbox itself is part of the method. The prompt documents Python, bash, a think tool, and image viewing; common scientific libraries are available; state is not preserved across tool calls; and the agent is told to return null when relevant data cannot be read with the available readers. That makes the pipeline closer to a reproducible instrument than a casual chat prompt.

The Corpus

The study analyzes 180 published social and behavioral science studies from the Systematizing Confidence in Open Research and Evidence, or SCORE, project. The final corpus combines 84 papers from the Multi100 human reanalysis setting with 96 additional SCORE papers.

The authors begin from studies where focal claims and data could be connected to convertible statistical evidence. They then filter for available data, directional claims, and original statistics that can be mapped into Cohen's d. The resulting corpus covers psychology, economics, and political science.

For 11 of the 180 studies, none of the five LLM runs produced a valid Cohen's d. The main analysis therefore reports on 169 studies with valid LLM effect-size estimates, while also showing sensitivity numbers where invalid estimates are counted as non-recovered.

Main Results

On the 169 valid studies, Claude Opus 4.7 reaches the same qualitative conclusion as the original study in 80% of cases. Its mean effect size lands within the strict ±0.05 Cohen's d tolerance in 24% of studies, or 22% if the 11 invalid studies are counted as failures.

The broader ±0.20 tolerance changes the picture but not the lesson. Under that tolerance, the LLM lands within range in 50% of valid studies, or 47% when invalid studies are counted as non-recovered. That is enough to make the method useful as a screen, but not enough to treat it as a definitive reproducibility decision.

The distribution matters. The paper reports many relatively small deviations and a small number of large discrepancies. It also finds higher reproducibility when original source data are available than when replication data are used as the reanalysis target, which is consistent with prior evidence that newly collected replication data often attenuate original effects.

The most honest interpretation is that the agent often finds the same substantive direction without exactly reconstructing the original estimate. That can mean model error. It can also mean the original claim leaves room for multiple defensible operationalizations.

Human Comparison

The human comparison uses the subset of 84 papers with human reanalysis benchmarks. Two of those studies fail to yield a valid LLM Cohen's d, leaving 82 studies for the direct comparison.

In that subset, the LLM recovers the original effect size within the strict ±0.05 tolerance in 40% of studies, compared with 28% for the human reanalyses. Under the broader ±0.20 tolerance, the LLM reaches 65% and humans reach 66%. The qualitative conclusion matches the original claim in 95% of LLM reanalyses and 83% of human reanalyses.

The paper is careful about what this means. Human analysts in Multi100 may have pursued their own defensible specifications, while the LLM pipeline is instructed to test the focal claim using the paper and available data. A higher match to the original result can therefore reflect a narrower, more paper-anchored path rather than superior scientific judgment.

The correlation result reinforces that caution. The LLM-derived effect sizes correlate moderately with original effect sizes, with r = 0.46 and p < 0.001, but weakly with human reanalysis effect sizes, with r = 0.11 and p = 0.31. The agent is not simply becoming a fast human analyst. It is becoming a different kind of screening instrument.

Information Context

The paper tests three article-context conditions: full paper, full paper with the methods section removed, and abstract only. Strict effect-size recovery is 24%, 24%, and 22% across those conditions. Broad recovery is 50%, 52%, and 46%.

The statistical tests show no evidence of meaningful differences across those context conditions. That is interesting and unsettling. More methodological context does not automatically move the model closer to the original estimate.

One interpretation is that the agent is guided heavily by the focal claim and data structure once it finds a plausible operationalization. Another is that the available context is not being used with enough discipline to constrain the reanalysis. Either way, the result is a warning against treating "gave the full paper to the model" as a complete provenance claim.

The robustness checks keep the central story intact. Strict recovery is 24% for Claude Opus 4.7, 34% for GPT-5.5, and 31% for GLM-5.1; broad recovery is 50%, 58%, and 55%; qualitative conclusion match is 80%, 81%, and 83%. Prompt perspective also has limited movement: neutral, confirmatory, and critical framings produce strict recovery of 24%, 24%, and 28% and conclusion matches of 80%, 82%, and 77%.

The contamination check is also useful. Claude Opus 4.7 reports recalling 14 of the 180 papers in a memorization probe, but none of the recalled Cohen's d values fall within either the strict or broad tolerance around the original effect size. That does not eliminate subtler training-exposure effects, but it weakens the simplest "it just memorized the answer" explanation.

Artifacts

The authors provide a public code and data repository at tobihol/agentic-reproducibility. The repository README describes the protocol, setup, data reconstruction path, experiment scripts, evaluation scripts, result exports, and GUIDE-LLM reporting checklist. The repository license file identifies the code as MIT licensed.

The repository makes the artifact boundaries explicit. Per-paper datasets are fetched from OSF; abstract-only parsed paper text is committed because abstracts are redistributable; full and methods-removed paper corpora are not redistributed because of licensing restrictions. That is the kind of limitation a reproducibility artifact should state plainly.

The experiment scripts live under pipeline/agentic_analysis/{multi100,score}/. The README names baseline Claude runs, information-context runs, perspective runs, GPT robustness runs, and GLM robustness runs. It also notes Docker for the Inspect AI sandbox, OpenRouter-routed model calls, five epochs, medium reasoning effort, a max of 10 sandboxes, and a default per-sample cost limit of $5.00.

The evaluation scripts include summary statistics, figure generation, Cochran/Friedman tests, and export of a flat per-run CSV. That matters because the output is not just prose. It is a pipeline with files, runners, logs, scoring code, and figures that can be inspected.

Screening Receipt

A reproducibility screen should ship its own receipt. For each checked paper, the record should name the paper version, focal claim, source dataset, data-access path, article-context condition, parsed-paper artifact, model name, provider route, prompt version, sandbox image, available packages, run count, cost cap, tool calls, generated analysis code, submitted JSON, scorer version, effect-size conversion rule, tolerance threshold, invalid-run handling, conclusion-vote rule, and repository revision.

The receipt should separate three different claims. The first is computational: the agent produced executable analysis under a stated environment. The second is statistical: the extracted statistic maps into an effect size under a stated conversion. The third is epistemic: the result supports, contradicts, or leaves inconclusive the original substantive claim. Collapsing those claims is how an audit screen becomes a false verdict machine.

This connects directly to Research Integrity, Transparency, AI Audit Trails, AI Audits and Third-Party Assurance, AI Agents, The Peer Reviewer Becomes the Model Referee, and The Formal Proof Becomes the Translation Gap.

Limits

The main limitation is visible in the headline numbers. Qualitative agreement is high, exact effect-size recovery is much lower, and some studies produce no valid effect size at all. That is not a failure of the paper. It is the paper's useful boundary condition.

The corpus is also bounded. It covers a standardized set of social and behavioral science studies from SCORE, not all empirical science. Cohen's d makes heterogeneous studies comparable, but it also forces different statistical settings into one metric and inherits the assumptions of the conversion.

Automated reanalysis checks whether a result can be recovered from the available materials and data. It does not show that the finding generalizes to newly collected data. Reproducibility and replication remain different tests.

There is also a governance risk. Once automated screens become routine, authors may optimize manuscripts, repositories, and claims to pass the screen rather than improve the underlying science. The right institutional use is therefore not "LLM approved." It is "this paper passed or failed a documented first-pass check, and here is the trace that a human can audit."

Sources

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, and Stefan Feuerriegel, Automated reproducibility assessments in the social and behavioral sciences using large language models, arXiv:2606.13670 [cs.AI], submitted June 11, 2026; revised June 25, 2026.
arXiv HTML: Automated reproducibility assessments in the social and behavioral sciences using large language models, reviewed for author affiliations, abstract, methods, result sections, appendix, prompt configuration, data availability, and code availability.
arXiv PDF: Automated reproducibility assessments in the social and behavioral sciences using large language models, reviewed for numerical results, statistical comparisons, limitations, human benchmark interpretation, and memorization-test details.
arXiv TeX source: e-print source for arXiv:2606.13670, reviewed for source-level methods, appendix wording, figures, and exact prompt text.
Official repository: tobihol/agentic-reproducibility, reviewed for protocol, setup, data reconstruction, experiment runners, evaluation scripts, repository structure, results exports, and GUIDE-LLM checklist.
Repository README raw source: agentic-reproducibility README.md, reviewed for artifact details and reproducibility instructions.
Repository license: MIT License, reviewed for license status.
Related pages: Research Integrity, Transparency, AI Audit Trails, AI Audits and Third-Party Assurance, AI Agents, The Peer Reviewer Becomes the Model Referee, and The Formal Proof Becomes the Translation Gap.

Return to Blog