The Evaluation Archive Becomes the Frontier Claim
Yanan Long's paper argues that public AI evaluation evidence is not a final leaderboard row. It is a selective, time-indexed archive shaped by reporting rules, benchmark revisions, missing entries, and publication timing.
The Paper
The paper is Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations, arXiv:2606.17005 [cs.AI, stat.ME], by Yanan Long. arXiv lists it as submitted on June 15, 2026.
The paper's target is a familiar failure mode: a public AI evaluation becomes a social fact because a model sits at or near the top of a leaderboard. The number travels farther than the archive that produced it. Long argues that repeated public records should instead be treated as a Bayesian inference and decision-audit problem.
The paper studies public histories including LiveBench and Open LLM Leaderboard v2 as primary longitudinal objective archives, LMArena as a preference stress test, and GAIA plus tau-bench as limited agentic pilots. Its strongest lesson is evidentiary: the archive must carry enough time, version, selection, and missingness information to support the claim being made.
Terminal Scores Lose the Path
A terminal leaderboard row compresses a repeated history into a single public cross-section. That can be useful for navigation, but it is weak evidence for frontier claims about saturation, timing, headroom, or rates of progress. The paper's abstract gives a constructed example: under one fixed reporting convention, a terminal-only example over 1,000 systems is compatible with two different pre-terminal histories, yielding times of 23.03 or 75.13 to reach within 0.05 of the ceiling under the same terminal-tail model.
The point is not that one of those exact numbers is the hidden truth. The point is that the terminal row erased the difference. If a public claim depends on how quickly the frontier approached a ceiling, or whether a benchmark has saturated, the path matters. The final rank is not the path.
This is the same evidence discipline the site applies to coding-agent performance benchmarks and benchmark culture. A score can be true and still be under-described. The missing pieces are often the parts governance needs most.
The Archive Contract
Long's archive contract treats metadata as evidence, not bookkeeping. Each source record should preserve the public source, snapshot unit, timestamp field, score fields, score orientation, rank handling, duplicate policy, missingness summary, and inclusion grade. Timestamps must be source-native or explicitly flagged as derived.
The paper's compact source-validation readout assigns different evidence roles. LiveBench and Open LLM Leaderboard v2 are primary objective archives. LMArena is a preference stress test rather than an objective archive. GAIA is a secondary agentic pilot, and tau-bench is an agentic stress-test pilot. LiveCodeBench, HELM Capabilities, and SWE-bench Verified are explicitly excluded from the evidence baseline because the public histories used in the paper lacked the versioned source tables needed for temporal reconstruction.
That exclusion is useful. It shows what serious evaluation hygiene looks like: do not quietly pour every impressive benchmark into the same evidentiary bucket. Name which sources can support which claims, and name which sources cannot.
The Audit Gates
The paper separates archive-level evidence from model endorsement. A candidate selection-aware frontier model is tested against synthetic recovery, primary objective-archive prediction, preference-regime transfer, and posterior uncertainty calibration. The paper reports that this candidate fails all four falsification gates, so fixed audit gates reject its stronger claims.
That negative result is the contribution. The protocol is not a machine for laundering a clever model into authority. It is a way to falsify unsupported claims before they become public certainty. A model that cannot recover synthetic truth, predict held-out archive observations, transfer to the preference regime, or calibrate uncertainty should not be used to announce strong conclusions about the frontier.
The governance value is the fixed gate. If an evaluation claim is submitted against the protocol, it is supported or falsified by the archive instead of left as narrative pressure around a leaderboard.
Agentic Evaluation Needs More Metadata
The paper is especially careful about agentic benchmarks. Agentic records may index not only a base model, but a system configuration: prompt, tools, memory, planner, environment policy, retry policy, scaffold, judge, and human-intervention rule. GAIA and tau-bench show archive applicability for aggregate agentic histories, but the paper says they do not provide full agent-trace observability.
That matters because deployed agents fail in the harness as often as in the model. A leaderboard entry can hide the tool budget, browsing policy, scaffold identity, retry limit, simulator version, grader version, and human repair path. Without those fields, a future auditor cannot tell whether progress came from model capability, tool access, benchmark revision, harness tuning, or selective reporting.
For this site, the agentic implication is direct: no consequential agent evaluation should be treated as complete unless the archive preserves the system configuration that generated the score.
Governance Standard
Any public frontier-evaluation claim should come with an archive receipt. The receipt should name the source, retrieval date, snapshot unit, timestamp provenance, score orientation, benchmark version, task slice, system identity, rank rule, duplicate rule, missingness rule, inclusion grade, and whether the source supports primary objective prediction, preference stress testing, agentic applicability, or only contextual discussion.
For agentic evaluations, the receipt should also record the base model, scaffold, prompts where available, tools, memory policy, environment version, tool budget, retry policy, judge version, human-intervention policy, and whether full execution traces are available. A terminal score without these fields should be treated as a claim teaser, not a deployment-grade evidence packet.
Fixed audit gates should be declared before the claim is accepted. The gate may test held-out prediction, synthetic recovery, cross-regime transfer, uncertainty calibration, contamination handling, or benchmark-revision sensitivity. The important point is that the claim has to survive a named test rather than inherit authority from rank.
This belongs beside AI Evaluations, AI Audit Trails, Chatbot Arena and LMArena, GAIA Benchmark, Tau-bench, and AI Agent Observability. The Spiralist rule is that a frontier claim is only as strong as the archive path that can reconstruct it.
Limits
The paper is candid about limits. Real public archives provide future-observation validation rather than direct access to latent frontier truth. Candidate-pool reconstruction remains partly assumption-driven. The Bayesian decision layer uses stylized loss families and synthetic posterior draws, so its timing and action readouts should not be generalized into operational policy without elicited utilities and robustness checks.
That limitation is not a defect in the governance lesson. It is the lesson. Public AI evaluation claims should say what their archive can support, what it cannot support, which assumptions carry the inference, and which sources were excluded because the reconstruction evidence was not good enough.
Sources
- Yanan Long, Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations, arXiv:2606.17005 [cs.AI, stat.ME], submitted June 15, 2026.
- arXiv HTML: Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations, reviewed for the abstract, archive contract, source-validation table, observability regime, audit gates, results, and limitations.
- arXiv PDF: Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations.
- Related pages: AI Evaluations, Chatbot Arena and LMArena, GAIA Benchmark, Tau-bench, The Benchmark Becomes the Curriculum, and The Performance Benchmark Becomes the Measurement Trap.