Blog · arXiv Analysis · Last reviewed June 25, 2026

The Evaluation Schema Becomes the Public Ledger

The June 2026 arXiv paper Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results, by Jan Batzner, Sree Harsha Nelaturu, Damian Stachura, Anastassia Kornilova, and 44 coauthors, treats AI evaluation scores as records that need provenance, versioning, and public memory.

Scores Need Chain of Custody

The paper, arXiv:2606.14516 [cs.AI, cs.CL, cs.CY], was submitted on June 12, 2026. Its exact title is Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results. It begins from a mundane problem with high governance stakes: AI evaluation scores are scattered across leaderboards, papers, blog posts, harness logs, and custom repositories, often in incompatible formats.

A benchmark score looks like a fact, but it is really a compressed event. Someone ran a model, through a particular access path, with particular prompts, decoding settings, benchmark version, metric definition, scorer, and data-processing stack. When that context disappears, the number remains but the evidence thins.

What the Paper Builds

Every Eval Ever is not a new evaluation harness. The authors describe it as a translation layer above existing sources. It defines a shared, versioned JSON schema for AI evaluation results, with an optional instance-level companion schema for prompts, outputs, references, scores, and metadata. The schema captures source provenance, model access mode, generation configuration, benchmark metadata, metric semantics, and, where available, sample-level data.

The project also supplies converters for HELM, lm-eval-harness, and Inspect AI, plus contributed converters for leaderboards and other formats. Submitted records are validated against the schema before entering the repository. The paper says the crowdsourced Hugging Face datastore spans 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats, with more than 200,000 aggregate results as of May 4, 2026.

What the Ledger Reveals

The paper's examples show why a score ledger matters. It notes that LLaMA 65B has been reported at both 63.7 and 48.8 on MMLU, with the difference traced to different evaluation harnesses. Every Eval Ever does not declare one number morally correct. It stores both as separate records with metadata, so the discrepancy becomes visible instead of folklore.

The datastore also exposes missing context. The authors report that inference platform is unknown or omitted in 98% of evaluation rows by micro-average, and reported in only 27% of rows when the 31 formats are weighted equally. Temperature appears in 77% of macro-averaged rows, while max tokens appears in 30%. These are not cosmetic fields; they are conditions under which a score was produced.

In a reproducibility case study, the authors converted official HELM records and local reproductions for three models across fourteen single-turn HELM benchmarks. The schema surfaced mismatched example sets, empty or truncated completions, stochastic disagreement, and residual score differences. It could expose the symptom, but where serving details were missing it could not always identify the cause.

Why This Matters

AI governance increasingly asks evaluations to do institutional work. A score can influence release gates, procurement, press claims, safety cases, model cards, audits, and regulatory risk assessments. If the score cannot be traced, it becomes a badge rather than evidence.

The public ledger framing changes the burden. A lab can still report a headline result, but the result should travel with the run identity, source relationship, model access path, benchmark version, metric direction, uncertainty, generation settings, and missing fields. For agents, it should also preserve tool and sandbox configuration. This does not make the score true. It makes the score inspectable.

The best feature of Every Eval Ever is not centralization for its own sake. It is that conflicting records can coexist. A public evaluation ecosystem should not pretend disagreement is an error whenever two methodologically valid runs differ. Sometimes the disagreement is the finding.

Governance Standard

Any public AI evaluation should produce an evaluation receipt: model identifier, provider or local engine, exact checkpoint or API name, benchmark name and version, split, prompt template, generation settings, scorer, metric direction, sample count, timestamp, evaluator relationship, source URL, code or harness version, uncertainty fields, and known missing metadata.

For high-impact systems, the receipt should be preserved with the release decision. For benchmarks used in public claims, the receipt should be machine-readable. For disputed scores, the institution should retain both records, annotate the conflict, and avoid rewriting history in place. Evaluation evidence should behave more like audit evidence than marketing copy.

Scope Boundary

The paper is infrastructure work, not a proof that evaluation science is solved. The authors say coverage is strongest for text-based single-model evaluations, while multimodal evaluations, human-preference judgments, and multi-agent settings are only partly supported. The schema depends on community adoption; labs and leaderboard operators may still omit important metadata. It also does not run evaluations, so it cannot recover details that were never recorded.

That limitation is the point. A schema cannot make a weak measurement strong. It can make the weakness visible before the number becomes institutional memory.

Sources


Return to Blog