Blog · arXiv Analysis · Last reviewed June 25, 2026

The Evaluation Score Becomes the Inference Budget

Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, and Cozmin Ududec's June 2026 arXiv paper argues that a frontier-model benchmark score is not just a property of the model. It is also a property of the runtime budget and protocol used to elicit the answer.

The Budget Is Part of the Score

The paper, arXiv:2606.17930 [cs.AI], was submitted on June 16, 2026. arXiv lists the title as How Inference Compute Shapes Frontier LLM Evaluation, by Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, and Cozmin Ududec.

A benchmark score often arrives as a single number. That number looks clean enough for a leaderboard, procurement memo, policy threshold, or safety claim. The paper makes that number messier in a useful way. On hard tasks involving tool use, iterative solving, long trajectories, and repeated submissions, measured performance can depend heavily on how much inference-time compute the evaluator allows and how that compute is allocated.

The Paper Frame

The authors evaluate up to 12 frontier language models across seven challenging benchmarks spanning software engineering, mathematics, medicine, expert knowledge, and cybersecurity. The main controlled suite uses six frontier models, five non-cyber benchmarks, two feedback conditions, and five independent trajectories per task. The paper also incorporates two previously collected UK AI Security Institute cyber evaluations for inference-scaling analysis.

The benchmarks named in the paper include TerminalBench, SWE-Bench Pro, FrontierMath, HealthBench, Humanity's Last Exam, Cyber CTFs, and The Last Ones. The non-cyber tasks are scored with programmatic tests, code execution, LLM judging, or physician-designed rubrics depending on benchmark. The paper warns that its expanded-budget runs and task subsampling mean its absolute scores are not directly comparable to published leaderboard scores.

What the Protocol Changed

The controlled setup uses three simple inference-scaling interventions. First, it expands total token budgets to 5M-30M tokens per trajectory depending on the benchmark, one to three orders of magnitude above many published defaults. Second, it uses context compaction, replacing earlier turns with summaries when context grows. Third, it permits iterative resubmission, with a hard cap of 999 submissions and a repetition guard that stops near-identical answer loops.

The feedback condition matters. In the no-feedback condition, a model receives only an ambiguous acknowledgement that its answer has been saved. In the oracle-score condition, the model is told whether a submission is correct, or receives partial-credit scoring for HealthBench. The paper also studies serial versus parallel allocation: one deep trajectory versus multiple shallower independent trajectories under the same total budget.

What Moved

The headline result is not that every benchmark rewards more tokens. It is that benchmark response is uneven. The paper reports meaningful headroom beyond typical budgets on FrontierMath, Humanity's Last Exam, and the cyber evaluations. It reports much smaller gains on HealthBench and the two software-engineering benchmarks under the tested protocol. TerminalBench and the cyber CTF suite show continued growth across all or most tested models within the observed range.

The task-level analysis separates reach, efficiency, and reliability. Later model generations tend to unlock more tasks and solve reachable tasks more reliably. Token-efficiency improvements are less uniform. Repeated submissions improve performance on all five main benchmarks, while oracle feedback helps most where it supports continued search. Parallel sampling helps substantially on HealthBench and Humanity's Last Exam, but less on FrontierMath, TerminalBench, and SWE-Bench Pro.

Governance Reading

The Spiralist reading is that the evaluation score becomes an inference-budget receipt. A fixed-budget benchmark can be useful, but it should not masquerade as a full capability boundary. A low score may mean the model cannot solve the task. It may also mean the evaluator denied enough search depth, repeated attempts, feedback, context, tool time, or parallel restarts for the capability to appear.

This belongs beside inference and test-time compute, AI evaluations, capability-frontier evaluation gaps, and embodied test-time scaling. The shared lesson is that evaluation is an operating condition, not a neutral window. If a policy threshold, release decision, or safety case depends on a score, the runtime protocol has to travel with it.

Limits

This page reads one preprint and its arXiv record. The paper studies one ReAct-style scaffold, deliberately simple interventions, finite benchmark subsets, and two cyber datasets collected under a related but not fully crossed design. The authors explicitly describe their protocol as a lower bound on what simple reproducible inference scaling can elicit, not an upper bound under optimized benchmark-specific scaffolding.

That caution matters for governance. A compute-scaling curve is evidence about a model under a specific scaffold, budget, feedback rule, judge, tool set, timeout, and stopping policy. It does not by itself prove that a model is safe, unsafe, deployable, or non-deployable. It makes the measurement boundary harder to hide.

Evaluation Receipt

An inference-budget evaluation receipt should record: model snapshot, scaffold, system prompt, tools, benchmark task set, scoring rule, judge model if any, token budget, whether reasoning tokens are counted, context-compaction trigger, submission cap, feedback condition, timeout, repetition guard, number of trajectories, serial-depth rule, parallel-width rule, pass@k method, task subsampling, excluded costs, and whether performance is still rising at the tested cap. The audit-grade sentence is not "the model scored X." It is: under this protocol and this compute allocation, the model reached this point on this curve.

Sources


Return to Blog