The Metric Choice Becomes the Governance Fork
Alex Fogelson, Zachary A. Brown, Hans Gundlach, Jayson Lynch, and Neil Thompson's July 2026 arXiv paper argues that the future of AI capability gaps depends on the metric used to measure the capability.
For this essay, a metric-governance receipt is the record that names the performance metric, whether it is bounded, what utility claim it represents, how it scales with training or inference compute, and what policy conclusion would change if a different but related metric were chosen.
The Claim
The paper, arXiv:2607.00913 [cs.AI], was submitted on July 1, 2026 and is listed by arXiv as accepted into the 2026 ICML Technical AI Governance Research Workshop. Its title is Two AI Metrics Diverged: Will it Make All the Difference?.
The paper asks a policy question that looks technical only at first: as frontier developers spend exponentially more on compute, will their models stay far ahead of fixed-budget developers, or will cheaper models eventually reach similar practical capability because hardware, algorithms, and data practices improve for everyone?
The authors' answer is conditional. Validation loss and many bounded benchmark scores can suggest convergence. Other metrics, especially metrics that measure an unbounded quantity such as task horizon, can imply durable frontier advantage. The difference is not a footnote. It can change whether policy treats compute concentration as a lasting moat or a temporary lead.
The Fork
The paper names two families of metrics. A meek metric is one where models with slower effective-compute growth eventually close the measured gap against models with faster growth. A mighty metric is a normal performance metric that does not have that convergence property.
The practical theorem is simple enough to carry into governance meetings: bounded performance metrics are meek. If a benchmark is scored from zero to one hundred percent, two systems can differ greatly for a while, but the measured distance cannot grow without limit. As both approach the ceiling, the gap on that metric shrinks.
That does not mean the real capability has become socially equal. It means the chosen metric has run out of room to express the difference. A capped score can be useful for ranking near-term systems and still be weak evidence for long-term concentration.
Metric Examples
The paper's software-engineering example is the clearest governance object. One engineer may care about accuracy over a fixed review interval. That metric is bounded: the system can only get to one hundred percent success on that interval. Another engineer may care about the longest task the agent can complete within an acceptable error tolerance. That metric can keep expanding as models handle longer work chains.
The paper also discusses ordinary benchmarks, validation loss, reinforcement-learning game performance, and task horizon. Benchmarks and win rates are often bounded. Task horizon can be unbounded because it asks how long a task can become while preserving a fixed success probability. A model that raises a one-hour frontier to a one-day frontier is changing the scale of delegation, not only the percentage score on a fixed test.
Governance Reading
The governance stakes are direct. If the valuable metric is meek, a frontier developer's compute lead may be commercially useful but transient. If the valuable metric is mighty, compute can remain a moat. That difference changes antitrust analysis, public compute arguments, export-control expectations, and the plausibility of broad downstream access.
For compute controls, the paper's warning is sharp: if the relevant capability is bounded in the terms society cares about, restrictions may delay a rival rather than impose a lasting ceiling. If the relevant capability is unbounded, restricting effective compute may preserve a durable gap.
Dangerous capabilities need the same discipline. Some harms may be threshold-like: once a model can reliably help with a cyber, persuasion, or biosecurity workflow, marginal gains above the threshold matter less than access to the threshold. Other harms may be scale-sensitive: the number of targets, speed of adaptation, or duration of autonomous work may continue to matter after the threshold is crossed.
Metric Receipts
A metric-governance receipt should name whether the metric is bounded, what real-world utility it stands for, whose utility function it represents, whether the policy question concerns threshold access or marginal advantage, and whether the metric is ordinal only or meaningful in cardinal units.
It should also say what compute axis is being studied: training compute, inference compute, hardware efficiency, algorithmic efficiency, data progress, raw spending, or some mixture. The same benchmark score can behave differently when the relevant policy question is about a fixed test, a longer task, a bigger target set, or a faster action loop.
Limits
This page treats the paper as a theoretical and governance-analysis contribution, not as a direct forecast of a specific lab, country, model family, or date. The paper itself discusses limitations around positional metrics, alternative compute-scaling paths, scale-dependent algorithmic progress, proprietary data or algorithmic advantages, and near-term prediction.
The safest reading is therefore methodological: policy should not inherit a conclusion from a metric before checking whether that metric expresses the actual power, harm, access, or utility that the policy is trying to govern.
Source Discipline
This page uses the arXiv abstract and HTML version as the primary source for title, authorship, submission date, workshop status, definitions, theorem-level claims, examples, governance implications, and limitations. It does not independently prove the mathematical results, reproduce the plotted assumptions, or validate the cited compute-trend literature.
Related Pages
- AI Capability Forecasting, Scaling Laws, AI Compute, Compute Governance, AI Evaluations, and Frontier AI Safety Frameworks cover the surrounding measurement and governance frame.
- The Capability Frontier Becomes the Evaluation Gap, The Evaluation Archive Becomes the Frontier Claim, The Cooperation Metric Becomes the Manipulation Trap, The Token Meter Becomes the AI Budget, and The Compute Border Becomes AI Governance give adjacent receipt patterns.
Sources
- Alex Fogelson, Zachary A. Brown, Hans Gundlach, Jayson Lynch, and Neil Thompson, Two AI Metrics Diverged: Will it Make All the Difference?, arXiv:2607.00913 [cs.AI], submitted July 1, 2026; accepted into the 2026 ICML Technical AI Governance Research Workshop.
- arXiv HTML: arXiv:2607.00913 HTML, reviewed for definitions, theorem statements, metric examples, inference-time extension, governance implications, limitations, and appendix proof framing.
- Paper PDF: arXiv:2607.00913 PDF.