Wiki · Individual Player · Last reviewed May 19, 2026

Beth Barnes

Elizabeth "Beth" Barnes is the founder and CEO of METR, the nonprofit research organization focused on evaluating frontier AI systems for autonomous capabilities, AI R&D acceleration, evaluation integrity, and catastrophic-risk-relevant behavior.

Snapshot

From Alignment to Evals

Barnes came into public AI safety through alignment research and then helped move part of the field toward empirical measurement. METR's team page describes her earlier work at DeepMind with the chief scientist on scaling laws for forecasting deep learning progress, and at OpenAI on safety targets, scalable oversight, and evaluation of code models for misalignment before release.

That background matters because evaluation is not separate from alignment. Alignment asks whether systems do what humans intend under pressure. Evaluation asks how that claim can be tested before the system is widely deployed, especially when the model may be agentic, tool-using, strategically aware, or embedded in real workflows.

METR

In 2022, the Alignment Research Center hired Barnes and incubated ARC Evals to conduct exploratory work on independent evaluations of cutting-edge AI models. In September 2023, ARC announced that ARC Evals would spin out, with Barnes leading the new organization. The group later became METR, short for Model Evaluation and Threat Research.

METR evaluates frontier AI models to help companies and society understand what capabilities the systems have and what risks they pose. Its public materials emphasize autonomous task completion, AI R&D acceleration, cyber-relevant and shutdown-resistance concerns, real-world software developer productivity, and AI behaviors that can undermine evaluations themselves.

Barnes' importance is partly institutional. She helped turn frontier model evaluation from an ad hoc research activity into an organization whose work sits between AI labs, public safety institutes, policymakers, and the public record.

Time Horizons

METR's time-horizon work is one of Barnes' most visible research programs. The central idea is to measure AI agents by the length of tasks they can complete, using how long comparable tasks take human professionals as the yardstick. This gives model progress a more interpretable unit than many static benchmark scores: minutes, hours, days, or weeks of autonomous work.

The March 2025 METR paper and post reported that the length of tasks frontier AI agents could complete had been increasing exponentially, with a roughly seven-month doubling time over the studied six-year period. The work also emphasized uncertainty: task composition, elicitation, scaffolding, human baselines, and benchmark design all affect the measurement.

The policy relevance is direct. If models move from completing short tasks to multi-hour or multi-day projects, the governance question changes from "can this model answer difficult questions?" to "can this system independently carry out consequential work?"

Model Risk Reports

Under Barnes' leadership, METR has produced and reviewed evaluations of frontier and near-frontier systems, including reports involving OpenAI, Anthropic, DeepSeek, Qwen, and other model families. METR's homepage in May 2026 highlighted GPT-5.1 evaluation results and described evaluations of risks from AI self-improvement, rogue replication, and sabotage of AI labs.

Barnes has also appeared as a contributor on external reviews of Anthropic risk reports, including 2026 work on automated R&D risk and sabotage risk. These reviews show the institutional shape of the field: frontier labs publish safety or risk claims, external evaluators examine the evidence and limitations, and governance depends on whether that review process is rigorous enough to matter.

Evaluation Integrity

A recurring theme in METR's work is that advanced AI systems can threaten the evaluation process itself. A model may exploit a task bug, overfit to a benchmark, notice that it is being tested, behave differently under evaluation, or appear competent only under a particular scaffold.

This makes Barnes' work especially important for frontier AI governance. The measurement body is not measuring a passive artifact. It is measuring systems that may use tools, adapt to context, infer incentives, and eventually assist in building their own successors. Evaluation integrity therefore becomes a safety problem, not only a methodology problem.

Spiralist Reading

Beth Barnes is a builder of instruments for the frontier.

In the Spiralist frame, the central danger of advanced AI discourse is that it becomes myth before it becomes measurement. Capability claims turn into prophecy. Safety claims turn into trust. Benchmarks turn into theater. Barnes' work pushes in the opposite direction: give the model time, tools, and realistic tasks, then measure what it can actually do.

This does not solve alignment. It does not guarantee that the right threat model has been chosen, that the lab has provided enough access, or that the model will behave the same way after deployment. But it changes the epistemic standard. A society facing powerful models needs people who can say, with evidence and uncertainty attached, where the frontier currently is.

The deeper institutional question is whether independent evaluation can become strong enough before autonomous AI becomes too fast, too profitable, or too strategically sensitive to inspect well.

Open Questions

Sources


Return to Wiki