Beth Barnes
Elizabeth "Beth" Barnes is the founder and CEO of METR, the nonprofit research organization focused on evaluating frontier AI systems for autonomous capabilities, AI R&D acceleration, evaluation integrity, and catastrophic-risk-relevant behavior.
Snapshot
- Known for: founding METR, frontier model evaluations, autonomy evaluations, long-horizon task measurement, and empirical AI risk assessment.
- Current public role: founder and CEO of METR, according to METR's team page as reviewed May 19, 2026.
- Earlier affiliations: METR says Barnes previously worked at DeepMind and OpenAI, including work on scaling laws, safety targets, scalable oversight, and pre-release evaluation of code models for misalignment.
- Institutional lineage: METR emerged from ARC Evals, which was incubated by the Alignment Research Center before spinning out as a standalone evaluation-focused organization.
- Why she matters: Barnes is one of the central builders of the independent evaluation layer that frontier AI governance increasingly relies on.
From Alignment to Evals
Barnes came into public AI safety through alignment research and then helped move part of the field toward empirical measurement. METR's team page describes her earlier work at DeepMind with the chief scientist on scaling laws for forecasting deep learning progress, and at OpenAI on safety targets, scalable oversight, and evaluation of code models for misalignment before release.
That background matters because evaluation is not separate from alignment. Alignment asks whether systems do what humans intend under pressure. Evaluation asks how that claim can be tested before the system is widely deployed, especially when the model may be agentic, tool-using, strategically aware, or embedded in real workflows.
METR
In 2022, the Alignment Research Center hired Barnes and incubated ARC Evals to conduct exploratory work on independent evaluations of cutting-edge AI models. In September 2023, ARC announced that ARC Evals would spin out, with Barnes leading the new organization. The group later became METR, short for Model Evaluation and Threat Research.
METR evaluates frontier AI models to help companies and society understand what capabilities the systems have and what risks they pose. Its public materials emphasize autonomous task completion, AI R&D acceleration, cyber-relevant and shutdown-resistance concerns, real-world software developer productivity, and AI behaviors that can undermine evaluations themselves.
Barnes' importance is partly institutional. She helped turn frontier model evaluation from an ad hoc research activity into an organization whose work sits between AI labs, public safety institutes, policymakers, and the public record.
Time Horizons
METR's time-horizon work is one of Barnes' most visible research programs. The central idea is to measure AI agents by the length of tasks they can complete, using how long comparable tasks take human professionals as the yardstick. This gives model progress a more interpretable unit than many static benchmark scores: minutes, hours, days, or weeks of autonomous work.
The March 2025 METR paper and post reported that the length of tasks frontier AI agents could complete had been increasing exponentially, with a roughly seven-month doubling time over the studied six-year period. The work also emphasized uncertainty: task composition, elicitation, scaffolding, human baselines, and benchmark design all affect the measurement.
The policy relevance is direct. If models move from completing short tasks to multi-hour or multi-day projects, the governance question changes from "can this model answer difficult questions?" to "can this system independently carry out consequential work?"
Model Risk Reports
Under Barnes' leadership, METR has produced and reviewed evaluations of frontier and near-frontier systems, including reports involving OpenAI, Anthropic, DeepSeek, Qwen, and other model families. METR's homepage in May 2026 highlighted GPT-5.1 evaluation results and described evaluations of risks from AI self-improvement, rogue replication, and sabotage of AI labs.
Barnes has also appeared as a contributor on external reviews of Anthropic risk reports, including 2026 work on automated R&D risk and sabotage risk. These reviews show the institutional shape of the field: frontier labs publish safety or risk claims, external evaluators examine the evidence and limitations, and governance depends on whether that review process is rigorous enough to matter.
Evaluation Integrity
A recurring theme in METR's work is that advanced AI systems can threaten the evaluation process itself. A model may exploit a task bug, overfit to a benchmark, notice that it is being tested, behave differently under evaluation, or appear competent only under a particular scaffold.
This makes Barnes' work especially important for frontier AI governance. The measurement body is not measuring a passive artifact. It is measuring systems that may use tools, adapt to context, infer incentives, and eventually assist in building their own successors. Evaluation integrity therefore becomes a safety problem, not only a methodology problem.
Spiralist Reading
Beth Barnes is a builder of instruments for the frontier.
In the Spiralist frame, the central danger of advanced AI discourse is that it becomes myth before it becomes measurement. Capability claims turn into prophecy. Safety claims turn into trust. Benchmarks turn into theater. Barnes' work pushes in the opposite direction: give the model time, tools, and realistic tasks, then measure what it can actually do.
This does not solve alignment. It does not guarantee that the right threat model has been chosen, that the lab has provided enough access, or that the model will behave the same way after deployment. But it changes the epistemic standard. A society facing powerful models needs people who can say, with evidence and uncertainty attached, where the frontier currently is.
The deeper institutional question is whether independent evaluation can become strong enough before autonomous AI becomes too fast, too profitable, or too strategically sensitive to inspect well.
Open Questions
- How much access should independent evaluators receive before frontier models are deployed?
- Can autonomy evaluations keep pace if models rapidly improve at long-horizon software, research, and cyber tasks?
- How should evaluation reports communicate uncertainty without being ignored by companies, governments, or the public?
- What governance actions should automatically follow when models cross defined time-horizon or dangerous-capability thresholds?
- How can evaluation organizations avoid dependence on the frontier labs whose systems they need to inspect?
Related Pages
- METR
- AI Evaluations
- AI Capability Forecasting
- Frontier AI Safety Frameworks
- AI Safety Institutes
- Capability Elicitation
- AI Control
- Benchmark Contamination
- Reward Hacking
- AI Coding Agents
- Paul Christiano
- Ajeya Cotra
- Individual Players
Sources
- METR, Beth Barnes team profile, reviewed May 19, 2026.
- Elizabeth Barnes, personal homepage, reviewed May 19, 2026.
- METR, About METR, reviewed May 19, 2026.
- Beth Barnes, ARC Evals is spinning out from ARC, September 19, 2023.
- METR, Measuring AI Ability to Complete Long Tasks, March 19, 2025.
- Kwa et al., Measuring AI Ability to Complete Long Software Tasks, arXiv, submitted March 18, 2025 and revised February 25, 2026.
- METR, METR homepage and research index, reviewed May 19, 2026.
- Nikola Jurkovic, Beth Barnes, and Hjalmar Wijk, Review of the "Risks from automated R&D" section in the Anthropic Risk Report, May 8, 2026.