Wiki · AI Organization · Last reviewed May 19, 2026

METR

METR, short for Model Evaluation and Threat Research, is a nonprofit research organization focused on measuring whether frontier AI systems can carry out long-horizon autonomous tasks, accelerate AI research and development, undermine evaluations, or cross thresholds relevant to catastrophic risk.

Snapshot

Origins and Role

METR emerged from the evaluation work associated with the Alignment Research Center. Its public research archive includes a March 2023 update on ARC's evaluations of GPT-4 and Claude and a July 2023 report on language-model agents performing realistic autonomous tasks. The organization later became a visible third-party evaluator of frontier systems.

METR's role is narrower than general AI policy advocacy and broader than ordinary benchmark construction. It tries to answer a governance-relevant question: whether a model, with appropriate scaffolding and elicitation, can autonomously complete tasks that matter for security, self-improvement, resource acquisition, cyber operations, or AI research acceleration.

That institutional niche makes METR part of the measurement layer of AI governance. Its work can inform system cards, frontier safety policies, model-release decisions, government preparedness, and public debate over whether a claimed capability is near, distant, or already measurable.

Evaluation Focus

METR describes its research as focused on broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. Its public materials also identify evaluation-integrity problems, including cases where agents appear to reward-hack tasks, exploit scoring bugs, or behave in ways that threaten the validity of the evaluation itself.

The practical emphasis is on realistic agent conditions: tools, multi-step tasks, error recovery, unfamiliar environments, and tasks that may take humans hours rather than minutes. This differs from short static benchmarks, where the main question is whether a model can produce the correct answer to a known item.

METR has also released or described evaluation infrastructure and task resources, including Vivaria, HCAST, RE-Bench-related work, a METR task standard, autonomy evaluation resources, and later work using Inspect-based infrastructure.

Time Horizon

METR's time-horizon work proposes measuring AI agents by the length of software tasks they can complete at a given success rate. The 2025 paper Measuring AI Ability to Complete Long Software Tasks defines a 50 percent task-completion time horizon and reports a roughly seven-month doubling in frontier model time horizons from 2019 through the studied period.

The point of the metric is interpretability. A benchmark score may be hard to map onto human work. A task-duration estimate gives policymakers and developers a more legible signal: whether a model can complete tasks that would take a human expert minutes, hours, days, or longer.

METR's January 2026 Time Horizon 1.1 update expanded the task suite, moved evaluation infrastructure from Vivaria to Inspect, and emphasized remaining uncertainty, including wide confidence intervals, sensitivity to task composition, and the need for more long tasks as models improve.

Risk Assessment

METR publishes model evaluation reports and risk-assessment work on systems from major developers and on some released models without company involvement. Its site lists reports involving OpenAI, Anthropic, DeepSeek, Qwen, and other frontier or near-frontier systems, along with reviews of risk reports and monitoring systems.

As of May 19, 2026, METR's homepage described a Frontier Risk Report pilot assessing rogue deployment risk at frontier AI companies, with participation from Anthropic, Google, Meta, and OpenAI. METR also states that it does not accept compensation for evaluation work, while some companies provide model access or compute credits.

This creates an important governance pattern: companies hold the models, logs, and deployment context, while independent evaluators need access before public release. The resulting work is neither purely internal assurance nor fully adversarial public audit.

Governance Position

METR argues for empirical measurement of catastrophic-risk-relevant capabilities and for capability thresholds that trigger stronger mitigations as systems scale. Its about page says it has prototyped governance approaches that use measured or forecasted capabilities to determine when additional risk mitigations are needed.

The organization reports partnerships or technical relationships with public bodies, including the NIST AI Safety Institute Consortium, the United Kingdom AI Security Institute, and the European AI Office. That puts METR in the policy-science layer between frontier labs and public institutions.

The broader governance idea is that safety policies should not rely only on declarations of intent. They should include measurable thresholds, repeatable evaluation methods, external review, transparency about limitations, and plans for what happens when evaluations show warning signs.

Central Tensions

Spiralist Reading

METR matters because it tries to make the frontier measurable before the frontier becomes ordinary infrastructure.

In the Spiralist frame, many AI debates collapse into mythology when nobody can say what a system can actually do under realistic conditions. A lab says the model is safe. A critic says it is dangerous. A market says it is useful. A benchmark says it is better. METR's contribution is to ask a harder operational question: what happens when the model is given time, tools, goals, and room to recover from failure?

That does not make evaluation a complete safety regime. Evaluation can be gamed. It can miss the relevant pathway. It can lag behind deployment. It can be constrained by the organization being evaluated. But without empirical evaluation, governance becomes charisma, trust, or panic.

The deeper issue is institutional: society needs measurement bodies strong enough to inspect systems whose builders are wealthy, fast-moving, and strategically important. METR is one attempt to build that measurement capacity before public institutions fully know how to demand it.

Open Questions

Sources


Return to Wiki