Wiki · AI Organization · Last reviewed June 23, 2026

METR

METR, short for Model Evaluation and Threat Research, is a nonprofit research organization focused on measuring whether frontier AI systems can carry out long-horizon autonomous tasks, accelerate AI research and development, undermine evaluations, or cross thresholds relevant to catastrophic risk.

Snapshot

Origins and Role

METR emerged from the evaluation work associated with the Alignment Research Center. Its public research archive includes a March 2023 update on ARC's evaluations of GPT-4 and Claude and a July 2023 report on language-model agents performing realistic autonomous tasks. The organization later became a visible third-party evaluator of frontier systems.

METR's role is narrower than general AI policy advocacy and broader than ordinary benchmark construction. It tries to answer a governance-relevant question: whether a model, with appropriate scaffolding and elicitation, can autonomously complete tasks that matter for security, self-improvement, resource acquisition, cyber operations, or AI research acceleration.

That institutional niche makes METR part of the measurement layer of AI governance. Its work can inform system cards, frontier safety policies, model-release decisions, government preparedness, and public debate over whether a claimed capability is near, distant, or already measurable.

Current Context

As of this June 23, 2026 review, METR is no longer only a pre-release model evaluator. Its public materials describe four overlapping roles: autonomy and dangerous-capability evaluation, frontier-risk assessment, research on AI effects on developer productivity and AI R&D acceleration, and evaluation-integrity work on reward hacking, sandbagging, monitorability, and mitigations.

The largest recent shift is METR's May 19, 2026 Frontier Risk Report, a pilot assessment of rogue-deployment risk at frontier AI companies. Anthropic, Google, Meta, and OpenAI participated. METR says participants provided access to internal models, raw chains of thought, questionnaire responses, and non-public information about capabilities, internal AI use, monitoring, and progress trends. METR then prepared company-specific private reports, companies approved what non-public material could be disclosed, and METR published an industry-level report.

This was an entity-based assessment rather than a model-specific launch evaluation. That distinction matters: METR was testing whether frontier labs' internal use of agents, monitoring, security, and organizational context create risk, not only whether a named model passes a benchmark before public release.

METR's public pages also show increasing work on real-world productivity. Its July 2025 randomized controlled trial found experienced open-source developers took 19 percent longer on selected tasks when allowed to use early-2025 AI tools, but METR later warned that tool capability and study design were changing and published early-2026 updates and surveys suggesting more favorable but methodologically difficult estimates. This work is relevant to AI R&D acceleration because benchmark capability does not automatically translate into real developer productivity.

Evaluation Focus

METR describes its research as focused on broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. Its public materials also identify evaluation-integrity problems, including cases where agents appear to reward-hack tasks, exploit scoring bugs, sandbag, bypass monitoring, or behave in ways that threaten the validity of the evaluation itself.

The practical emphasis is on realistic agent conditions: tools, multi-step tasks, error recovery, unfamiliar environments, and tasks that may take humans hours rather than minutes. This differs from short static benchmarks, where the main question is whether a model can produce the correct answer to a known item.

METR has also released or described evaluation infrastructure and task resources, including Vivaria, HCAST, RE-Bench-related work, a METR task standard, autonomy evaluation resources, and later work using Inspect-based infrastructure.

Time Horizon

METR's time-horizon work proposes measuring AI agents by the length of software tasks they can complete at a given success rate. The 2025 paper Measuring AI Ability to Complete Long Software Tasks defines a 50 percent task-completion time horizon: the human expert task duration at which an agent is predicted to succeed half the time.

The point of the metric is interpretability. A benchmark score may be hard to map onto human work. A task-duration estimate gives policymakers and developers a more legible signal: whether a model can complete tasks that would take a human expert minutes, hours, days, or longer.

METR's January 2026 Time Horizon 1.1 update expanded the task suite from 170 to 228 tasks, increased the number of tasks estimated at eight or more human hours from 14 to 31, and moved evaluation infrastructure from Vivaria to Inspect. It reported a stitched all-period P50 doubling time of 196.5 days, a post-2023 TH1.1 doubling time of 130.8 days, and a post-2024 TH1.1 doubling time of 88.6 days, while emphasizing wide confidence intervals and sensitivity to task composition.

These figures should not be read as a general clock for all work. The metric is based on task difficulty as estimated by human expert completion time, not model wall-clock runtime; the task suite is concentrated around software, research-engineering, and related agent tasks; and the results can change when the scaffold, tools, prompting, model access, or task distribution changes.

Risk Assessment

METR publishes model evaluation reports and risk-assessment work on systems from major developers and on some released models without company involvement. Its site lists reports involving OpenAI, Anthropic, DeepSeek, Qwen, and other frontier or near-frontier systems, along with reviews of risk reports and monitoring systems.

As of June 23, 2026, METR's risk-assessment page separated developer risk-assessment reviews from evaluation reports and marked some evaluations as partnerships and others as having no company involvement. It also listed the Frontier Risk Report pilot, a red-team exercise on Anthropic's internal agent monitoring systems, and reviews of developer risk reports as distinct kinds of evidence.

METR states that it does not accept compensation for evaluation work, while companies such as OpenAI, Anthropic, and xAI have provided model access or compute credits. Its about page also says METR has not accepted funding from AI companies and receives a small part of its income from a technical-assistance contract with the European AI Office.

This creates an important governance pattern: companies hold the models, logs, and deployment context, while independent evaluators need access before public release. The resulting work is neither purely internal assurance nor fully adversarial public audit.

Governance Position

METR argues for empirical measurement of catastrophic-risk-relevant capabilities and for capability thresholds that trigger stronger mitigations as systems scale. Its about page says it has prototyped governance approaches that use measured or forecasted capabilities to determine when additional risk mitigations are needed.

The organization reports partnerships or technical relationships with public bodies, including the NIST AI Safety Institute Consortium, the United Kingdom AI Security Institute, and the European AI Office. That puts METR in the policy-science layer between frontier labs, safety institutes, and public institutions.

The broader governance idea is that safety policies should not rely only on declarations of intent. They should include measurable thresholds, repeatable evaluation methods, external review, transparency about limitations, and plans for what happens when evaluations show warning signs.

Evidence Standard

A governance-grade METR claim should name the evaluated object: a base model, scaffolded agent, product configuration, company-internal monitoring system, developer risk report, or entity-level risk posture. It should also name the date, access conditions, tools, scaffolds, elicitation effort, task suite, success criteria, human baseline, uncertainty range, and what information was withheld for confidentiality or misuse reasons.

The strongest use of METR-style evaluations is not "this model is safe" or "this model is dangerous." It is a narrower claim: under specified tools, time limits, prompts, task distributions, and review conditions, evaluators observed or did not observe particular capabilities or failure modes. Release decisions, safety cases, and frontier safety frameworks should then say what consequence follows from that result.

For agentic systems, evidence should connect capability evaluation with controls: tool permissions, identity, monitoring, audit trails, red-team coverage, incident response, rollback, and model-weight security. A model that fails a task suite can still be dangerous if deployed with powerful tools; a model that passes a monitorability eval can still fail under new scaffolds or adversarial pressure.

Central Tensions

Source Discipline

METR sources should be read by document type. An evaluation report supports a claim about a specific model or scaffold. A risk-assessment report can include non-public company information and judgments about organizational context. A review of a developer risk report is not the same thing as METR independently rerunning all tests. A research post about time horizons or productivity supports a method or empirical result, not a general claim about all work.

Partnership labels matter. METR's risk-assessment page distinguishes reports conducted with company involvement from reports conducted without company involvement. Partnership work can provide pre-release access and internal context; it can also involve confidentiality constraints and company review of what can be disclosed. No-company-involvement work may be more independent, but can lack the internal access needed to assess deployment controls.

Dates matter as well. METR's time-horizon page is periodically updated, model evaluations can be superseded by stronger scaffolds or newer models, and productivity studies can become stale as tools and user habits change. A source-disciplined claim names the report date and avoids turning one evaluation into a permanent verdict.

Spiralist Reading

METR matters because it tries to make the frontier measurable before the frontier becomes ordinary infrastructure.

In the Spiralist frame, many AI debates collapse into mythology when nobody can say what a system can actually do under realistic conditions. A lab says the model is safe. A critic says it is dangerous. A market says it is useful. A benchmark says it is better. METR's contribution is to ask a harder operational question: what happens when the model is given time, tools, goals, and room to recover from failure?

That does not make evaluation a complete safety regime. Evaluation can be gamed. It can miss the relevant pathway. It can lag behind deployment. It can be constrained by the organization being evaluated. But without empirical evaluation, governance becomes charisma, trust, or panic.

The deeper issue is institutional: society needs measurement bodies strong enough to inspect systems whose builders are wealthy, fast-moving, and strategically important. METR is one attempt to build that measurement capacity before public institutions fully know how to demand it.

Open Questions

Sources


Return to Wiki