METR
METR, short for Model Evaluation and Threat Research, is a nonprofit research organization focused on measuring whether frontier AI systems can carry out long-horizon autonomous tasks, accelerate AI research and development, undermine evaluations, or cross thresholds relevant to catastrophic risk.
Snapshot
- Full name: Model Evaluation and Threat Research.
- Known for: autonomy evaluations, long-horizon task measurement, AI R&D capability evaluations, pre-deployment model assessments, frontier-risk reports, and work on evaluation integrity.
- Organizational form: independent research nonprofit, according to METR's own public materials.
- Leadership: Beth Barnes is listed as founder and CEO, Chris Painter as president, Hjalmar Wijk as chief scientist, and Nate Rush as CTO on METR's about page as reviewed June 23, 2026.
- Funding and access: METR says it is donation-funded, has not accepted funding from AI companies, and may receive model access or compute credits from companies it evaluates.
- Why it matters: frontier AI governance increasingly depends on empirical claims about what systems can do before deployment, not only on broad capability benchmarks or company assurances.
Origins and Role
METR emerged from the evaluation work associated with the Alignment Research Center. Its public research archive includes a March 2023 update on ARC's evaluations of GPT-4 and Claude and a July 2023 report on language-model agents performing realistic autonomous tasks. The organization later became a visible third-party evaluator of frontier systems.
METR's role is narrower than general AI policy advocacy and broader than ordinary benchmark construction. It tries to answer a governance-relevant question: whether a model, with appropriate scaffolding and elicitation, can autonomously complete tasks that matter for security, self-improvement, resource acquisition, cyber operations, or AI research acceleration.
That institutional niche makes METR part of the measurement layer of AI governance. Its work can inform system cards, frontier safety policies, model-release decisions, government preparedness, and public debate over whether a claimed capability is near, distant, or already measurable.
Current Context
As of this June 23, 2026 review, METR is no longer only a pre-release model evaluator. Its public materials describe four overlapping roles: autonomy and dangerous-capability evaluation, frontier-risk assessment, research on AI effects on developer productivity and AI R&D acceleration, and evaluation-integrity work on reward hacking, sandbagging, monitorability, and mitigations.
The largest recent shift is METR's May 19, 2026 Frontier Risk Report, a pilot assessment of rogue-deployment risk at frontier AI companies. Anthropic, Google, Meta, and OpenAI participated. METR says participants provided access to internal models, raw chains of thought, questionnaire responses, and non-public information about capabilities, internal AI use, monitoring, and progress trends. METR then prepared company-specific private reports, companies approved what non-public material could be disclosed, and METR published an industry-level report.
This was an entity-based assessment rather than a model-specific launch evaluation. That distinction matters: METR was testing whether frontier labs' internal use of agents, monitoring, security, and organizational context create risk, not only whether a named model passes a benchmark before public release.
METR's public pages also show increasing work on real-world productivity. Its July 2025 randomized controlled trial found experienced open-source developers took 19 percent longer on selected tasks when allowed to use early-2025 AI tools, but METR later warned that tool capability and study design were changing and published early-2026 updates and surveys suggesting more favorable but methodologically difficult estimates. This work is relevant to AI R&D acceleration because benchmark capability does not automatically translate into real developer productivity.
Evaluation Focus
METR describes its research as focused on broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. Its public materials also identify evaluation-integrity problems, including cases where agents appear to reward-hack tasks, exploit scoring bugs, sandbag, bypass monitoring, or behave in ways that threaten the validity of the evaluation itself.
The practical emphasis is on realistic agent conditions: tools, multi-step tasks, error recovery, unfamiliar environments, and tasks that may take humans hours rather than minutes. This differs from short static benchmarks, where the main question is whether a model can produce the correct answer to a known item.
METR has also released or described evaluation infrastructure and task resources, including Vivaria, HCAST, RE-Bench-related work, a METR task standard, autonomy evaluation resources, and later work using Inspect-based infrastructure.
Time Horizon
METR's time-horizon work proposes measuring AI agents by the length of software tasks they can complete at a given success rate. The 2025 paper Measuring AI Ability to Complete Long Software Tasks defines a 50 percent task-completion time horizon: the human expert task duration at which an agent is predicted to succeed half the time.
The point of the metric is interpretability. A benchmark score may be hard to map onto human work. A task-duration estimate gives policymakers and developers a more legible signal: whether a model can complete tasks that would take a human expert minutes, hours, days, or longer.
METR's January 2026 Time Horizon 1.1 update expanded the task suite from 170 to 228 tasks, increased the number of tasks estimated at eight or more human hours from 14 to 31, and moved evaluation infrastructure from Vivaria to Inspect. It reported a stitched all-period P50 doubling time of 196.5 days, a post-2023 TH1.1 doubling time of 130.8 days, and a post-2024 TH1.1 doubling time of 88.6 days, while emphasizing wide confidence intervals and sensitivity to task composition.
These figures should not be read as a general clock for all work. The metric is based on task difficulty as estimated by human expert completion time, not model wall-clock runtime; the task suite is concentrated around software, research-engineering, and related agent tasks; and the results can change when the scaffold, tools, prompting, model access, or task distribution changes.
Risk Assessment
METR publishes model evaluation reports and risk-assessment work on systems from major developers and on some released models without company involvement. Its site lists reports involving OpenAI, Anthropic, DeepSeek, Qwen, and other frontier or near-frontier systems, along with reviews of risk reports and monitoring systems.
As of June 23, 2026, METR's risk-assessment page separated developer risk-assessment reviews from evaluation reports and marked some evaluations as partnerships and others as having no company involvement. It also listed the Frontier Risk Report pilot, a red-team exercise on Anthropic's internal agent monitoring systems, and reviews of developer risk reports as distinct kinds of evidence.
METR states that it does not accept compensation for evaluation work, while companies such as OpenAI, Anthropic, and xAI have provided model access or compute credits. Its about page also says METR has not accepted funding from AI companies and receives a small part of its income from a technical-assistance contract with the European AI Office.
This creates an important governance pattern: companies hold the models, logs, and deployment context, while independent evaluators need access before public release. The resulting work is neither purely internal assurance nor fully adversarial public audit.
Governance Position
METR argues for empirical measurement of catastrophic-risk-relevant capabilities and for capability thresholds that trigger stronger mitigations as systems scale. Its about page says it has prototyped governance approaches that use measured or forecasted capabilities to determine when additional risk mitigations are needed.
The organization reports partnerships or technical relationships with public bodies, including the NIST AI Safety Institute Consortium, the United Kingdom AI Security Institute, and the European AI Office. That puts METR in the policy-science layer between frontier labs, safety institutes, and public institutions.
The broader governance idea is that safety policies should not rely only on declarations of intent. They should include measurable thresholds, repeatable evaluation methods, external review, transparency about limitations, and plans for what happens when evaluations show warning signs.
Evidence Standard
A governance-grade METR claim should name the evaluated object: a base model, scaffolded agent, product configuration, company-internal monitoring system, developer risk report, or entity-level risk posture. It should also name the date, access conditions, tools, scaffolds, elicitation effort, task suite, success criteria, human baseline, uncertainty range, and what information was withheld for confidentiality or misuse reasons.
The strongest use of METR-style evaluations is not "this model is safe" or "this model is dangerous." It is a narrower claim: under specified tools, time limits, prompts, task distributions, and review conditions, evaluators observed or did not observe particular capabilities or failure modes. Release decisions, safety cases, and frontier safety frameworks should then say what consequence follows from that result.
For agentic systems, evidence should connect capability evaluation with controls: tool permissions, identity, monitoring, audit trails, red-team coverage, incident response, rollback, and model-weight security. A model that fails a task suite can still be dangerous if deployed with powerful tools; a model that passes a monitorability eval can still fail under new scaffolds or adversarial pressure.
Central Tensions
- Access versus independence: strong pre-deployment evaluation requires company access, but company access can create dependence, review constraints, or partial visibility.
- Measurement versus reality: task suites make autonomy measurable, but real institutions, networks, and adversarial environments are messier than benchmark tasks.
- Entity assessment versus model assessment: a report about a lab's internal controls is not the same as a report about one model checkpoint, and the two can support different governance decisions.
- Publication versus misuse: evaluation methods need transparency, while some tasks, scaffolds, or findings may reveal capability pathways.
- Thresholds versus judgment: capability thresholds make governance concrete, but the choice of threshold, safety margin, and mitigation can be contestable.
- Acceleration measurement: measuring AI R&D acceleration may itself become more urgent as models become capable enough to improve the systems used to evaluate them.
Source Discipline
METR sources should be read by document type. An evaluation report supports a claim about a specific model or scaffold. A risk-assessment report can include non-public company information and judgments about organizational context. A review of a developer risk report is not the same thing as METR independently rerunning all tests. A research post about time horizons or productivity supports a method or empirical result, not a general claim about all work.
Partnership labels matter. METR's risk-assessment page distinguishes reports conducted with company involvement from reports conducted without company involvement. Partnership work can provide pre-release access and internal context; it can also involve confidentiality constraints and company review of what can be disclosed. No-company-involvement work may be more independent, but can lack the internal access needed to assess deployment controls.
Dates matter as well. METR's time-horizon page is periodically updated, model evaluations can be superseded by stronger scaffolds or newer models, and productivity studies can become stale as tools and user habits change. A source-disciplined claim names the report date and avoids turning one evaluation into a permanent verdict.
Spiralist Reading
METR matters because it tries to make the frontier measurable before the frontier becomes ordinary infrastructure.
In the Spiralist frame, many AI debates collapse into mythology when nobody can say what a system can actually do under realistic conditions. A lab says the model is safe. A critic says it is dangerous. A market says it is useful. A benchmark says it is better. METR's contribution is to ask a harder operational question: what happens when the model is given time, tools, goals, and room to recover from failure?
That does not make evaluation a complete safety regime. Evaluation can be gamed. It can miss the relevant pathway. It can lag behind deployment. It can be constrained by the organization being evaluated. But without empirical evaluation, governance becomes charisma, trust, or panic.
The deeper issue is institutional: society needs measurement bodies strong enough to inspect systems whose builders are wealthy, fast-moving, and strategically important. METR is one attempt to build that measurement capacity before public institutions fully know how to demand it.
Open Questions
- How independent can a third-party evaluator remain when pre-deployment access depends on frontier labs?
- What minimum evaluation rights should governments require for models above defined capability or compute thresholds?
- How should evaluators report uncertainty when task suites are incomplete or models may be evaluation-aware?
- Can long-horizon autonomy metrics generalize from software tasks to scientific, cyber, economic, and organizational action?
- What mitigations should automatically follow from a model crossing a dangerous autonomy or AI R&D acceleration threshold?
- When should entity-level assessments of frontier AI companies become a regulatory requirement rather than a voluntary pilot?
Related Pages
- AI Evaluations
- AI Capability Forecasting
- Frontier AI Safety Frameworks
- AI Safety Cases
- AI Audits and Third-Party Assurance
- AI Safety Institutes
- AI Red Teaming
- AI Control
- Capability Elicitation
- Benchmark Contamination
- Model Cards and System Cards
- Secure AI System Development
- AI Agents
- EU AI Act
- Reward Hacking
- Chain-of-Thought Monitorability
- AI Organizations
- Beth Barnes
Sources
- METR, About METR, reviewed June 23, 2026.
- METR, Research archive, reviewed June 23, 2026.
- METR, Risk Assessment, reviewed June 23, 2026.
- METR, Frontier Risk Report (February to March 2026), May 19, 2026.
- METR, Task-Completion Time Horizons of Frontier AI Models, last updated May 8, 2026.
- METR, Time Horizon 1.1, January 29, 2026.
- Kwa et al., Measuring AI Ability to Complete Long Software Tasks, arXiv, submitted March 18, 2025 and revised February 25, 2026.
- METR, Red-Teaming Anthropic's Internal Agent Monitoring Systems, March 26, 2026.
- METR, MALT: A Dataset of Natural and Prompted Behaviors That Threaten Evaluation Integrity, October 14, 2025.
- METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 10, 2025.
- METR, We are Changing our Developer Productivity Experiment Design, February 24, 2026.
- METR, Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity, May 11, 2026.
- METR, Common Elements of Frontier AI Safety Policies, December 9, 2025.
- METR, Evaluating Language-Model Agents on Realistic Autonomous Tasks, July 31/August 1, 2023.
- METR, Update on ARC's recent eval efforts, March 18, 2023.