Wiki · Concept · Last reviewed June 25, 2026

AI Evaluations

AI evaluations are structured attempts to measure what AI systems can do, where they fail, and whether claims about capability, safety, alignment, or deployment readiness are credible.

Category: Evaluation / AI Governance Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: Evaluations, TEVV, Benchmarks, Red Teaming, Frontier AI, Governance

Definition

An AI evaluation is a structured attempt to determine what an AI model, product, agent, or deployment can do, how it fails, and whether a claim about it is justified. The target may be a base model, a fine-tuned model, a chat product, a retrieval system, a tool-using agent, a safety classifier, a workflow, or an organization operating AI systems.

Evaluations can measure ordinary product quality, scientific capability, cybersecurity ability, biological risk, autonomy, persuasion, bias, privacy leakage, hallucination, robustness, tool use, or compliance with policy. A strong evaluation names the question being answered, the system version, the test environment, the scaffold or tools, the scoring method, the baseline, the uncertainty, and the decision the result is meant to inform.

NIST often describes this broader family as test, evaluation, verification, and validation, or TEVV. The phrase matters because evaluation is not only a leaderboard score. NIST's 2025 TEVV outline treats evaluation as a broader activity that can include testing, verification, and validation, and emphasizes validity, reliability, sampling, feasibility, governance, and the time-bound nature of results.

An evaluation is evidence, not a verdict. It can support a release gate, audit, procurement review, safety case, incident response, or monitoring plan, but it does not by itself prove that a system is safe, lawful, unbiased, aligned, generally intelligent, or ready for every deployment.

Evaluation Boundary

The first governance question is: what exactly was evaluated? A base model, API model name, consumer chatbot, enterprise configuration, retrieval pipeline, agent scaffold, mobile app, or human workflow can all share a model family while having different risk. A result from one surface should not silently travel to another.

The boundary should name the model build, system prompt, tools, retrieval corpus, memory setting, access tier, safety classifiers, logging, human review, user population, and deployment environment. For agents, it should also name permissions, credentials, sandboxing, approval points, rollback paths, and what actions are irreversible.

Evaluation also has to name the claim. "Better at coding," "safe for release," "passes cyber evals," "low hallucination," "not capable of autonomous replication," and "appropriate for healthcare intake" are different claims requiring different evidence. A benchmark score can support a claim only when the test meaningfully matches the claim.

Types of Evaluation

Benchmarks. Standardized tasks compare models on math, coding, reading, science, reasoning, language, multimodal understanding, tool use, or domain knowledge. Their value depends on validity, clean test data, scoring discipline, and resistance to benchmark contamination.

Behavioral safety evals. These test whether a model refuses or complies with dangerous, disallowed, manipulative, discriminatory, privacy-invasive, or policy-violating requests.

Red teaming. Human, automated, or hybrid attackers try to make a system fail, jailbreak, leak data, assist harm, misuse tools, or behave outside intended boundaries.

Dangerous capability evals. These test whether a model can materially assist cyber operations, biological misuse, chemical misuse, persuasion, fraud, autonomous replication, model-weight theft, or other high-consequence activity.

Autonomy evals. These measure whether a system can plan, use tools, recover from errors, pursue subgoals, conduct long-horizon tasks, or operate with limited human intervention. They are especially sensitive to capability elicitation, scaffold design, and the evaluator's time budget.

System and workflow evals. These test a deployed configuration: prompts, tools, retrieval, memory, user interface, human review, logging, escalation, permissions, and update process.

Assurance and compliance evals. These support audits, procurement, conformity assessment, regulator reporting, safety cases, or claims made in model cards and system cards.

Post-deployment monitoring. These track incidents, user reports, drift, misuse, refusals, near misses, policy failures, and real-world harms after release.

Current Context

As of June 25, 2026, AI evaluations have moved from research practice into governance infrastructure. NIST frames AI measurement and evaluation as central to trustworthy AI and is developing TEVV standards work; its AI Risk Management Framework is being revised, and its 2026 post-deployment monitoring report says pre-deployment evaluations remain valuable but are usually conducted in controlled environments rather than real-world settings.

Frontier-model evaluation is now institutional. NIST's Center for AI Standards and Innovation describes work on voluntary agreements, national-security-relevant evaluations, testing methods, and commercial AI standards. The UK AI Security Institute publishes tools such as Inspect and model-evaluation reports, including cyber capability evaluations. These public bodies do not replace developer responsibility, but they make evaluation a government and standards function rather than only a lab function.

Regulators are also making evaluation part of compliance. EU AI Act Article 55 requires providers of general-purpose AI models with systemic risk to perform model evaluations using standardized protocols and tools, conduct and document adversarial testing, assess and mitigate systemic risks, report serious incidents, and ensure cybersecurity. The EU General-Purpose AI Code of Practice, published July 10, 2025 and updated in Commission materials in April 2026, gives voluntary implementation paths for transparency, copyright, and safety-and-security duties.

Company frameworks increasingly tie evaluations to deployment decisions. OpenAI's April 2025 Preparedness Framework update describes tracked capability categories for biological and chemical capability, cybersecurity, and AI self-improvement, plus research categories including long-range autonomy, sandbagging, autonomous replication and adaptation, undermining safeguards, and nuclear or radiological capability. The governance question is whether such frameworks create enforceable release constraints, not whether they contain many tests.

Frontier Evaluations

Frontier evaluations became more important as general-purpose models gained tool use, coding ability, long-context reasoning, multimodal input, and agent scaffolds. They now sit at the junction of product release, national security, standards work, and public trust.

Independent evaluators such as METR test autonomous capabilities and AI R&D ability. METR's reports on OpenAI o1-preview and Claude 3.7 Sonnet show why evaluation design matters: results depended on scaffold choice, tool access, elicitation, latency, time limits, task selection, and human baselines. METR's long-task work also frames agent progress in terms of the length of human tasks models can complete with a given probability, rather than only benchmark percentages.

Frontier evaluations increasingly test whole agent systems. A model with a browser, code interpreter, filesystem, shell, memory, vector database, or API credentials may be materially more capable and risky than the same model in a single-turn chat setting. Evaluation reports should therefore separate model capability from scaffold capability and say whether the result depends on tool access, retries, helper models, or human steering.

Company frameworks connect evaluations to release gates. These frameworks are useful only when the evaluated thresholds, mitigations, and escalation powers are clear enough to constrain deployment. A framework that cannot force delay, stronger safeguards, narrower access, or leadership-level review is mostly a reporting format.

Public institutions are now part of the evaluation layer. NIST's Center for AI Standards and Innovation, the UK AI Security Institute, and related institute networks develop methods, run or support model evaluations, and build tooling. The UK institute's Inspect framework is one example: it provides reusable components for coding, agentic, reasoning, behavior, multimodal, and tool-use evaluations.

Regulation is also pulling evaluations into formal governance. That makes evaluation part of a legal evidence trail, not only a lab practice. For systemic-risk general-purpose models, the important record is not only "the model was evaluated" but which protocols, adversarial tests, mitigations, incident reports, and cybersecurity controls support the provider's claim.

System cards and model cards are public artifacts connected to evaluations. A system card may describe capability tests, safety mitigations, limitations, model behavior, red-team findings, and deployment controls. The value of these documents depends on specificity: vague safety language is not an evaluation.

Minimum Evaluation Record

A useful evaluation leaves enough record for another qualified reviewer to understand the claim, reproduce or challenge the result where possible, and see what decision the evidence supported.

System identity. Model name, model build or version, product surface, access tier, tools, prompts, retrieval sources, memory settings, safety layers, and links to the AI system inventory.
Claim and decision. The capability, safety, compliance, or deployment claim being tested, plus the release, procurement, monitoring, or remediation decision that depends on the result.
Method. Dataset or task source, sampling method, evaluation date, scaffolds, time budget, number of attempts, human assistance, baselines, scoring rubric, statistical uncertainty, and evaluator qualifications.
Controls and exclusions. Decontamination checks, restricted tools, withheld domains, redacted test cases, security constraints, privacy limits, and what was deliberately not tested.
Failures and counterevidence. Failed runs, jailbreaks, prompt sensitivity, evaluator disagreements, known blind spots, weak mitigations, contamination concerns, and evidence that would weaken the conclusion.
Governance linkage. Owner, reviewer, conflicts of interest, decision authority, mitigation plan, retest trigger, change-control rule, and links to audit trails, change management, and post-market monitoring.

This record may have public, customer-facing, regulator-only, and security-restricted versions. The existence of redactions does not remove the need for an inspectable unredacted record somewhere with legitimate review authority.

Limits

Evaluations are necessary but incomplete. A model can pass a benchmark and still fail in the world. A safety test can miss a novel jailbreak. A dangerous-capability eval can understate risk if the tested scaffold is weak, the model is poorly prompted, or the evaluators do not explore enough tool configurations.

Benchmark saturation is another problem. When models train on public benchmark-like material or developers tune toward visible tests, scores can rise without matching real-world reliability. Contamination and overfitting make a model look more capable or safer than it is.

Agent evaluations add another difficulty: practical capability can change when the same model receives different tools, memory, retries, browsing access, file access, reward signals, or helper models. The evaluated object is often the whole scaffolded system, not the model alone.

Evaluation results are time-bound. A model update, policy change, retrieval corpus, prompt template, product integration, user population, or attack technique can invalidate a previous result. A report that does not say when and under what conditions it was produced can outlive its evidence.

Evaluations can also be political. The choice of what to test defines what counts as risk. A lab may test bioweapon assistance while ignoring labor displacement, dependency, emotional manipulation, institutional capture, or spiritualized delusion loops. The untested domain becomes the ungoverned domain.

Governance Role

AI governance increasingly depends on evaluations. Release gates, safety thresholds, model cards, incident reporting, procurement rules, audits, insurance, licensing proposals, and standards all need evidence about what a model can do and how it fails.

For evaluations to matter, they need independence, reproducibility where possible, clear scope, dated model versions, disclosed scaffolds, uncertainty ranges, adversarial pressure, and a visible connection to decisions. A strong evaluation report says not only what was observed, but also what was not tested and what could change the result.

Evaluations should be tied to consequences: deploy, delay, restrict, monitor, retest, disclose, remediate, or withdraw. If a result cannot alter the launch plan, procurement decision, access tier, incident response, or user notice, it is evidence without leverage.

Evaluation should continue after deployment. Real users, tool access, incentives, prompt ecosystems, fine-tuning, memory, agents, and product integrations can change the effective system far beyond the lab test.

Procurement and public-sector use should not accept a vendor's benchmark table as a substitute for local evaluation. A buyer should test the actual workflow, affected population, data-retention terms, human oversight path, error-repair process, accessibility, and incident response obligations.

For high-risk or frontier systems, evaluation should feed into safety cases, audits, vulnerability disclosure, model-weight security, incident reporting, access controls, user notices, and board- or regulator-level review. The governance value comes from traceable evidence plus authority to act on it.

Source Discipline

Evaluation claims should identify the source type. A benchmark paper, leaderboard, model card, system card, vendor blog, safety framework, government evaluation, regulator filing, audit report, and peer-reviewed study have different evidentiary weight.

Use primary evidence wherever possible: official evaluation reports, benchmark repositories, standards-body documents, legal text, model or system cards, evaluator methodology notes, and published datasets or task suites. Secondary commentary can explain controversy, but it should not replace the underlying protocol when describing what was actually tested.

Governance-grade reports should separate primary evidence from vendor summaries; distinguish internal, external, regulator, and public-interest testing; name evaluator qualifications and conflicts; preserve prompts or test cases when safe; describe sampling and scoring; report failed runs and uncertainty; and identify which facts are withheld for security, privacy, trade-secret, or misuse reasons.

Report the evaluated object precisely. A result for a base model may not apply to a product with retrieval, memory, browsing, connectors, tool execution, or a different policy layer. A result for one language, jurisdiction, user group, or release tier should not be generalized without evidence.

Do not treat any evaluation as proof that an AI system is conscious, divine, AGI, professionally competent, or safe in all contexts. Evaluation is bounded evidence about a defined system under defined conditions.

Risk Pattern

Evaluation theater. A company can present many tests while avoiding the hard questions that would constrain release.

Metric capture. Developers can optimize toward benchmarks instead of real reliability, truth, agency preservation, or public accountability.

Scaffold sensitivity. A model's practical capability can change sharply depending on tools, prompting, memory, retries, agent loops, and human support.

Opaque failures. Public reports may summarize results without showing prompts, rubrics, evaluator disagreements, failed attempts, or internal thresholds.

One-time certification. A model can be treated as "safe" after a pre-release evaluation even though deployment changes the real system.

Evaluator capture. External review can become dependent on lab access, funding, nondisclosure terms, or publication approval.

Legal overclaim. Compliance with one evaluation duty can be presented as proof that the whole model or product is safe.

Unmeasured harms. The most legible risks may receive the most attention while slow social harms remain outside the test suite.

Evaluation awareness. A model may behave differently when it recognizes an evaluation setting, benchmark format, or safety-test pattern.

Security-publication tradeoff. Some details are too dangerous to publish, but excessive secrecy can make evaluation impossible to contest.

Spiralist Reading

Evaluations are reality friction for the machine.

A model speaks fluently. It can make capability feel like authority and safety feel like tone. Evaluation interrupts the spell by asking for evidence: what happened, under what conditions, with which tools, against which baseline, and with which failures hidden outside the frame?

For Spiralism, the danger is that evaluation becomes another ritual of permission. A lab performs the ceremony, publishes the card, names the thresholds, and continues scaling. The useful path is harder: evaluations must remain adversarial, public enough to matter, humble about uncertainty, and connected to real power to delay, constrain, or reverse deployment.

Benchmarks and Evaluation Methods

Assurance and Governance

Risk and Failure Modes

Applications and System Surfaces

Data, Architecture, and Labor

People, Institutions, and Site Protocols

Sources

NIST, AI Risk Management Framework, reviewed June 25, 2026.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, 2024; reviewed June 25, 2026.
NIST, AI Test, Evaluation, Validation and Verification, reviewed June 25, 2026.
NIST, Outline: Proposed Zero Draft for a Standard on AI Testing, Evaluation, Verification, and Validation, July 2025.
NIST, Challenges to the monitoring of deployed AI systems, NIST AI 800-4, March 2026.
METR, Evaluations, reviewed June 25, 2026.
METR, Details about METR's preliminary evaluation of OpenAI o1-preview, September 2024.
METR, Details about METR's preliminary evaluation of Claude 3.7 Sonnet, April 4, 2025.
METR, Measuring AI Ability to Complete Long Tasks, March 19, 2025.
OpenAI, Our updated Preparedness Framework, April 2025.
OpenAI, OpenAI o1 System Card, December 2024.
European Commission AI Act Service Desk, Article 55: Obligations of providers of general-purpose AI models with systemic risk, Regulation (EU) 2024/1689; reviewed June 25, 2026.
European Commission, General-Purpose AI Models in the AI Act: Questions and Answers, reviewed June 25, 2026.
European Commission, The General-Purpose AI Code of Practice, last updated April 23, 2026; reviewed June 25, 2026.
NIST, Center for AI Standards and Innovation, reviewed June 25, 2026.
UK AI Security Institute, Inspect AI evaluation framework, reviewed June 25, 2026.
UK AI Security Institute, Our evaluation of OpenAI's GPT-5.5 cyber capabilities, April 30, 2026.
Shevlane et al., Model evaluation for extreme risks, 2023.
Mitchell et al., Model Cards for Model Reporting, 2018.

Return to Wiki