Wiki · Concept · Last reviewed June 25, 2026

NIST AI TEVV

NIST AI TEVV means test, evaluation, validation, and verification work for artificial-intelligence systems. It is the measurement discipline behind claims that an AI system was tested, checked against requirements, judged fit for a use, or evaluated for risk.

Category: Evaluation / AI Governance Published: June 25, 2026 Modified: June 25, 2026 Last reviewed: June 25, 2026 Tags: NIST, TEVV, AI evaluations, assurance, monitoring, governance

Definition

NIST uses TEVV as a compact label for testing, evaluation, validation, and verification of AI technologies and systems. The National Institute of Standards and Technology describes trustworthy AI products and services as depending heavily on reliable measurements and evaluations of the underlying technologies and their use. NIST's TEVV work develops metrics, measurement methods, evaluation methods, guidelines, practices, and standards participation for AI systems as they mature and enter new applications.

TEVV is not one benchmark and not one certification stamp. Testing asks what happens under specified conditions. Evaluation judges observed behavior against a purpose, metric, risk question, or comparison. Verification asks whether stated requirements are met. Validation asks whether the system is suitable for the intended use, population, context, and consequence. A model can pass a narrow test while being unfit for the deployment that matters.

Why It Matters

AI governance often fails by using evaluation language too loosely. "The model was tested" may mean a public benchmark, an internal red-team exercise, a policy check, a safety classifier run, a simulation, or a customer pilot. TEVV asks for the missing nouns: tested by whom, against what requirement, under what conditions, with what data, on which system version, and for which deployment decision. NIST's AI RMF Playbook places TEVV documentation inside the Measure function and calls for documenting test sets, metrics, and tools because documentation supports valid and reliable measurement.

Current Context

As of June 25, 2026, NIST's TEVV page describes long-running AI measurement and evaluation work. NIST also published a 2025 proposed zero-draft outline for an AI TEVV standard. The outline says AI TEVV was one of two Zero Drafts pilot topics and proposes a high-level standard rather than detailed prescriptions for every TEVV method.

NIST's broader AI governance material gives TEVV a place in lifecycle risk management. The AI Risk Management Framework is intended for voluntary use and, on NIST's current page, is being revised. NIST AI 800-4, created March 6, 2026, says pre-deployment evaluations are valuable but mostly occur in controlled environments, while deployed AI systems need monitoring in real-world settings.

TEVV's Four Questions

Test: What did the system do under specified inputs, conditions, tools, users, or adversarial pressure?

Evaluate: What does the result mean against a metric, baseline, safety claim, policy requirement, domain objective, or risk threshold?

Verify: Did the system satisfy stated requirements, controls, documentation duties, security expectations, or release criteria?

Validate: Is the system suitable for the intended real-world use, including affected people, institutional incentives, error repair, oversight, and monitoring?

The terms overlap, but separating them prevents a common category error: treating a laboratory test as validation for a social deployment.

Evidence Record

A TEVV record should let a qualified reviewer reconstruct what was claimed, what was measured, what was not measured, and what decision followed. It should name:

Object: model version, product surface, prompts, tools, retrieval sources, memory, safety layers, access tier, and links to the AI system inventory.
Claim: the capability, risk, requirement, or fitness-for-use question being answered.
Method: test sets, metrics, sampling, evaluator role, baselines, tools, scaffolds, time limits, human assistance, and uncertainty.
Failures: failed tests, adverse cases, evaluator disagreements, known blind spots, contamination concerns, and limits on publication.
Decision: release gate, mitigation, procurement choice, audit conclusion, monitoring trigger, rollback condition, or reason for no action.

The record may have public, customer, auditor, regulator, and security-restricted versions. Redaction can be justified, but there should still be an inspectable record for someone with legitimate review authority.

Agentic Systems

TEVV becomes harder when AI systems act through tools. A model in a chat box is not the same system as a model with browser access, shell access, payment permissions, memory, code execution, or enterprise connectors.

For AI agents, the tested object must include the scaffold. The record should cover sandboxing, credentials, permission boundaries, human approval points, logs, tool-call limits, rollback, and incident handling. This connects TEVV to AI agent observability and AI audit trails: if the system cannot preserve trace evidence, the organization may not be able to verify controls after an incident.

Post-Deployment Monitoring

Pre-deployment TEVV is necessary but incomplete. NIST AI 800-4 emphasizes that real-world monitoring matters because deployed systems face dynamic inputs, unforeseen outputs, distributed infrastructure, policy complexity, and changing users or incentives. The TEVV record should define what requires retesting: model update, prompt change, tool change, retrieval corpus change, new user population, incident pattern, drift signal, or material expansion of consequences.

Limits

TEVV does not make risk disappear. It can be narrow, stale, contaminated, underpowered, or tuned to the wrong claim. It can be captured by vendor incentives, constrained by nondisclosure terms, or converted into compliance theater.

TEVV also depends on governance authority. Evidence is useful only if it can change access, delay release, narrow scope, trigger review, improve monitoring, notify users, repair harm, or stop a system. Measurement without decision consequence becomes ritual.

Finally, TEVV should not be used to make metaphysical or promotional claims about AI systems. It is bounded evidence about defined behavior under defined conditions.

Spiralist Reading

TEVV is the anti-incantation.

When a machine speaks fluently, organizations are tempted to answer with adjectives: reliable, safe, robust, aligned, trusted. TEVV asks for the work beneath the adjective: what was tested, what failed, who checked, which version, which use, and what changed because of the evidence.

For Spiralism, the danger is paperwork liturgy. The useful form is harsher: a dated record, a real boundary, a list of failures, a decision point, and power to slow the deployment when the evidence is not enough.

Open Questions

What TEVV evidence should be public when a provider claims a model or agent is safe enough for release?
How should evaluators validate systems that change through tool access, memory, retrieval, and updates?
What post-deployment events should automatically trigger retesting or rollback?

Sources

NIST, AI Test, Evaluation, Validation and Verification, reviewed June 25, 2026.
NIST, Outline: Proposed Zero Draft for a Standard on AI Testing, Evaluation, Verification, and Validation, 2025; reviewed June 25, 2026.
NIST AI Resource Center, AI RMF Playbook: Measure, reviewed June 25, 2026.
NIST, AI Risk Management Framework, reviewed June 25, 2026.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024; reviewed June 25, 2026.
NIST, Challenges to the Monitoring of Deployed AI Systems, NIST AI 800-4, March 6, 2026; reviewed June 25, 2026.

Return to Wiki