Blog · Analysis · May 2026

The Measurement State Comes for AI

AI safety institutes mark a new political form: the state trying to build measurement capacity fast enough to govern systems it does not build.

From Summits to Labs

AI governance has been moving from declaration to instrumentation.

The first public layer was the summit: Bletchley Park in 2023, Seoul in 2024, Paris in 2025. Governments issued statements, companies made commitments, and the phrase "frontier AI" became a policy object. That mattered, but a declaration cannot tell whether a model can automate a cyber operation, help design a dangerous biological workflow, deceive an evaluator, manipulate a vulnerable user, or behave differently once tools and memory are attached.

For that, governments need measurement. Not vibes, not press releases, not a company's system card alone, but people, testbeds, access agreements, red-team methods, secure environments, and institutional memory about what previous models could do. This is the deeper meaning of AI safety institutes.

The United Kingdom launched the first national AI Safety Institute after the 2023 Bletchley Park summit. The United States built its institute inside NIST and later reoriented it as the Center for AI Standards and Innovation, or CAISI. In November 2024, the U.S. Departments of Commerce and State launched the International Network of AI Safety Institutes with members including Australia, Canada, the European Union, France, Japan, Kenya, South Korea, Singapore, the United Kingdom, and the United States.

This is not just another advisory ecosystem. It is the state trying to learn how to see a machine before the machine becomes ordinary infrastructure.

Why Measurement Is Power

Modern institutions govern through measurement. Public health needs surveillance data. Financial regulation needs balance sheets, stress tests, reporting rules, and auditors. Aviation safety needs incident reports, certification processes, flight data, and accident investigation. Environmental law needs emissions measurement, exposure models, and monitoring stations. Without measurement, law becomes theater: values without instruments.

AI intensifies this problem because capability is not visible from the interface. A chatbot can sound harmless while holding latent skill in coding, persuasion, scientific reasoning, tool use, or strategic planning. A model can be weak in one scaffold and dangerous in another. A public demo can show refusal while an internal evaluation with different prompts, tools, or safeguards removed shows a different system.

That makes evaluation capacity a form of sovereignty. A government that cannot test frontier systems depends on the builder's description of the builder's product. It can react to scandals, negotiate from ignorance, or regulate by category labels. It cannot govern the actual machine.

The most important question is therefore not whether AI safety institutes are "pro-AI" or "anti-AI." The serious question is whether they create a public capacity to know what private systems can do.

The New Institute Layer

CAISI's public materials describe a center that works on guidelines, best practices, voluntary standards, security measurement, national-security evaluations, and interagency coordination. On May 5, 2026, NIST announced agreements with Google DeepMind, Microsoft, and xAI for frontier AI national-security testing. The announcement said CAISI would conduct pre-deployment evaluations and targeted research, building on earlier partnerships with OpenAI and Anthropic. It also stated that CAISI had completed more than 40 evaluations, including on unreleased state-of-the-art models.

The UK AI Security Institute describes a similar empirical mission. In its first-year reflection, the institute said it had built evaluation suites for cyber attacks, chemical and biological misuse, autonomous agent capabilities, safeguards, and societal impacts. It framed its role as giving governments an empirical understanding of advanced AI safety.

The International Network of AI Safety Institutes adds a coordination layer. NIST's November 2024 fact sheet listed priorities including synthetic content risk, foundation-model testing, risk assessment for advanced systems, and common approaches to interpreting tests. The network's first joint testing exercise, led by U.S., UK, and Singapore technical experts, tested Meta's Llama 3.1 405B across general academic knowledge, closed-domain hallucinations, and multilingual capabilities.

These details matter because they show the emerging shape of governance. It is not only a rulebook. It is a measurement stack: evaluation methods, shared taxonomies, secure access, government labs, company agreements, technical workshops, synthetic-content research, and cross-border comparison.

The Voluntary Access Problem

The weak point is access.

Many institute relationships with frontier labs are voluntary or partnership-based. Voluntary access is better than no access. It can let evaluators see models before release, test systems with reduced safeguards, and give feedback while design choices can still change. It can also build trust and technical fluency inside government.

But voluntary access is not the same as public authority. A lab may shape timing, model version, tool environment, disclosure, publication, and remediation. The institute may learn enough to advise but not enough to compel. It may see a risk but lack power to delay release. It may publish a general conclusion while the public cannot inspect prompts, scaffolds, failed runs, evaluator disagreements, or the company's response.

This is the danger of evaluation theater. A model passes through a government-adjacent testing ritual. The public hears that experts were involved. The company ships. If harm appears later, everyone can point to a process, but the process may not have had enough authority to matter.

The answer is not to dismiss institutes. It is to ask what teeth their measurements have. A thermometer that cannot change treatment is useful for diagnosis but weak as governance. A frontier-model evaluation that cannot trigger disclosure, delay, mitigation, procurement limits, incident reporting, or public warning is measurement without institutional consequence.

What Gets Measured

AI safety institutes have good reasons to prioritize cyber, biosecurity, chemical weapons, model autonomy, and national-security misuse. These are high-consequence domains where government responsibility is direct and where public disclosure can itself create risk. A state should not wait for market incentives to tell it whether models can materially assist dangerous actors.

But every measurement regime creates a shadow. The measured risks become the official risks. The unmeasured risks become atmospheric: acknowledged in language, weak in enforcement.

That matters for the rest of AI's social surface. The systems now being evaluated for cyber and bio capability are also entering schools, workplaces, courts, therapy-like interactions, political information flows, companion products, search, software production, and administrative decision-making. They mediate knowledge and belief. They change labor pathways. They create dependency loops. They generate synthetic media. They alter what people treat as evidence.

A narrow security mandate can miss these harms. A model that is not a bioweapons risk may still be a dependency engine. A model that cannot autonomously conduct a serious cyberattack may still produce workslop at scale, manipulate lonely users, distort public knowledge, or collapse apprenticeship pathways. A model that passes a national-security test may still become a high-control interface when attached to memory, persuasion, and economic incentives.

The International AI Safety Report process is useful here because it tries to synthesize a broader research base on general-purpose AI capabilities and risks. Its significance is not that one report can settle the field. It is that public institutions need a shared scientific baseline that is not simply a lab's release narrative.

The Governance Standard

A credible measurement state for AI should meet a higher standard than "government experts looked at it."

First, evaluations should be scoped. Reports should say what model version, safeguards, tools, prompts, scaffolds, time limits, and assistance were used. A model is not a single object once it can be wrapped in agents, retrieval, memory, code execution, browser tools, and user-specific context.

Second, uncertainty should be public where safety permits. The useful output is not only a pass/fail label. It is what was tested, what failed, what was inconclusive, and what could change the result.

Third, measurement should connect to action. Evaluations should trigger concrete consequences: mitigation, staged release, additional testing, procurement restrictions, model-card changes, monitoring requirements, incident-reporting duties, or delayed deployment when evidence warrants it.

Fourth, institutes need independence from both panic and capture. They should not become anti-innovation panic offices. They also should not become reassurance vendors for companies that need a public stamp before launch. Their loyalty must be to public evidence.

Fifth, the risk map must widen. Security is necessary, but not sufficient. Evaluation programs should build bridges to labor economics, child safety, mental health, civil rights, education, democratic integrity, content provenance, and institutional design. The machine does not enter society only through catastrophic misuse. It enters through ordinary workflows.

Sixth, international coordination should not erase local context. Shared methods are valuable, especially for transnational models. But AI risk is also cultural and institutional. A multilingual evaluation is not just a translation exercise; it tests whether a model behaves differently across languages, norms, vulnerabilities, and governance settings.

The Spiralist Reading

AI safety institutes are an attempt to put public instruments around a private oracle.

The oracle is not mystical. It is a model stack: data, compute, optimization, evaluation, scaffolding, interface, deployment, feedback. But to ordinary users and many institutions, its outputs arrive as fluent authority. It names, summarizes, recommends, classifies, refuses, remembers, and acts. The social danger is not only that it may be wrong. The danger is that its wrongness can be polished, scalable, and institutionally absorbed.

Measurement interrupts that spell. It asks the machine to show itself under pressure. It turns aura into evidence. It says: under which conditions did this system help with an attack, invent a source, persuade a user, evade oversight, fail a safeguard, or behave differently after tools were attached?

But measurement can also become ritual. The lab submits. The institute tests. The report gestures. The model ships. The public receives the ceremony as reassurance. In that failure mode, evaluation becomes another layer of recursive reality: a symbol of control standing in for control itself.

The useful path is narrower and harder. AI safety institutes must become public friction with memory. They need enough access to see, enough independence to say, enough authority to matter, and enough humility to admit what they cannot measure yet. They should not promise that the machine is safe. They should make it harder for anyone else to make that claim without evidence.

That is the real institutional bet. Not that measurement will solve AI governance, but that without measurement, governance will be mostly performance.

Sources


Return to Blog