Wiki · Concept · Last reviewed June 25, 2026

Inspect AI

Inspect AI is an open-source evaluation framework for frontier AI systems, built around reproducible tasks, solvers, scorers, tools, model providers, logs, and sandboxed execution.

Definition

Inspect AI, often written simply as Inspect, is a framework for frontier AI evaluations developed by the UK AI Security Institute and Meridian Labs. The official documentation says it can be used for evaluations measuring coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding.

The public GitHub repository describes Inspect as a framework for large language model evaluations created by the UK AI Security Institute. It provides built-in components for prompt engineering, tool usage, multi-turn dialogue, and model-graded evaluations, while allowing extensions through other Python packages.

Inspect is not itself a benchmark, safety verdict, model card, or regulator. It is evaluation infrastructure: a way to define tasks, run models through them, record logs, vary elicitation strategies, score outputs, and preserve enough execution evidence for later review.

How It Works

Inspect's core unit is the task. The task documentation says tasks are the fundamental integration point for datasets, solvers, and scorers. A task supplies samples, defines how a model should be elicited, and records how outputs should be scored. Additional task options can set limits, sandboxes, metadata, setup steps, epochs, metrics, and logging behavior.

Solvers matter because evaluation is not only a question list. A solver can be a simple generation call, a chain of prompting steps, a tool-use loop, a ReAct-style agent, a coding-agent scaffold, or another elicitation strategy. Inspect lets evaluators compare those strategies while keeping the task definition inspectable.

Scorers matter because the same run may need exact matching, model grading, multiple metrics, abstention handling, or post-hoc rescoring. Inspect also supports model providers across lab APIs, cloud APIs, hosted open models, and local models, with extension points for additional providers.

Agent Context

Agent evaluation needs more than answer checking. Agents use tools, call browsers, write files, run code, and accumulate intermediate state. Inspect's tool and sandbox documentation is therefore central to its governance value. By default, tool calls run in the main evaluation process, but Inspect supports dedicated sandbox environments when evaluators need arbitrary code execution or per-sample filesystem resources.

Docker sandbox support is built in, with extension APIs for additional sandbox types. The sandbox documentation covers command execution, file reads and writes, per-sample files, multiple named sandbox environments, cleanup, resource limits, and tracing for commands that timeout or fail. Those details make agent failures replayable enough to audit.

Governance and Safety

Inspect's governance value is that it makes evaluation less like a screenshot and more like a run record. A serious result can identify the model provider, model version, task, dataset, solver, scorer, generation options, sandbox configuration, tool calls, logs, failures, and scoring method.

The companion Inspect Evals repository is a public collection of evaluations for Inspect AI, created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. Its register spans categories such as coding, assistants, cybersecurity, safeguards, mathematics, reasoning, knowledge, and multimodal tasks. That makes Inspect both a framework and part of a wider evaluation distribution system.

The limit is equally important. A run in Inspect is not automatically a good evaluation. Poor task design, weak datasets, noisy scorers, benchmark contamination, non-representative agent scaffolds, hidden provider changes, or overbroad conclusions can still make the result misleading. Inspect improves the evidence trail; it does not decide what evidence is adequate.

Defense Pattern

Source Discipline

Claims about Inspect should distinguish the framework from a particular evaluation suite, a particular model result, and a particular evaluator's interpretation. "Run with Inspect" is not enough. The citation should name the Inspect version, task source, model, solver, scorer, sandbox, date, and log location where possible.

When citing Inspect Evals, preserve whether an evaluation lives in the central repository, points to an external code repository, or is only listed in the register. Those distinctions matter for maintenance, reproducibility, and trust.

Spiralist Reading

Spiralism reads Inspect as a discipline against leaderboard theater. The important artifact is not the number alone, but the recipe that produced it: task, model, solver, scorer, sandbox, and log.

An evaluation framework can become ceremony if it is used to launder authority. It becomes useful when it keeps the conditions attached to the result. The score is a sign; the run record is the memory.

Open Questions

Sources


Return to Wiki