Wiki · Concept · Last reviewed June 25, 2026

Inspect AI

Inspect AI is an open-source evaluation framework for frontier AI systems, built around reproducible tasks, solvers, scorers, tools, model providers, logs, and sandboxed execution.

Category: Concept Updated: June 25, 2026 Tags: AI evaluation, UK AI Security Institute, Inspect AI, evals, sandboxes, frontier models

Definition

Inspect AI, often written simply as Inspect, is a framework for frontier AI evaluations developed by the UK AI Security Institute and Meridian Labs. The official documentation says it can be used for evaluations measuring coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding.

The public GitHub repository describes Inspect as a framework for large language model evaluations created by the UK AI Security Institute. It provides built-in components for prompt engineering, tool usage, multi-turn dialogue, and model-graded evaluations, while allowing extensions through other Python packages.

Inspect is not itself a benchmark, safety verdict, model card, or regulator. It is evaluation infrastructure: a way to define tasks, run models through them, record logs, vary elicitation strategies, score outputs, and preserve enough execution evidence for later review.

How It Works

Inspect's core unit is the task. The task documentation says tasks are the fundamental integration point for datasets, solvers, and scorers. A task supplies samples, defines how a model should be elicited, and records how outputs should be scored. Additional task options can set limits, sandboxes, metadata, setup steps, epochs, metrics, and logging behavior.

Solvers matter because evaluation is not only a question list. A solver can be a simple generation call, a chain of prompting steps, a tool-use loop, a ReAct-style agent, a coding-agent scaffold, or another elicitation strategy. Inspect lets evaluators compare those strategies while keeping the task definition inspectable.

Scorers matter because the same run may need exact matching, model grading, multiple metrics, abstention handling, or post-hoc rescoring. Inspect also supports model providers across lab APIs, cloud APIs, hosted open models, and local models, with extension points for additional providers.

Agent Context

Agent evaluation needs more than answer checking. Agents use tools, call browsers, write files, run code, and accumulate intermediate state. Inspect's tool and sandbox documentation is therefore central to its governance value. By default, tool calls run in the main evaluation process, but Inspect supports dedicated sandbox environments when evaluators need arbitrary code execution or per-sample filesystem resources.

Docker sandbox support is built in, with extension APIs for additional sandbox types. The sandbox documentation covers command execution, file reads and writes, per-sample files, multiple named sandbox environments, cleanup, resource limits, and tracing for commands that timeout or fail. Those details make agent failures replayable enough to audit.

Governance and Safety

Inspect's governance value is that it makes evaluation less like a screenshot and more like a run record. A serious result can identify the model provider, model version, task, dataset, solver, scorer, generation options, sandbox configuration, tool calls, logs, failures, and scoring method.

The companion Inspect Evals repository is a public collection of evaluations for Inspect AI, created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. Its register spans categories such as coding, assistants, cybersecurity, safeguards, mathematics, reasoning, knowledge, and multimodal tasks. That makes Inspect both a framework and part of a wider evaluation distribution system.

The limit is equally important. A run in Inspect is not automatically a good evaluation. Poor task design, weak datasets, noisy scorers, benchmark contamination, non-representative agent scaffolds, hidden provider changes, or overbroad conclusions can still make the result misleading. Inspect improves the evidence trail; it does not decide what evidence is adequate.

Defense Pattern

Name the task. Publish or archive the task definition, dataset version, sample selection, and task options.
Separate solver from scorer. Record how the model was elicited separately from how the result was judged.
Pin model access. Identify model provider, endpoint, generation parameters, and any model roles used for grading.
Use sandboxes for action. Tool-using, coding, or cyber tasks should preserve the environment where side effects occur.
Preserve logs. Treat logs, tool calls, traces, and rescoring records as part of the evaluation evidence.
Report scope. State what the evaluation does not test before turning scores into governance claims.

Source Discipline

Claims about Inspect should distinguish the framework from a particular evaluation suite, a particular model result, and a particular evaluator's interpretation. "Run with Inspect" is not enough. The citation should name the Inspect version, task source, model, solver, scorer, sandbox, date, and log location where possible.

When citing Inspect Evals, preserve whether an evaluation lives in the central repository, points to an external code repository, or is only listed in the register. Those distinctions matter for maintenance, reproducibility, and trust.

Spiralist Reading

Spiralism reads Inspect as a discipline against leaderboard theater. The important artifact is not the number alone, but the recipe that produced it: task, model, solver, scorer, sandbox, and log.

An evaluation framework can become ceremony if it is used to launder authority. It becomes useful when it keeps the conditions attached to the result. The score is a sign; the run record is the memory.

Open Questions

Which Inspect logs should be public for high-stakes model claims, and which should remain protected because they contain sensitive prompts, artifacts, or exploit details?
How should Inspect tasks be versioned when a benchmark changes faster than governance review cycles?
When should model-graded scoring be accepted, challenged, or replaced with human adjudication?
How should sandbox evidence be summarized for auditors who cannot safely rerun the environment?

Sources

UK AI Security Institute and Meridian Labs, Inspect AI documentation.
UKGovernmentBEIS, GitHub, inspect_ai: Inspect, a framework for large language model evaluations.
Inspect AI documentation, Tasks.
Inspect AI documentation, Using Models.
Inspect AI documentation, Sandboxing.
UKGovernmentBEIS, GitHub, inspect_evals: collection of evaluations for Inspect AI.

Return to Wiki