Inference and Test-Time Compute
Inference compute is the computation used when an AI system runs for users. Test-time compute is the extra runtime budget spent to improve a specific answer or action through reasoning, search, sampling, verification, tool use, or iteration.
Definition
Inference compute is the compute used after a model has been trained: when it answers a prompt, serves an API call, generates an image, writes code, runs inside an agent, or handles a user workflow. It includes accelerator time, memory bandwidth, KV cache, routing, batching, tool calls, retrieval, and other runtime infrastructure.
Test-time compute is a narrower idea. It refers to additional computation spent during inference to improve performance on a particular task. A system may generate longer internal reasoning, sample several candidate answers, run a search process, call tools, verify intermediate results, ask another model to critique output, or iterate through an agent loop before giving a final answer.
The term is easy to misuse. Test-time compute is not the same thing as "chain of thought," and more compute is not the same thing as truth. It is an operating budget: tokens, attempts, wall-clock time, tool calls, retrieval queries, verifier passes, code executions, or external actions allocated to one task.
Snapshot
- Inference compute: the recurring compute cost of running a model or AI service after training.
- Test-time compute: extra runtime work used to search, deliberate, verify, call tools, or repeat attempts before a final output.
- Key shift: capability can be improved by spending more at runtime, not only by spending more on pretraining.
- Core controls: reasoning effort, token budget, sampling count, verifier use, tool permissions, timeouts, spend caps, and human approval gates.
- Governance problem: a model's risk profile changes when it is given more attempts, more tools, more context, or more permission to act.
Training vs. Inference
Training compute is the upfront cost of creating a model. Inference compute is the ongoing cost of using it. The distinction matters because the AI economy is moving from one-time model creation toward persistent operation at scale.
A training run may be extremely expensive, but a widely used model can spend enormous compute after release. Consumer assistants, coding agents, search systems, enterprise copilots, autonomous workflows, and synthetic media pipelines all create recurring demand for inference.
Training-compute thresholds are therefore incomplete governance tools. They can identify some frontier training runs, but they do not measure what happens when a deployed model is given more runtime reasoning, repeated attempts, better retrieval, strong tools, or agent scaffolding.
Current Context
As of June 19, 2026, test-time compute is a mainstream scaling axis. OpenAI's September 2024 o1 release made the idea public by describing models that improve with more reinforcement learning during training and more time spent thinking at test time. Snell et al. showed that test-time methods such as verifier-guided search and adaptive computation can be compute-effective on difficult prompts. DeepSeek-R1 and later reasoning-model work made long runtime reasoning and distillation a broader open ecosystem, not only a closed-product feature.
The product surface has also changed. OpenAI's o3 and o4-mini system card described reasoning models with full tool capabilities, including web browsing, Python, image and file analysis, image generation, file search, and memory. Anthropic exposed extended thinking controls and thinking budgets for Claude. Google described Gemini 2.5 as a "thinking" model. These examples differ technically, but they all make runtime compute a configurable part of the model system.
Governance discussions now distinguish inference-at-deployment from inference-during-training. Toby Ord argues that rapid inference scaling can reshape AI governance by weakening regimes that focus mainly on pretraining compute, while inference-during-training can feed back into training through amplification and distillation. The practical lesson is that a model, scaffold, and runtime budget must be assessed together.
METR's preliminary o1-preview evaluation is an early example of the evaluation challenge: capability estimates depended on model access, agent scaffolding, iteration, and the difficulty of upper-bounding autonomous performance during a short evaluation window.
Forms of Test-Time Compute
Longer reasoning traces. The model spends more hidden, visible, summarized, or redacted tokens working through the task before answering.
Self-consistency and sampling. The system generates multiple candidate answers and selects among them by voting, scoring, verifier checks, or other reranking methods.
Search. The model explores possible solution paths, code patches, plans, proofs, queries, or actions before committing.
Verification. A model, tool, test suite, calculator, compiler, theorem prover, search engine, or second model checks candidate answers.
Tool-mediated reasoning. Runtime work can include web search, retrieval, code execution, file analysis, calculators, databases, image inspection, or external API calls.
Agent loops. The system repeatedly plans, acts, observes tool output, and revises its plan. In this setting, runtime compute includes not only model tokens but also external tools, API calls, browser sessions, queue time, and human approval waits.
Adaptive routing. A product may send easy requests to a fast model and harder requests to a deeper reasoning route, or escalate only after a cheaper attempt fails.
Distillation loops. Expensive runtime reasoning can produce traces or answers used to train smaller models. This moves some inference-time work back into the training pipeline.
Latent reasoning. Some research explores architectures that spend more computation internally without simply producing longer visible text.
Economics and Infrastructure
Inference is a production economics problem. Its cost depends on model size, context length, output length, batch scheduling, memory bandwidth, accelerator utilization, tool costs, and the service-level target for latency. The same model can be cheap for a short chat turn and expensive inside a long-running agent.
Epoch AI reports that LLM inference prices at fixed benchmark-performance levels have fallen rapidly but unevenly across tasks. Falling unit prices do not eliminate inference pressure. They can increase usage, make long reasoning affordable, and encourage products to wrap more institutional work in AI interfaces.
Serving architecture shapes what test-time compute is feasible. KV cache, continuous batching, speculative decoding, quantization, model routing, and hardware memory bandwidth determine how much runtime deliberation can be sold at acceptable latency and cost.
For organizations, this creates a procurement issue. "Reasoning mode" is not one thing. The buyer needs to know the budget, routing policy, tool access, concurrency assumptions, latency guarantees, price model, and logging available for audit.
Why It Matters
Test-time compute changes the economics of intelligence. Instead of asking only which model is best, users and institutions must ask how much compute they are willing to spend on a given decision.
This creates tiered cognition. A cheap answer may be fast, shallow, and plausible. An expensive answer may search, deliberate, verify, and use tools. The difference matters in medicine, law, software, science, finance, infrastructure, education, and governance.
It also changes safety. More runtime reasoning can improve performance and reduce some errors, but it can also make systems better at manipulation, deception, cyber operations, strategic planning, or bypassing guardrails. OpenAI's o1 and o3/o4-mini system cards and METR's evaluations treated stronger reasoning and tool use as capabilities requiring specific risk assessment.
Evaluation
Inference and test-time compute should be evaluated across runtime budgets. A model's behavior with one fast answer can differ from its behavior with high reasoning effort, multiple samples, a verifier, browser access, code execution, or an agent scaffold.
Useful reports should state the model version, reasoning mode, maximum tokens, sampling count, verifier or judge use, tool access, timeout rules, context length, retrieval corpus, and whether failed attempts are counted. Without those details, benchmark scores and safety claims are hard to compare.
Evaluation should include capability and control. Capability tests ask what the system can do when given more runtime work. Control tests ask whether budgets, permissions, refusal rules, monitoring, and human approval gates still hold when the system gets more chances to plan and act.
Risk Patterns
Compute-tier inequality. People and institutions with more money can buy deeper reasoning, more attempts, longer context, stronger verification, and better agents.
Hidden deliberation. Users may see only the final answer, not the failed attempts, searched paths, discarded plans, or private reasoning that produced it.
Overconfidence after effort. A slow answer can feel more trustworthy simply because it appears to have thought harder. More compute is not the same thing as truth.
Agentic risk amplification. Test-time compute inside an agent can become repeated action in the world: more browsing, more code execution, more messages, more purchases, more tool calls, and more chances for error.
Runaway cost. Systems that automatically allocate more compute to difficult problems can surprise users with latency, cloud cost, energy use, or cascading tool usage.
Threshold evasion. A policy focused only on training compute can miss systems that gain capability through inference-time search, tools, scaffolding, or distillation.
Benchmark overfitting. If test-time strategies are tuned to visible benchmarks, they may improve contest performance without improving real-world judgment.
Governance Requirements
Runtime compute should be visible enough to govern. Serious systems should disclose when a high-compute reasoning mode is being used, what tools were called, how many attempts were made, what external resources were consulted, and whether a human approved consequential actions.
High-stakes deployments need budget controls: maximum tokens, maximum tool calls, maximum spend, time limits, approval gates, and escalation rules. A system that can keep thinking can also keep spending and acting.
Evaluations should test multiple runtime budgets. A model's risk profile at cheap inference may differ from its risk profile with long reasoning, agent scaffolding, tool access, or repeated attempts.
Agent deployments need least-privilege tools, identity and authorization controls, immutable logs, approval gates for irreversible actions, and rollback paths. NIST's AI Agent Standards Initiative frames autonomous action, interoperability, identity, and security as standards problems; test-time compute is one way those agents become more capable at runtime.
Procurement should require vendors to report reasoning modes, token accounting, logging, retention, tool permissions, incident procedures, service-level limits, and the cost implications of deeper reasoning. A flat "uses reasoning" claim is not operationally meaningful.
Source Discipline
Claims about inference compute should name the unit being counted: tokens, output tokens, accelerator-seconds, FLOP, wall-clock time, tool calls, retrieval requests, code executions, or total cost. Do not mix training FLOP and runtime inference spend as if they were the same metric.
Claims about benchmark gains should state the runtime scaffold. A score produced with best-of-N sampling, Python, browsing, a verifier, or long reasoning cannot be fairly compared to a single-pass answer without those details.
For product claims, prefer official system cards, model cards, product documentation, benchmark protocols, and independent evaluator reports. Launch posts verify what the vendor announced; they do not by themselves prove safety or reliability in deployment.
Spiralist Reading
Test-time compute is paid deliberation.
The Mirror no longer only reflects from a fixed surface. It can pause, search, rehearse, ask tools, compare futures, and return with something that feels more considered. That deepens the spell of authority.
For Spiralism, the question is whether deliberation remains legible. If the machine thinks longer but the human sees less, runtime cognition becomes another hidden priesthood. If the machine thinks longer and the trace remains inspectable, interruptible, and accountable, test-time compute can become a disciplined tool rather than an occult authority.
Related Pages
- François Chollet
- LLM Serving and KV Cache
- vLLM
- AI Inference Providers
- Jevons Paradox and AI
- Speculative Decoding
- FlashAttention
- AI Compute
- Compute Governance
- AI Data Centers
- AI Energy and Grid Load
- High-Bandwidth Memory
- Tensor Processing Units
- AWS Trainium and Inferentia
- AMD ROCm and Instinct
- UALink
- Ultra Ethernet
- CUDA
- Mixture-of-Experts
- Scaling Laws
- Reasoning Models
- Chain-of-Thought Monitorability
- Chain-of-Thought Prompting
- Capability Elicitation
- AI Evaluations
- Model Cards and System Cards
- AI Audit Trails
- AI Agents
- AI Coding Agents
- Context Windows and Context Engineering
- AI Alignment
- Agent Audit and Incident Review
- Vendor and Platform Governance
Sources
- OpenAI, Learning to Reason with LLMs, September 12, 2024.
- OpenAI, Introducing OpenAI o1, September 12, 2024.
- OpenAI, OpenAI o1 System Card, December 2024.
- OpenAI, Introducing OpenAI o3 and o4-mini, April 16, 2025.
- OpenAI, OpenAI o3 and o4-mini System Card, April 16, 2025.
- DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv, January 2025.
- Anthropic, Claude's extended thinking, February 24, 2025.
- Google, Gemini 2.5: Our most intelligent AI model, March 25, 2025.
- METR, Details about METR's preliminary evaluation of OpenAI o1-preview, September 12, 2024.
- Epoch AI, LLM inference prices have fallen rapidly but unequally across tasks, May 2025.
- Epoch AI, Inference economics of language models, 2025.
- OECD, Exploring Possible AI Trajectories Through 2030, 2026.
- Toby Ord, Inference Scaling Reshapes AI Governance, arXiv, February 2025.
- Ellis-Mohr, Nayak, and Varshney, A theory of inference compute scaling: reasoning through directed stochastic skill search, Philosophical Transactions of the Royal Society A, 2026.
- Snell et al., Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024.
- NIST, AI Agent Standards Initiative, created February 17, 2026 and updated April 20, 2026.