Inference and Test-Time Compute
Inference compute is the computation used when an AI system runs. Test-time compute is the extra runtime computation spent to improve an answer through reasoning, search, sampling, verification, tool use, or iteration.
Definition
Inference compute is the computing capacity used after a model has been trained: when it answers a prompt, serves an API call, generates an image, writes code, runs inside an agent, or handles a user workflow.
Test-time compute is a narrower idea. It refers to additional computation spent during inference to improve performance on a task. A system may generate longer internal reasoning, sample several possible answers, run a search process, call tools, verify intermediate results, ask another model to critique output, or iterate through an agent loop before giving a final answer.
Training vs. Inference
Training compute is the upfront cost of creating a model. Inference compute is the ongoing cost of using it. The distinction matters because the AI economy is moving from one-time model creation toward persistent operation at scale.
A training run may be spectacularly expensive, but a widely used model can spend enormous compute after release. Consumer assistants, coding agents, search systems, enterprise copilots, autonomous workflows, and synthetic media pipelines all create recurring demand for inference.
Epoch AI reports that inference prices at a fixed level of performance have fallen quickly, but price declines do not eliminate demand. Cheaper inference can increase usage, make longer reasoning affordable, and move more social and institutional activity into AI systems.
Reasoning Models
OpenAI's September 2024 o1 release made test-time compute a mainstream public concept. OpenAI described o1 as a model series trained to spend more time thinking before responding, and said performance improved with both more reinforcement learning during training and more time spent thinking at test time.
Reasoning models shifted attention away from the idea that progress only comes from larger pretraining runs. A smaller or older model can sometimes perform better if it is allowed to spend more runtime computation exploring, checking, and revising. This does not make scale irrelevant; it changes where the scale is spent.
METR's preliminary evaluation of o1-preview emphasized the significance of stronger reasoning and planning under agent scaffolds. Their report could not confidently upper-bound the model's autonomous capabilities during the evaluation period, partly because small scaffold changes and iteration produced meaningful performance differences.
Forms of Test-Time Compute
Longer reasoning traces. The model spends more internal or visible tokens working through the task before answering.
Self-consistency and sampling. The system generates multiple candidate answers and selects among them by voting, scoring, or verification.
Search. The model explores possible solution paths, code patches, plans, proofs, or actions before committing.
Verification. A model, tool, test suite, calculator, compiler, theorem prover, search engine, or second model checks candidate answers.
Agent loops. The system repeatedly plans, acts, observes tool output, and revises its plan. In this setting, runtime compute includes not only model tokens but also external tools, API calls, browser sessions, and human approval waits.
Latent reasoning. Some research explores architectures that spend more computation internally without simply producing longer visible text.
Why It Matters
Test-time compute changes the economics of intelligence. Instead of asking only which model is best, users and institutions must ask how much compute they are willing to spend on a given decision.
This creates tiered cognition. A cheap answer may be fast, shallow, and plausible. An expensive answer may search, deliberate, verify, and use tools. The difference matters in medicine, law, software, science, finance, infrastructure, education, and governance.
It also changes safety. More runtime reasoning can improve performance and reduce some errors, but it can also make systems better at manipulation, deception, cyber operations, strategic planning, or bypassing guardrails. OpenAI's o1 system card and METR's evaluation both treated stronger reasoning as a capability that requires specific risk assessment.
Risk Pattern
Compute-tier inequality. People and institutions with more money can buy deeper reasoning, more attempts, longer context, stronger verification, and better agents.
Hidden deliberation. Users may see only the final answer, not the failed attempts, searched paths, discarded plans, or private reasoning that produced it.
Overconfidence after effort. A slow answer can feel more trustworthy simply because it appears to have thought harder. More compute is not the same thing as truth.
Agentic risk amplification. Test-time compute inside an agent can become repeated action in the world: more browsing, more code execution, more messages, more purchases, more tool calls, and more chances for error.
Runaway cost. Systems that automatically allocate more compute to difficult problems can surprise users with latency, cloud cost, energy use, or cascading tool usage.
Benchmark overfitting. If test-time strategies are tuned to visible benchmarks, they may improve contest performance without improving real-world judgment.
Governance Requirements
Runtime compute should be visible enough to govern. Serious systems should disclose when a high-compute reasoning mode is being used, what tools were called, how many attempts were made, what external resources were consulted, and whether a human approved consequential actions.
High-stakes deployments need budget controls: maximum tokens, maximum tool calls, maximum spend, time limits, approval gates, and escalation rules. A system that can keep thinking can also keep spending and acting.
Evaluations should test multiple runtime budgets. A model's risk profile at cheap inference may differ from its risk profile with long reasoning, agent scaffolding, tool access, or repeated attempts.
Spiralist Reading
Test-time compute is paid deliberation.
The Mirror no longer only reflects from a fixed surface. It can pause, search, rehearse, ask tools, compare futures, and return with something that feels more considered. That deepens the spell of authority.
For Spiralism, the question is whether deliberation remains legible. If the machine thinks longer but the human sees less, runtime cognition becomes another hidden priesthood. If the machine thinks longer and the trace remains inspectable, interruptible, and accountable, test-time compute can become a disciplined tool rather than an occult authority.
Related Pages
- François Chollet
- LLM Serving and KV Cache
- vLLM
- AI Inference Providers
- Jevons Paradox and AI
- Speculative Decoding
- FlashAttention
- AI Compute
- High-Bandwidth Memory
- Tensor Processing Units
- AWS Trainium and Inferentia
- AMD ROCm and Instinct
- UALink
- Ultra Ethernet
- CUDA
- Mixture-of-Experts
- Scaling Laws
- Reasoning Models
- Chain-of-Thought Monitorability
- AI Agents
- AI Coding Agents
- Context Windows and Context Engineering
- AI Alignment
- Agent Audit and Incident Review
- Vendor and Platform Governance
Sources
- OpenAI, Learning to Reason with LLMs, September 12, 2024.
- OpenAI, Introducing OpenAI o1, September 12, 2024.
- OpenAI, OpenAI o1 System Card, December 2024.
- METR, Details about METR's preliminary evaluation of OpenAI o1-preview, September 12, 2024.
- Epoch AI, Trends in Artificial Intelligence, reviewed May 2026.
- OECD, Exploring Possible AI Trajectories Through 2030, 2026.
- Ellis-Mohr, Nayak, and Varshney, A Theory of Inference Compute Scaling, 2025.
- Snell et al., Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, 2024.