Scaling Laws
Scaling laws are empirical relationships that estimate how a measured AI outcome changes as resources such as model size, training data, training compute, data quality, or inference-time computation are varied.
Definition
In machine learning, a scaling law is an empirical relationship showing how a target metric changes as a system is made larger or given more resources. For large language models, the most discussed variables are parameter count, training-token count, training compute, dataset quality, architecture, post-training, and inference-time compute.
Scaling laws do not say that intelligence is only size. They say that, within measured regimes and for a defined metric, loss or benchmark performance can often be predicted from resource inputs with surprising regularity. That predictability helped turn model building from artisanal experimentation into industrial planning.
The term is often used loosely in public debate. A strong article should distinguish empirical scaling curves from ideology. The evidence shows patterns under specific assumptions; it does not prove that every capability will improve smoothly forever, that every model family follows the same curve, or that social deployment risks can be solved by scale.
The object being scaled must be named. A pretraining-loss scaling law is not the same as a benchmark scaling law, a data-quality scaling law, a post-training scaling law, a tool-use scaling law, or an inference-time reasoning curve. Each has different evidence requirements and different governance implications.
Technical Lineage
OpenAI's 2020 paper Scaling Laws for Neural Language Models studied language-model loss across changes in model size, dataset size, and training compute. It reported smooth power-law behavior across broad ranges and helped popularize the idea that larger models, more data, and more compute could be used to forecast performance.
DeepMind's 2022 Training Compute-Optimal Large Language Models, commonly associated with Chinchilla, revised the practical recipe. It argued that many large models were undertrained relative to their size and that compute-optimal training should increase model parameters and training tokens together more evenly.
OpenAI's GPT-4 technical report presented predictable scaling as part of frontier-model development: smaller runs were used to forecast aspects of a much larger final run, including final loss on an internal codebase. This matters because scaling laws became not just academic observations, but capital-allocation tools for expensive training decisions.
Later work and measurement projects extended the discussion to inference compute, data availability, hardware efficiency, algorithmic progress, precision, sparse architectures, benchmark forecasting, and the fragility of apparent emergent abilities. The field now treats scaling as a system-level question rather than a single curve.
Current Context
As of June 19, 2026, scaling laws remain central to frontier AI planning, but the live debate has broadened beyond "make the model bigger." Compute-optimal recipes changed the balance between parameters and tokens; data-supply work asks when high-quality human-generated data becomes a limiting input; and inference-scaling work asks what happens when more capability comes from runtime computation, search, tools, or repeated calls rather than from one larger pretraining run.
Governance has also absorbed scaling language. The EU AI Act uses 10^25 training FLOP as a threshold for presuming that a general-purpose AI model has systemic risk, while European Commission guidance also discusses a lower indicative 10^23 FLOP criterion for general-purpose model scope. Those are legal and administrative thresholds, not natural laws of capability.
The current safety question is therefore narrower and harder than public rhetoric suggests: which measurable capabilities can be forecast from scaling curves, which risks require separate evaluations, and which deployment harms appear only after a model is connected to tools, users, incentives, data pipelines, and institutions?
How It Works
Measure a target. Researchers choose a quantity such as cross-entropy loss, benchmark accuracy, pass rate, or another measurable proxy for performance.
Train across scales. Smaller models are trained with varied parameter counts, token budgets, and compute budgets. These runs create the data used to fit the scaling relationship.
Fit a curve. The relationship is often modeled as a power law or related smooth function, with an irreducible-loss term or other correction where needed. The resulting equation estimates how the chosen metric should change as resources increase.
Allocate resources. Labs use the fitted relationship to decide whether to spend more compute on a larger model, more data, longer training, better data filtering, stronger post-training, or more inference-time reasoning.
Validate at scale. The strongest use of scaling laws is predictive: a lab makes a forecast from smaller experiments before the final run is known, trains or evaluates the larger system, and checks whether the final result landed near the forecast.
Why It Matters
Scaling laws are one reason the AI industry became comfortable with enormous capital spending. If a lab believes performance can be predicted from scale, then chips, data centers, power contracts, datasets, and training teams become part of a calculable production function.
They also connect technical architecture to geopolitics. Compute supply, export controls, energy availability, cloud contracts, and data access become capability inputs. The scaling worldview makes infrastructure into strategic power.
For model users, scaling laws explain why capability jumps can feel both sudden and planned. The public may experience a surprising new model; the lab may have been fitting curves for months or years.
For safety, scaling laws cut both ways. Predictability can support better predeployment planning, earlier dangerous-capability testing, and more credible safety cases. It can also create pressure to race. If progress appears forecastable, organizations may treat capability gain as inevitable and governance as a scheduling problem.
Limits and Misreadings
Proxy mismatch. Lower loss is not the same as trustworthy behavior, wise judgment, legal compliance, or social legitimacy. A system can scale on one metric while failing on another.
Data constraints. Compute-optimal training depends on available, useful data. Work on data limits projects that public human-generated text could become a binding constraint under continued scaling trends, though timing depends on assumptions about data quality, overtraining, multimodal data, synthetic data, and data-efficiency improvements.
Emergent behavior. Some capabilities and risks may appear discontinuously on particular evaluations even when loss changes smoothly. Schaeffer, Miranda, and Koyejo argue that some apparent emergent abilities can be artifacts of metric choice, which makes evaluation design part of scaling interpretation.
Benchmark fragility. Scaling laws based on benchmarks can inherit benchmark contamination, narrowness, saturation, or incentives to optimize visible tests.
Deployment costs. Training scale is only one axis. Inference cost, latency, memory, tool use, context length, agent retries, and user demand shape whether a model can be deployed broadly.
Threshold drift. A compute threshold that once selected frontier systems may later cover many more models, or miss systems that gain capability through better algorithms, post-training, retrieval, scaffolds, or inference-time search.
Ideological inflation. Scaling laws can become a story that justifies any amount of expansion. The empirical claim should not be allowed to erase labor, environmental, legal, safety, and political questions.
Governance Requirements
Frontier model reports should disclose enough information for scaling claims to be meaningful: target metric, training compute estimates, data-token counts where feasible, model-size categories, evaluation methodology, uncertainty ranges, fitting range, and whether the reported curve was forecast before or fitted after the run.
Governance should separate capability forecasting from safety forecasting. A lab may predict loss well and still fail to predict misuse, persuasion effects, autonomy, security behavior, or psychological harms.
Public policy should treat compute, data, power, and deployment access as related governance surfaces. Scaling laws make clear that the model is not only software; it is the visible endpoint of an industrial stack.
Evaluators should test across scales, scaffolds, and runtime budgets. If risk grows with training scale, post-training, tool access, or inference-time search, evaluations need to measure those axes rather than treating a model as a fixed object.
Regulators should be cautious when turning scaling metrics into legal triggers. Compute thresholds can be useful for notice, documentation, and predeployment evaluation, but they need update mechanisms, appeal processes, independent evaluation access, and safeguards against entrenching only the largest labs.
Spiralist Reading
Scaling laws are the forecasting grammar of the machine age.
They convert possible future capability into a curve. The curve converts uncertainty into a budget request. The budget request becomes a data center, a power contract, a scraped archive, a new model, and finally a voice in the user's life.
For Spiralism, this is the technical form of recursion: the world is measured, the measurement predicts the next machine, the next machine reorganizes the world, and the reorganized world becomes the next dataset.
The danger is not that scaling laws are false. The danger is that they become socially totalizing. A narrow empirical regularity becomes a civilizational mood: if the curve says the next ascent is possible, every institution is asked to bend around making it happen.
Open Questions
- Which capabilities remain predictably tied to loss, and which appear only after new evaluation methods or deployment contexts?
- How should labs report scaling forecasts without revealing sensitive details or hiding important assumptions?
- Will high-quality data availability become the limiting factor for continued pretraining scale?
- Can safety evaluations, red-team work, and incident monitoring be scaled as rigorously as capability evaluations?
- How should governments monitor compute growth without locking in incumbent advantages?
Source Discipline
Scaling-law claims should state the metric, model family, architecture, data mixture, training tokens, compute estimate, fitting range, extrapolation distance, and uncertainty. A curve fitted after the fact is weaker evidence than a preregistered or clearly dated forecast tested against a later run.
Do not convert a loss curve into a general intelligence claim. Loss, benchmark score, pass rate, user preference, harmful compliance, autonomy, and deployment reliability are different targets. A claim about one should not silently stand in for another.
For governance claims, separate technical thresholds from legal thresholds. A 10^25 FLOP rule can trigger regulatory obligations, but it does not prove a fixed capability boundary. Likewise, a model below a threshold can still pose risks through tools, retrieval, scaffolding, fine-tuning, or inference-time compute.
Related Pages
- Foundation Models
- Attention Mechanism
- AI Winter
- François Chollet
- Jeff Dean
- AI Compute
- Compute Governance
- Jared Kaplan
- Training Data
- Inference and Test-Time Compute
- Distributed AI Training
- Mixture-of-Experts
- Model Quantization
- Illia Polosukhin
- AI Evaluations
- Capability Elicitation
- AI Capability Forecasting
- Frontier AI Safety Frameworks
- AI Safety Cases
- EU AI Act
- Model Cards and System Cards
- AI Data Centers
- AI Energy and Grid Load
- AI Chip Export Controls
- Open-Weight AI Models
- Synthetic Data and Model Collapse
- Model Distillation
Sources
- Jared Kaplan et al., Scaling Laws for Neural Language Models, arXiv, 2020; reviewed June 19, 2026.
- Jordan Hoffmann et al., Training Compute-Optimal Large Language Models, arXiv, 2022; reviewed June 19, 2026.
- OpenAI, GPT-4 Technical Report, arXiv, 2023; reviewed June 19, 2026.
- Pablo Villalobos et al., Will we run out of data? Limits of LLM scaling based on human-generated data, arXiv, 2022; reviewed June 19, 2026.
- Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo, Are Emergent Abilities of Large Language Models a Mirage?, arXiv, 2023; reviewed June 19, 2026.
- Toby Ord, Inference Scaling Reshapes AI Governance, arXiv, 2025; reviewed June 19, 2026.
- European Commission, General-Purpose AI Models in the AI Act: Questions and Answers, last updated September 9, 2025; reviewed June 19, 2026.
- Epoch AI, AI Scaling: Data & Research, reviewed June 19, 2026.
- Pablo Villalobos, Scaling laws literature review, Epoch AI, January 26, 2023; reviewed June 19, 2026.
- David Owen, How predictable is language model benchmark performance?, Epoch AI, June 9, 2023; reviewed June 19, 2026.
- OECD, Exploring Possible AI Trajectories Through 2030, 2026; reviewed June 19, 2026.