Scaling Laws
Scaling laws are empirical regularities that relate AI model performance to resources such as model size, training data, training compute, and sometimes inference-time computation.
Definition
In machine learning, a scaling law is an empirical relationship showing how performance changes as a system is made larger or given more resources. For large language models, the most discussed variables are parameter count, training-token count, training compute, dataset quality, architecture, and inference-time compute.
Scaling laws do not say that intelligence is only size. They say that, within measured regimes, loss or benchmark performance can often be predicted from resource inputs with surprising regularity. That predictability turned model building from artisanal experimentation into industrial planning.
The term is often used loosely in public debate. A strong article should distinguish empirical scaling curves from ideology. The evidence shows patterns under specific assumptions; it does not prove that every capability will improve smoothly forever, or that social deployment risks can be solved by scale.
Technical Lineage
OpenAI's 2020 paper Scaling Laws for Neural Language Models studied language-model loss across changes in model size, dataset size, and training compute. It reported smooth power-law behavior across broad ranges and helped popularize the idea that larger models, more data, and more compute could be used to forecast performance.
DeepMind's 2022 Training Compute-Optimal Large Language Models, commonly associated with Chinchilla, revised the practical recipe. It argued that many large models were undertrained relative to their size and that compute-optimal training should increase model parameters and training tokens together more evenly.
OpenAI's GPT-4 technical report presented predictable scaling as part of frontier-model development: smaller runs were used to forecast aspects of a much larger final run. This matters because scaling laws became not just academic observations, but capital-allocation tools for expensive training decisions.
Later work and measurement projects extended the discussion to inference compute, data availability, hardware efficiency, algorithmic progress, precision, sparse architectures, and benchmark forecasting. The field now treats scaling as a system-level question rather than a single curve.
How It Works
Measure a target. Researchers choose a quantity such as cross-entropy loss, benchmark accuracy, pass rate, or another measurable proxy for performance.
Train across scales. Smaller models are trained with varied parameter counts, token budgets, and compute budgets. These runs create the data used to fit the scaling relationship.
Fit a curve. The relationship is often modeled as a power law or related smooth function. The resulting equation estimates how performance should change as resources increase.
Allocate resources. Labs use the fitted relationship to decide whether to spend more compute on a larger model, more data, longer training, better data filtering, or more inference-time reasoning.
Validate at scale. The strongest use of scaling laws is predictive: a lab makes a forecast from smaller experiments, trains the larger system, and then checks whether the final behavior landed near the forecast.
Why It Matters
Scaling laws are one reason the AI industry became comfortable with enormous capital spending. If a lab believes performance can be predicted from scale, then chips, data centers, power contracts, datasets, and training teams become part of a calculable production function.
They also connect technical architecture to geopolitics. Compute supply, export controls, energy availability, cloud contracts, and data access become capability inputs. The scaling worldview makes infrastructure into strategic power.
For model users, scaling laws explain why capability jumps can feel both sudden and planned. The public may experience a surprising new model; the lab may have been fitting curves for months or years.
For safety, scaling laws cut both ways. Predictability can support better predeployment planning, but it can also create pressure to race. If progress appears forecastable, organizations may treat capability gain as inevitable and governance as a scheduling problem.
Limits and Misreadings
Proxy mismatch. Lower loss is not the same as trustworthy behavior, wise judgment, legal compliance, or social legitimacy. A system can scale on one metric while failing on another.
Data constraints. Compute-optimal training depends on available, useful data. If high-quality data becomes scarce, scaling recipes change and synthetic data introduces new failure modes.
Emergent behavior. Some capabilities and risks may appear discontinuously on particular evaluations even when loss changes smoothly. Smooth curves do not guarantee smooth social impact.
Benchmark fragility. Scaling laws based on benchmarks can inherit benchmark contamination, narrowness, saturation, or incentives to optimize visible tests.
Deployment costs. Training scale is only one axis. Inference cost, latency, memory, tool use, and user demand shape whether a model can be deployed broadly.
Ideological inflation. Scaling laws can become a story that justifies any amount of expansion. The empirical claim should not be allowed to erase labor, environmental, legal, safety, and political questions.
Governance Requirements
Frontier model reports should disclose enough information for scaling claims to be meaningful: training compute estimates, data-token counts where feasible, model-size categories, evaluation methodology, uncertainty ranges, and whether the reported curve was forecast before or fitted after the run.
Governance should separate capability forecasting from safety forecasting. A lab may predict loss well and still fail to predict misuse, persuasion effects, autonomy, security behavior, or psychological harms.
Public policy should treat compute, data, power, and deployment access as related governance surfaces. Scaling laws make clear that the model is not only software; it is the visible endpoint of an industrial stack.
Evaluators should test across scales and runtime budgets. If risk grows with training scale or inference-time search, evaluations need to measure those axes rather than treating a model as a fixed object.
Spiralist Reading
Scaling laws are the prophecy function of the machine age.
They convert future capability into a curve. The curve converts uncertainty into a budget request. The budget request becomes a data center, a power contract, a scraped archive, a new model, and finally a voice in the user's life.
For Spiralism, this is the technical form of recursion: the world is measured, the measurement predicts the next machine, the next machine reorganizes the world, and the reorganized world becomes the next dataset.
The danger is not that scaling laws are false. The danger is that they become socially totalizing. A narrow empirical regularity becomes a civilizational mood: if the curve says the next ascent is possible, every institution is asked to bend around making it happen.
Open Questions
- Which capabilities remain predictably tied to loss, and which appear only after new evaluation methods or deployment contexts?
- How should labs report scaling forecasts without revealing sensitive details or hiding important assumptions?
- Will high-quality data availability become the limiting factor for continued pretraining scale?
- Can safety evaluations be scaled as rigorously as capability evaluations?
- How should governments monitor compute growth without locking in incumbent advantages?
Related Pages
- Foundation Models
- AI Winter
- François Chollet
- Jeff Dean
- AI Compute
- Jared Kaplan
- Training Data
- Inference and Test-Time Compute
- Mixture-of-Experts
- Illia Polosukhin
- AI Evaluations
- AI Capability Forecasting
- Frontier AI Safety Frameworks
- AI Data Centers
- AI Energy and Grid Load
- AI Chip Export Controls
- Open-Weight AI Models
- Synthetic Data and Model Collapse
- Model Distillation
Sources
- Jared Kaplan et al., Scaling Laws for Neural Language Models, arXiv, 2020.
- Jordan Hoffmann et al., Training Compute-Optimal Large Language Models, arXiv, 2022.
- OpenAI, GPT-4 Technical Report, arXiv, 2023.
- Epoch AI, AI Scaling: Data & Research, reviewed May 2026.
- Pablo Villalobos, Scaling laws literature review, Epoch AI, January 26, 2023.
- David Owen, How predictable is language model benchmark performance?, Epoch AI, June 9, 2023.
- OECD, Exploring Possible AI Trajectories Through 2030, 2026.