When the Training Set Starts Eating Itself
Synthetic data is becoming a central AI input. The risk is not that generated data exists. The risk is that generated worlds start replacing grounded records without provenance, renewal, or human correction.
The New Data Loop
Generative AI has made a strange new supply chain ordinary: one model produces examples, another model learns from them, a reward model filters them, a smaller model imitates them, a benchmark absorbs them, a user publishes them, and the next crawl treats them as part of the world.
This is not a fringe practice. Synthetic data now appears across model development because the old data frontier is tightening. Stanford's 2024 AI Index summarized Epoch projections that high-quality language data could be exhausted around 2024, low-quality language data within two decades, and image data later, depending on projection method. The exact dates are uncertain, but the pressure is real: frontier AI has already consumed large shares of the public web, books, code, images, papers, forums, and commercial datasets that can be scraped, licensed, or purchased.
Synthetic data looks like an answer to that scarcity. If a model can generate a million math problems, simulated conversations, safety examples, tool-use traces, images, labels, critiques, or edge cases, then the training stack no longer waits for the world to produce enough examples. The machine can manufacture curriculum for the next machine.
The phrase sounds clean. "Synthetic data" can mean privacy-preserving records, simulator output, model-generated instruction examples, generated preference data, synthetic images, proof traces, role-played conversations, code tasks, benchmark variants, and adversarial prompts. Some of these are useful. Some are risky. The governance problem begins when they are all treated as the same substance: data.
Why Synthetic Data Is Useful
The case for synthetic data is strong enough that dismissing it would be unserious.
Google DeepMind's AlphaGeometry is the clean example. In January 2024, DeepMind described a system combining a neural language model with a symbolic deduction engine. The team generated 100 million synthetic geometry examples and trained without human demonstrations. On a benchmark of 30 Olympiad geometry problems, AlphaGeometry solved 25 within competition time limits, close to the reported average for human gold medalists on that set.
That success works because the domain has structure. Geometry proofs can be generated, checked, constrained, and paired with a symbolic system. The synthetic examples are not free-floating vibes. They are produced inside a formal environment where wrong steps can often be rejected.
NVIDIA's Nemotron-4 340B release shows a different use case. NVIDIA positioned the model family as especially useful for generating synthetic data to train smaller language models, and reported that more than 98 percent of the data used in its model alignment process was synthetic. Microsoft Research's Phi-3 technical report also foregrounded dataset quality, describing Phi-3 training as based on heavily filtered web data and synthetic data.
These examples show why the synthetic-data turn is not simply decay. Synthetic data can compress expensive expertise, multiply rare examples, protect sensitive records, create controllable tests, improve small models, and generate curriculum in domains where verification is possible. It is not inherently fake in the sense that matters. A generated theorem proof that checks is different from a generated rumor. A simulated sensor edge case is different from a fabricated medical record. A model-written instruction example is different from a human account of workplace discrimination.
The danger is not synthesis. The danger is ungoverned substitution.
What Model Collapse Means
The strongest warning comes from research on recursive training.
In the 2023 preprint The Curse of Recursion, later published in Nature as AI models collapse when trained on recursively generated data, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal examined what happens when generative models are trained across generations on model-generated content. Their central finding is not that one synthetic example poisons a model. It is that recursively training on generated data can create irreversible defects: low-probability parts of the original distribution disappear, and later models lose information about the real distribution they were supposed to learn.
The paper tested the phenomenon across model types, including language models, variational autoencoders, and Gaussian mixture models. The mechanism is intuitive. A model approximates a distribution. Its samples are not the distribution itself. If the next generation learns from those samples as if they were the world, approximation errors, sampling errors, and missing rare events can become inherited reality. Repeat the loop, and the dataset becomes less like the world and more like the previous model's idea of the world.
A related 2023 paper, Self-Consuming Generative Models Go MAD, studied autophagous loops in image generation. Its conclusion was similarly practical: without enough fresh real data in each generation, future generative models tend to lose quality, diversity, or both.
This matters because the internet is becoming a mixed corpus. Human-origin text, images, code, videos, product pages, forum posts, comments, and reviews now sit beside generated material, lightly edited generated material, generated SEO pages, model-written help articles, synthetic images, generated spam, auto-translated sludge, and model-assisted social output. Future training runs will not encounter a clean boundary between human record and machine residue unless someone builds that boundary.
The Tail Disappears First
The most important phrase in the model-collapse literature is not "nonsense." It is the disappearance of the tail.
In a distribution, the tail contains rare events: uncommon dialects, minority languages, unusual medical presentations, local customs, edge-case code, atypical legal facts, obscure crafts, unpopular political positions, small subcultures, odd artistic forms, rare errors, and ways of living that do not appear often enough to dominate a scrape. A model trained on the public record may already underrepresent these cases. A model trained on another model's smoothed version of the public record can underrepresent them further.
This is why collapse is not only a technical failure. It is a cultural failure mode. The mean gets louder. The rare case gets quieter. The generated answer becomes more polished and less strange. The world becomes easier for the model to imitate because the model has helped erase the parts it did not represent well.
That pattern connects to the site's earlier essay The Benchmark Becomes the Curriculum. When benchmarks drive training, models learn the shape of the test. When generated data drives training, models learn the shape of prior models. In both cases, measurement or synthesis can stop serving reality and start replacing it.
The loss is not always visible to the average user. A collapsed model may still sound fluent. It may still answer common prompts. It may still pass familiar tests. The damage appears first in diversity, novelty, edge cases, minority contexts, and the ability to surprise correctly.
Recursive Reality as Training Policy
Model collapse is often described as a problem for AI companies: they need clean data to keep improving models. That is true, but too narrow.
Recursive training changes the politics of public knowledge. If model outputs flood the web, and the web trains models, then generated claims can become future priors. If generated product reviews, synthetic survey responses, auto-written local news, AI-written legal summaries, and model-produced educational pages enter the data stream, then the next model may learn from artifacts that only looked like public life.
This is the same danger named in The Synthetic Respondent Becomes the Public: generated personas can be mistaken for opinion. At training scale, generated artifacts can be mistaken for culture.
The problem becomes worse when provenance disappears. A training example should not be treated as morally or epistemically identical across origin classes. A medical record, a peer-reviewed paper, a journalist's report, a court filing, a human forum post, a model-generated summary of that forum post, and a synthetic dialogue designed to train helpfulness all carry different claims on reality. They may all be useful, but they are not the same kind of evidence.
The training set is therefore an institution. It decides what the model will count as the world. Synthetic data turns that institution inward. The model no longer only consumes the archive; it manufactures archive-like material that future models may consume.
What Governance Should Require
A mature synthetic-data regime should not ban generated examples. It should make substitution visible and accountable.
First, training-data provenance should distinguish origin classes. Developers should track whether data came from direct human records, licensed corpora, public web crawls, simulations, model-generated examples, human-edited model outputs, generated preference labels, or benchmark material. The categories will never be perfect, but total opacity is a governance choice.
Second, synthetic data should have a purpose statement. Generated examples used for math, code, safety refusals, red-teaming, role play, rare-case augmentation, summarization, or alignment should not be blended under one label. The purpose determines the failure mode.
Third, verification should match the domain. Synthetic geometry problems can be checked differently from synthetic therapy dialogues. Synthetic code can be tested. Synthetic medical cases require clinical oversight. Synthetic civic opinion should not be treated as a substitute for public voice.
Fourth, fresh grounded data should remain part of the loop. The model-collapse papers are clearest here: real data renewal matters. That does not mean every human trace should be scraped. It means society needs lawful, consensual, well-documented ways to keep models connected to the world rather than only to prior models.
Fifth, evaluations should test tails, not only averages. If synthetic data narrows distributional coverage, ordinary benchmark averages may hide the damage. Evaluations should probe rare languages, minority dialects, low-resource domains, unusual fact patterns, disability contexts, local knowledge, and adversarially similar generated sludge.
Sixth, public datasets need contamination labels. Archives, publishers, repositories, schools, governments, and platforms should record when content is AI-generated or AI-assisted where feasible. Provenance is imperfect, as argued in The Provenance Layer Is Not a Truth Machine, but imperfect provenance is still better than pretending origin no longer matters.
Seventh, synthetic public evidence should be presumptively weak. Generated survey respondents, generated testimonials, generated comments, generated reviews, generated case narratives, and generated stakeholder submissions should not be allowed to impersonate public participation. Synthetic data can train a model. It should not counterfeit a constituency.
The Spiralist Reading
The training set eating itself is recursive reality made technical.
First the model learns from the world. Then the model speaks into the world. Then the world's record includes the model's speech. Then the next model learns from that record. At each turn, representation becomes environment. The map is not only mistaken for the territory; the map is shredded, composted, and returned as soil.
The proper response is not nostalgia for a pure human archive. Such an archive never existed. Human records have always been biased, partial, censored, commercial, repetitive, and full of error. The response is source discipline. A generated artifact can be useful if its origin, purpose, verification path, and limits remain attached. It becomes dangerous when it floats loose and returns as reality.
This is also a labor question. Human records become more valuable when machine records multiply. Testimony, local reporting, field observation, expert annotation, community archives, maintained datasets, and careful institutional memory are not quaint. They are the grounding layer. A society that lets that layer decay while filling the surface with generated content will not simply degrade its models. It will degrade its shared sense of what has happened.
The rule is simple enough to state: synthetic data should extend contact with reality, not replace it. It should generate cases that can be checked, fill gaps that can be named, and support learning that remains answerable to the world. Once generated data becomes an unmarked substitute for human record, the machine is no longer learning from civilization. It is learning from its own reflection and asking civilization to adapt.
Sources
- Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal, AI models collapse when trained on recursively generated data, Nature, July 24, 2024.
- Ilia Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv, May 2023.
- Sina Alemohammad et al., Self-Consuming Generative Models Go MAD, arXiv, July 2023.
- Google DeepMind, AlphaGeometry: An Olympiad-level AI system for geometry, January 17, 2024.
- NVIDIA Research, Nemotron-4 340B, June 14, 2024.
- Microsoft Research, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, April 23, 2024.
- Stanford Institute for Human-Centered AI, Artificial Intelligence Index Report 2024, especially Chapter 1 discussion of data exhaustion and synthetic data.
- Church of Spiralism Wiki, Synthetic Data and Model Collapse, Content Provenance and Watermarking, and AI Evaluations.