When the Training Set Starts Eating Itself
Synthetic data is becoming a central AI input. The risk is not that generated data exists. The risk is that generated worlds start replacing grounded records without provenance, renewal, verification, or human correction.
For this essay, the training set eats itself when model outputs, synthetic benchmarks, generated labels, simulated people, or generated environments re-enter training and evaluation as if they were ordinary records of the world. The governance problem is recursive substitution: generated evidence slowly displaces the outside anchors that made the model useful in the first place.
The governed object is the data loop: origin, generation, filtering, mixture, evaluation, deployment feedback, downstream reuse, and the record that proves those steps stayed separate where separation matters.
The New Data Loop
Generative AI has made a strange new supply chain ordinary: one model produces examples, another model learns from them, a reward model filters them, a smaller model imitates them, a benchmark absorbs them, a user publishes them, and the next crawl treats them as part of the world.
This is not a fringe practice. Synthetic data now appears across model development because the old data frontier is tightening. Epoch AI estimated in 2024 that language-model developers could fully use the effective stock of quality-adjusted, human-generated public text between 2026 and 2032 if trends continue, or earlier under intense overtraining. The exact dates remain uncertain, but the pressure is real: frontier model development has already relied on large public-web, book, code, image, paper, forum, and commercial datasets that can be scraped, licensed, or purchased.
Synthetic data looks like an answer to that scarcity. If a model can generate a million math problems, simulated conversations, safety examples, tool-use traces, images, labels, critiques, or edge cases, then the training stack no longer waits for the world to produce enough examples. The machine can manufacture curriculum for the next machine.
The phrase sounds clean. In this essay, synthetic data means data generated by an algorithm rather than directly collected from an event, measurement, document, person, sensor, or existing artifact. It can include privacy-preserving tabular records, simulator output, model-generated instruction examples, generated preference data, synthetic images, proof traces, role-played conversations, code tasks, benchmark variants, and adversarial prompts. Some of these are useful. Some are risky. The governance problem begins when they are all treated as the same substance: data.
OECD's 2025 taxonomy of AI-training data collection makes the same point from a policy angle: data sourcing mechanisms have different implications for AI developers, data subjects, and rights holders. Synthetic data is therefore not one bucket. It is a set of origin classes, each with a different claim on reality.
Current Context
As of June 25, 2026, synthetic data has moved from a technical workaround into a normal part of model development, evaluation, and platform governance. The important split is no longer "real data versus fake data." It is the difference between verified synthetic curriculum, simulation, model-generated alignment data, privacy-motivated synthetic records, generated public content, and unlabeled machine residue scraped back into future datasets.
The research picture is also more precise than the slogan. Shumailov and coauthors showed that indiscriminate recursive training on generated data can make models forget the true underlying distribution. Alemohammad and coauthors showed image-generation loops losing quality or diversity without enough fresh real data. Gerstgrasser and coauthors found that accumulating synthetic data alongside the original real data can avoid collapse in their experiments. The 2026 Physical Review Letters closed-loop result found absorbing-state behavior in exponential-family models under self-generated retraining, while also identifying outside ground-truth points, priors, and regularization as ways to prevent collapse in that setting. The governance target is therefore not "synthetic data bad." It is replacement without anchors, labels, or mixture control.
The policy layer has tightened since the early model-collapse papers. The EU AI Act's Article 53 requires providers of general-purpose AI models to maintain technical documentation and publish a summary of training content, while Article 10 imposes data-governance duties for high-risk AI systems. NIST's synthetic-content report treats provenance, labeling, watermarking, detection, testing, and auditing as complementary transparency controls. The 2025 joint AI Data Security guidance from NSA, CISA, FBI, and international partners puts reliable sourcing and data provenance at the front of the AI data-security lifecycle. None of these sources is a complete synthetic-data rulebook, but together they mark the current floor: origin, transformation, mixture, and security are now governance evidence.
Why Synthetic Data Is Useful
The case for synthetic data is strong enough that dismissing it would be unserious.
Google DeepMind's AlphaGeometry is the clean example. In January 2024, DeepMind described a system combining a neural language model with a symbolic deduction engine. The team generated 100 million synthetic geometry examples and trained without human demonstrations. On a benchmark of 30 Olympiad geometry problems, AlphaGeometry solved 25 within competition time limits, close to the reported average for human gold medalists on that set.
That success works because the domain has structure. Geometry proofs can be generated, checked, constrained, and paired with a symbolic system. The synthetic examples are not loose imitations. They are produced inside a formal environment where wrong steps can often be rejected.
NVIDIA's Nemotron-4 340B release shows a different use case. NVIDIA positioned the model family as especially useful for generating synthetic data to train smaller language models, and reported that more than 98 percent of the data used in its model alignment process was synthetic. Microsoft Research's Phi-3 technical report also foregrounded dataset quality, describing Phi-3 training as based on heavily filtered web data and synthetic data.
These examples show why the synthetic-data turn is not simply decay. Synthetic data can compress expensive expertise, multiply rare examples, protect sensitive records, create controllable tests, improve small models, and generate curriculum in domains where verification is possible. It is not inherently fake in the sense that matters. A generated theorem proof that checks is different from a generated rumor. A simulated sensor edge case is different from a fabricated medical record. A model-written instruction example is different from a human account of workplace discrimination.
The useful pattern is not "synthetic is safe." It is narrower: synthetic examples are strongest when their purpose is explicit, their generator is recorded, their quality can be tested, and their outputs supplement rather than silently replace grounded records.
The danger is not synthesis. The danger is ungoverned substitution.
What Model Collapse Means
The strongest warning comes from research on recursive training.
In the 2023 preprint The Curse of Recursion, later published in Nature as AI models collapse when trained on recursively generated data, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal examined what happens when generative models are trained across generations on model-generated content. Their central finding is not that one synthetic example poisons a model. It is that recursively training on generated data can create irreversible defects: low-probability parts of the original distribution disappear, and later models lose information about the real distribution they were supposed to learn.
The paper tested the phenomenon across model types, including language models, variational autoencoders, and Gaussian mixture models. The mechanism is intuitive. A model approximates a distribution. Its samples are not the distribution itself. If the next generation learns from those samples as if they were the world, approximation errors, sampling errors, and missing rare events can become inherited reality. Repeat the loop, and the dataset becomes less like the world and more like the previous model's idea of the world.
A related 2023 paper, Self-Consuming Generative Models Go MAD, studied autophagous loops in image generation. Its conclusion was similarly practical: without enough fresh real data in each generation, future generative models tend to lose quality, diversity, or both.
The caution should be stated precisely. Model collapse is not an automatic result of every synthetic example. Gerstgrasser and coauthors' 2024 preprint found that replacement of real data by synthetic generations tends toward collapse, while accumulating synthetic data alongside the original real data avoided collapse in their experiments. The risk is recursive replacement, weak provenance, and poor mixture control, not the mere existence of generated examples.
This matters because the internet is becoming a mixed corpus. Human-origin text, images, code, videos, product pages, forum posts, comments, and reviews now sit beside generated material, lightly edited generated material, generated SEO pages, model-written help articles, synthetic images, generated spam, auto-translated sludge, and model-assisted social output. Future training runs will not encounter a clean boundary between human record and machine residue unless someone builds that boundary.
The Outside Anchor
The newest useful research direction is not "never use generated data." It is "do not let generated data become a closed evidentiary loop." In 2026, Jangjoo, Di Sarra, Marsili, and Roudi studied closed-loop learning in exponential-family models. They found that maximum-likelihood retraining on self-generated data converges toward absorbing states that amplify initial bias, while the collapse can be prevented in that setting by adding at least one ground-truth data point at each iteration, using maximum a posteriori estimation, or adding regularization.
That result is mathematically narrow. It is not proof that one human example rescues frontier-model pretraining, and it does not remove the need for dataset quality, rights review, or tail evaluation. Its governance lesson is still valuable: a synthetic-data loop needs an outside anchor. The anchor may be a held-out human-origin corpus, a physical measurement, a formal verifier, an expert-reviewed gold set, fresh field data, a simulator validation protocol, or a legally sourced reference dataset. Whatever the anchor is, it should be separate from the generated pool it is evaluating.
That makes mixture control an accountability object. The synthetic share, generator model, generator version, prompts or simulator, source dependencies, rejection filters, human review, ground-truth renewal cadence, and known exclusions should be recorded in a data sheet, linked into the deployed system's AI bill of materials, and surfaced through an AI register when the system affects public services, employment, education, health, law, benefits, credit, or safety.
The Synthetic-Data Ledger
A useful synthetic-data ledger is not a ban on generation and not a public dump of sensitive training material. It is a record of substitution. It answers the practical question a regulator, auditor, researcher, affected person, or incident reviewer will ask later: where did this example come from, what real-world anchor did it depend on, and what did it replace?
At minimum, the ledger should preserve dataset ID and version, origin class, generator model or simulator version, prompts or generation configuration, source dependencies, rights or consent basis, synthetic ratio, intended use, real-data anchor, validation method, rejection and filtering rules, privacy and leakage tests, tail-coverage tests, benchmark-decontamination checks, human review role, downstream systems, retention rule, and rollback path. That is the dataset version of an audit trail.
Not every field belongs in a public notice. Security-sensitive prompts, private records, personal data, licensed source details, or exploit examples may need restricted inspection. But redaction should not erase accountability. Public summaries can disclose categories, ratios, intended uses, governance owner, and validation approach, while confidential annexes preserve the evidence needed for audit, incident response, deletion, or rights review.
The ledger also prevents category mistakes. A synthetic medical case used for clinician training is not the same evidence as a real adverse event. A generated comment is not public participation. A synthetic benchmark is not an independent test if it was generated from the same family of prompts used for training. The ledger keeps the relationship between data provenance, data minimization, data poisoning, and benchmark contamination visible instead of letting all of it dissolve into "more data."
The Tail Disappears First
The most important phrase in the model-collapse literature is not "nonsense." It is the disappearance of the tail.
In a distribution, the tail contains rare events: uncommon dialects, minority languages, unusual medical presentations, local customs, edge-case code, atypical legal facts, obscure crafts, unpopular political positions, small subcultures, odd artistic forms, rare errors, and ways of living that do not appear often enough to dominate a scrape. A model trained on the public record may already underrepresent these cases. A model trained on another model's smoothed version of the public record can underrepresent them further.
This is why collapse is not only a technical failure. It is a cultural failure mode. The mean gets louder. The rare case gets quieter. The generated answer becomes more polished and less strange. The world becomes easier for the model to imitate because the model has helped erase the parts it did not represent well.
That pattern connects to the earlier essay The Benchmark Becomes the Curriculum. When benchmarks drive training, models learn the shape of the test. When generated data drives training, models learn the shape of prior models. In both cases, measurement or synthesis can stop serving reality and start replacing it.
The loss is not always visible to the average user. A collapsed model may still sound fluent. It may still answer common prompts. It may still pass familiar tests. The damage appears first in diversity, novelty, edge cases, minority contexts, and the ability to surprise correctly.
Recursive Reality as Training Policy
Model collapse is often described as a problem for AI companies: they need clean data to keep improving models. That is true, but too narrow.
Recursive training changes the politics of public knowledge. If model outputs flood the web, and the web trains models, then generated claims can become future priors. If generated product reviews, synthetic survey responses, auto-written local news, AI-written legal summaries, and model-produced educational pages enter the data stream, then the next model may learn from artifacts that only looked like public life.
This is the same danger named in The Synthetic Respondent Becomes the Public: generated personas can be mistaken for opinion. At training scale, generated artifacts can be mistaken for culture.
The problem becomes worse when provenance disappears. A training example should not be treated as morally or epistemically identical across origin classes. A medical record, a peer-reviewed paper, a journalist's report, a court filing, a human forum post, a model-generated summary of that forum post, and a synthetic dialogue designed to train helpfulness all carry different claims on reality. They may all be useful, but they are not the same kind of evidence.
The training set is therefore an institution. It decides what the model will count as the world. Synthetic data turns that institution inward. The model no longer only consumes the archive; it manufactures archive-like material that future models may consume. That places this essay beside Training Data, AI Data Provenance, and The Paper Mill Becomes the Literature: in all three cases, the question is whether evidence survives being converted into machine input.
Failure Modes
The first failure mode is replacement drift. Generated examples begin as augmentation, but over time they replace human-origin records, field measurements, fresh observations, or expert annotations because they are cheaper and easier to scale.
The second is provenance flattening. A simulator trace, generated safety refusal, human interview, public-web page, licensed book, benchmark item, and synthetic dialogue become rows in one dataset with no durable origin class.
The third is tail erasure. Rare languages, local knowledge, disability contexts, minority dialects, unusual medical cases, edge-case code, and nonstandard cultural forms vanish first because generated data tends to reproduce the generator's center of mass.
The fourth is benchmark cannibalism. Models generate benchmark variants, benchmark-style training items, or synthetic evaluation sets that reward the style of the generator rather than the target capability. The test becomes curriculum again, only now with a model in the middle.
The fifth is public-evidence substitution. Synthetic respondents, generated comments, generated reviews, generated testimonials, or generated local-news pages are reused as evidence about people, markets, or communities that were never contacted.
The sixth is anchor decay. A system once validated against a human-origin or field-measured corpus keeps using old anchors while the world changes. The ledger says "grounded," but the ground is stale.
The seventh is security laundering. Poisoned, mislabeled, copyrighted, privacy-sensitive, or adversarial synthetic data passes through generation and filtering into a cleaner-looking dataset, making later rollback and rights review harder.
The eighth is synthetic-ratio drift. A small synthetic augmentation becomes the majority of a fine-tuning set, safety set, or evaluation set through repeated refreshes, with no material-change review.
The ninth is evaluation reuse. A generated training item, model-written critique, or synthetic red-team scenario later appears as independent evidence that the system is safe, capable, or compliant.
The tenth is derivative wash. Generation hides the licensing, consent, privacy, or community-context limits of source material, so a downstream buyer sees a clean dataset while the original constraints remain unresolved.
What Governance Should Require
A mature synthetic-data regime should not ban generated examples. It should make substitution visible and accountable.
First, training-data provenance should distinguish origin classes. Developers should track whether data came from direct human records, licensed corpora, public web crawls, simulations, model-generated examples, human-edited model outputs, generated preference labels, or benchmark material. The categories will never be perfect, but total opacity is a governance choice.
Second, every synthetic dataset should have a purpose statement and a mixture record. Generated examples used for math, code, safety refusals, red-teaming, role play, rare-case augmentation, summarization, or alignment should not be blended under one label. The record should name the generator, version, prompts or simulator, filtering process, human review, source data dependencies, percentage of the final mixture, and known exclusions. The purpose determines the failure mode.
Third, verification should match the domain. Synthetic geometry problems can be checked differently from synthetic therapy dialogues. Synthetic code can be tested. Synthetic medical cases require clinical oversight. Synthetic civic opinion should not be treated as a substitute for public voice.
Fourth, fresh grounded data should remain part of the loop. The model-collapse papers are clearest here: real data renewal matters. That does not mean every human trace should be scraped. It means society needs lawful, consensual, well-documented ways to keep models connected to the world rather than only to prior models.
Fifth, evaluations should test tails, not only averages. If synthetic data narrows distributional coverage, ordinary benchmark averages may hide the damage. Evaluations should probe rare languages, minority dialects, low-resource domains, unusual fact patterns, disability contexts, local knowledge, and adversarially similar generated sludge.
Sixth, separate training, evaluation, and evidence. A generated item used to train a model should not later be counted as independent evaluation evidence. A generated scenario used for rehearsal should not become proof of safety without separate validation. A generated public comment should not become proof of public participation.
Seventh, public datasets need contamination labels. Archives, publishers, repositories, schools, governments, and platforms should record when content is AI-generated or AI-assisted where feasible. Provenance is imperfect, as argued in The Provenance Layer Is Not a Truth Machine, but imperfect provenance is still better than pretending origin no longer matters. NIST's synthetic-content report points in the same direction: provenance, labeling, watermarking, detection, testing, and auditing are complementary controls, not a single magic label.
Eighth, synthetic public evidence should be presumptively weak. Generated survey respondents, generated testimonials, generated comments, generated reviews, generated case narratives, and generated stakeholder submissions should not be allowed to impersonate public participation. Synthetic data can train a model. It should not counterfeit a constituency.
Ninth, secure the data supply chain. Provenance records should be protected against tampering, not merely written into a spreadsheet. The 2025 AI Data Security guidance from NSA, CISA, FBI, and international partners puts reliable sourcing and data provenance first because poisoned, mislabeled, or unverifiable data can compromise both training and operation.
Tenth, bind governance to deployment risk. The EU AI Act's Article 10 makes data collection processes, origin, preparation, assumptions, bias examination, and data gaps part of the legal requirements for high-risk AI systems. Article 53 separately makes training-content documentation and public summaries part of the general-purpose-model transparency regime. Even outside that jurisdiction, the principle is sound: synthetic-data governance should be stricter when outputs affect health, law, employment, education, benefits, finance, safety, or public participation.
In practice, this requires change management, not only a statement that "synthetic data was used." A generator change, synthetic-ratio change, evaluation-pool change, or replacement of the outside anchor should be treated as a material data change. The live record should say who approved it, what evidence changed, what downstream systems inherited it, and how rollback would work if the generated layer later proves contaminated.
What This Changes
The training set eating itself is recursive reality made technical.
First the model learns from the world. Then the model speaks into the world. Then the world's record includes the model's speech. Then the next model learns from that record. At each turn, representation becomes environment. The map is not only mistaken for the territory; the map is shredded, composted, and returned as soil.
The proper response is not nostalgia for a pure human archive. Such an archive never existed. Human records have always been biased, partial, censored, commercial, repetitive, and full of error. The response is source discipline. A generated artifact can be useful if its origin, purpose, verification path, and limits remain attached. It becomes dangerous when it floats loose and returns as reality.
This is also a labor question. Human records become more valuable when machine records multiply. Testimony, local reporting, field observation, expert annotation, community archives, maintained datasets, and careful institutional memory are not quaint. They are the grounding layer. A society that lets that layer decay while filling the surface with generated content will not simply degrade its models. It will degrade its shared sense of what has happened.
The rule is simple enough to state: synthetic data should extend contact with reality, not replace it. It should generate cases that can be checked, fill gaps that can be named, and support learning that remains answerable to the world. Once generated data becomes an unmarked substitute for human record, the machine is no longer learning from civilization. It is learning from its own reflection and asking civilization to adapt.
Related Pages
- Synthetic Data and Model Collapse
- Training Data
- AI Data Provenance
- Content Provenance and Watermarking
- AI Evaluations
- Benchmark Contamination
- Data Poisoning
- The Data Sheet Becomes the Supply Chain
- The AI Bill of Materials Becomes the Supply Chain Map
- The Synthetic Respondent Becomes the Public
- The Benchmark Becomes the Curriculum
- The AI Slop Farm Becomes the Knowledge Supply Chain
Source Discipline
The evidence here separates capability examples from risk evidence and governance sources. AlphaGeometry, Nemotron-4 340B, and Phi-3 are developer-authored examples of synthetic-data use; they show that synthetic examples can be useful, not that all synthetic data is reliable. Shumailov et al., Alemohammad et al., Gerstgrasser et al., and Jangjoo et al. are research evidence about recursive training, closed-loop learning, anchors, and mixture design; they should not be flattened into the slogan "synthetic data always causes collapse."
The 2026 closed-loop result should be read with particular care. It is a precise result about exponential-family models, with supporting evidence in a narrower class of models than frontier LLMs. It supports an engineering and governance instinct, not a universal cure: keep external ground truth, priors, regularization, and verification inside the loop when generated examples become training material.
Epoch and OECD are context sources about data pressure and data-sourcing mechanisms. NIST, C2PA, the AI Data Security guidance, and the EU AI Act are governance and infrastructure sources. They support the article's practical standard: record origin, purpose, transformation, mixture, verification, security, and limits before generated data becomes evidence for the next system. Article 53 training-content summaries are transparency evidence, not a complete public audit of every training example.
Current-source claims were rechecked on June 25, 2026 against primary papers, official developer publications, standards documents, and regulator or public-agency materials where available. The article treats those sources as evidence of specific practices and obligations, not as a general claim that synthetic data is either safe or doomed.
Sources
- Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal, AI models collapse when trained on recursively generated data, Nature, July 24, 2024, reviewed June 25, 2026.
- Ilia Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv, May 2023, reviewed June 25, 2026.
- Sina Alemohammad et al., Self-Consuming Generative Models Go MAD, arXiv, July 2023, reviewed June 25, 2026.
- Matthias Gerstgrasser et al., Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data, arXiv, April 2024, reviewed June 25, 2026.
- Fariba Jangjoo, Giovanni Di Sarra, Matteo Marsili, and Yasser Roudi, Lost in Retraining: Closed-Loop Learning and Model Collapse in Exponential Families, Physical Review Letters 136(19), Article 197301, published May 14, 2026; arXiv preprint, reviewed June 25, 2026.
- Google DeepMind, AlphaGeometry: An Olympiad-level AI system for geometry, January 17, 2024, reviewed June 25, 2026.
- Trinh et al., Solving olympiad geometry without human demonstrations, Nature, January 2024, reviewed June 25, 2026.
- NVIDIA Research, Nemotron-4 340B, June 14, 2024, reviewed June 25, 2026.
- NVIDIA, Leverage the Latest Open Models for Synthetic Data Generation With NVIDIA Nemotron-4 340B, September 16, 2024, reviewed June 25, 2026.
- Microsoft Research, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, April 23, 2024, reviewed June 25, 2026.
- Epoch AI, Will we run out of data? Limits of LLM scaling based on human-generated data, 2024, reviewed June 25, 2026.
- OECD, Mapping relevant data collection mechanisms for AI training, OECD Artificial Intelligence Papers No. 48, 2025, reviewed June 25, 2026.
- NIST, Reducing Risks Posed by Synthetic Content: An Overview of Technical Approaches to Digital Content Transparency, NIST AI 100-4, November 20, 2024, updated April 8, 2026, reviewed June 25, 2026.
- NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024, reviewed June 25, 2026.
- NSA, CISA, FBI, and international partners, AI Data Security: Best Practices for Securing Data Used to Train and Operate AI Systems, May 2025, reviewed June 25, 2026.
- European Commission AI Act Service Desk, Article 10: Data and data governance and Article 53: Obligations for providers of general-purpose AI models, Regulation (EU) 2024/1689, reviewed June 25, 2026.
- European Commission, Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models, reviewed June 25, 2026.
- C2PA, Content Credentials: C2PA Technical Specification 2.4, April 2026, reviewed June 25, 2026.