As they say, Garbage In, Garbage Out. As synthetic-data consumption outpaces human content creation, an information crisis could soon arrive.
A stark warning from The Register’s opinion column claims generative AI (GenAI) systems are “cannibalizing their own future” by training on synthetic data, with errors compounding so severely that outputs risk becoming “unrecognizable from reality”.
The opinion piece on 27 May argues that “model collapse” — a phenomenon where AI systems degrade after ingesting their own outputs — is accelerating, as firms prioritize cost-cutting over data quality. The collapse can accrue from three factors:
- Error accumulation, in which each model generation inherits and amplifies flaws from previous versions, causing outputs to drift from original data patterns.
- Loss of tail data: In this, rare events are erased from training data, and eventually, entire concepts are blurred
- Feedback loops then reinforce narrow patterns, creating repetitive text or biased recommendations
The column cites a 2024 Nature study showing AI models trained on predecessors’ outputs develop “irreversible defects,” including error accumulation, loss of rare data patterns, and feedback loops that amplify biases.
The self-poisoning cycle of synthetic data
According to Steven J Vaughan-Nichols, OpenAI’s claim of generating 100bn words daily underscores the scale: with synthetic content flooding the web, future models will increasingly lack human-crafted training material.
Retrieval-augmented generation (RAG) systems, designed to ground AI in external data, offer no panacea. A Bloomberg study of 11 leading models had found that RAG reduced hallucinations, but introduced new risks such as private data leaks and biased financial advice. “It’s garbage in, garbage out — but the garbage now breeds exponentially,” the column warns.
No easy fixes for the “AI ice age”
Proposals to mix synthetic and human data face a harsh reality: as media outlets and academia increasingly rely on GenAI-generated output, high-quality human content is dwindling. The op-ed quotes AI firm Aquant’s blunt assessment: “When AI is trained on its own outputs, the results drift further from reality.”
With businesses prioritizing AI-driven “efficiency” over accuracy, the column predicts a tipping point where even “brain-dead CEOs” notice plummeting reliability. Early symptoms include hallucinated market data and reliance on content farms over SEC filings. As one exasperated researcher asked: “Where is new, expert human content supposed to come from?”