Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Model collapse is a degenerative phenomenon in machine learning where AI models trained on data generated by other AI models progressively lose quality, diversity, and accuracy over successive generations. The process creates a feedback loop in which each generation of model amplifies the errors and biases of the previous one, ultimately producing homogenized, error-prone outputs disconnected from real-world data distributions. 1)
The phenomenon was formally identified and named in the landmark 2023 paper “The Curse of Recursion: Training on Generated Data Makes Models Forget” by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, and colleagues, initially published as a preprint and later in Nature (2024). 2) The researchers demonstrated that training generative models — including language models, variational autoencoders (VAEs), and Gaussian mixture models — on their own outputs causes compounding information loss across generations.
The paper identified two distinct stages:
The core mechanism of model collapse is a recursive feedback loop. When AI-generated content is published to the internet and subsequently scraped for training data, new models are inadvertently trained on synthetic rather than human-generated data. Each successive generation:
The analogy commonly used is photocopying: each copy of a copy degrades the image until details are lost entirely. 6) In a concrete example, a dataset with 90% yellow objects and 10% blue objects produces a model that generates an even higher proportion of yellow. After several generations, blue objects disappear entirely from the output. 7)
Researchers have termed this process model autophagy disorder (MAD), or colloquially, AI cannibalism — a state where models consuming their own outputs produce increasingly homogenized results detached from reality. 8)
Model collapse makes data provenance — tracking the origins and history of training data — a critical concern. As AI-generated content floods the internet, distinguishing synthetic data from human-created data becomes increasingly difficult. 9)
Key challenges include:
The Harvard Journal of Law and Technology has noted the legal implications, arguing for a “right to uncontaminated human-generated data” as a foundation for maintaining AI quality. 11)
Researchers have proposed several approaches to counter model collapse:
Model collapse intersects with several related concepts in machine learning: