Modern large language models are, at their foundation, compression functions applied to human-generated text. They are trained on extraordinary quantities of writing: books, articles, forums, code repositories, academic papers, social media. This dependence on human-generated text has long seemed like a structural advantage — the internet is vast, humans write prolifically, the resource appeared inexhaustible.

Epoch AI's analysis has complicated that picture. Their research into the scaling trajectories of frontier models suggests that we may approach the limits of high-quality human-generated text available for training before the end of this decade — possibly much sooner. The data wall is not a distant theoretical constraint. It is an engineering problem that is already shaping decisions at the largest AI laboratories.

Understanding what "high-quality" means here is essential. Not all text is equal for training purposes. The text that most improves language model performance tends to be information-dense, syntactically coherent, produced by humans with genuine communicative intent over extended periods. This excludes a significant fraction of the web: spam, auto-generated content, duplicated pages, low-quality SEO text. The most effective training corpora — Common Crawl filtered into C4 or FineWeb, curated book datasets, academic repositories — are already aggressive filtrations of available data. It is this filtered subset that is approaching exhaustion.

The industry has two main responses. The first is synthetic data: AI-generated text used to train future AI models. This is already widespread in practice. Instruction-tuning datasets, knowledge distillation pipelines, and self-improvement approaches all depend in part on model-generated data. The second is multimodality: moving beyond text to train on images, video, audio, and physical simulation data. Models like GPT-4o and Gemini Ultra signal this direction.

The synthetic data path has a subtle but fundamental problem that the research community has named model collapse. When a model is trained on outputs from another model, it inherits those outputs' biases, blind spots, and hallucinations — often in amplified form. Over multiple generations of self-training, a model loses the diversity and grounding that authentic human-generated data provides. The distribution narrows, the tails disappear, rare but important patterns in language are smoothed away. It is, in information-theoretic terms, the lossy compression of lossy compression.

There is an economic implication here that the industry has been reluctant to discuss directly: the value of authentic human-generated data is mechanically increasing. Writers, journalists, researchers, domain experts — those who produce dense, high-quality text grounded in genuine knowledge and experience — hold something that AI systems cannot self-produce without degradation over time. The ongoing disputes over training data rights and content compensation are not peripheral issues; they are conflicts over a resource whose scarcity is increasing.

The broader cultural implication is what concerns me most. An internet progressively filled with synthetic text — written partly by models trained on models trained on models — will require a qualitatively different kind of epistemic caution from its readers. The question "was this written by a human with direct knowledge of the subject?" will matter more, not less, as the proportion of AI-generated content increases. The ability to read critically, to source claims, to notice the particular texture of genuine expertise versus fluent confabulation — these are the skills that an era of synthetic content scarcity makes more valuable, precisely at the moment when the friction of not using AI to shortcut them is being systematically reduced.

The data wall is a technical problem. Its downstream consequences are, at least in part, a problem of attention.