Data Scarcity: A Crossroads for AI Development
Over the past decade, AI has grown rapidly by devouring human knowledge, but now the supply of high-quality data is nearing depletion. This crisis mirrors the food shortages once faced by our human ancestors. In 2000 BC, our ancestors were forced to migrate due to climate change; in 2026, the silicon-based lifeforms we have created face the same existential choice: either degenerate through data inbreeding, or break through their own limitations to achieve an evolution that surpasses human cognition—at the cost of humanity potentially losing all control over AI.
Why Do Machines Need So Much Data?
A human child can recognize a cat after seeing it just once, extracting key features to form a cognitive understanding—a highly efficient learning process. But machines lack common sense about the three-dimensional world and innate cognitive abilities; they are essentially highly specialized probabilistic predictors. To make them “understand” cats, we must rely on training with massive amounts of data.
This approach to data feeding has become increasingly extreme in recent years. During the era of expert systems in the 1980s, programmers spent immense effort typing out hundreds of thousands of logical rules line by line, only to end up with a few megabytes of text data. Once exposed to the real world, these machines immediately broke down, rendering them completely useless. In 2012, AI entered the era of computer vision, and data feeding evolved into manual labeling. Li Feifei’s team’s ImageNet dataset contained 14 million images, all manually annotated, amounting to tens of gigabytes of data. Yet the machine could only barely recognize the outline of a cat by counting pixels; it had no understanding of a cat’s essence or behavior.
With the advent of the large language model era, machines needed to learn logical reasoning and human emotions, and the speed of manual labeling could no longer keep up with demand. Silicon Valley engineers simply opened the floodgates, feeding the machines all the text left by humanity on the internet—formal articles, books, user reviews—all at once. From then on, the volume of data available to machines began to explode exponentially.
The Data-Gorging History of Large Models
When GPT-3 was released in 2020, its training dataset consisted of approximately 500 billion tokens, with raw data scraped from the internet amounting to dozens of terabytes—a scale far exceeding the scope of human-readable text as traditionally understood. By around 2024, the training data for the new generation of large models had surged from the hundreds of billions to the trillions of tokens. In just a few short years, the scale of data consumption by these models increased by several orders of magnitude, with no signs of significant deceleration.
As of 2026, the total volume of high-quality text publicly available on the internet is nearing its limit, and the consumption rate of leading models is approaching the upper bound of human-generated high-quality content. There are two core reasons for the machine’s ever-growing appetite: first, the need to support complex reasoning; second, the continuous scaling of model parameters. From GPT-3 to today’s colossal models with tens of trillions of parameters, the more parameters a model has, the more training data it requires. This vicious cycle of parameters and data is the primary driver of today’s data scarcity.
The core logic of large language models is essentially probabilistic word chaining. Operating in a high-dimensional space spanning hundreds of billions of dimensions, they use probability calculations to determine the most likely next word. While they can recognize a cat or understand literature, they do not truly grasp the meanings behind these words; rather, they have etched human-related text into an intricate and vast probabilistic coordinate map. It is as if a blind person must feel every contour of an object to identify it, yet remains completely unable to see its true form.
To draw the most intuitive comparison: an average person could read voraciously for a lifetime and still accumulate fewer than a billion tokens, whereas a single training run for a top-tier large model already equals the total reading volume of tens of thousands of human lifetimes.
Countdown to the Data Famine: High-Quality Text to Be Exhausted by 2027–2030
While this unrestrained consumption of data fueled ChatGPT’s early success, it has also exposed the industry’s Achilles’ heel: the veins of high-quality data available for training are nearly depleted. In a 2024 report, the authoritative organization Epoch AI predicted that all high-quality human-generated text on the internet—such as professional books, academic papers, and quality news—will be exhausted as early as between 2027 and 2030. Looking back from 2026, this countdown is already upon us.
The underlying logic is crystal clear: the annual growth rate of AI training datasets exceeds 100%, while the annual growth rate of high-quality content generated by humans is less than 10%. This imbalance between supply and demand is irreversible. At the same time, an increasing number of websites are proactively restricting AI content scraping, tightening data access through copyright agreements and even legal action, further locking down AI’s data supply.
An Even More Deadly Crisis: Training Data Is Being Contaminated by AI
Data depletion is merely the beginning of the crisis; what is even more deadly is that the remaining data sources are being comprehensively contaminated. The genuine wisdom created by humans is being completely overwhelmed by the informational waste generated by machines themselves—a situation far more severe than simply running out of food.
The “Dead Internet” theory, proposed as early as 2022, is becoming a reality. This theory predicts that after 2026, the vast majority of content on the internet will be generated by AI—as the cost of content production has dropped to nearly zero following the widespread adoption of large language models, content farms have begun frantically churning out plagiarized articles and fake news in bulk. Between 2023 and 2024, Amazon’s Kindle platform already felt the significant impact of AI-generated content. A flood of homogenized, low-quality, hastily produced books flooded the market, even forcing the platform to implement restrictions, such as limiting the number of books a single account can publish per day. These books have virtually no substantive content; they merely rely on keyword stuffing to earn meager royalties, yet they continue to pollute the AI’s training data.