Large Language Models (LLMs) require large amounts of data for training. Very large. Like the entire textual content of the World Wide Web large. In the case of the largest such models — OpenAI’s GPT, Google’s Gemini, Meta’s LLaMA, France’s Mistral — most of the data used is simply vacuumed up from the internet , if not by the companies themselves then by third-party bot-jockeys like Common Crawl, which provides structured subsets of the data suitable for AI training. Other tranches come from digitized archives like the Books 1, 2 and 3 collections and Z-Library.
In nearly all cases, the hoovering and archive-compiling has been done without the permission or even the knowledge of the creators or rights owners of the vacuumed-up haul.
Some of those sources are more valuable than others in terms of their usefulness for training LLMs. Professionally produced or vetted content, like newspaper archives or Wikipedia entries, is generally better teaching material than the grist from the comment thread of a random blog because its more likely to have been fact-checked and to hew to the rules of syntax and grammar LLM developers want their models to learn.
Some publishers, particularly news media outlets, have begun to block web crawlers from sucking up their content in hopes of being able to charge for access to their archives.
Given the vast quantity of content available on the internet, however, no single source or archive is so essential to the efficiency of a model that not having it for training would be a showstopper. Nor is it generally possible, given the complexity of the computational processes involved in training, to attribute the output of a model to all the possible sources that went into it.
As discussed here in previous posts, the lack of those reference points has so far prevented the development of any sort of price discovery mechanism for training data that would allow an efficient market to emerge, even where there is a willingness among the parties to engage. The licensing deals that have been done so far have been done largely for instrumental purposes — to advertise good faith on the part of AI companies, or, as with Reddit’s deals with Google and others, to bolster the investment case for its stock.
According to a fascinating new report in the Wall Street Journal, however, the scales may be starting to tip. For all the deep well of training material available on the internet, the supply may be starting to run dry for the next iterations of LLMs.
According to researchers cited by the Journal, “the industry’s need for high-quality text data could outstrip supply within two years,” making progress on developing more powerful models difficult.
AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcriptions of public YouTube videos, people familiar with the matter said.
According to Pablo Villalobos, a researcher at Epoch cited by the Journal, GPT-5 could require 60 trillion to 100 trillion tokens to make a meaningful qualitative leap from GPT-4, which he estimates was trained on a mere 12 trillion tokens. He projected there’s 90% chance than the demand for high-quality training data will outstrip supply by 2029.
Other companies are investigating using smaller, more carefully selected training corpuses to create more powerful and efficient models. Some executives, including OpenAI’s Sam Altman, suggest the days of very large models like GPT may be lining up in the rear-view mirror.
We’ve suggested before that a trend toward smaller models trained with smaller datasets could make it easier for a viable market for data to develop, by increasing archive owners’ bargaining power.
A tightening of the supply of high-quality data to meet the growing demand could further enhance the leverage of rights owners, particularly those of sources that have not yet been tapped, like the transcripts of YouTube videos.
A rebalancing of supply and demand for training data, and the seeds of market development taking root could also spur policymakers to establish some basic rules of the road, giving investors confidence that a viable AI data market will emerge to provide reasonable returns.
The EU AI Act’s data transparency requirement is a step in the direction of market formation. But a supply crunch could do much more.