AI, Copyright and Piracy: An Open and Cloze Case
New study brings a bit of empiricism to the debate
One of the most vexing challenges arising from the use of copyrighted works to train generative AI models is determining the value of any particular work, or works, in reference to a model’s performance. The lack of an empirical methodology to measure the impact of access to a particular dataset on a model’s monetizable prospects has hindered the development of a robust and sustainable licensing market for training data, and made it difficult to assess liability in the event such access is found to be infringing, such as with the use of pirated copies of works in the dataset.
A new paper by two researchers at UC Berkley, titled Cloze Encounters: The Impact of Pirated Data Access on LLM Performance, shines some important new light on the question, with implications both for ongoing and future litigation, and for the development of the nascent licensing market.
The researchers, Stella Jia and Abhishek Nagaraj, designed a clever study to measure the impact of access to the notorious Books3 dataset, containing pirated copies of some 195,000 titles, on the performance of four groups of large language models (LLMs), all of which are known to have used Books3 in training: OpenAI’s GPT class of models, Google’s Gemini series, Anthropic’s Claud Haiku, and Meta’s Llama series.
According to the paper, they found that “direct access to the full text of a book [in] Books3 significantly enhances performance on the cloze name task.”
The cloze method was first described by media researcher Wilson L. Taylor in 1953. The name is derived from the term “closure” in Gestalt theory referring to a sense of completeness. It was developed to test the readability of a text, but has also been used to measure fluency in learning second languages, in computer science to test the performance of natural language processing systems, and in many other situations. It involves testing a student of computer system’s ability to fill in missing information in a sentence based on the surrounding context.
For instance, in the phrase, “On December 17, 1903, a ______________ day at Kitty Hawk, North Carolina, Wilbur and Orville Wright made the first successful powered flight in a heavier-than-air craft,” the blank is preceded by the indefinite article “a” and followed by a the noun “day.” Therefore, the missing word must be an adjective. In the context of the rest of the phrase — and presuming some prior knowledge of the physics and mechanics of flight — the missing adjective would likely be related to weather or atmospheric conditions, most plausibly those related to air, or lift, such as “windy” or “blustery,” rather than to water or terrain.
Jia and Nagaraj used a variation known as the name cloze test to determine whether a model could correctly fill in a missing name in a phrase taken verbatim from a particular work. It was designed to test whether the work had been included in the model’s training data to provide the necessary prior knowledge. For example, the phrase “Yeah? Well maybe satyr emotions work differently than human emotions. Because you’re wrong. I don’t care what he thinks. _______________ pulled his feet up onto the branch,” is taken from The Lightning Thief by Rick Riordan. The model would need to correctly identify “Grover” as the missing name.
To apply the test in relation to the Books3 dataset the researchers compiled a sample of 13,000 titles, roughly half of which were included in Books3 and half were not. They then took an equal number of phrases from each group and tested the models’ ability to fill in the missing names in both.
In all cases, the models performed significantly better on phrases taken from works in Books3 than on those taken from non-Books3 titles. The degree of improvement varied widely among the models, however.
The most pronounced improvements were found in GPT-3.5 Turbo and GPT-4.0, as compared to the Llama 3.1 series and the Claude Haiku and Gemini series of models. The “Books3 effect” led to a 21%-23% improvement in accuracy in the GPT models, compared to a 7%-9% improvement in the others. Part of the difference is attributable to the GPT models underperforming the others by about 5% across all book titles.
The study also found the “Books3 effect” dropped sharply for works published after 2020, when the dataset was compiled, reinforcing the conclusion that it was used in training.
Notably, the researchers also found that a title’s relative popularity, as measured by aggregated reviews in Goodreads, had a measurable impact on the models’ accuracy. The Books3 effect was significantly greater with less popular books, compared to more popular books. The finding suggests training data substitution plays an important role in determining a work’s effect on performance. In this case, more popular books are more likely to be included, in whole or in part, in resources beyond Books3, in reviews, summaries, derivative works and other sources, reducing the importance of direct access.
The study also found almost no Books3 effect in the smaller Llama-8B model compared to the larger Llama-70B version, underscoring that model size makes a difference in performance.
The usual caveats: The study concerns LLMs and it is not clear whether the results could be replicated with other types of generative AI systems, such as image or music generators. The use of Books3 by some of the models in the study is confirmed by the records of litigation and other legal processes. But the lack of broader transparency onto training data corpora could also limit opportunities to replicate the results with other datasets. The significant variation in the “Books3 effect” among models in the study make it difficult to generalize the results.
Still, the findings hold several important implications for the debate over the use of copyrighted works to train AI models:
The study demonstrates it is both theoretically and practically possible to quantify the impact of direct access to a particular dataset on a model’s performance, in at least some circumstances and with some kinds of models, even within the context of training corpora measured in terabytes and petabytes.
Somewhat counter-intuitively, the value of direct access to a work for training a model is inversely correlated with its popularity.
The models’ ability to correctly identify missing words in phrases taken from specific works suggests that models do, effectively, retain copies of the texts in their training data, although in what form or by what process is not clear from the study.
The same dataset can yield different degrees of performance gains in different models, likely reflecting broader differences in training methods and the mix of data in training corpora, reinforcing the value of greater transparency onto training sources and methods.
By themselves, the findings of one academic study are not likely to settle many arguments, let alone litigation. But the work of Jia and Nagaraj is an important step toward moving the debate around the use of copyrighted works by AI systems out of the realm of theory and speculation and onto more empirical terrain.