Split Decision: Court Rules AI Training Is Fair Use, But Consider The Source
Anthropic both wins and loses in books case
For folks on both sides of the AI vs. copyright divide, there’s a lot to like, and a lot not so much, in Tuesday’s court ruling in the case of Andrea Bartz, et. al. v. Anthropic. Judge William Alsup of the federal district court for northern California found the Amazon-backed AI startup’s use of copyrighted books to train its Claude LLMs to be a “spectacularly” transformative fair use, “among the most transformative [technologies] many of us will see in our lifetimes.“ But he said Anthropic will need to stand trial anyway for knowingly, and possibly willfully, compiling a vast library of pirated copies of those books used in training and and retaining them for other, unspecified possible future uses.
I’ll leave it to better informed legal minds than me to discuss whether Judge Alsup’s conclusions are legally sound. But elements of the analysis in his a 32-page opinion, are likely to find their way into the broader policy debate around AI and copyright, whether his ruling holds up or not.
Transparency
The order draws a clear distinction between the sources of training material and the end uses of that material. Just because the end use may be legal, Alsup makes clear, that does not, by itself, render the sources and methods used to acquire the material necessarily legal.
This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.
Creators and rights owners are likely to seize on that language to bolster their demands for greater transparency onto the data used to train AI models, just as the cause is losing support among policymakers.
The transparency issue nearly derailed the U.K. government’s Data (Use and Access) Bill earlier this month. The bill is a centerpiece of Prime Minister Keir Starmer's plan to goose Britain’s sluggish economy by clearing the way for data-intensive domestic technology development and attract investment from outside the U.K. Peers in the House of Lords, however, led by one-time filmmaker Baroness Beeban Kidron, repeatedly sought to insert an amendment to require AI developers to disclose the data used to train their models.
The Starmer government strongly opposed the amendment, under heavy pressure from U.S. technology companies and the Trump administration, and managed to get it stripped from the bill in the House of Commons each time “the other place” (as they say in Parliament) inserted it. It took six round trips between the two chambers but the bill finally passed, without the transparency amendment, after the Kidron forces concluded they had made their point and that further resistance would be futile.
The EU’s AI Act actually included a provision requiring AI developers to provide a “sufficiently detailed summary” of the data used in training. But the European Commission, also under pressure from the White House, is now reviewing elements of the act and may delay its full implementation beyond the scheduled August 1 deadline to to consider possible changes to the law. The AI Code of Practice mandated by the act, meant to provide a blueprint for developers to comply with the law’s many do’s and don’ts, has also been softened from its original design during the drafting process to make it accommodating to tech companies.
Various bills floated in the U.S. Congress have also included a data transparency mandate, but none have ever come to vote.
Rights owners will be hoping the attention on the coming trial on Anthropic’s use data sourced from known repositories of pirated material, along with similar charges being faced by Meta in the separate Kadry case, will bring renewed urgency to the transparency debate.
Quality In, Quality Out
Alsup’s discussion also highlights the commercial value of the data found in professionally published texts as a factor underlying Anthropic’s extensive use of books in training.
Over time, Anthropic came to value most highly for its data mixes books like the ones Authors had written, and it valued them because of the creative expressions they contained. Claude’s customers wanted Claude to write as accurately and as compellingly as Authors. So, it was best to train the LLMs underlying Claude on works just like the ones Authors had written, with well-curated facts, well-organized analyses, and captivating fictional narratives — above all with “good writing” of the kind “an editor would approve of.” Anthropic could have trained its LLMs without using such books or any books at all. That would have required spending more on, say, staff writers to create competing exemplars of good writing, engineers to revise bad exemplars into better ones, energy bills to power more rounds of training and fine-tuning, and so on. Having canonical texts to draw upon helped.
The direct connection between the quality of the input and the quality of the output highlighted in that passage is likely to be noted by many types of publishers in future discussions with AI companies over licensing their archives as training data.
Indiscriminate scraping of the web will reap tonnages of text, but what it yields is already being commoditized. To create a model with commercially marketable features, training on professionally produced content of the same type as the features being marketed is likely to yield the best results. If you want well written output you need well written and edited input.
It also introduces some workable metrics into potential licensing negotiations. The commercial value of an archive to a model can be at least inferred from the commercial or reputational success of its contents, and its relevance to model’s intended use case.
Out Of Style
Another factor in the policy debate over AI and copyright is whether an author’s or artist’s distinctive style can or should be protected from copying or mimicking. But Judge Alsup shoots that notion down.
Authors further argue that the training was intended to memorize their works’ creative elements — not just their works’ non-protectable ones (Opp. 17). But this is the same argument. Again, Anthropic’s LLMs have not reproduced to the public a given work’s creative elements, nor even one author’s identifiable expressive style (assuming arguendo that these are even copyrightable). Yes, Claude has outputted grammar, composition, and style that the underlying LLM distilled from thousands of works. But if someone were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not.
There is plenty more to chew on in Alsup’s opinion. It may even be appealed as it’s a summary judgment on a matter of law. It advances the debate, both legally and on policy. But it is certainly not the last word on either.