From RAG to Riches
Retrieval augmented generation is a threat to publishers, but also an opportunity
On February 7, a federal district court ruled that Ross Intelligence’s unlicensed use of Westlaw content to develop a competing, AI-powered legal research tool harmed the market for Westlaw’s service and therefore did not qualify as fair use. Less than a week later, a group of U.S. and Canadian news media publishers filed a copyright infringement lawsuit against Canadian AI company Cohere, which also cited harm to the plaintiffs’ content licensing market from Cohere’s generative AI tools.
Despite the timing, the two cases are almost certainly not directly related. There are 14 named plaintiffs on the Cohere complaint. It is simply not possible to put that many lawyers in a room and expect consensus on a brief to emerge in less than a week. The complaint even reads a bit like a committee project, with occasional digressions to dot some particular “i” or cross a favored “t.” It had to have been in the works well before the ruling in Ross came down.
But the two cases do rhyme. They both come in the wake of the U.S. Supreme Court’s 2023 ruling in Andy Warhol Foundation v. Goldsmith, which emphasized that harm to the market for the original work from a derivative use to be the most important factor in a fair use analysis. To judge from the two cases, copyright litigators as well as the lower courts got the memo.
As Berkley Law professor Pamela Samuelson spelled out in her keynote at last year’s RightsTech AI Summit, the harm from a derivative use must be concrete affecting an existing market not merely a speculative harm to a potential business opportunity for the plaintiff, for it to weigh against a finding of fair use. In Ross, the court found that the defendant’s legal research tool competed directly with Westlaw’s existing business of licensing its research and data service to law firms and libraries. In the Cohere case, the plaintiffs argue they all have long-standing content licensing businesses and emphasize that a market for licensing content for use in AI systems, while nascent, is real and growing.
RAG systems marry large language models with search engines to provide timely responses to user prompts. LLMs can take months to train and can quickly grow stale because much of the data they are trained on has a shelf life. Any relevant information that emerges after training is complete is invisible to the model. RAG models try to solve that problem by using search technology to scour and internet and retrieve the most up-to-date information on the subject of the prompt. They then feed that data to the LLM to generate a response.
Often, the response merely summarizes what information the system retrieves. In many cases, those summaries are sufficient to answer the user’s question without their needing to click through to websites from which the information was sourced, depriving those sites of valuable traffic. For obvious reasons, the most up-to-date information is often found on news websites.
According to the publishers’ complaint, however, Cohere has not secured licenses for the data it scrapes from the web to power its RAG models. Moreover, the plaintiffs claim, many of the summaries Cohere’s model’s generate contain verbatim or near-verbatim copies of publishers’ copyrighted articles or substantial excerpts from them, making them directly competitive with the original sources.
Last week’s suit against Cohere is not the first to challenge a RAG model. That honor belongs to News Corp., which sued RAG pioneer Perplexity in October 2024. But it’s unlikely to be the last, either.
Even more than LLMs themselves, RAG models represent both a danger and an opportunity for publishers and other rights owners. Because RAG systems scrape content from the web in real time to respond to prompts, if that scraping is unlicensed publishers will face an ongoing competitive threat to their business, to say nothing of the recurring infringement of their copyrights that scraping may entail.
The flip side of that coin, however, is that RAG technology could offer publishers something LLMs alone cannot, if an effective licensing system could be established: a ongoing revenue stream from AI.
The number of content licensing deals between publishers and LLM developers is now reckoned in the dozens, including last week’s deal between The Guardian and OpenAI. Yet those deals almost certainly underweight the value of news content to AI models.
Nearly all of the AI licensing deals to date are effectively one-offs. The AI developer is given access to the publisher’s content archive to train one or more models during the term of the deal in exchange for payment. Once the content has been ingested and processed by the model, however, its entire value will have been extracted and permanently and irretrievably incorporated into the model. While that value is retained by the model, there is no further value to the AI developer in access to the content once training is complete. As a result, there is no need to renew any deal with the publisher.
For the publisher, it’s one and done.
In the case of RAG systems, however, where real-time collection of current information is essential to the model, ongoing and uninterrupted access to news reporting has clear and quantifiable value to the model provider.
As I discuss in a forthcoming report for the Variety Intelligence Platform (VIP+), the challenge publishers face in trying to capture that value is technical more than contractual. Apart from the decades-old and unenforceable robots.txt protocol, publishers currently have few tools to effectively regulate access to their content by automated bots and web crawlers. As the plaintiffs against Cohere point out in their complaint, even paywalled content is not immune to being accessed and used by AI systems.
It that technical challenge can be overcome, however, RAG could be a road to AI riches for publishers.