The structural incentives of the digital economy have long encouraged an ethos among technology companies of seeking forgiveness rather than permission when it comes to engaging with the creators and owners of copyrighted works. In a winner-take-all race, moving first and moving fast are viewed as essential, even at the cost of breaking things along the way. If you win the race, any necessary clean up after the fact — settling some lawsuits, signing a few cosmetic licensing deals — becomes a tolerable cost of doing business, especially when you have financial backers willing to underwrite the risk. If you don’t win the race, none of it matters anyway.
That history and those same incentives have largely driven Silicon Valley’s approach to generative AI as well. AI companies have vacuumed up terabytes of copyrighted material to feed their models without so much as a howdy-do to creators and rights owners, and then wrapped themselves in claims of fair use when sued over it. And by and large, they have so are gotten away with it.
In at least one pending AI lawsuit, however, that appetite for risk may just have caught up with a tech company. The case, in the Northern District of California, involves a group of authors including comedian Sarah Silverman and essayist Ta-Nehisi Coates, and Meta Platforms, the parent of Facebook and Instagram. The authors filed the case in July 2023 charging Meta with using their works without permission to train its Llama large language model (LLM). Yet, in the year and a half since, the plaintiffs have had trouble gaining traction with their copyright claims, several of which were tossed out by the judge in November 2023.
Recently, however, the authors obtained new counsel, lead by the estimable David Boies, who famously led the U.S. government’s successful prosecution of Microsoft on antitrust grounds in 1998. In a filing with the court last week, Boies’ team revealed it had uncovered potentially incriminating new evidence against Meta and its CEO, Mark Zuckerberg, in the course of pre-trial discovery, including evidence of willful copyright infringement, and intentional removal of copyright management information (CMI) from documents used in training.
This week, the court unsealed some of the newly uncovered evidence (see here and here), and it paints a vivid and unflattering picture of a tech company running through one copyright warning light after another as it races to keep pace with its AI competitors, including Google and OpenAI (h/t Jason Kint).
Boies’ legal eagles found a series of internal emails indicating Meta, at the personal direction of “MZ” (presumably Mark Zuckerberg), made use of the notorious Library Genesis (LibGen) dataset to train Llama, despite the misgivings of Meta’s executive team. A so-called shadow library apparently originating in Russia, LibGen contains tens of millions of texts, including books, articles, academic papers and other copyrighted works, nearly all of which are believed to have been pirated.
Zuckerberg’s approved plan for accessing the database, moreover, involved downloading the collection via a torrent site using Meta’s own computers. As with most torrent sites, the one used by Meta requires a downloader also to participate in the torrent network, which means uploading some portions of the same files they are downloading, a process known as “seeding.”
In other words, Meta allegedly not only illegally copied the works in the LibGen collection but distributed them as well, indicating possible willful infringement. That could significantly increase any damages assessed if Meta were to lose the case.
According to the latest filing, Meta engineers, presumably in deference to their misgivings about the whole enterprise, seeded LibGen with the least amount of material they could in order to download the rest, a practice known as “leeching.”
Here’s how one engineer described Meta’s actions in a deposition:
Meta: That’s what we did and the library that we used [was called] Lib Torrent for downloading LibGen, [Meta employee] Bashlykov configured the configure setting so the smallest amount of seeding could occur.
Plaintiffs’ Counsel: What does that mean?
Meta: When you use a Torrent protocol, part of the configuration on how Torrents work is, you [ ] can only download as long as you offer to participate in the Torrent network in some way and seeding means up to open up some . . . sharing [sic] of the Torrent file while you’re downloading.
The new filing also references evidence suggesting Meta may have removed CMI from the documents it downloaded in part to help conceal its use of pirated data in training.
Meta’s corporate representative also admitted that Meta’s removal of CMI may have “facilitated” the training of Llama, thereby assisting the copyright infringement at the center of this case and resulting in conduct that clearly violates the DMCA.
This discovery suggests that Meta strips CMI not just for training purposes, but also to conceal its copyright infringement, because stripping copyrighted works’ CMI prevents Llama from outputting copyright information that might alert Llama users and the public to Meta’s infringement. To further minimize the risks… Meta’s programmers included “supervised samples” of data when fine-tuning Llama to ensure Llama’s output would include less incriminating answers when answering prompts regarding the source of Meta’s AI training data.
None of the new evidence has yet been tested or vetted by the court. Nor has Meta yet filed a response to the latest filing.
It is not clear, for instance, whether any of the material uploaded by Meta to Lib Torrent included works by the named plaintiffs. If not, the uploading could be deemed irrelevant to the immediate case (although if the court certifies the case as a class action that may not matter). Nor is there yet evidence that anyone downloaded the material Meta itself uploaded, which could nullify a claim of distribution (although such unlicensed “making available” could leave Meta liable for infringement in Europe).
Yet, if borne out the new evidence could poke a big hole in Meta’s fair use defense. The evidence portrays Meta’s actions as pre-meditated, intentionally deceptive, and by themselves not particularly transformative. Any transformation of the training data that may have occurred, happened well after — but not directly in conjunction with — the allegedly infringing actions had occurred.
More broadly, the acute competitive pressures the case portrays Meta as responding to are by no means unique to Meta. All major AI companies face the same pressure to collective as much training data as possible as quickly as possible to keep their models competitive. And it would be no great surprise to find many of them following the same course of conduct, with the same awareness aforethought that their actions were likely infringing and yet taking them anyway.
Whether liable or not, it’s not a pretty picture.