Raw Story Media and AlterNet last week became the latest rights owners to see their claims against an AI company dismissed before trial by a federal court.
Unlike many other AI plaintiffs, Raw Story, which owns AlterNet, had not charged the defendant, OpenAI, with copyright infringement based on its unauthorized reproduction of their works to train ChatGPT. Instead, it brought its claims under §1202(b) of the Digital Millennium Copyright Act, which prohibits the intentional removal of copyright management information (CMI) from protected works.
The strategy was part of a growing trend among AI plaintiffs to try to sidestep an involved and unpredictable fair use analysis by the court by casting the defendant’s actions as a straight-up violation of statutory law — no complex, multi-factor analysis required. Fair use is not a defense to a 1202(b) violation.
Unfortunately for Raw Story, US District Judge Colleen McMahon didn’t need a multi-factor analysis to dismiss its complaint, either. The problem, she wrote in her 10-page ruling, was a straight-up matter of standing under federal rules of procedures and a failure to allege an identifiable, concrete harm.
First [sic], Plaintiffs argue that they have standing to pursue damages because "the unlawful removal of CMI from a copyrighted work is a concrete injury." ..Second [sic], Plaintiffs argue that they have standing to pursue injunctive relief, because they have alleged that there is a substantial risk that Defendants' program will "provide responses to users that incorporate[] material from Plaintiffs' copyright-protected works or regurgitate[] copyright-protected works verbatim or nearly verbatim." ... Defendants respond that neither theory of harm identifies a concrete injury-in-fact sufficient to establish standing. I agree with Defendants. Plaintiffs' claims for both damages and injunctive relief are DISMISSED because Plaintiffs' lack Article III standing.
Judge McMahon is undoubtedly correct as a matter of law. And she is hardly the first jurist to look skeptically at speculative claims of harm to rights owners from the training of AI models, whether brought under §1202 or §106. In her ruling, she locates the problem in the operation of AI systems.
“When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer,” she wrote. “Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote.”
Whether she intended to, or not, McMahon here puts her finger precisely on the nature of the fundamental conflict at issue. The law is deterministic. Effects must have identifiable, particularized causes to be judiciable. Liability cannot attach to speculative harms.
Generative AI is probabilistic. The harm it poses to rights owners is not speculative but stochastic; the likelihood that a particular work or collection of works used in training will be implicated in a particular output is not zero, but it falls within a probability distribution rather than as a deterministic outcome. There is harm, in the form of appropriated latent value, but legally speaking no foul.
So far, no court or legal theorist has devised a strategy to reconcile the two modalities at work, at least within the realm of copyright. The square deterministic peg of the law simply does not fit into the round hole of stochastic probabilism at the heart of AI.
McMahon alludes to the problem in her conclusion.
Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants' training sets, but rather Defendants' use [sic] of Plaintiffs' articles to develop ChatGPT without compensation to Plaintiffs… Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been "elevated" by Section 1202(b )(i) of the DMCA… Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.
It should be the question before policymakers, however. Generative AI systems indisputably create value by divining statistical relationships among words, word parts and sentences in published texts. But the process is extractive.
Even if you accept that statistical correlations are distinct from the expressive features of training texts and therefore fall outside the existing parameters of copyright, it does not obviously follow from that that no other obligation of equity should apply. Mining companies generate value by extracting ore from the ground but access to the land and mineral rights comes at a price.
The question is whether a legally sustainable and equitable theory of stochastic liability for AI training should (and could) be devised, within copyright or other body of law, to address the current imbalance between AI developers and copyright owners, and to do so without leaving courts to engage in arcane statistical exegesis on individual claims.
That might still leave the question of whether the juice would be worth the squeeze to rights owners, given the enormous volume of texts used to train even specialized AI language models. But that at least would be a question for rights owners to answer for themselves.
ETA: Actual attorney Peter Csathy disagrees that Judge McMahon got it right on the law and suggests the ruling could be overturned on appeal.