Reading Comprehension
Opting-out of AI training will require a machine-readable technical standard
One of the more critical and contentious questions that has loomed over the EU AI Act since it went into effect August 1 is whether or how the text-and-data mining (TDM) exception in the EU Copyright Directive applies to the use of copyrighted works to train generative AI models.
The TDM exception in the Directive on Copyright in the Digital Single Market (DSM Directive) provides a mandatory exception for the reproduction of copyrighted works and the extraction of information from certain legally accessible databases for scientific research purposes, without requiring permission or payment to rights holders. Rights owners can prevent their works from being covered by the TDM exception, however, by expressly reserving the right to extract text and data.
A recent study by two scholars presented to the European Parliament concluded that the training of generative AI models is not covered by the TDM exception and is therefore infringing. The text of the AI Act, however, requires model developers “to identify and comply with... a reservation of rights expressed pursuant” to the opt-out provision in the DSM directive, which would seem to indicate the authors of the AI Act believed the exception applied. Why require compliance with an opt-out if the exception doesn’t apply in the first place?
Last week, a district court in Hamburg, Germany, handed down a ruling (Google Translate version) holding that the widely used LAION (Large Artificial Intelligence Open Network) training dataset qualifies for the TDM exception.
The case was brought by a photographer over the inclusion of his photo among the LAION data. LAION is a non-profit organization founded in 2021 to advance machine learning research and applications, particularly machine vision. It compiles various datasets of image-text pairs and makes them available free of charge. The datasets include hyperlinks to images or image files that are publicly available on the internet paired with text descriptions of the image along with associated metadata.
The photographer, Robert Kneschk, sued claiming LAION reproduced his image without permission and ignored an express reservation of rights appearing on website of the photo agency to which he had licensed his photo.
As a case of first impression under the AI Act, the proceeding was closely watched in EU legal circles. The district court’s ruling is subject to appeal, however, so the final result may not yet be settled. And given the salience of the question of the TDM exception’s application to AI training, it is likely to continue to spark discussion and debate.
The district court’s ruling is quite narrow, in any case. It held that LAION, as a non-profit foundation that is not substantially controlled or influenced by any for-profit commercial entity, was engaged in scientific research and therefore eligible for the TDM exception. From that, the court concluded it did not need to reach the further question of whether Kneschk or the photo agency had met the legal threshold for opting out by sufficiently asserting a reservation of rights.
“Whether the defendant can rely on the limitation of Section 44b of the Copyright Act [the opt-out provision] seems doubtful in the present case”
— Hamburg Regional Court
The court did, however, offer some dicta on the question that could reverberate among both rights owners and technology companies even after the main question is finally resolved.
The law stipulates that “a reservation of use for works accessible online is only effective if it is made in machine-readable form.” The disclaimer on the photo agency’s website, however, appeared only as natural language, human-readable text. In its discussion, the court averred that, “Whether the defendant can rely on the limitation of Section 44b of the Copyright Act [the opt-out provision] seems doubtful in the present case,” because the reservation was not made by the plaintiff himself and was not in machine-readable form.
In response to or anticipation of the passage of the AI Act, many rights owners rushed out statements purporting to reserve their rights with respect to TDM for AI. In May, Sony Music sent letters to 700 AI developers and streaming services demanding information on whether and how they had used Sony music in connection with AI training. It subsequently published a “Declaration of AI Training Opt-Out” on its website. Warner Music followed with a similar warning in July. Dutch image collection society Pictoright and France’s author society Sacem have drafted general reservation of rights statements creators and rights owners can use to opt out of AI training.
But can they? If the Hamburg court’s doubts regarding Kneschk’s reliance on a non-machine readable declaration are borne out beyond the present case, it would seem that natural-language opt-out statements, by themselves, will not be legally sufficient to evade the TDM exception in the context of AI under EU law.
More critically, though, the case throws a spotlight on the lack of any standardized machine-readable method or protocol for opting out of AI training. Given the scale of the datasets needed to train large models an automated process is a practical, if not a legal, imperative for any viable, large-scale licensing system to succeed. It would also be needed to support a mandatory compliance regime if things were to go that way.
Efforts are being made. The Coalition for Content Provenance and Authenticity (C2PA) has developed a “do not train” metadata tag as part of the Adobe-led Content Credentials initiative, and has recently gained support from leading AI companies including OpenAI, Meta, Google and Amazon — or at least enough support from them to come to the meetings. On the other hand, Open AI just raised $6.6 billion in new funding at a valuation of $157 billion and is going to need access to all the data it can get to train the new models those investors will be expecting, so…
In any case, Content Credentials is a voluntary, industry-led project with no mechanism to require or enforce compliance with the do-not train tag. And a blanket “do not train” declaration, whether in human or machine-readable form, may not be flexible enough to permit discretionary licensing.
Netherlands-based Liccium is developing an automated opt-out protocol and registry based on the International Standard Content Code (ISCC) recently approved by the International Standards Organization (ISO). Unlike metadata that is extrinsic to a data file, ISCC codes can be generated from the data file itself and is unique to content of the file. That would likely satisfy the machine-readable legal standard. But the Liccium protocol still relies on the opt-out obligation that falls on the rights owner in the EU’s TDM exception, not on any compliance obligation by AI companies.
Industry-led technical standards like Content Credentials have worked in other contexts. If enough big AI companies get on board with the do-not train standard that may be sufficient for many rights owners’ purposes. And certainly a definitive ruling from the courts that unlicensed use of copyrighted works to train AI models is infringing would be a game-changer insofar as it could undergird a discretionary opt-in system of licensing.
But we’re a long way from that point. Nearly all the licensing deals that have been announced so far, are blanket deals — a negotiated fee for access to a publisher’s archive, the dross as well as the silver. That approach may work for select rights owners with particularly valuable archives. But it is not scalable to a general licensing system. And right now, the technological as well as the legal infrastructure is not in place to support a scalable model.