The European Union AI Office, established by the AI Act, last week issued a first draft of its Code of Conduct for providers of “general purpose” (i.e. foundation) models, including sections on compliance with EU copyright law, and transparency on compliance. And it would seem to settle the question of whether a reservation of rights by copyright owners, such as opting out of AI training, must be in machine-readable form to be enforceable. It does.
Here’s the relevant section:
When Signatories engage in text and data mining according to Article 2(2) of Directive (EU) 2019/790 for the development of their general-purpose AI models, they commit to ensure that they have lawful access to copyright-protected content and to identify and comply with rights reservations expressed pursuant to Article 4(3) of Directive (EU) 2019/790.
For those joining us late, Directive (EU) 2019/790, aka the Copyright Directive, permits the use of copyrighted works for text-and-data mining (TDM) in certain circumstances. But it also allows copyright owners to opt out of having their works used in TDM by expressly “reserving” their rights (Article 4(3)).
The Code of Conduct is intended to provide guidance to AI developers on best practices for demonstrating compliance with the requirements imposed on them by the AI Act. That includes, as here, complying with the Copyright Directive, which means complying with valid TDM opt-out assertions.
To satisfy the requirement to comply with TDM op-outs, the Code stipulates, developers must:
[O]nly employ crawlers that read and follow instructions expressed in accordance with the Robot Exclusion Protocol (robots.txt); and
[M]ake best efforts in accordance with widely used industry standards to identify and comply with other appropriate machine-readable means to express a rights reservation at source and/or work level… In particular, Signatories are encouraged to implement widely adopted tools that enable expressions of rights reservations at aggregate level.
No mention of complying with the kinds of Terms of Service provisions or general opt-out declarations many rights owners implemented in response to passage of the AI Act.
The catch, of course, is that the robots.txt protocol was developed long before generative AI was a thing and isn’t fully fit to purpose (although it was updated by Google earlier this year); and no other “widely used industry standards” to identify and comply with machine-readable opt-out indicators currently exist. Efforts to develop such systems are underway, but none could fairly be considered “widely used” at this point.
The Code does require general-purpose AI developers to “engage in bona fide discussions” with rights owners, standards organizations, and other stakeholders to “develop interoperable machine-readable standards to express a rights reservation.” It also says the EU Commission will convene and chair such meetings, and at some point may “issue information about state-of-the-art solutions that providers are expected to honour.”
The draft Code offers no time frame for settling on or implementing a standard. But the overall emphasis on compliance with automated, machine-readable opt-out indicators makes clear — if it wasn’t already — that the EU will let the burden fall on creators and rights owners to regulate access to their works to train AI models rather than on technology companies to affirmatively secure agreements for access before making such use.
Technology companies are not completely off the hook, however. Other provisions of the draft require developers of general-purpose AI models to:
Conduct “reasonable copyright due diligence” before entering into contracts with third-party dataset providers;
Request information from any such third party as to how it “identified and complied with,” rights reservations;
Take steps to mitigate the risk that downstream applications that integrate with a general purpose model will generate infringing output;
Avoid “overfitting” models during training that could cause them to memorize and regurgitate protected works; and
Avoid crawling and scraping sites offering pirated content.
The draft also spells out various transparency requirements for developers regarding their compliance with opt-out indicators, including:
Making public, in “broadly understood” language, up-to-date information on the measures they adopt to identify and comply with rights reservations;
Providing up-to-date information on all crawlers they deploy to collect data for training, including all their relevant robots.txt features;
Provide the AI Office, upon its request, information about the sources of data used to train, test, and validate models, and about authorizations (i.e. licenses or other permissions) to access and use the data.
It is unclear from the draft whether that last requirement is the same as, or in addition to the AI Act’s requirement that developers make public “sufficiently detailed summaries” of the data used to train their models.
On its face, the draft Code seems favorable to creators and rights owners insofar as it includes substantive copyright-compliance and monitoring measures AI developers must follow to be in compliance with the AI Act. But rights owners are likely to find the Code’s implicit adoption of an opt-out regime for AI training problematic.
As Fairly Trained founder and CEO Ed Newton-Rex noted in a recent blog post, it places a substantial administrative burden on creators; neither URL-based nor works-based opt-out indicators are effective at controlling usage of downstream copies of the works; and there is no requirement that models be retrained if works are opted out after the initial training and release.
The draft, of course, is just that: a draft, a work-in-progress. Stakeholders and EU policymakers will now have an opportunity to bang away at the code and bang on about their views. The AI Act stipulates that the Code of Conduct will go through four drafting rounds before the final version is released next April.
Let the banging begin.