While controversy and litigation continue to rage around the intersection of copyright and generative AI technology, and likely will continue to do so for the foreseeable future, some rights owners and AI developers have begun, tentatively, to explore ways to create a less contentious relationship. The first seven months of 2024 has seen at least a trickle of deals between news media publishers and AI companies to allow the use of publishers’ content to train large language models (LLMs), starting in February with Reddit’s widely reported deal with Google. Since then, Google, OpenAI and Microsoft have reached deals with Vox, the Financial Times, News Corp., Le Monde, The Atlantic and other publishers. AI developers have also struck a number of agreements with photo agencies and stock houses for data to train text-to-image generators (T2I).
While encouraging, the agreements that have been reached to date are largely the result of direct negotiations between individual rights owners and individual AI developers and involve discrete, relatively limited datasets. Given the enormous quantity and range of data being used to train generative AI models, that approach poses obvious practical challenges as a solution for large corpora of data involving multiple rights owners.
Those challenges have led many analysts and stakeholders to look to collective licensing as a model for the use of copyrighted works in AI systems, albeit generally without specifying how that would or could work at the scale needed to accommodate AI. Earlier this month, however, the Copyright Clearance Center (CCC) took a small step toward figuring that out, announcing an extension to its blanket Annual Copyright License to provide AI re-use rights from a broad range of publishers.
CCC was launched in 1978 to provide blanket licenses to enterprises for certain types of re-use of content from its roster of academic, scientific and corporate publisher clients. While initially focused on photocopying, over the years it has gradually expanded the re-use cases covered by its basic license. The AI re-use case is the latest such extension.
According to CCC president & CEO Tracey Armstrong, the AI extension has been in the works since last year, and the impetus for adopting it came from both its publishers members and its licensees. She describes the adoption of the extension so far as “very good,” from both ends, but declines to provide percentages.
“Our clients are going through the [license] renewal process now, and as they do they’re adopting” the newly expanded license, Armstrong tells me.
For now, the AI license remains limited. It applies only to the internal use of publishers’ content by enterprises “in AI systems,” such as using an AI model to summarize multiple journal articles on the same subject. Works cannot be used to train AI models and outputs cannot be used or distributed externally.
There is also no content feed associated with the AI rights.
“This is a rights-only license,” for content licensees already subscribe to or have otherwise obtained,” according to Armstrong.
The AI license is also only available as part of CCC’s broader basic license, which could limit its appeal to enterprises that are not already licensees. A standalone AI license could be in the offing however.
“You’ll be seeing more licenses over the next 12 months,” Armstrong says.
Although discussions with publishers and licensees about the AI license were ongoing for 20 months or so, CCC was able to get to this point only by holding in abeyance for now some of the harder questions that arise around collective licensing for AI.
Money collected from the AI license, for instance, will go into a pool and eventually be distributed to publishers as royalties. CCC is not planning to track actual usage of the content by licensees, however, and is still working with economists and econometricians to devise an equitable model for allocating money from the pool.
Armstrong insists it’s a nut CCC will be able to crack based on its past experience allocating pooled revenue from its basic license. Part of the challenge to rights owners in relation to AI, however, is that AI tends to confound many earlier models and assumptions.
Nowhere is that more than case than on the question of attribution. Unlike collective licensing of music, for instance, where radio play can be monitored and streaming plays can be tracked to the second, and royalties can be paid out more or less accordingly, attributing AI outputs to particular inputs is extremely difficult if not impossible.
Just how sensitive the question of attribution can be, particularly for the sort of academic publishers represented by CCC, came out the same week its new AI license was announced. Informa, the parent company of academic publisher Taylor & Francis, revealed in an earnings report that it had struck an AI licensing deal with Microsoft for an initial fee of $10 million plus annual payments for three years.
Not only had Informa neglected to inform T&F authors whose books and journal articles were included in the deal, it apparently made no provision for attribution of those works for use by Microsoft’s models. That raised alarms among the authors over the potential loss of citations from the mash up of their work with others. Citations are a valuable coin of the realm in academic circles. How often your work is cited by other researchers can often determine whether you secure tenure or other publishing deals, and the loss of any significant number of them can be more costly in the long run than any loss from unpaid royalties for the licensing of your work.
As a middleman representing publishers in licensing deals, that’s not something that could be pinned on CCC if it were to arise. But it’s an indication of how fraught the issue of collective licensing for AI use could become, even when done in good faith.
Collective management of rights and blanket licensing have proved to be workable solutions to large-scale licensing challenges for creators and rights owners in the past. AI may not let them hide under a blanket in the future.