Deals Hint at Emerging Market for AI Training Data

Plus: DOJ Puts AI Companies On Notice, and Raising the Price of Music

May 31, 2024

For all the sturm und drang around the unlicensed use of copyrighted works to train AI models there sure seem to be a lot of licensing deals being discussed all of a sudden, for access to copyrighted works to train AI models. In just the past two weeks, OpenAI has signed licensing deals with Wall Street Journal publisher News Corp., Vox Media and The Atlantic for access to their archives for training, and we’ve seen reports that Meta, Alphabet and OpenAI are all in conversations with the Hollywood studios about licensing movie and television footage.

In the few months before that, we saw deals between OpenAI and the Financial Times, and Reddit sign with both Google and OpenAI. Photo agencies and archives have also been actively striking deals with AI companies over the past few months, including Shutterstock, Photobucket and Flickr.

What gives? Have AI companies suddenly gotten religion that they’re now willing to license what they previously just scraped?

More likely, incentives have simply changed, on both sides of the table. The flood of litigation brought by creators and rights owners against AI companies is one factor behind the seeming change or heart. But so, too, is a shift in ordinary market forces and conditions.

The biggest shift is the depletion of easy fodder for training. The internet has been all but scraped clean of the most easily accessible content. AI companies now need access to content that is behind paywalls and bot blockers, handing a degree of leverage to publishers.

Related to that depletion of scrapable content is a decreasing return on scale. For the first few iterations of models, simply ingesting more training data brought measurable improvement to their quality and capability. But there are signs that sheer scale no longer brings competitively meaningful improvement to models, or that the increase in scale required to make a difference is effectively unachievable.

Achieving meaningful improvements in the usefulness of models is now more sensitive to the quality of training data used. That has handed leverage to rights owners sitting on the choicest caches of content, as evident by the roster of respected mastheads represented in the deals so far, and the absence of more problematic brands.

On the flip side of that coin is a measure of price capitulation on the part of publishers, at least for now. As discussed in previous posts, the current dispute between the New York Times and OpenAI feels very much like a dispute over price rather than a refusal by either party to deal. But all publishers are anxious to see a market develop for access to their data for training and many leading brands are willing to sign deals now for relatively modest sums to prime that development.

As Reddit COO Jen Wong explained of the UGC platform’s initial agreements, “I'd say, these are midterm deals is how we think about them because it's such a nascent and early market that we want to see how things unfold. So, not forever, but long enough to understand value.”

Another shift has been rights owners’ growing interest in leveraging AI technology for their own purposes. Anxious for access to the latest AI technology and expertise they may be more willing to trade access to their content to get it.

The Hollywood studios are particularly keen to leverage the technology.

“We are very focused on AI. The biggest problem with making films today is the expense,” Sony Pictures CEO Tony Vinciquerra told an investor event last week. “We will be looking at ways to…produce both films for theaters and television in a more efficient way, using AI primarily.”

Photo agencies are also trading content for technology. Shutterstock has developed its own AI image generator using OpenAI’s DALL-E technology but has also struck training deals with Meta, Google, Amazon and Apple. Getty Images and Adobe have also developed their own image generators trained on their own content but relying on technology provided by others.

There have been other shifts also helping drive the trend toward more selective use of content to train models. Greater regulatory scrutiny of how AI companies access and use data in training, including new mandatory disclosure rules regarding training datasets, is clearly a factor. So, too, has growing interest among potential enterprise customers in models that are more narrowly tuned to their particular use cases rather than relying on large, generically trained models.

While encouraging, the market that has emerged thus far is a narrow one. The licensing agreements reached to date have all been direct deals, between individual AI companies and individual rights owners, for access to discreet, more or less finite datasets. It does nothing for the broad swath of creators and rights owners too small to treat directly with large AI companies who might benefit more from the development of collective rights management structures.

But it’s a start.

ICYMI

DOJ: AI Companies Must Find a Way to Compensate Artists and Authors

To date, the Federal Trade Commission has mostly carried the antitrust ball on AI companies and their use of data in their models. But this week, the Department of Justice took the handoff. Speaking at a conference at Stanford University Assistant Attorney General Jonathan Kanter, the head of the antitrust division, warned AI companies they could face action by his department if they don’t find a way to fairly compensate artists, performers and other creators for the use of their work. “If firms in the AI ecosystem violate the antitrust laws, the antitrust division will have something to say about that,” Kanter said. “What incentive will tomorrow’s writers, creators, journalists, thinkers and artists have if AI has the ability to extract their ingenuity without appropriate compensation The people who create and produce these inputs must be properly compensated."

Music: Still Going For a Song?

While Spotify, Amazon, Apple and other leading music streaming platforms have all successfully pushed through price increases over the past several months, tunes are still undervalued, according to analysts at Morgan Stanley. In a report issued this week, they argue that music in general, including streaming and live is highly underpriced, leaving ample room to increase revenue. “On a nominal basis, US consumer spending on recorded music is below 1999,” they wrote, “suggesting significant opportunity for the industry to exercise pricing power - something we have begun to see in only the last year or two.” That will be music to the ears of artists, publishers and record companies.

RightsTech Extra

Discussion about this post