Access Denied: A New Model For AI Data Licensing
Cloudflare's pay-per-crawl approach could bypass the AI-copyright divide
Creators and rights holders generally chafe at the proposition that the burden should fall on them to affirmatively opt-out of letting their works be used to train AI models. The concept was enshrined in the European Union’s Copyright Directive and the AI Act, and is very much on the table as the UK debates new AI and copyright legislation. Given national governments’ zeal to encourage domestic AI development, including in the U.S., an opt-out policy is trending toward being the de facto global standard.
In contrast, rights holders as a rule maintain an opt-in policy should be the default, in which the burden would fall on would-be AI developers to secure licenses before ingesting any copyrighted material.
The policy debate aside, both approaches present technical challenges. With opt-out, there is no certain, machine-readable means for publishers to signal their preference to AI scrapers and bots that harvest their content beyond the unenforceable robots.txt protocol and a few ISO-certified but non-mandatory industry standards such as C2PA’s Content Credentials specification and Liccium’s TDM-AI registry. Nor is their a settled taxonomy for specifying different levels of permission.
With opt-in, the challenge has been even greater. Apart from lacking statutory support their has been no automated technical means to implement it. With the exception of a few fledging efforts to implement collective licensing, such as the Copyright Clearance Center’s AI training license, every opt-in to date has required direct negotiation between a single publisher and a single AI company.
Last week, however, Cloudflare made opt-in technically feasible at scale. The internet infrastructure provider announced it will begin blocking AI crawlers by default on websites running on its Content Delivery Network.
Called pay-per-crawl, the new system uses a “managed robots.txt” platform allowing publishers to turn over management of their robots policy to Cloudflare. The system then leverages the often neglected HTTP 402 response code, “payment required.” When a request from a bot for access to a Cloudflare-registered site hits the network it must signal its intent-to-pay in the heading or it will receive a 402 response and be denied access. Publishers, as well as AI developers willing to pay, can each register their payment details will Cloudflare, and where payment is required it acts as merchant-of-record to execute the transaction.
Publishers can also specify which pages on their site require payment and which can be accessed without payment, for example blocking access to pages that are monetized with ads but allowing access to pages visited by developers or that respond to support requests. Publishers can set their own pricing, and can also tailor their responses to requests. For instance, if a crawler doesn’t have a billing relationship with Cloudflare, publishers can send a 403 “Access Forbidden” response but indicate that a relationship could be available in the future.
AI developers can set their crawler’s payment parameters, such as agreeing to pay up to a certain dollar amount (you can read more on the technical details here).
Pay-per-crawl is currently available only in private beta to Cloudflare customers. But Cloudflare operates the largest CDN on the internet, handling roughly 20% of global web traffic, serving 78 million HTTP requests per second, and used by more than 2 million websites. Some 35 leading publishers are already on board, including Condé Nast, Time, The Atlantic, AP, BuzzFeed, Reddit, and Ziff-Davis, and at least one major record label, Universal Music, issued a statement cheering the launch.
Technical achievement aside, the Cloudflare system introduces a number of key capabilities to enable a more meaningful and robust licensing system for AI training data.
Critically, it makes robots.txt enforceable by default, by blocking access at the network level unless payment is first agreed to. That shifts the burden of compliance from copyright owners to users of copyrighted works to train AI models, even in the absence of statutory support for such compliance.
Second, pay-per-crawl is both flexible and granular. Publishers can set a flat price for all crawlers, but they can also allow specific crawlers to bypass the payment wall, for example those affiliated with AI companies with which they have direct licensing arrangements. The system also supports dynamic pricing, enabling publishers to establish different prices for different bot-request paths or for different types of content on their sites. For instance, the price-per-crawl could dynamically adjust based on how many users an AI application has, or whether a request comes from a for-profit or non-profit AI developer.
That flexibility gives publishers direct control over their AI-facing business model, rather than having to accept terms set by a third-party application.
Most critically, however, pay-per-crawl enables a licensing model based on access rather than on copyright.
Basing licensing on access rather than copyright has several advantages. It is unaffected by the outcome of the policy debate around AI and copyright. Whether training AI models on copyrighted works is fair use, or not, is irrelevant if the material in question cannot be collected in the first place.
It also places licensing in the realm of ordinary, well-developed contract law rather than the more fluid, contingent waters of copyright law. Charging for access is a familiar, well-established business model and relatively straightforward to administer. Most subscription businesses are based on it, in fact. When you subscribe to Netflix or to Spotify your payment grants you access to any and all of the programming available on those platforms, not specific programs. You might pay for certain additional permissions, such as the number of concurrent streams or the number of devices on which you can use the service. But the content you access with those additional permissions is not relevant to the model.
When disputes arise — if access is gained without entitlement — adjudication does not require four-part balancing tests and does not implicate constitutional principles.
The same virtues can apply in a B2B context as well. The advantages of elevating access over use in licensing may even be amplified in the context of AI. It could, for instance, lessen the urgency around some of the most confounding and contentious questions complicating the current licensing debate, such as determining the value of a particular archive to a particular AI model, or accounting for the content’s ongoing use and relevance to a model post-training. A certain amount of ongoing usage could be assumed and baked into the price for access, perhaps with payments spread over time.
The pay-per-crawl model also provides both sides with clear, quantifiable metrics for accounting and reporting purposes. For publishers, each permitted crawl can be tabulated and assigned a known and fixed value for calculating revenue from AI. For AI companies, each accepted request represents a known unit of cost for calculating the cost of materials. The new metrics might even provide a foundation to resolve the attribution problem for royalty calculations.
Cloudflare is not the first to try to ground the use of copyrighted works by AI in access. Some web publishers, lately most notably Reddit, have sought to leverage a terms-of-service click-license to control access to their site by AI crawlers. The startup Tollbit.ai has developed technology to allow individual web publishers to prohibit access to their sites by unauthorized crawlers, and is used by several of the same publishers that have signed up with Cloudflare. But Cloudflare’s ability to control access to sites by default, at the network level and at scale, could make access-based licensing the default.