The Infrastructure of Control
Building an AI data marketplace from the ground up
In Brussels last week, the Legal Affairs Committee of the European Parliament, by a vote of 17-3, approved a series of proposals for ensuring the “fair remuneration” of rightsholders for the use of copyrighted works by generative AI. The proposals will be put before the full Parliament at the next plenary session scheduled for March.
While no doubt well-intentioned, like so many similar pronouncements the proposals are long on what the committee would like to see happen, and short on how to make it happen in the real world.
The committee calls, for instance, for rightsholders to have “full control over the digital use of their content by AI systems and models for training purposes,” by means of a “robust and functioning” (i.e. machine-readable) opt-out mechanism.
All well and good, but a meaningful opt-out mechanism needs to be machine-enforceable, not just readable. That is, the opt-out signal, once detected, must functionally disable the requesting machine’s access to the copyrighted work at the network level, and that disability must persist through downstream copies of the work.
My point is not to pick on the Legal Affairs Committee. The problem is fundamental to the digital media economy: the how is almost entirely in the network. It cannot be mandated or imposed from the outside. Any solution for realizing the what must become part of the network’s architecture.
That’s not easy to do as a technical matter. Reproducing files comprised of binary code is fundamental to how von Neumann machines and the networks that connect them operate.
Nor is it an easy sale to shareholders or prospective investors. Inhibiting the fundamental operation of the network is not nearly as sexy a proposition as building the network itself or the applications and capabilities it supports. Those are easier to do as a technical matter and therefore more likely to produce a near-term return on investment.
That’s one basic reason why you see hundreds of billions of dollars in investment capital going into building AI data centers today. Investing in infrastructure pencils out even if the business case for AI applications is yet to. It’s also why, historically, far less investment has gone into figuring out how to architect IP principles into digital networks, and where it has been tried the results have often been discouraging, as any music blockchain investor can tell you.
That under-investment is starting to change, however. Last week, the World Intellectual Property Organization (WIPO) announced the planned March 17 launch of the Artificial Intelligence Infrastructure Interchange (AIII, pronounced “A triple I), and opened applications for “any creator, rightsholder, tech professional, AI developer, or academic with the expertise” to join the project’s Technical Exchange Network (TEN). The goal of the initiative is to advance a “global dialogue on the technical and operational aspects of the intellectual property system in the context of artificial intelligence.”
That is, to figure out how to “enable digital content to circulate globally while helping to ensure that creators and rightsholders are properly identified and, where applicable, compensated by including “copyright infrastructure” within the architecture of AI.
Also last week, the internet infrastructure provider Cloudflare announced the acquisition of Human Native, a company I profiled back in 2024. Cloudflare is a major content delivery network (CDN) and cybersecurity provider whose core business is securing corporate and government networks against attacks. But as discussed here last year, it also developed the pay-per-crawl economic model for web publishers based on its managed-robots.txt platform.
The platform makes AI opt-outs expressed in a website’s robots.txt functionally enforceable at the network level by blocking a web crawler’s access to the site if the bot’s heading does not signal an intent-to-pay or an existing payment relationship with the publisher on the part of the bot’s source. Publishers and AI developers can register their payment credentials with Cloudflare, which then acts as the merchant-of-record to execute the transaction.
What sets pay-per-crawl apart from other opt-out based approaches to licensing is that it is based on access to the content, not the content itself. It does not rely on embedding a signal into the content, which may or may not prove persistent. And it removes the transaction from the abstract realm of copyright to the concrete world of network engineering.
Pay-per-crawl’s other critical feature is scalability. Plenty of direct AI licensing deals are being struck these days between individual rightsholders and AI developers. Scarcely a week goes by, in seems, that Universal Music Group doesn’t announce a new one. But as I and many other analysts have noted, that’s not a model that can scale beyond the relative handful of publishers with archives valuable enough that developers will be willing to pay for them. As pay-per-crawl is not based on the content itself, the size or perceived value of an archive is irrelevant.
With its acquisition of Human Native, Cloudflare adds another dimension to its AI licensing arsenal. Human Native’s business is aggregating datasets from rightsowners and making them available to AI developers. In between, it cleans and structures the data, and ensures necessary clearances are in place, to make it easily usable for training AI models. It’s a different approach to creating a model that can scale.
“I think the two approaches are actually very complementary,” Human Native co-founder James Smith told me last week. “It’s about giving publishers control.”
The goal, Smith says, is to “create an alternative to scraping.” Scraping, he notes yields “messy data” that is unstructured and often unreliable. The combination of Cloudflare’s access-control capability and a marketplace of pre-cleaned and cleared datasets is intended to steer AI developers away from relying on blind scraping for data to looking for ready-to-use quality datasets.
“What developers need from data,” Smith tells me, is “access, preparation, legality and quality. And it needs to be accessible in a marketplace.”
It doesn’t hurt, he adds, that Cloudflare’s CDN carries roughly 20% of global internet traffic, including all those blind scraper bots, giving the project a built-in degree of scale. The ultimate goal, Smith says, is to build “an internet-scale infrastructure for data commerce.”
Watch this space.


Hey, great read as always. What if implementing machine-enforceable opt-outs at the network level creates significant lantency for models or becomes a huge attack surface?
Excellent framing of the architecture problem. The shift from content-based signals to network-level access control is huge becuase it sidesteps the whole persistance issue with embedded metadata. I've seen similar patterns in enterprise security where perimeter controls always outperform embedded protections. Cloudflare carrying 20% of traffic gives this real teeth compared to past blockchain attempts that never reached critical mass.