The distinguishing characteristic of large language models (LLMs) is, as the name implies, their sheer size. Meta’s LLaMA-2 and OpenAI GPT-4 are each comprised of well more than 100 billion parameters — the individual weights and variables they derive from their training data and use to process prompt inputs.
Scale is also the defining characteristic of the training process LLM’s undergo. The datasets they ingest and are almost incomprehensively large — equivalent to the entire World Wide Web — and require immense amounts of computing capacity and energy to analyze.
Generative AI models don’t need to be that large to be useful, however. Researchers at Microsoft, for instance, recently published a technical report on a language model they call phi-1.5 comprised of a mere 1.3 billion parameters, or about one one-hundredth the size of GPT-3.5, which powered the original ChatGPT. A new version, called phi-2, contains 2.7 billion parameters.
Those small language models were trained on correspondingly smaller, more selective datasets. And while not as capable as LLMs, in benchmarking tests phi displayed capabilities comparable to models 5 to 10 times larger, according to the researchers. They even exhibited multi-modality, or the ability to process images as well as texts.
If assigned to the right tasks, such small language models could prove to be perfectly suitable alternatives to large foundation models like GPT and LLaMA. And they could be developed and deployed at a fraction of the cost and with a fraction of computing capacity and energy requirements.
They could also help address other challenges posed by the rise of generative AI technology.
The scale of LLMs is a major contributing factor to the controversy over the use of copyrighted material in training. Even if it were to be established that any such use requires authorization from the rights owner, the sheer volume and diversity of data involved will make it extremely difficult to devise and implement a fair and manageable system to administer those authorizations.
Scale is also likely to defeat any attempt to attribute the output of LLMs to particular inputs for purposes of remuneration. With hundreds of billions of discrete parameters within a model, it is effectively impossible to know what any one parameter is accomplishing or to trace the calibration of that parameter to a particular input or set of inputs.
Scaling down the size of the models could help on both scores. If smaller but still capable models can be trained on smaller, more tightly curated datasets, the process would more readily lend itself to licensing. A few large archives licensed from a small number of individual rights owners might be sufficient to train models for specific applications.
The contributions of each archive to the model would also be less opaque, making attribution more plausible.
StabilityAI’s Stable Audio model, for instance, was trained on a mere 800,000 licensed tracks, for which rights owners were remunerated.
Even a single archive, if sufficiently comprehensive, could be adequate to train a usable, task-specific model.
Getting Images, for instance, recently unveiled an image generator in partnership with Nvidia that it claims was trained exclusively with Getty’s own library of licensed photos.
Adobe’s Firefly image generator was trained entirely on its own archive of photos and other images, according to the company.
The Big 3 record companies are all investigating creating music generator models that could be trained exclusively on their own internal libraries of tracks.
In addition to being cheaper for developers to create, small models trained on limited datasets offer benefits to users compared to relying large, foundational language and diffusion models. Foremost among those benefits is protection against potential copyright liability risk that can come with using applications built on Stable Diffusion, GPT or other large foundational models.
While most of the copyright litigation against generative AI developers has so far targeted foundation models, the companies behind those models are working feverishly to shift liability onto downstream application developers and closer to the end user (see below).
Models trained on smaller, more selectively curated datasets might also prove better suited to specific use cases
In short, while the training and use of large foundation models may continue to pose difficult legal and copyright policy challenges due to their scale, traditional market forces increasingly could shift much of ordinary business and consumer use of generative AI technology to smaller, task-specific models trained on liability-free datasets. That could help shift the discussion around AI from an argument over fair use and derivative works to the more mundane concerns of cost, time-to-market and product design.
Watch List
No foundation? With barely a week to go before the next and final scheduled “trilogue” session on the EU AI Act, agreement on the final text of the law is looking farther off than where the discussions started. Big technology companies, led by OpenAI, Microsoft and Google, have been lobbying feverishly behind the scenes to exempt large foundation models from the Act’s strictest provisions and shift most of regulatory burden onto downstream applications. And they’ve managed to convince Germany, France and Italy, the EU’s three big powers, to adopt their position, upsetting what had seemed to be broad consensus among the member states and the EU Parliament. Ironically, should the tech companies get their way, what was once expected to become a global benchmark for AI regulation could end up being less robust than the White House executive order recently issued by the Biden Administration in the U.S. Euractiv. Corporate Europe Observatory.
Split decision The mystery of why Daryl Hall is suing his long-time music and business partner John Oates has been solved. Hall is seeking to block Oates from selling his interest in their joint venture to Primary Wave, which already owns a significant interest in the duo’s back catalog. Hall, who has expressed regrets over selling off part of the catalog 15 years ago, before streaming sent the price of music copyrights through the roof, claims Oates’ proposed sale of his stake now would violate the terms of the business agreement between the two. As the hot market for song catalogs continues apace, more disputes over deals involving jointly owned copyrights are inevitable. Billboard.
And another thing Reply comments are due next week (Dec. 6) in the U.S. Copyright Office’s inquiry into copyright and artificial intelligence. The office received nearly 10,000 comments in the first round, according to U.S. Register of Copyrights Shira Perlmutter, all of which have been or will be read by USCO staff. The comments and replies will inform the office’s report to Congress due next year on whether changes or additions to the Copyright Act are needed to address generative AI. USCO.