The technology and vision behind generative AI is amazing, but stealing the work of the world's creators to build it is not.
There are three key things that AI companies need to build their models, three key resources -- people, compute and data. That is, engineers to build the models, GPUs to run the training process and data to train the models on. AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model. But they expect to take the third resource, training data, for free.
Right now, many AI companies train on creative work they haven't paid for or even asked permission to use. This is unfair and unsustainable. But if we reset, and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist.
Most AI companies today do not license the majority of their training data. They use web scrapers to find, download and train on as much content as they can gather. They're often pretty secretive about what they do train on, but what's clear is that training on copyrighted work without a license is rife.
For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64 percent of them were trained, in part, on Common Crawl, a dataset that includes copyrighted works, such as newspaper articles from major publications. And a further 21 percent didn’t reveal enough information to know either way. Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.
But this training, this unlicensed training on creative work, has serious negative consequences for the people behind that work. And this is for the simple reason that generative AI competes with its training data. This is not the narrative that AI companies like to portray. We like to talk about democratization, about letting more people be creative. But the fact that AI competes with its training data is inescapable. A large language model trained on short stories can create competing short stories. An AI image model trained on stock images can create competing stock images. An AI music model trained on music that's licensed to TV shows can create competing music to license to TV shows. These models, however imperfect, are so quick and easy to use that this competition is inevitable.
And this isn't just theoretical. Generative AI is still pretty new, but we're already seeing exactly the sort of effects you'd expect in a world in which generative AI competes with its training data. For instance, the well-known filmmaker Ram Gopal Varma recently said that he'll use AI music in all his projects going forward. Indeed, there are multiple reports of people starting to listen to AI music in place of human-produced music, and recently, an AI song hit number 48 in the German charts. In all these cases, AI music is competing with the songs it was trained on.
Or take Kelly McKernan. Kelly is an artist from Nashville. For 10 years, they made enough money selling their work that art was their full-time income. But in 2022, a dataset that included their works was used to train a popular AI image model. Their name was one of many used by huge numbers of people to create art in the style of specific human artists. Kelly's income fell by 33 percent almost overnight. Illustrators around the world report similar stories, being outcompeted by AI models they have reason to believe were trained on their work.
The freelance platform Upwork wrote a white paper in which they looked at the effects, that they've seen on the job market, of generative AI. They looked at how job postings on their platform have changed since the introduction of ChatGPT, and sure enough, they found exactly what you'd expect, that generative AI has reduced the demand for freelance writing tasks by 8 percent, which increases to 18 percent if you look at only what they term lower-value tasks.
So the initial data we have, plus the individual stories we hear, all align with the logical assumption: "Generative AI competes with the work it's trained on." It's so quick and easy to use, it's inevitable, and it competes with the people behind that work.
Now creators argue this training is illegal. The legal framework of copyright affords creators the exclusive right to authorize copies of their work, and AI training involves copying. Here, in the US, many AI companies argue that training AI falls under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances, such as creating parodies of a work. Creators and rights holders strongly disagree, saying there's no way this narrow exception can be used to legitimize the mass exploitation of creative work to create automated competitors to that work. And for the record, I entirely agree.
Of course, this question is previously untested in the courts, and there are currently around 30 ongoing lawsuits brought by rights holders against AI companies, which will help to address this question. But this will take time, and creators are suffering from what they see as unjust competition right now.
So they propose a solution that has been used and worked before -- licensing. If a commercial entity wants to use copyrighted work, be it for merchandise manufacturing or building a streaming service, they license that work. Now AI companies have a bunch of reasons why this shouldn’t apply to them. There’s the fair use legal exception that I’ve already mentioned. There's also the argument that since humans can train on copyrighted work without a license, AI should be allowed to, too. But this is a very hard claim to justify. Artists have been learning from each other for centuries. When you create, you expect other people to learn from you. You learn from a range of sources, from other art to textbooks to taking lessons. Much of this you or someone else paid for, supporting the entire ecosystem. In generative AI, commercial entities valued at millions or billions of dollars scrape as much content as they can, often against creators' will, without payment, making multiple copies along the way -- which are subject to copyright law -- to create a highly scalable competitor to what they're copying. So scalable, in fact, that there are AI image generators estimated to be making 2.5 million images a day and AI song generators outputting 10 songs a second. To argue that human learning and AI training are the same and should be treated the same is preposterous.
AI companies also argue that licensing their training data would be impractical. They use so much training data, they say, that individual payments to each creator behind the data would be small. But this is true of many content-licensing markets. Creators still want to get paid, even if the payments are small. AI companies also argue that they simply use too much data for licensing to even be feasible. But this is harder and harder to believe in a world in which there is such a range of datasets that you can access with permission. You can license data from media companies. There have been 27 major deals between AI companies and rights holders in the last year alone, and that's to say nothing of the smaller ones that don't get reported. There are marketplaces of training data where you can get more data. You can expand this with data that's in the public domain -- that is, in which no copyright exists, like the 500-billion-word dataset Common Corpus. You can expand this further with synthetic data, that is, data that's created itself by an AI model, in which usually no copyright exists. So there are multiple options available to you if you want to build your model without infringing copyright.
But the strongest evidence that it's possible to license all your data is that there are multiple companies doing it already. I know, because I've done it myself. I've worked in what we now call generative AI for over a decade, and last September, my team at Stability AI released an AI music model that trained on licensed music.
A number of other companies have done the same thing, and I founded Fairly Trained in order to highlight this fact, and these companies. Fairly Trained is a nonprofit that certifies generative AI companies that don't train on copyrighted work without a license. We launched in January of this year, and we've already certified 18 companies. Now these companies take a variety of approaches to licensing their training data. We have an AI voice model that's trained on individual voices it's licensed. We have an AI music model that's licensed more than 40 music catalogs. We have a large language model that's trained only on data in the public domain, mostly from government documents and records. We have companies who have paid upfront fees for their data. We have companies who share their revenue with their data providers. There is no one answer to the exact specifics of how one of these licensing deals has to work. The beauty of licensing is that the two parties can come together and figure out what works for them. And this is happening more and more now.
You will hear that a requirement to license training data somehow stifles innovation, that it's only the big AI companies that can afford these huge upfront licensing fees. But in reality, it's the smaller start-ups who are bothering to license all their data, and they're doing so, often, without hefty upfront licensing fees, but using models such as revenue shares.
And there's another major upside to licensing your training data. All of this training on copyrighted work is forcing publishers to shut off access to their content. The Data Provenance Initiative looked at 14,000 websites commonly used in AI training sets, and they found that, over the course of a single year, looking at only the domains of the highest value for AI training, the number that was restricted via opt-outs or terms of service increased from three percent to between 20 and 33 percent. The web is being gradually closed due to unlicensed training. Now this is bad for new AI models, for new entrants to the market, but also for everyone -- researchers, consumers and more, who benefit from an open internet.
It should come as no surprise that the general public do not agree with AI companies about what they can train their models on. One poll from the AI Policy Institute, in April, asked people about the common policy among AI companies of training on publicly available data. This is data that is openly available online, which of course includes a lot of copyrighted work, like news articles and, often, pirated media. 60 percent of people said this should not be allowed versus only 19 percent who said it should. The same poll went on to ask whether AI companies should compensate data providers. 74 percent said yes, and only nine percent said no. Time and time again, when we ask the public these questions, they show support for requirements around permission and payment, and a rejection of the notion that something being publicly available somehow makes it fair game.
And the people who make the art that society consumes feel the same way. Today, we launched a "Statement on AI Training," a short, simple open letter, which simply reads: “The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted." This has already been signed by 11,000 and counting creators around the world, including Nobel-winning authors, Academy Award-winning actors and Oscar-winning composers. And if you agree with this sentiment, I encourage you to sign it today at aitrainingstatement.org.
What this statement and previous ones like it make abundantly clear is that these artists, these creators, view the unlicensed training on their work by generative AI models as totally unjust and potentially catastrophic to their professions. So if you are an advocate for unlicensed AI training, just remember that the people who wrote the music that you are listening to and the books you’re reading probably disagree.
So where does this leave us? Well, right now, many of the world's artists, writers, musicians, creators straight-up hate generative AI. And we know, from their own words, that one of the reasons for this is that we're training on their work without asking them. But it doesn't have to be this way. The AI industry and the creative industries can be and should be mutually beneficial. But for this mutually beneficial relationship to emerge, we have to start from a position of respect for the value of the works being trained on and the rights of the people who made them.
I'm not arguing that all AI development should be halted. I'm not arguing that AI should not exist. What I'm arguing is that the resources used to build generative AI should be paid for. Licensing is hard work. It will slow you down in the short term, but you'll ultimately reach exactly the same point -- models that are just as capable, just as powerful -- and you'll do so without forcing the world's publishers to batten down the hatches and destroy the commons, and without pitting the world's creators against you.
So I hope that more AI companies will follow the example set by those we've certified at Fairly Trained, and license all their training data. I hope that employees at these companies will demand this of their employers. And I hope that everyone who uses generative AI will ask what their favorite models were trained on. There is a future in which generative AI and human creativity can coexist, not just peacefully, but symbiotically. It's been a rough start, but it's not too late to change course.
Thank you.
(Applause)