This week, a YouTuber from Massachusetts named David Millette filed a class action lawsuit against Open AI, alleging that the company has been training its models on YouTube transcripts without notifying creators like him or providing compensation. Millette’s suit argues that OpenAI creates “valuable” products based on “works that were copied…without consent, without credit, and without compensation.” He’s seeking a jury trial and over $5 million in damages on behalf of all YouTube users and creators whose data was scooped up by OpenAI for training. Neither Google nor OpenAI have yet commented on the suit, though it’s worth noting that they could pay $50 million tomorrow and never think about it again.
Once again, this case pivots on an extremely open question in the field of AI, that we’ve discussed previously. Does training an AI model on someone else’s content constitute a copyright violation? While copyright questions are relatively cut and dry – if I download a YouTube video, repost it somewhere else, and collect ad revenue, that’s a clear violation – many, if not most, of the practical debates happening in chat rooms, internet forums, and now courtrooms nationwide are considerably murkier.
These questions pick at the edges of our understanding of copyright, intellectual property, and creative work more generally. What if I just download a brief portion of a YouTube video, and then repost it somewhere with my own commentary? Is that stealing or remixing? What if I repost a segment of the audio underneath the original visuals? What if I change around and re-edit the video, and add my own creative spin to it? And what if I transcribe a YouTube video, then use that text to train a generative AI model? For example, OpenAI’s ChatGPT can engage a human user in (reasonably) authentic casual conversations and answer queries in plain text. ChatGPT and other AI models like it need to train on a vast amount of material in order to work properly.
It doesn’t even matter what specifically those training materials were, so long as they represent real writing by human authors. (The hope is that one day, models might be able to train on “synthetic data,” rather than real examples of human writing, but so far, this development remains impractical.) A data trove known as “The Pile,” which was used by a number of major tech companies like Salesforce and Apple to train their AI software, included a grab-bag of materials such as Wikipedia entries, archives from the European Parliament, and old corporate emails sent by Enron executives that were released as part of the federal investigation into the failed energy company. So long as it was written by human beings communicating in everyday language, it can help train an AI model that hopes to recreate that form of written communication.
The Pile also included transcripts from hundreds of thousands of YouTube videos, compiled from over 48,000 channels. These included brands like MIT, Khan Academy, and NPR, as well as individual creators like tech reviewer Marques Brownlee and comedian/game streamer Jacksepticeye.
Apparently, this is just one example of many. Earlier this week, Samantha Cole of 404 Media reported that Nvidia staffers downloaded a huge collection of videos from YouTube, Netflix and elsewhere to help train their commercial AI software, including the Omniverse 3D world generator, a self-driving car system, and other products. The graphics card maker’s advanced microchips run the massive AI workloads, which require intense amounts of processing power to function. The explosion of interest in AI has made Nvidia the world’s most valuable company, with a market capitalization of $3.34 trillion as of June 2024. They’re worth more now than Apple or Microsoft, and they’re nonetheless covertly “borrowing” content from YouTubers.
In a response to Engadget, Nvidia representatives argued that the company’s use of YouTube, Netflix, and other copyrighted data for training was “in full compliance with the letter and the spirit of copyright law.” They suggest that these laws protect specific expressions, and not the “facts, ideas, data, or information” being scooped up by the AI models during training, which are then used by the programs “to make their own expression(s).”
Based on responses gathered by Business Insider from the company’s own AI chatbot, Facebook, Instagram, and WhatsApp owner Meta also trained its AI on YouTube transcripts. According to Meta’s AI service, the company actually built its own bot – called Meta Scraping and Extraction, or MSAE – which it used to gather vast amounts of data from the web for AI training. Meta AI reports that its training dataset features 3.7 million YouTube transcripts, even though YouTube’s terms of service specifically prohibits the use of bots and scrapers to collect data from videos. A Meta spokesperson didn’t specifically deny the AI chatbot’s responses to Business Insider, but suggested that perhaps it had been mistaken in its answers.
(So either the chatbot is telling the truth, and Meta is sort of covering up the fact that they transcribed 3.7 million YouTube videos they don’t own… or Meta’s AI chatbot is outright lying to this reporter for no clear reason. This story is a bit of a lose-lose for Meta.)
Back in April, the New York Times revealed that OpenAI used transcriptions of over 1 million hours of YouTube content to train its GPT-4 model. YouTube transcripts were such a central part of GPT-4 training, that OpenAI actually developed its own proprietary audio transcription model, known as Whisper, to speed the process along. According to the Times, OpenAI President Greg Brockman was directly involved in selecting and gathering the YouTube videos as part of this effort.
OpenAI’s response views training an AI model as functionally equivalent to a human being reading an article or watching a video. After all, it’s not a copyright violation to read a book and learn from the information it contains. OpenAI spokesperson Lindsay Held describes their software in the same fashion, telling the Times that its models all have “a unique data set that we curate to help their understanding of the world.” It’s a clever rhetorical strategy, shuffling the agency off to the computer – OpenAI is simply helping its software understand the world – rather than the humans who built the computer and told it what to do.
But semantics aside, the question remains: Is transcribing a YouTube video and then using it to train an AI chatbot just an example of on-the-job learning? Does it amount to actual intellectual property theft? Or is it somewhere in between? Considering the vast amounts of money to be made or lost in the AI space, and the necessity of using previously published materials to train these models, IP and copyright conflicts will almost certainly be settled in a courtroom. And probably sooner rather than later.
Generative AI technology is unable to continue advancing at anywhere near its current pace without fresh training materials, and now that the word is out about the key importance and value of these materials, publishers and online platforms have gotten smarter about protecting their archives from being scraped. Consider, as well, that the sheer volume of material that AI companies need to properly train their systems means that they couldn’t actually pay for all the content that they use while still turning any kind of profit. It would be impractical to pay every YouTuber for the rights to their videos, even if you designed the perfect AI chatbot from all that information.
As Silicon Valley attorney Sy Damle explained in a 2023 discussion about copyright law: “the only practical way for [AI] tools to exist is if they can be trained on massive amounts of data without having to license that data. The data needed is so massive that even collective licensing can’t really work.” So getting free, unfettered access to other people’s content, including the work of individual content creators, is an existential question for AI companies. They need these rulings to go their way in order to continue making their products at all.
When the AI hype cycle was new, the questions surrounding it were theoretical and abstract. Would it ever be possible for a computer to make its own videos, or even become an influencer? Could these tools complement the work of human creators, rather than replacing them? Now that the world has had a few years to adjust, and seen both the creative potential and the drawbacks of these tools, we’ve come around to the practicalities. Who holds the rights to the information available to all of us on the internet, and who gets to decide how it’s used?