A new report from Wired, co-published along with Proof News, confirms what many have long suspected, and The New York Times first suggested back in April. Even though it’s prohibited by the platform’s Terms of Service, and seemingly violates the copyrights held by digital creators over their own work, AI models have been training on YouTube videos.
Here’s how it works. A non-profit research group called EleutherAI created a dataset known as the Pile, which contains a vast trove of textual information taken from across the internet. This includes articles from English-language Wikipedia, records from the US Patent Office, PubMed, the archives of the European Parliament, and even employee emails from the Enron Corporation that were made public as part of a federal investigation into the defunct energy firm. The Pile also contains a dataset called “YouTube Subtitles,” consisting of all-text video transcripts, not only in English but sometimes international translations as well.
By Wired and Proof’s estimates, the Pile includes subtitles from over 170,000 YouTube videos, taken from more than 48,000 different channels. These include professional networks and brands like Khan Academy, MIT, Harvard, and NPR. But also a lot of popular independent creators. YouTube’s most-subscribed superstar Jimmy “MrBeast” Donaldson had his videos scraped. As did tech reviewer Marques Brownlee. Political commentator David Pakman. Gamers Jacksepticeye and PewDiePie, and many others. (Proof has a tool you can use to determine if your videos have been included in the Pile.)
All of that content has been scooped up by EleutherAI, added to the Pile, and then integrated into AI models from all sorts of major tech companies, including Salesforce, Nvidia, Anthropic, and even Apple. Once these models are trained, it’s not entirely clear – even to their creators – exactly how these data sets will be reimagined and repurposed into original outputs. And there’s no way to “untrain” an AI model. Essentially, what’s done is done.
And as large a collection of data as the Pile represents, it’s only the tip of the iceberg. Creating the large language models that power apps like ChatGPT and its follow-ups was already a gargantuan task. But fine-tuning them requires even more training data that’s more closely fine-tuned and specialized. To create its GPT-4 upgrade, OpenAI transcribed more than 1 million hours of YouTube videos.
Annie Gilbertson – one of the two reporters behind the Wired and Proof News piece – spoke with Passionfruit about the massive scale of AI data collection, and the impact its already having on creators.
“The Pile is still a small fraction of what companies use to train their models,” she said. “Those that watchdog AI see the scale as akin to gobbling up the internet. The research group Epoch AI estimates that in the next few years, AI will have consumed all human-generated public text.”
Developers of AI apps insist that they’re not stealing or plagiarizing training material, because the models don’t just repeat what they’ve read but generate original outputs, thus qualifying as “fair use.” In practice, this doesn’t always hold up to scrutiny. YouTuber David Pakman notes in the Wired piece that he’s already encountered AI clips on TikTok that make it look like Tucker Carlson is repeating words he originally spoke on his YouTube channel. Just a few weeks ago, Forbes called out AI startup Perplexity for plagiarizing one of its news articles.
OpenAI CTO Mira Murati told the Wall Street Journal that she’s not 100% sure if YouTube was also used to train the company’s text-to-video app, Sora. YouTube CEO Neal Mohan suggests this would be against the rules. But it’s not clear exactly what, if anything, he plans to do about it. After all, Google also trained its own AI models on YouTube videos. For the time being, it’s not entirely certain whether or not any major company or platform will have the backs of their individual creators and users in this fight. Or even if there’s anything left to be done.
As Gilbertson explained, “there is no requirement for companies building AI to be transparent about what training data their engineers use. They often don’t disclose anything. It took us weeks of going through research papers and posts to find evidence of who was using YouTube Subtitles, and there’s no reason to believe we’ve identified every user.”
Our fundamental understanding of artistic licensing and copyright protection seems fairly clear on this. No one else is allowed to appropriate or re-use your work without your permission and, depending on your requirements, possible compensation. Self-interested companies gambling their future solvency and relevance on AI apps clearly hope to work around these definitions, and it seems likely that the courts will ultimately have to rule on some of these issues. But protracted legal fights cost money, and digital creators vs. the entire technology industry makes for a lopsided fight.
“There is ongoing litigation, which is a slow process, as well as legislative proposals that would require transparency when using copyrighted work,” Gilbertson told Passionfruit. “Until the rules are clarified, I expect big tech will continue to write their own.”
Seeing this drama play out on YouTube, where so many large corporations with near-infinite resources have used copyright law to limit the reach and financial independence of independent creators, makes at least one internet double standard abundantly clear. Apple might single-handedly de-monetize a video of someone reviewing Apple TV+ shows for using too much of their content. Then, turn around and train their AI app on that reviewer’s videos. Then use that app to generate original video content of their own without hiring or supporting a real human creator. So is copyright enforceable or not? Are companies the only ones who own their content, or do individual people have the same rights?
Some creators appeared to take the news in relative stride. Marques Brownlee, among the YouTubers that Wired confirmed, had their videos transcribed for the Pile, tweeted that he feels Apple “technically avoids ‘fault’ here because they’re not the ones scraping.” Gilbertson told me that a few creators she spoke with have experimented with AI tools themselves, and some were unsurprised that the tech industry’s “reverse Robin Hood mentality” had finally come for their work.
But for the most part, creators found the move inappropriate and even “disrespectful… as if engineers [see] YouTubers as too unimportant to even inform.” In a statement to Wired, Julie Walsh Smith – the CEO of Complexly, the studio behind scraped YouTube shows like “Crash Course” and “SciShow” – hit upon the central point that most creators seem to be raising. “We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent,” she wrote.
Many if not most digital creators would probably have approved of the use of their work to train AI models, if they received an actual formal request, and especially if they received compensation. It’s the “better ask forgiveness than permission” approach, particularly in an environment in which creators themselves are not afforded the same privilege when it comes to copyright or fair use, that’s so galling.
The legal system may be the only remaining possible recourse for creators at this point, though Gilbertson also suggests that some are holding out hope that AI apps themselves will be their own downfall. Many creators remain dubious that these apps can produce the kinds of worthwhile or compelling content their creators have long promised. “I also heard mixed beliefs about whether AI content generation will be a threat to [creators’] livelihood in the future,” Gilbertson said. “Several told me audiences won’t connect to material made by a machine in the same way they connect to a person’s creative work.”
If AI apps can’t ever duplicate the work of real-world YouTubers, regardless of how many videos are used in their training, they’ll remain a potential violation of copyright law. (Not to mention YouTube’s own stated rules.) But at least they won’t be putting creators out of a job any time soon.