DMR News

Advancing Digital Conversations

Apple and NVIDIA Among Companies Using YouTube Transcripts Without Permission to Train AI Models

ByHilary Ong

Jul 19, 2024

Apple and NVIDIA Among Companies Using YouTube Transcripts Without Permission to Train AI Models

Apple, NVIDIA, Anthropic, and Salesforce have been found to use YouTube transcripts without permission to train their AI models, according to an investigation by Proof News. The dataset, created by the nonprofit EleutherAI, includes transcripts from 173,536 YouTube videos spanning more than 48,000 channels.

Prominent creators like Marques Brownlee and MrBeast, alongside major news outlets such as The New York Times, BBC, and ABC News, have their content included in this dataset, highlighting a significant issue in AI development: the use of data from creators without their consent or compensation.

What’s Included in the Dataset?

The dataset, known as “YouTube Subtitles,” is part of a larger collection called “The Pile.” This collection includes a wide variety of content, such as:

  • Transcripts from educational providers like Khan Academy, MIT, and Harvard.
  • Media outlets like The Wall Street Journal, NPR, and the BBC.
  • Transcripts from entertainment shows like “The Late Show With Stephen Colbert.”
  • Content from YouTube stars such as MrBeast, Jacksepticeye, and PewDiePie.

Proof News Contributor Alex Reisner, who uncovered The Pile last year, created a searchable database of the content, emphasizing that intellectual property owners should know if their work is being used to train AI systems.

Marques Brownlee confirmed on X that Apple sourced data for their AI from several companies, including those that scraped transcripts from YouTube videos, including his own. He highlighted the ongoing nature of this issue, predicting it will be a long-term problem.

Google responded to these revelations by reiterating YouTube CEO Neal Mohan’s stance that using YouTube’s data to train AI models violates the platform’s terms and conditions. Jennifer Martinez, a spokesperson for AI startup Anthropic, stated that while the company had used The Pile to train its generative AI assistant, YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. She referred questions about potential violations of YouTube’s terms of service to the dataset’s authors.

Apple also clarified in an email to Mashable, that its open-source language model, OpenELM, indeed used the dataset, but emphasized that the project is solely for research purposes and will not underpin any of its AI services, including Apple Intelligence. Apple also highlighted its option for websites to opt out of AI training and assured that its models are built using high-quality, licensed, and publicly available data.

Salesforce also provided clarification in an email to Mashable, stating that the Pile dataset was used to train an AI model in 2021 strictly for academic and research purposes. Salesforce emphasized that the dataset was publicly available and released under a permissive license.

This lack of transparency from AI companies about their training data sources has sparked criticism. Artists and photographers recently criticized Apple for not disclosing the sources of training data for Apple Intelligence, its new generative AI feature.

YouTube, as the world’s largest repository of videos, offers a vast amount of data, making it an attractive resource for AI training. Earlier this year, OpenAI’s chief technology officer, Mira Murati, avoided specifying whether YouTube videos were used to train Sora, an upcoming AI video generation tool. Alphabet CEO Sundar Pichai also warned that using YouTube data for AI training violates the platform’s terms of service.

Discovery and Impact of The Pile

Alex Reisner, a Proof News contributor, discovered The Pile last year and created a searchable database of the content, believing that intellectual property owners should know if their work is being used to train AI systems.

He stated, “I think it’s hard for us as a society to have a conversation about AI if we don’t know how it’s being built. I thought YouTube creators might want to know that their work is being used. It’s also relevant for anyone who’s posting videos, photos, or writing anywhere on the internet because right now AI companies are abusing whatever they can get their hands on.”

Despite the removal of The Pile from its official download site, it remains available on file-sharing services. For those interested in checking if their YouTube video subtitles are included in the dataset, Proof News has provided a lookup tool.


Featured Image courtesy of metamorworks/Getty Images/iStockphoto

Follow us for more AI news updates.

Hilary Ong

Hello, from one tech geek to another. Not your beloved TechCrunch writer, but a writer with an avid interest in the fast-paced tech scenes and all the latest tech mojo. I bring with me a unique take towards tech with a honed applied psychology perspective to make tech news digestible. In other words, I deliver tech news that is easy to read.

Leave a Reply

Your email address will not be published. Required fields are marked *