Apple, Nvidia, Anthropic reportedly trained AI models on YouTube data without consent
An investigation by Proof News found Apple, Nvidia, and Anthropic, among others, have used material from thousands of YouTube videos to train AI despite YouTube’s rules against harvesting materials from the platform without permission.
Annie Gilbertson and Alex Reisner for Wired:
Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media posts, often unbeknownst to the creators.
Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.
The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.
Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the “flat-earth theory.”
Representatives at EleutherAI, the creators of the dataset, did not respond to requests for comment on Proof’s findings, including allegations that videos were used without permission… According to a research paper published by EleutherAI, the dataset is part of a compilation the nonprofit released called the Pile… Most of the Pile’s datasets are accessible and open for anyone on the internet with enough space and computing power to access them.
Support MacDailyNews at no extra cost to you by using this link to shop at Amazon.
MacDailyNews Take: So, it looks like EleutherAI, not Apple, downloaded the YouTube Subtitles datae. Apple hadn’t responded to Wired‘s request for comment at the time of publication, so it’s possible Apple, and Anthropic, Nvidia, etc. were unware that YouTube Subtitles were included in the dataset. Generative AI needs data, a lot of data, and it’s clearly in its “Wild West” stage.
Proof News offers a tool that allows users to search YouTube videos to reveal those used to train generative AI here.
Please help support MacDailyNews. Click or tap here to support our independent tech blog. Thank you!
Support MacDailyNews at no extra cost to you by using this link to shop at Amazon.
The post Apple, Nvidia, Anthropic reportedly trained AI models on YouTube data without consent appeared first on MacDailyNews.