"Classic Pirate Sites" Power Most GenAI Models (Not Just Meta's), New Research Shows

Leading AI "Forensics" Expert Lays Out The Data In His New Report (Available Here)

Apr 29, 2025

Welcome to your weekly “brAIn” dump! This time, on a Tuesday and at a later time (consider this my very own A/B test). My “mAIn event” features new research by leading AI “forensics” expert Thomas Heldrup that shows, with reams of supporting data, widespread AI training on content (even paywalled content) from “classic pirate sources.” Then, my “AI video of the week” — my recent “Future of AI & Music” talk. Next, the “mosAIc” — a collage of important AI stories and podcasts I curate for you. Finally, the “AI Litigation Tracker” — updates on key GenAI-focused IP cases by Partner Avery Williams of McKool Smith (access the full “Tracker” here via this link).

But First …

Will Hollywood hit an iceberg with generative AI?

Well, famed Hollywood “Titanic” director James Cameron believes there should be no copyright scrutiny on the “input” (training) side of the AI equation — i.e., he believes that scraping of copyrighted works without consent and compensation should be permissible. Instead, he argues that copyright scrutiny should happen only on the “output” (display) side of the equation. He likens AI model scraping to our own everyday human “scraping” of the “data” that surrounds us in life. “We’re all models. All reacting to our training data,” he says in this video. “You can’t control the input.” Do you agree with Cameron? Take the poll below.

I. The mAIn Event - AI & The “Classic Pirate Sources” Used to Fuel It

By Thomas Heldrup, Head of Content Protection & Enforcement, Danish Rights Alliance — and a Leading "AI “Forensics” Expert

[Peter’s editorial note: Heldrup’s exclusive new report, titled “Report on Pirated Content Used in the Training of Generative AI,” is available for download here free].

I Won’t Bury The Lead: Here’s Heldrup’s Key Take-Away

“[G]enerative AI model providers across the spectrums of LLMs, image, video, music generation have obtained infringing copies of copyrighted protected works that were sourced from what I call ‘classic pirate sources’ such as illegal file-sharing and streaming sites.”

Heldrup’s Summary of His Findings

[Peter’s editorial note: Here are Heldrup’s findings, in his own words, to back up his central conclusion].

“So what is being mashed together when generative AI models produce output? To a large extent it’s infringing copies of copyright protected content sourced from classic ‘pirate’ sites such as illegal file sharing and streaming sites (LibGen, Anna’s Archive, Z-library, Watchseries, OpenSubtitles, etc.).”

Heldrup points out that “these datasets contain content spanning books, press publications, song lyrics, recorded music, movies and TV.”

Heldrup further says that his research demonstrates “the prevalence of pirated content in generative AI training data and how this content is being collected.” In his words, “AI models — from just about every major GenAI developer — have been trained on copyrighted works.”

Meta Is Most Brazen. But Even Apple Does It.

Heldrup calls out Meta, in particular, for brazenly continuing to leverage pirate sources like LibGen, despite facing lawsuits in the U.S. around its use of pirate sources. Meta notoriously refuses to entertain any media licensing deals. A recent “must read” feature story in Vanity Fair echoes the lengths of Meta’s contempt. Titled “This Is How Meta AI Staffers Deemed More Than 7 Million Books to Have No ‘Economic Value,’” the article describes how top Meta executives looked the other way in the wake of “the company’s decision to train its model on a database containing more than 7 million pirated books.” They knew exactly what they were doing (infringing the works of countless authors), but went full steam ahead anyway, according to the story.

But Meta certainly isn’t alone. Heldrup reports that Apple is guilty too, despite its artist-first and privacy-focused marketing and overall positioning.

In a critical finding, Heldrup concludes that several major AI developers’ training data sets even include copyrighted paywalled content from Netflix, YouTube and other streaming platforms. Those AI developers, he says, use software called “stream ripping” to circumvent DRM protections. New research by the University of Washington, the University of Copenhagen, and Stanford back up Heldrup’s research and conclusions, as recently reported in TechCrunch.

A 4th Potential Kind of AI Copyright Infringement?

So do these findings — i.e., copyrighted content access via “classic pirate sites” for AI training purposes — open the door to a new potential 4th infringement claim on the part of content owners against AI developers? The Kadrey v. Meta plaintiff authors certainly think so (Kadrey is one of the most high profile AI copyright infringement cases). And it’s hard to argue with their position when Meta concedes it uses pirated data sets for training.

So, it certainly looks like IP rights-holders may have 4 separate potential copyright infringement claims in the context of GenAI: (1) AI “training” infringement, (2) RAG infringement, (3) display/out infringement, and now (4) “classic pirate site” sourcing infringement.

DOWNLOAD REPORT HERE

Expert, Researcher & Author Thomas Heldrup is Head of Content Protection & Enforcement at the Danish Rights Alliance, a leading member-driven, non-profit organization that represents the interests of Danish rights holders in the creative industries (film, music, literature). I was introduced to Heldrup by other leading AI experts months ago, as being one of the world’s foremost AI “forensics” experts.

Feel free to send Peter Csathy your feedback at peter@creativemedia.biz.

II. Video of the Week - The Future of Music (In An Increasingly AI World): My Recent Talk

I recently gave a talk about “The Future of Sound” — focusing on AI and music — at AI expert Curt Doty’s great El sAIlon event in Santa Fe, New Mexico. I was joined by other experts who offered their perspectives. It’s an overall presentation you’ll find both informative and entertaining. Watch it by clicking on “play” in the video below.

III. The mosAIc — “Must Read” Stories/Podcasts

(1) The Washington Post Licenses Its Content To OpenAI

In yet another major AI content licensing deal, Jeff Bezos’s The Washington Post just inked a major content licensing deal with Sam Altman’s OpenAI. No financial terms were disclosed, nor was it clear whether “training” was part of the overall deal (likely not). This is yet another indicator that a vibrant and growing market exists for IP rights-holders to license their content to AI developers — a reality that undercuts OpenAI’s (and the others’) claims of “fair use” (as I’ve written repeatedly).

(2) Meanwhile Ziff Davis Sues OpenAI

On the flipside, one of the biggest U.S. publishers chose the other path, suing OpenAI for mass infringement and seeking hundreds of millions of dollars. Read more here.

(3) Listen to My Interview with Mike Campbell of Tom Petty & The Heartbreakers

I interview Mike Campbell, Tom Petty’s long-time collaborator and guitarist of the Heartbreakers, about the band’s classic breakout song “Refugee.” It’s all part of new Season 4 of my “The Story Behind the Song” podcast (you can find it on all major podcast platforms). What do these music interviews have to do with AI? They remind us of the power of human creativity & personality in an increasingly AI world. You can both listen to (or watch) my interview with Mike via the buttons below.

LISTEN TO MIKE CAMPBELL INTERVIEW

WATCH MY INTERVIEW WITH MIKE

And if you like my interview with Campbell, you’ll love my 60+ interviews with other legendary artists that range from Don McLean (American Pie), to Debbie Harry of Blondie (Rapture) and Lindsey Buckingham of Fleetwood Mac (Tusk), to Billy Idol (White Wedding) and the Cure, Tears for Fears, The Killers, The Shins, Run DMC, Rick Astley, Boy George, and so many more. All my guests are icons and legends. And all of their stories are timeless and fascinating. Here’s the link to my podcast series.

(4) Actors Who Sold AI Avatars Stuck In Dystopian World

Read this fascinating (and sobering) real-life tale about unintended AI consequences — Synthesia style. Synthesia is a leading AI video avatar company, now valued at over $2 billion. Fun fact: I followed (and believed in) this company from the first time I saw CEO Victor Riparbelli pitch it to investors at a United Talent Agency event 6-ish years ago. Alas, I didn’t put my money where my mouth was!

Now Hiring! Send Me Referrals.

If you know someone who would be a great “fit” for the position below, please reach out to me directly at peter@creativemedia.biz. Or ask them to reach out.

IV. AI Litigation Tracker: Updates on Key Generative AI/Media Cases (by McKool Smith)

Partner Avery Williams and the team at McKool Smith (named “Plaintiff IP Firm of the Year” by The National Law Journal) lay out the facts of — and latest critical developments in — the key generative AI IP litigations below. All of those detailed updates can be accessed via this link to the “AI Litigation Tracker”. McKool is a leader in both copyright and patent-related AI litigation.

(1) Kadrey v. Meta

(2) The New York Times v. Microsoft & OpenAI

(3) Thomson Reuters v. Ross Intelligence

(4) In re OpenAI Litigation (class action)

(5) Dow Jones, et al. v. Perplexity AI

(6) UMG Recordings v. Suno

(7) UMG Recordings v. Uncharted Labs (d/b/a Udio)

(8) Getty Images v. Stability AI and Midjourney

(9) Universal Music Group, et al. v. Anthropic

(10) Sarah Anderson v. Stability AI

(11) Raw Story Media v. OpenAI

(12) The Center for Investigative Reporting v. OpenAI

(13) Authors Guild et al. v. OpenAI

GO TO THE LITIGATION TRACKER

NOTE: Go to the “AI Litigation Tracker” tab at the top of “the brAIn” website for the full discussions and analyses of these and other key generative AI/media litigations. And reach out to me, Peter (peter@creativemedia.biz), if you would like an introduction to McKool Smith to discuss these and other legal & litigation matters related to media/entertainment.

About My Firm Creative Media

My firm and I represent media companies and rights-holders for generative AI content strategy and licensing, with uniquely deep relationships and exclusive “insider” market intelligence. We know the key players inside AI tech and pride ourselves in reaching them in record time to execute. Not just talk. We also generally specialize in breakthrough business development and M&A and cost-effective legal services in the worlds of media, entertainment, AI and tech.

Reach out to me at peter@creativemedia.biz to explore working with us.

Learn more about Creative Media

Send your feedback to me and my newsletter via peter@creativemedia.biz.

Peter Csathy's "the brAIn" - real media & AI intelligence

Discussion about this post