Perplexity's Use of "RAG" Opens Up 3 Paths to Copyright Infringement

The Wall Street Journal & The New York Times Have New Supreme Court and "Google Books" Court Precedent to Beat Back "Fair Use"

Oct 22, 2024

A funny thing happened Monday morning. I posted my weekly newsletter laying out The New York Times recent “cease and desist” demand against generative AI search tool Perplexity. And then within two hours, News Corp announced it was suing Perplexity for similar reasons — copyright infringement on a mass scale. Coincidence? I think not! There was something in the air.

And for that reason — and in order to make my earlier article as up-to-date and concise as possible — I’ve re-imagined it for you here, in this special bonus edition of “the brAIn.” If you didn’t read Monday’s newsletter carefully — or if you simply want a better version of it to reconsider all the massive issues at stake for both media/entertainment and GenAI tech — then this one’s for you!

Both The New York Times and News Corp’s new lawsuit reveal that Perplexity’s actions — including its use of so-called “RAG” (Retrieval-Augmented Generation) natural language processing — exposes the Jeff Bezos-backed GenAI darling to three different possible paths to infringement. And “fair use” defenses likely will not shield Perplexity from liability in the courts (if these cases don’t settle first).

“RAG” Is Critical Here, Yet Little Discussed & Understood

RAG combines real-time information retrieval with GenAI to enable LLMs to retrieve relevant external content, and then use that retrieved content to generate the most accurate, informative, contextually relevant, and up-to-date responses to user prompts. And RAG exposes Perplexity and others like it (including OpenAI’s ChatGPT) to three different possible forms of infringement.

1. Generative AI’s Path #1 to Infringement

Last week, I laid out why GenAI developers’ (like Perplexity) wholesale reliance on “fair use” to defend its non-consensual and uncompensated scraping of unlicensed content for AI “training” purposes is not “fair use.” I call this GenAI infringement’s “Path #1,” and you can refer to my full analysis of this “Path 1” here.

2. GenAI’s Path’s #2 and #3 to Infringement

First, RAG-enabled search is only possible by direct real-time (or near real-time) copying of the content on which it crawls (using search indexes like Bing to even side-step content behind paywalls), even if that content isn’t ultimately exposed to the end user. RAG processing, therefore, is a form of wholesale copying that is entirely separate from the initial copying of LLMs for “training” purposes. So RAG’s copying is Path #2 to infringement.

Apart from these Paths #1 and #2 to infringement, GenAI “outputs” themselves can be infringing, even if they aren’t substantially similar to the relevant “inputs.” This is Path #3 to infringement.

Perplexity (and others like it) defend their actions by claiming “fair use” — no license to do any of these things is necessary from rights-holders. But the most critical recent copyright infringement court decisions pave the way to rejection of its arguments for all Paths — (1) the U.S. Supreme Court’s Andy Warhol decision (which I discussed at length here), and (2) the Second Circuit Court of Appeals’ new September 4th decision in Hachette Book Group v. Internet Archive (“Hachette”), which relies heavily on Warhol’s new reasoning. Critically, the Second Circuit is the same court that decided the famous “Google Books” case on which GenAI companies hang their “fair use” hats. So the Second Circuit’s new decision in Hachette is critical.

3. Setting the Stage

Consider this real world situation. If Perplexity trains on quality content sources (like The Wall Street Journal and The NY Times) without a license in order to be your personalized one-stop source for up-to-date news and information, and if Perplexity spits out its up-to-date summaries based on those unlicensed quality sources, should the mere fact that Perplexity’s output (for any specific prompt) doesn’t precisely track The WSJ’s and The NY Times’ relevant articles word-for-word mean that those media companies have no claims against Perplexity?

They certainly doesn’t think so. In the words of The Times’ cease and desist, Perplexity is “unjustly enriched” when it uses its expressive, carefully written and researched, and edited journalism without a license.”

4. Context Is Critical - In Fact, It Is Everything

Perplexity’s “fair use” arguments — just like similar arguments made by virtually all other GenAI companies — fail under both the Supreme Court’s Warhol decision and the Second Circuit’s latest ruling in Hachette.

First, the question of infringement in the context of GenAI is unlike any type of infringement previously considered by the courts. And, context matters — and can make all the difference. Don’t take that from me. Take it from the Second Circuit in Hachette, where the Court writes that “each case raising the question [of fair use] must be decided on its own facts” — and further, that “fair use is a flexible concept whose application varies depending on the context.” In the specific context before it in Hachette, the Second Circuit rejected the defense’s argument of “fair use.” It wasn’t a GenAI case, but Hachette’s core rationale is directly relevant to GenAI.

At least one federal court has already determined that GenAI infringement cases are utterly unique and, therefore, require new ways of thinking about what constitutes infringement. U.S. District Judge William Orrick of the Northern District of California, overseeing a high profile case against image generator Stability AI, recently wrote that “run of the mill” substantial similarity copyright cases are “unhelpful in this case where the copyrighted works themselves are alleged to have not only been used to train the AI models but also invoked in their operation.”

5. It’s Market Substitution, Stupid!

So, because context matters, let’s take a close look at GenAI’s fundamental over-arching themes. In Google Books, the Second Circuit excused Google’s copying on “fair use” grounds, because Google’s actions “amplified” the market for the books it copied (I won’t repeat my analysis of “Google Books” here, but you can refer back to last week’s newsletter where I discuss the issue at length). But in its new Hachette ruling, the Second Circuit differentiated its context from Google Books to reject “fair use,” relying heavily on the Supreme Court’s rejection of “fair use” in Warhol (I also discuss Warhol at length in last week’s newsletter). The Second Circuit’s fundamental rationale for its decision in Hachette — just like the Supreme Court’s in Warhol — is a finding of “market substitution,” which is 180-degrees different than its market “amplification” rationale at the core of Google Books.

In reaching its decision in Hachette, the Second Circuit applied all four factors of the Copyright Act’s relevant “fair use” test, relying most heavily on the fourth which considers “the effects of the use upon the potential market for or value of the copyrighted work.” It called this fourth factor “undoubtedly the most important element of fair use” and concluded that defendant Internet Archive’s actions directly and adversely impacted the plaintiffs’ relevant commercial licensing opportunities.

6. A Lucrative GenAI Content Licensing Market Is At Stake

There can be little doubt that the kind of commercial harm and market substitution at the core of both Hachette and Warhol is also at stake here with Perplexity. Applying the “fair use” test’s fourth factor (the most important one, as recent court decisions have emphasized), an established market now exists (and is growing) for media and entertainment companies to license their content for GenAI applications. OpenAI’s recent $250 million licensing deal with News Corp. is proof positive of that.

So Perplexity’s and OpenAI’s unlicensed use of The WSJ, The NY Times and other copyrighted content directly adversely impacts licensing revenues that would otherwise go their way. And, as the Second Circuit points out, “the impact on potential licensing revenues is a proper subject for consideration in assessing the fourth [fair use] factor.”

7. RAG Output Is Not “Transformative”

Perplexity and other GenAI players like OpenAI try to obfuscate and muddy the waters about infringement Paths #1, 2 and 3 by first mixing them all up together into one basket, and second, by saying that GenAI “outputs” show no substantial similarity to the relevant content inputs — and, therefore, are protected “transformative” uses.

But the Second Circuit’s new ruling in Hachette — which heavily relied upon the Supreme Court’s Warhol decision — gives The WSJ, The NY Times and other media companies strong arguments to beat back those “fair use” challenges.

First, RAG’s real-time copying alone (separate and apart from the initial training copying Path #1) should be enough to find infringement for the market “substitution” reasons noted above. After all, Perplexity’s intent, when copying, is to be the single source for up-to-date information. Second, Perplexity’s GenAI “repackaging” of The WSJ’s and The NY Times’ unlicensed content doesn’t necessarily make it “transformative.” In fact, in the words of the Hachette court, “to be transformative, a use must do something more than repackage or republish the original copyrighted work.”

Critically, the Second Circuit noted that the Supreme Court in Warhol significantly narrowed what it means for an output to be “transformative” (and, therefore, non-infringing). If the “new work merely supplants the original” — i.e., is intended to “achieve a purpose that is the same as, or highly similar to, that of the original” — that can be enough. In fact, it was enough for the Supreme Court to reject “fair use.”

Perplexity’s actions can be infringing even if its “outputs” aren’t verbatim or even substantially similar reproductions of the relevant copyrighted content. The Second Circuit, in fact, underscores that “the word ‘transformative,’ then, cannot be taken too literally.”

If Perplexity could simply do as it pleases — build a business that competes directly with The WSJ and The NY Times based, at least in meaningful part, on using their content without licensing it (which it does with RAG) — it would be doing so, in the words of the Court, “as to deprive the rights holder of significant revenues” that could lead to the kind of widespread “market harm that would result from the unrestricted and widespread conduct of the same sort.” In other words, to bless Perplexity’s acts, would be to bless the similar acts of the other Perplexity’s of the world.

8. Public Policy Favors The WSJ & The NY Times

Perplexity and all other GenAI developers will predictably scream that public policy demands that their powerful new GenAI tools should be made widely available to the public — and that they should be applauded for enabling a new form of creative democratization. In their view, this societal good outweighs the Copyright Act’s commercial monopoly that it gives to individual rights-holders like The Times.

But even conceding some degree of merit to GenAI’s purportedly noble argument, the Second Circuit in Hachette rejected similar ones, underscoring that the Copyright Act’s “monopolistic power is a feature, not a bug” — and“rewards the individual author in order to benefit the public. The Court emphasized — much in the same way as the Supreme Court did in Warhol — that too literal of an application of an output’s “transformative” nature could lead to absurd results. To rule otherwise, it writes, would be to “significantly narrow — if not entirely eviscerate — copyright owners’ exclusive right to prepare (or not prepare) derivative works.”

The “derivative works” at stake here in the GenAI context are, among other things, the substantial lucrative commercial content licensing opportunities that should flow to rights-holders in order for GenAI to be able to “do its thing.” Without this essential content they need to feed, GenAI models are essentially practically useless.

So, to excuse Perplexity’s unlicensed “taking” in this case would be to render The WSJ’s and The NY Times’ massive investments in their reporting and analyses essentially meaningless. Remember, it’s not just about the harm caused by one infringer like Perplexity. As the Second Circuit points out, it’s also about “the market harm that would result from unrestricted and widespread conduct of the same sort.” If The WSJ, The NY Times and other rights-holders can’t collect from GenAI players like Perplexity that try to cash in and compete using their content for free, then why even bother? As the Court notes in Hachette, “it is difficult to compete with free.“

9. GenAI Carries the Burden of Proving “Fair Use”

One more critical thing. The Second Circuit underscored that those who rely upon the defense of “fair use” to beat back claims of infringement (i.e., Perplexity in this example) carry the “burden of proving that the secondary use does not compete in the relevant market.” It’s not the other way around.

Perplexity, in my view, would not be able to satisfy its burden against The Times.

The Wall Street Journal’s new case against Perplexity — together with The New York Times’ recent “cease and desist” (and ongoing litigation against OpenAI) — shine a bright spotlight on RAG and GenAI’s three potential pathways to copyright infringement. And the Second Circuit’s new Hachette decision — together with the Supreme Court’s recent Warhol decision — pave the way for rejection of GenAI “fair use” attempts to block infringement’s Paths #1, 2 and 3.

Ultimately, I believe the courts should — and will — simply find “unjust enrichment” in these GenAI cases and demand that Tech pays to play.

Peter Csathy's "the brAIn" - real media & AI intelligence

Discussion about this post