Reddit Sues Perplexity AI, Alleging ‘Industrial-Scale’ Data Theft

In brief

Social media platform Reddit has sued Perplexity AI, accusing the firm of an “industrial-scale” scheme to scrape its user-generated content.
Reddit alleges billions of search pages were scraped through tools that bypassed its and Google’s protections.
The lawsuit names Perplexity, SerpApi, Oxylabs, and AWM Proxy as defendants.

Social media platform Reddit has sued Perplexity AI in federal court on Wednesday, alleging that the artificial intelligence company and its data partners orchestrated an “ industrial-scale” scheme to scrape the platform’s user-generated content.

Reddit alleges that the other defendants: SerpApi, Oxylabs, and AWM Proxy, developed and sold tools specifically designed to break security measures protecting its content, enabling the large-scale scraping of Reddit data from search results.

The tools were allegedly built with the intention of bypassing two layers of protection: first, by evading Reddit’s own anti-scraping systems, and second, by circumventing Google’s controls to extract Reddit content directly from its search engine results.

The data companies operated as “data-scraping service providers” and “circumvented Google’s technological control measures and automatedly accessed, without authorization, almost three billion search engine results pages,” a copy of the lawsuit reads.

Reddit claims Perplexity used data from the three firms for its answer engine even after receiving a cease-and-desist letter in May 2024.

A representative from Perplexity responded and shared a full response, posted on Reddit.

Perplexity intentionally posted its response on Reddit “to illustrate a simple point: it’s a public Reddit link accessible to anyone, yet by the logic of Reddit’s lawsuit, if you refer to it in any way, they just might sue you too,” the representative told Decrypt.

Perplexity described the lawsuit as “a sad example of what happens when public data becomes a big part of a public company’s business model.”

“Reddit thinks that’s their right. But it is the opposite of an open internet,” Perplexity stated.

A representative from SerpApi told Decrypt they did not receive “any communication or service from Reddit” on the matter, adding that they “strongly disagree with Reddit’s allegations” and intend to seek legal recourse.

“No company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price,” Denas Grybauskas, chief governance and strategy officer at Oxylabs, told Decrypt in an emailed statement.

Reddit similarly “made no attempt to speak” with Oxylabs, Grybauskas said.

Decrypt has reached out to Reddit, Google, and AWM Proxy for comment and will update this article should they respond.

A legal tangle

In cases like this, courts would need to look first at whether the terms of service from platforms like Reddit “explicitly addresses AI training, data scraping, and commercial use,” Andrew Rossow, public affairs attorney and director of strategic partnerships at video search and content intelligence platform Oriane, told Decrypt.

If a user agreed to terms that “grant the platform a broad, perpetual, royalty-free license to their content,” that license “generally governs the relationship between the user and the platform,” Rossow explained.

But it doesn’t “automatically grant the AI company a license” to do the same, unless the terms permitted the platform “to sublicense or sell the data for that purpose,” he added.

Courts would then have to “distinguish between the user’s copyright in their expression (the text of the post) and the use of the content for data mining (extracting patterns, facts, and language models),” he explained.

Still, the supposed “knowledge” behind an LLM (large-language model) “is the product of millions of users’ time, effort, and creative expression,” Rossow argued.

“Treating this human-generated content as a free, raw, undifferentiated resource is a form of labor exploitation that devalues online contributions,” Rossow opined, adding that AI companies need to “respect digital citizenship and community norms,” given how these are “the implicit and explicit rules of the digital public spaces they ingest.”