As publishers block the Wayback Machine over AI scraping fears, the preservation of the web’s public record is threatendAs publishers block the Wayback Machine over AI scraping fears, the preservation of the web’s public record is threatend

AI threatens the Internet Archive’s Wayback Machine

2026/04/18 13:00
5 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

For nearly 3 decades, the nonprofit Internet Archive has served as one of the web’s most important memory systems. Its tool, the Wayback Machine, allows users to retrieve past versions of web pages. It has become an essential function in an online ecosystem where content is routinely edited, deleted, or lost.

For example, a powerful entity might force a publisher to take down a webpage. If a user can save that page on the Wayback Machine before it is removed, then the record is preserved.

One high-profile case in the Philippines is that of Senator Tito Sotto’s who asked the Inquirer to have articles about Pepsi Paloma removed, to which the news site acquiesced. The Inquirer articles were taken down but not before they were preserved on The Wayback Machine.

A Reddit user sums up the importance of Wayback’s function: “The internet never forgets. Thanks to the Wayback Machine.” It’s an invaluable part of the information ecosystem online in preserving records.

By early 2026, worldwide, the scale of that effort had become immense. Reporting from Firstpost notes that the Wayback Machine has surpassed one trillion archived pages.

But that vast archive is now under mounting pressure, not from technical limits, but from how artificial intelligence is reshaping the economics of online publishing.

The gist? News sites are blocking the Wayback Machine from saving their pages, and the primary reason is that AI companies are circumventing news sites’ AI blocks by going to saved pages on Wayback.

The AI backlash against open archives

A growing number of media organizations are restricting the Wayback Machine’s ability to crawl and preserve their content. Reporting by WIRED earlier this week, citing analysis from AI detection firm Originality AI, found that at least 23 major news sites (not identified in the article) now block ia_archiverbot, the Internet Archive’s primary crawler.

Earlier in January, NiemanLab found that a total 241 outlets across nine countries restrict at least one of Wayback’s bots.

USA Today Co., the largest newspaper publisher in the US, accounts for a large number of blocked sites, effectively removing hundreds of local publications from the archival record. The New York Times has implemented similar measures while Reddit also announced in August 2025 that it would block Wayback’s crawlers. The Guardian, meanwhile, allows limited crawling but restricts how its articles appear in the archive, which in turn, makes them harder for the public to access.

New York Times spokesperson Graham James, quoted by WIRED, said that “the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.”

So despite differences in implementation, the logic appears to be aligned: concern that archived content could be reused to train AI systems without compensation.

AI, with Google search summaries and ChatGPT being a source of information for a growing number of users, has already changed the way that people interact with news sites, and impacted traffic that brings in revenues.

Now, with the trend of news sites blocking the Wayback Machine, AI is changing how the world preserves valuable information, potentially making the internet far more amorphous in the near future.

The USA Today contradiction

A USA Today report used the Wayback Machine to analyze how US Immigration and Customs Enforcement changed detention statistics over time, work that depended on archived web pages. Ironically, USA Today Co. simultaneously blocks the Archive from preserving its own content.

Wayback Machine director Mark Graham described this contradiction to WIRED, noting that publishers are able to rely on the archive’s records while restricting its access. He has also characterized the Archive as “collateral damage” in a broader conflict between publishers and AI companies.

Is there valid reason for news sites to act as such? A 2023 analysis by The Washington Post found that data from the Internet Archive have indeed appeared in major training datasets.

Because the Wayback Machine aggregates large volumes of structured, historical web data, it has indeed become an attractive resource for AI development, raising concerns among publishers about downstream use. So the question, does Wayback Machine also need to adjust their systems, and find a way to circumvent AI scraping, in order to appease news sites?

Fighting for the Wayback Machine

In response to the growing trend of media organizations blocking the Wayback Machine, a coalition of advocacy groups, including Fight for the Future, the Electronic Frontier Foundation (EFF), and Public Knowledge, organized an open letter of support titled “Journalists Applaud the Internet Archive’s Role In Preserving the Public Record”.

The letter, which has collected over 100 signatures from working journalists, serves as a formal thank-you to the Internet Archive for its “essential service” in a media landscape increasingly defined by “link rot, corporate consolidation, and cost-cutting.”

It also highlights some valuable functions:

  • Preserving knowledge ecosystems: The letter points out that the Wayback Machine maintains permanent citations for nearly 5 million news articles referenced on Wikipedia.
  • Responsible archiving: To counter the narrative of publishers concerned about AI, the signatories highlight that the Internet Archive does not engage in “paywall circumvention or irresponsible scraping.” Instead, they argue that the Archive is a proactive partner that treats the work of journalists with integrity.

The EFF also said: “The Internet Archive has preserved the web’s historical record for nearly thirty years. If major publishers begin blocking that mission, future researchers may find that huge portions of that historical record have simply vanished. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”

The Internet Archive’s Mark Graham, quoted by WIRED, says that they are “in conversation” with the Times along with other publishers. Graham also said, “there’s no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what’s going on in our world.” – Rappler.com

Market Opportunity
Blockstreet Logo
Blockstreet Price(BLOCK)
$0.003383
$0.003383$0.003383
-5.23%
USD
Blockstreet (BLOCK) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!