For nearly 3 decades, the nonprofit Internet Archive has served as one of the web’s most important memory systems. Its tool, the Wayback Machine, allows users to retrieve past versions of web pages. It has become an essential function in an online ecosystem where content is routinely edited, deleted, or lost.
For example, a powerful entity might force a publisher to take down a webpage. If a user can save that page on the Wayback Machine before it is removed, then the record is preserved.
One high-profile case in the Philippines is that of Senator Tito Sotto’s who asked the Inquirer to have articles about Pepsi Paloma removed, to which the news site acquiesced. The Inquirer articles were taken down but not before they were preserved on The Wayback Machine.
A Reddit user sums up the importance of Wayback’s function: “The internet never forgets. Thanks to the Wayback Machine.” It’s an invaluable part of the information ecosystem online in preserving records.
By early 2026, worldwide, the scale of that effort had become immense. Reporting from Firstpost notes that the Wayback Machine has surpassed one trillion archived pages.
But that vast archive is now under mounting pressure, not from technical limits, but from how artificial intelligence is reshaping the economics of online publishing.
The gist? News sites are blocking the Wayback Machine from saving their pages, and the primary reason is that AI companies are circumventing news sites’ AI blocks by going to saved pages on Wayback.
A growing number of media organizations are restricting the Wayback Machine’s ability to crawl and preserve their content. Reporting by WIRED earlier this week, citing analysis from AI detection firm Originality AI, found that at least 23 major news sites (not identified in the article) now block ia_archiverbot, the Internet Archive’s primary crawler.
Earlier in January, NiemanLab found that a total 241 outlets across nine countries restrict at least one of Wayback’s bots.
USA Today Co., the largest newspaper publisher in the US, accounts for a large number of blocked sites, effectively removing hundreds of local publications from the archival record. The New York Times has implemented similar measures while Reddit also announced in August 2025 that it would block Wayback’s crawlers. The Guardian, meanwhile, allows limited crawling but restricts how its articles appear in the archive, which in turn, makes them harder for the public to access.
New York Times spokesperson Graham James, quoted by WIRED, said that “the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.”
So despite differences in implementation, the logic appears to be aligned: concern that archived content could be reused to train AI systems without compensation.
AI, with Google search summaries and ChatGPT being a source of information for a growing number of users, has already changed the way that people interact with news sites, and impacted traffic that brings in revenues.
Now, with the trend of news sites blocking the Wayback Machine, AI is changing how the world preserves valuable information, potentially making the internet far more amorphous in the near future.
A USA Today report used the Wayback Machine to analyze how US Immigration and Customs Enforcement changed detention statistics over time, work that depended on archived web pages. Ironically, USA Today Co. simultaneously blocks the Archive from preserving its own content.
Wayback Machine director Mark Graham described this contradiction to WIRED, noting that publishers are able to rely on the archive’s records while restricting its access. He has also characterized the Archive as “collateral damage” in a broader conflict between publishers and AI companies.
Is there valid reason for news sites to act as such? A 2023 analysis by The Washington Post found that data from the Internet Archive have indeed appeared in major training datasets.
Because the Wayback Machine aggregates large volumes of structured, historical web data, it has indeed become an attractive resource for AI development, raising concerns among publishers about downstream use. So the question, does Wayback Machine also need to adjust their systems, and find a way to circumvent AI scraping, in order to appease news sites?
In response to the growing trend of media organizations blocking the Wayback Machine, a coalition of advocacy groups, including Fight for the Future, the Electronic Frontier Foundation (EFF), and Public Knowledge, organized an open letter of support titled “Journalists Applaud the Internet Archive’s Role In Preserving the Public Record”.
The letter, which has collected over 100 signatures from working journalists, serves as a formal thank-you to the Internet Archive for its “essential service” in a media landscape increasingly defined by “link rot, corporate consolidation, and cost-cutting.”
It also highlights some valuable functions:
The EFF also said: “The Internet Archive has preserved the web’s historical record for nearly thirty years. If major publishers begin blocking that mission, future researchers may find that huge portions of that historical record have simply vanished. There are real disputes over AI training that must be resolved in courts. But sacrificing the public record to fight those battles would be a profound, and possibly irreversible, mistake.”
The Internet Archive’s Mark Graham, quoted by WIRED, says that they are “in conversation” with the Times along with other publishers. Graham also said, “there’s no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what’s going on in our world.” – Rappler.com


