In an engineer’s ideal world, architecture is always beautiful. In the real world of high-scale systems, you have to make compromises. One of the fundamental problems an engineer must think about at the start is the vicious trade-off between Write Speed and Read Speed.
Usually, you sacrifice one for the other. But in our case, working with petabytes of data in AWS, this compromise didn’t hit our speed–it hit the wallet.
We built a system that wrote data perfectly, but every time it read from the archive, it burned through the budget in the most painful way imaginable. After all, reading petabytes from AWS costs money for data transfer, request counts, and storage class retrievals… A lot of money!
This is the story of how we optimized it to make it more efficient and cost-effective!
True story: a few months back, one of our solution architects wanted to pull a sample export from a rare, low-traffic website to demonstrate the product to a potential client. Due to a bug in the API, the safety limit on file count wasn’t applied.
Because the data for this “rare” site was scattered across millions of archives alongside high-traffic sites, the system tried to restore nearly half of our entire historical storage to find those few pages.
That honest mistake ended up costing us nearly $100,000 in AWS fees!
Now, I fixed the API bug immediately (and added strict limits), but the architectural vulnerability remained. It was a ticking time bomb…
Let me tell you the story of the Bright Data Web Archive architecture: how I drove the system into the trap of “cheap” storage and how I climbed out using a Rearrange Pipeline.
When I started working on the Web Archive, the system was already ingesting a massive data stream: millions of requests per minute, tens of terabytes per day. The foundational architecture was built with a primary goal: capture everything without data loss.
It relied on the most durable strategy for high-throughput systems: Append-only Log.
For the ingestion phase, this design was flawless. Storing data in Deep Archive costs pennies, and the write throughput is virtually unlimited.
The architecture worked perfectly for writing… until clients came asking for historical data. That’s when I faced a fundamental contradiction:
cnn.com, google.com, and shop.xyz.cnn.com for the last year.”Here lies the mistake that inspired this article. Like many engineers, I’m used to thinking about latency, IOPS, and throughput. But I overlooked the AWS Glacier billing model.
I thought: “Well, retrieving a few thousand archives is slow (48 hours), but it’s not that expensive.”
The Reality: AWS charges not just for the API call, but for the volume of data restored ($ per GB retrieved).
Imagine a client requests 1,000 pages from a single domain. Because the writing logic was chronological, these pages can be spread across 1,000 different TAR archives.
To give the client these 50 MB of useful data, a disaster occurs:
We were paying to restore terabytes of trash just to extract grains of gold. It was a classic Data Locality problem that turned into a financial black hole.
I couldn’t quickly change the ingestion method–the incoming stream is too parallel and massive to sort “on the fly” (though I am working on that), and I needed a solution that worked for already archived data, too.
So, I designed the Rearrange Pipeline, a background process that “defragments” the archive.
This is an asynchronous ETL (Extract, Transform, Load) process, with several critical core components:
Selection: It makes no sense to sort data that clients aren’t asking for. Thus, I direct all new data into the pipeline, as well as data that clients have specifically asked to restore. We overpay for the retrieval the first time, but it never happens a second time.
\
Shuffling (Grouping): Multiple workers download unsorted files in parallel and organize buffers by domain. Since the system is asynchronous, I don’t worry about the incoming stream overloading memory. The workers handle the load at their own pace.
\
Rewriting: I write the sorted files back to S3 under a new prefix (to distinguish sorted files from raw ones).
2024/05/05/random_id_ts.tar → [cnn, google, zara, cnn]2024/05/05/cnn/random_id_ts.tar → [cnn, cnn, cnn...] MERGE INTO or UPDATE is prohibitively expensive.This change radically altered the product’s economics:
cnn.com, the system restores only the data where cnn.com lives.Bright Data is scaling the Web Archive even further. If you enjoy:
Then I’d love to talk.
We’re hiring strong Node.js engineers to help build the next generation of the Web Archive. Having data engineering and ETL experience is highly advantageous. Feel free to send your CV to vadimr@brightdata.com.
More updates coming as I continue scaling the archive—and as I keep finding new and creative ways to break it!
\


