**Note:** See [[Statuses#Books|here]] for valid statuses
> [!META]- Inline Metadata
[author::]
[recommended by::]
[status:: reading]
[related goals::]
[link::]
[tags:: #source/books #concepts/programming/system-design #career/inteview ]
[rating::]
# What is the Book About?
**Note**: See [[How To Read a Book]]
What kind of book is it?
- Non-fiction, practical book. Examples of system design interview questions.
What is the book about in 1-2 sentences at most?
- System design interview questions and how to answer them.
# Summaries
## Chapter 1: Scale from Zero to Millions of Users
### Single Server Setup
- Request flow:
1. Users access websites through domain names, which are resolved by DNS
2. Internet protocol (IP) address is returned to the browser.
3. Once IP address is obtained, HTTP requests are sent directly to server
4. The web server returns HTML pages or JSON response.
### Database
- Separating web/mobile traffic (web tier) from database (data tier) servers allows them to be scaled independently.
- Non-relational DBs might be right choice if:
- Application requires super-low latency
- Data are instructured
- You only need to serialize and deserialize data
- You need to store massive amount of data.
### Vertical vs. Horizontal Scaling
- **Vertical scaling** - adding more power (CPU, memories, etc.) to servers. "Scale up"
- Limitations:
- Hard limit - can't add unlimited CPU and memory
- No failover or redundancy
- **Horizontal scaling** - adding more serviers to pool of resources
### Load Balancer
- A **load balancer** evenly distributes traffic among web servers.
- The load balancer communicates with web serves via private IPs.
- Load balancers can redirect traffic away from offline servers
- They also can handle rapid gains in traffic, allowing you to simply add more servers behind the load balancer to accomodate the growing traffic.
### Database Replication
- Master/slave terminology: master only handles writes, slave databases only support reads and get copies of databases from the master.
- Advantages
- Better performance
- More reliable
- If one slave DB is available then goes offline, read operations get directed to master database
- If master goes offline, after some time a slave will be promoted to the new master.
### Cache
A **cache** is a temporary storage area that stores the result of expensive responses/frequently acccesed data in memory to serve them more quickly.
A cache tier not only reduces database workloads and increases system performance, but you're also able to scale the cache tier independently.
**Read-through cache strategy** is a strategy where, upon receiving a request, the web server checks if the cache has the appropriate response for the client. If it does, it's sent back to the client. If not, it queries the database and stores the result in the cache before sending it back to the client.
#### Best Practices
- Caches are best used for data that is accessed frequently but modified infrequently. Caches often use volatile memory (RAM), so they're not a good choice for persisting data.
- An expiration policy should be used to remove stale or unused data from the cache.
- Consistency is a challenge - whether due to scaling across regions or to modified data in the underlying data store.
- A single cache server become sa single point of failure, multiple cache servers across different datacenters should be used.
- Caches should have an **eviction policy** like least-recently-used (LRU) to remove data from a full cache.
### Content delivery network (CDN)
A **CDN** is a network of geographically dispersed servers used to deliver static content.
#### Workflow
1. User A tries to get `image.png` via URL, the domain is provided by the CDN provider.
2. If the CDN server doesn't have the image in its own cache, it will request it from the origin.
3. Origin returns `image.png` to CDN server, which includes an optional TTL header that describes how long for the CDN server to cache the image.
4. CDN caches the image and returns it to the user, where it stays until TTL expires.
5. A different user requests the same image.
6. The image is returned from the cache as long as the TTL hasn't expired.
#### CDN Considerations
- Cost - you are charged for transfers into and out of the CDN
- Setting appropriate cache expiration
- CDN fallback - how clients handle a CDN outage
- File invalidation
### Stateless web tier
For horizontal scaling of the web tier, state (such as user session data) should be moved out of the web tier (**stateful architecture**) and into persistent storage (**stateless storage**).
### Data centers
Geo-routing services can be used to route requests to a data center based on the location of the requester.
Technical challenges to resolve for a multi-data center setup:
- Traffic redirection, like GeoDNS
- Data synchronization, such as replicating data across multiple data centers
- Test and deployment needs to happen at each data center
### Message queue
A **message queue** is a memory-based component that supports asynchronous communication using a [pub/sub framework](https://cloud.google.com/pubsub/docs/overview).
Decoupling producers and consumers via asynchronous messaging allows for better reliability because the producer and consumer don't need the other to do their work and can be scaled independently.
### Logging, metrics, automation
Monitoring error logs either per-server or aggregated in a centralized service allows for detection of systemic problems.
Metrics can help understand the health status of the system, such as:
- Host-level metrics (CPU, memory, I/O)
- Aggregated metrics such as performance for the entire cache tier
- Business metrics like DAUs, retention, revenue, etc.
Automation would include CI/CD, automated testing, etc.
### Database scaling
See [[System Design Interview#Vertical vs Horizontal Scaling|vertical vs horizontal scaling above]].
Horizontal scaling for database servers is also called **sharding**, with the additional requirement that each server holds the same schema but unique data. When data is looked up, a hash function is used to find the correct shard to pull from.
#### Sharding considerations
- Resharding data - needed when a shard fills up faster than others, data needs to be redistributed and the sharding function updated. **[[System Design Interview#Chapter 5 Design Consistent Hashing|Consistent hashing]]** can be used to address this.
- **Celebrity problem** - aka *hotspot key problem*. Excessive access to a specific shard could cause server overload of that particular shard. One solution is a separate database server for "celebrity" data.
- **Join and denormalization** - after sharding, performing joins can be difficult. One solution is denormalizing the data so that queries can be performed on a single table. Non-relational databases can also be used for this.
## Chapter 2: Back-of-the-envelope estimation
Some reference tables to aid in back-of-the-envelope calculations.
### Powers of Two
One ASCII character uses one byte of memory, or 8 bits.
$2^n$
Power | Approx. value | Full name | Short name
------|---------------|-----------|-----------
10 | 1 thousand | 1 Kilobyte | 1 KB
20 | 1 million | 1 Megabyte | 1 MB
30 | 1 billion | 1 Gigabyte | 1 GB
40 | 1 trillion | 1 Terabyte | 1 TB
50 | 1 quadrillion | 1 Petabyte | 1 PB
### Latency numbers
Operation name | Time
---------------|------
L1 cache reference | 0.5 ns
Branch mispredict | 5 ns
L2 cache reference | 7 ns
Mutex lock/unlock | 100 ns
Main memory reference | 100 ns
Compress 1K bytes with Zippy | 10,000 ns => 10 μs
Send 2K bytes over 1 Gbps network | 20,000 ns => 20 μs
Read 1 MB sequentially from memory | 250,000 ns => 250 μs
Round trip within same datacenter | 500,000 ns => 500 μs
Disk seek | 10,000,000 ns => 10 ms
Read 1 MB sequentially from network | 10,000,000 ns => 10 ms
Read 1 MB sequentially from disk | 30,000,000 ns => 30 ms
Send packet CA -> NL -> CA | 150,000,000 ns => 150 ms
tl;dr: Memory access is faster than disk, avoid disk seeks if possible, simple compression algos are fast, so compress data before sending it over Internet if possible.
### Availability (Nines)
Availability % | Downtime per day | Downtime per week | Downtime per month | Downtime per year
---------------|------------------|-------------------|-------------------
99% | 14.40 minutes | 1.68 hrs | 7.31 hrs | 3.65 days
99.9% | 1.44 minutes | 10.08 minutes | 43.83 minutes | 8.77 hrs
99.99% | 8.64 s | 1.01 minutes | 4.38 minutes | 52.6 minutes
99.999% | 864 ms | 6.05 s | 26.3 s | 5.26 minutes
99.9999% | 86.4 ms | 604.8 ms | 2.63 s | 31.56 s
### Example
Assume:
- 300M MAUs ^tjbvf6
- 50% use Twitter daily ^iwv2u8
- Users post 2 tweets per day on avg ^ap3z1w
- 10% of tweets contain media
- Data stored for 5 years
#### Query per second estimate
- DAU = [[System Design Interview#^tjbvf6|300M]] * [[System Design Interview#^iwv2u8|50%]] = 150M ^xxe92k
- Tweets QPS = [[System Design Interview#^xxe92k|150M]] * [[System Design Interview#^ap3z1w|2 tweets / 24 hours]] / 3600 s = ~3500
- Peak QPS = 2 * QPS = ~7000
## Chapter 3: A Framework for System Design Interviews
The system design interview simulates real-life problem solving where coworkers collaborate on an ambiguous problem and come up with a solution that meets their requirements. The process used is much more important than the final answer you arrive at.
Interviewers will also be looking for red flags, including the following:
- over-engineering
- narrow-mindedness
- stubbornness
### A 4-step process for an effective system design interview
1. Understand the problem and establish design scope
- Ask questions
- Do not give an answer quickly without thinking it through
- Write down assumptions
- Questions:
- What features?
- How many users?
- How fast to scale up?
2. Propose high-level design and get buy-in
- Come up with initial blueprint
- Treat interviewer as a teammate and work together
- Draw box diagrams on whiteboard or paper
- Do [[System Design Interview#Chapter 2 Back-of-the-envelope estimation|back of the envelope estimations]]
3. Design deep dive
- Work with the interviewer to identify and prioritize components.
- Don't get into unnecessary detail - manage time
4. Wrap-up
- Talk about possible improvements
- Talk about error cases
- Discuss operational issues (ie monitoring)
- Can dig into how to handle the next scale curve (1M users vs 10M)
## Chapter 4: Design a Rate Limiter
A **rate limiter** is used to control the rate of traffic that is sent by a client or service.
### Step 1: Understand problem and establish design scope
Some example questions to ask from the text:
Q: What kind of rate limiter? Server or client side?
A: Server-side
Q: Does it throttle requests based on a user ID or something else?
A: Should be flexible to handle different throttle rules.
Q: What is scale of the system?
A: Should be able to handle a large number of requests.
Requirements:
- Limit excessive requests
- Low latency
- Use as little memory as possible
- Distributed rate limiting
- Exception handling
- High fault tolerance
### Step 2: Propose high-level design and get buy-in
#### Where to put rate limiter?
Client side: ❌ because it can be altered by the client
Server side
Middleware: recent, popular method. One example in popular cloud providers is a part of an API gateway
#### How to evaluate where to put rate limiter
- What works best with current tech stack?
- What works with the algorithm you want to use? API gateways may not support the algorithm you want to use.
- If you're already using an API gateway
#### Algorithms for rate limiting
- Token bucket algorithm ^oe15hx
- Container with pre-defined capacities where tokens are added at a prescribed rate until it's full.
- Each request consumes a token, if any exist in the bucket.
- Parameters: bucket size and refill rate
- Usually have to have different buckets for each API endpoints
- Leaking bucket algorithm
- Similar to [[System Design Interview#^oe15hx|token bucket]] but requests are processed at a fixed rate.
- When a request arrives, the system checks if the queue is full. If not, request is added to the queue. If it is, the request is dropped.
- Requests are pulled from the queue and processed at regular intervals.
- 2 parameters: bucket (queue) size, outflow rate
- Pros:
- Memory efficient
- Requests processed at fixed rate so useful where stable request processing rate is needed
- Cons:
- A burst of traffic can fill the queue with old requests and then cause new ones to be rate limited.
- May be difficult to tune both parameters
- Fixed window algorithm ^t81l6b
- Divides the timeline into fixed-size time windows each with a counterparts
- Each request within the window increments the counter by 1. If the counter reaches its max within that window, all new requests are dropped.
- Also problematic with bursts of traffic. Spikes at edges of window could cause more requests to come through than anticipated.
- Sliding window log algorithm ^k0llhe
- Request timestamps are kept track of in a cache like Redis
- When a new request comes in, old timestamps (any outside the current window) are flushed, and timestamp of the new request is added to the cache.
- If the size of the log is <= the allowed size, the request is accepted.
- Very accurate, but consumes a lot of memory
- Sliding window counter algorithm
- Hybrid of [[System Design Interview#^t81l6b|fixed window]] and [[System Design Interview#^k0llhe|sliding window log]].
- Calculate requests according to folloing formula:
- requests in current window + requests in previous window * overlap percentage of rolling and previous window:
- Smooths out spikes in traffic and is memory efficient.
#### High-level architecture
Where to store counters? In-memory cache like Redis.
Redis has `INCR` and `EXPIRE` commands that can help us with our rate limiter.
### Step 3: Design Deep Dive
#### Rate Limiting Rules
Use config files. These can have the domain, the unit and requests_per_unit of the rate limit, etc. These are saved on disk, and workers access them and put frequently accessed ones in a cache (one separate from the one in which timestamps are stored).
#### Exceeding the Rate Limit
Return HTTP status code **429** - too many requests. You can decide what to do with the rate-limited requests, like adding them to a queue for later processing.
HTTP response headers are used to communicate to the client the status of their requests - requests raimining, the rate limit it self, etc.
#### Rate Limiters in a Distributed Environment
**Race conditions** occur in highly concurrent environments, for example when two threads get the counter value from Redis before either writes to it to update it, so they both increment from, say, 3 instead of one incrementing from 3 and the other one from 4. This can be solved by using sorted sets in Redis.
**Synchronization** issues can also occcur with distributed systems. This is easily worked around by using a single centralized cache (like Redis) for all rate limiters.
For performance optimization, a multi-data center setup and synching data with eventual consistency are both good strategies.
**Monitoring** is also necessary to see if your rules are too strict or lax.
## Chapter 5: Design Consistent Hashing
If you have $n$ cache servers, it's common to balance the load using this hash method: $serverIndex = hash(key) \mod{n}$
This works well with even data distribution and fixed server size, but if a server goes offline, this can cause lots of cache misses.
### Consistent hashing
> Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used only $\frac{k}{n}$ keys need to be remapped on average, where $k$ is the number of keys and $n$ is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped.
- Wikipedia
To visualize consistent hashing, imagine some hash function $f$ like SHA-1. In SHA-1, all possible values (the **hash space**) are between 0 and $2^{160} - 1$. If you visualize it as an array with slots for each value, then connecting either end, you get a **hash ring**.
To use consistent hashing, use hash function $f$ to map servers onto the ring, and to map cache keys onto the ring.
To find where a key is stored, go clockwise on the hash ring from the key's position and find the nearest server.
Adding a server only requires you to redistribute some fraction of keys, because it's only the keys on the hash ring that come before the newly inserted server and the one "behind" it that need to move to the new server.
Likewise, removing a server only requires the redistribution of some fraction of keys by the same logic.
### Virtual nodes
**However** keys and servers aren't always uniformly distributed on the hash ring, especially as you add or remove servers. You can make the distribution of servers more uniform by using virtual nodes, which are references to a server that already exists. If `s0` is server 0, then `s0_0`, `s0_1`, `s0_2` are virtual nodes representing server 0. As the number of virtual nodes increases, key distribution becomes more balanced.
## Chapter 6: Design a Key-Value Store
Build a key-value store that supports `put(key, value)` and `get(key)`.
In a single server setup, it is enough to store k-v pairs in a hash table in memory. A distributed key-value store or **distributed hash table** is more complex.
### CAP Theorem
The **CAP theorem** says that it is impossible for a distributed system to provide more than 2 of the 3 guarantees:
**Consistency**: all clients see the same data at the same time no matter which node they connect to
**Availability**: any client that requests data gets a response even if some nodes are down
**Partition tolerance**: a partition in this context is a communication break between nodes. Partition tolerance, then, means the system continues to operate despite partitions
If you choose a CP system (consistency over availability), then if a node goes down you would have to block all writes to the system to keep all nodes consistent.
If you choose an AP system (availability over consistency), then if a node goes down the system will accept reads even though it might return stale data and writes will keep being accepted, with data being synced when the node comes back on.
---
Core components of a key-value store:
- Data partition
- Partitioning the data across nodes is necessary for a distributed hash table
- [[System Design Interview#Chapter 5 Design Consistent Hashing|Consistent hashing]] can be used for this
- Data replication
- Again with consistent hashing, keys can be added to the first $n$ servers on the hash ring
- Consistency
- **Quorum consensus** where some number of nodes for each operation (read $R$ or write $W$) has to acknowledge a successful operation before the coordinator marks the operation as successful.
- **Strong consistency** ($W + R > N$) is when a client will never see out of date data
- **Weak consistency** is when sunsequent read operations may not see most updated value
- **Eventual consistency** is a form of weak consistency. Given enough time, all updates are propagated and all replicas are consistent. Thjis is the model recommended for the k-v store.
- Inconsistency resolution
- Versioning is used here, where each update is treated as a ner immutable version of the data.
- A **vector clock** helps with this, which is a $[server, version]$ pair associated with some data item.
- To keep the data from getting too large, some threshold is set so old versions are discarded.
- Reconciliation happens when there is a conflict, ie any pair's version number is less than that of the original.
- i.e. at $D2(S_x, 2])$, servers $S_y$ and $S_z$ simultaneously do another write, which gives $D3([S_x, 2], [S_y, 1])$ and $D4([S_x, 2], [S_z, 1])$, then the reconciled object is $D5([S_x, 3], [S_y, 1], [S_z, 1])$
- Handling failures
- **Gossip protocol** - each node maintains a node list, with member IDs and heartbeat counters. Every node increments its heartbeat counter, and periodically sends heartbeats to random nodes, which then propagate to other nodes. Once nodes receive heartbeats, their node lists are updated. If it hasn't increased for some node for more than some number of periods, the node is considered offline.
- To handle temporary failures:
- **Strict quorum** would block read and write operations
- **Sloppy quorum** improves availability by choosing only known healthy servers for reads and writes, ignoring offline servers. Data is pushed back when servers come back up. ^9gv3ds
- To handle permanent failures:
- [[Merkle trees]] are created
- Write path
1. Write request is persisted to commit log
2. Write is added to cache
3. After some threshold (time, number of writes, full cache, etc.) cache is flushed and data written to disk
- Read path
1. System checks if requested data is in memory.
2. If not, checks [[Bloom Filter]] to figure out which SSTable (sorted string table) to look at
3. SSTables return dataset to system, system returns it to user
### Summary Table
Goal/Problems | Technique
--------------|-----------
Ability to store big data | [[System Design Interview#Chapter 5 Design Consistent Hashing\|Consistent hashing]] to spread load across servers
High availability reads | Data replication / Multi-datacenter setup
High availability writes | Versioning and conflict resolution with vector clocks
Dataset partition | [[System Design Interview#Chapter 5 Design Consistent Hashing\|Consistent hashing]]
Incremental scalability | [[System Design Interview#Chapter 5 Design Consistent Hashing\|Consistent hashing]]
Heterogeneity | [[System Design Interview#Chapter 5 Design Consistent Hashing\|Consistent hashing]]
Tunable consistency | [[System Design Interview#Chapter 5 Design Consistent Hashing\|Consistent hashing]]
Handling temporary failures | [[System Design Interview#^9gv3ds\|Sloppy quorum]] and hinted handoff
Handling permanent failures | Merkle trees
Handling data center outage | Cross-datacenter replication
## Chapter 7: Design a Unique ID Generator in Distributed Systems
- Can't use an auto-increment feature in a traditional database in a distributed environment because a single server isn't large enough for the data and generating IDs this way across multiple DBs is extremely challenging.
### Requirements
- IDs must be:
- Unique
- Numerical values ^wihifh
- 64 bit ^p2wrn9
- Ordered by date ^359cio
- able to be generated > 10k UIDs/s ^duaoo2
### Multi-master Replication
Multi-master replication does use the auto-increment feature of database systems, but increments by the number of servers in use $k$.
#### Drawbacks
- Difficult to scale with multiple datacenters
- IDs don't [[System Design Interview#^359cio|increase with time]] across multiple servers
- Doesn't scale well when servers are added or removed (have to change $k$ across all servers
### UUID
A UUID is a 128-bit number used to ID information in computer systems and has an extremely low probability of getting a collision.
Using a UUID does not require coordination between servers because of the low probability of a collision.
#### Example
`a8098c1a-f86e-11da-bd1a-00112444be1e`
#### Drawbacks
- 128 bits long, vs. the [[System Design Interview#^p2wrn9|64-bit requirement]]
- IDs don't [[System Design Interview#^359cio|increase with time]]
- IDs are able to be [[System Design Interview#^wihifh|non-numeric]]
### Ticket Server
Originally developed by Flickr, this centralizes the auto-increment feature to one server, and web servers use that to generate their primary keys.
#### Drawbacks
- Single point of failure ^qptcxy
- Using multiple ticket servers to get around the [[System Design Interview#^qptcxy|single point of failure]] raises the same synchronization issues that [[#Multi-master Replication]] has.
### Twitter Snowflake Approach
This strategy uses divide and conquer, where each segment of the 64bit ID represents different things:
- Sign bit: The 1st bit, always 0 but reserved for future use.
- Timestamp: 41 bits. ms since epoch. This allows sorting by time. Max value is $2 ^ {41} - 1 = 2199023255551 \approx 69 years$ since epoch
> [!WARNING]
> A deeper dive thing to go into would be to address clock synchronization between machines generating these timestamps.
- Datacenter ID: 5 bits, max capacity of $2 ^ 5 = 32$ datacenters
- Machine ID: 5 bits, max capacity of $2^5 = 32$ machines (per datacenter)
- Sequence number: 12 bits. For every ID generated on the same machine or process in the same millisecond, this is incremented by 1. After each millisecond, reset to 0. This means that we can have $2 ^ {12} = 4096$ combinations per millisecond, which is well above our [[System Design Interview#^duaoo2|requirement of > 10k UIDs/s]].
## Chapter 8: Design a URL Shortener
Design a URL shortener that does the following:
- Shortening - given long URL, return shorter URL
- Redirecting - given a shortened URL, redirect to original URL
- High availability, scalability, and fault tolerance
### Estimations
Writes: given 100M URLs generated per day. Writes per second: $100M / 24 / 3600 = 1160$
Read: assuming reads to writes is 10:1, 11600
Assume URL shortener will run for 10 years, so must support $100M * 365 * 10 = 365B$ records
If average URL length is 100, storage requirement over 10 years: $365B * 100 bytes = 36.5 TB$
### API Endpoints
Shortening: client sends POST with one parameter, original URL.
Redirecting: client sends GET request, returns long URL for redirect
>[!Note]
>There are two redirect codes that can be used. The 301 redirect communicates a permanent move to the long URL, whic causes the browser to cache the response and send future requests straight to the original URL. The 302 redirect communicates a temporary move so future requests will also go through the service. The pro to this is that it enables analytics on traffic usage patterns on the service itself.
### Data Model
A simple row where `id` is the primary key, and there are two more fields: `longUrl` and `shortUrl`.
### Hash Function
To shorten a URL, we want to use a hash function that maps a long URL to a the hash value output.
The hash value consists of `[0-9A-Za-z]`, which gives us $10 + 26 + 26 = 62$ possible characters. We should find the shortest length of the hash value $n$ that gives us a number of combinations just greater than our storage requirement for records (365B records). $62^n = 62^7 = 3.5T$. ^3528td
#### Hash + collision resolution
You can use existing hash functions like CRC32, MD5, etc. They generate longer strings than required, so you'll have to slice some of the generated keys, which increases the risk of collision. If one happens, you can recursively add predefined strings until the collision is resolved.
Querying the DB to check for a collision on each attempt can get expensive, so you can use bloom filters to help with that.
#### Base-62 conversion
Since there are [[System Design Interview#^3528td|62 possible characters]] at each position in the string, you can convert the primary key for that URL to base-62 as your hash function. This needs a [[#Chapter 7: Design a Unique ID Generator in Distributed Systems|unique ID generator]] for the primary keys, and if that generator creates IDs sequentially, nearby URLs can be guessed which is a security risk.
### URL Shortening Deep Dive
This deep dive covers the base-62 conversion.
Steps:
1. input is `longUrl`
2. Check if `longUrl` is in the DB
3. If so, fetch shortened URL from DB and return it to the client
4. If not, generate a new unique ID with unique ID generator
5. Convert ID to short URL with base-62 conversion
6. save ID, short URL, and long URL to DB
### URL Redirect Deep Dive
1. User clicks shortened URL
2. Load balancer forwards request to web servers
3. if shortURL is in cache, return long URL directly
4. If not in cache, check DB. If not there, user likely entered invalid URL.
5. return long URL to user
### Other points to talk about
- rate limiting
- analytics
- database scaling
## Chapter 9: Design a Web Crawler
The basic algorithm is as follows:
1. Given set of URLs, download all associated web pages
2. Extract any URLs in those pages
3. Add new URLs to list, repeat
A good web crawler is...
1. Scalable - the web is huge, crawling it should be extremely efficient and use parallelization
2. Robust - crawler must handle bad HTML, unresponsive servers, crashes, malicious links, etc.
3. Polite - crawler should not make too many requests to a single site
4. Extensible - minimal changes to support new content types
### Back of the envelope estimation
- Assume 1B web pages downloaded every month
- QPS: 1_000_000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second
- Peak QPS = 2 * QPS = 800
- Assume avg web page size = 500k
- 1B page * 500k = 500 TB storage per month
- If stored for 5 years: 500 TB * 12 months * 5 years = 30 PB
### High level design
![[IMG_9002 Large.png]]
### Seed URLs
- Simply a lisst of URLs
- You need to intelligently pick URLs, so think out loud about how to choose and divide
### URL Frontier
This is a FIFO queue that lists URLs to be downloaded. This is because BFS is better for web than DFS.
This can ensure politeness, prioritization, and freshness.
The URL frontier actually consists of two queues:
#### Priority (front) queue
![[IMG_9004 Large.png]]
Prioritizer takes list of addresses and computes priorities. Each queue has some priority associated with it, and the selector selects from queues based on their priority, sending URLs to the back queue.
#### Politeness (back) queue
Uses a mapping table to map domain to queue. A router ensures URL goes to right queue, while the selector ensures that URLs are selected in order to avoid DDoS'ing and other bad behaviors. Worker threads do the actual downloads.
![[IMG_9003 Large.png]]
#### Storage
Most URLs are on disk, but in-memory buffers are used for enqueue/drqueue operations.
### HTML Downloader
"Simply" reads URLs and downloads them if not blocked by a `robots.txt` file.
The downloader calls the DNS resolver to get IPs from URL and sends downloaded content to the content parser.
#### Performance Optimizations
1. Distribute crawl across servers and threads, partitioning URLs
2. Cache DNS - DNS is synchronous so it can become a bottleneck
3. Locality - distribute crawl servers close to their targets
4. Short time-out - don't spend too much time on slow servers.
### Content Parser
This parses the HTML from downloader.
### Content seen?
This hashes incoming content, then checks if it's in a database of encountered content. If not, adds it.
### Content storage
Store popular sites in cache, put the rest on disk-based database.
### URL extractor
This gets URLs for future use, converts relative URLs to absolute ones.
### URL filter
Used to exclude error pages, blacklisted sites, etc.
### URL seen
Use a bloom filter or hash map to prevent adding the same URL back to the beginning of the crawler structure.
### Additional system improvements
#### Robustness
- Consistent hashing to spread load across servers
- Save crawl states and data to restart failures
- Exception handling
- Data validation
#### Extensibility
Support new content types with new modules. A modular architecture would enable this.
#### Avoiding problematic content
1. Redundant content - checksums or hashes to skip duplicates
2. Spider traps - hard to avoid programmatically, will have to manually blacklist
3. Data noise - ads, code snippets, spam URLs, etc.