Imagine you are the city planner for a new town. You do not start by pouring concrete. You start by asking: how many people will live here? Where will they work? How will they get around? What happens if the main road floods?
System design is the same thing, but for software. Before you write a single line of code, you answer: who are the users? What will they do? How much traffic will the system handle? What breaks first, and how do we recover?
Most engineers learn to build features. System design teaches you to build systems that keep working when 10 users become 10 million, when a database dies at 3am, or when a viral tweet sends 50x your normal traffic.
This distinction matters in interviews because interviewers do not ask “can you implement a login page?” They ask “design Twitter” or “design a URL shortener.” They want to see you think about the whole system, not just one function.
The difference: a feature is a function. A system is an architecture of components that communicate, fail, and scale together. This post gives you the vocabulary and mental models to think at the system level.
Think of a restaurant. You (the client) sit at a table and look at a menu. You tell the waiter (the network) what you want. The waiter takes your order to the kitchen (the server). The kitchen prepares your food and sends it back through the waiter. You eat.
That is the entire client-server model. The client makes requests, the server processes them, and the network carries messages between them.
Three components:
The request-response cycle is the fundamental pattern. The client sends a request, the server processes it and returns a response. The client waits. This is synchronous by default.
Variations exist. A thin client does almost nothing (a browser rendering HTML from the server). A thick client does heavy lifting locally (a mobile app with offline support and local computation). Peer-to-peer systems eliminate the server entirely — every node is both client and server (BitTorrent, WebRTC).
Click Send Request to watch a packet travel from client to server and back.
The demo above shows a packet traveling from client through DNS to the server and database, then the response flowing back. Each step has a time cost, and those costs add up.
When you type https://example.com/users into your browser and press Enter, a surprising amount happens before anything appears on screen. Understanding each step is what separates engineers who debug effectively from engineers who stare at logs hoping for inspiration.
Here is the full lifecycle:
DNS Lookup — The browser asks a DNS resolver to translate example.com into an IP address like 93.184.216.34. This involves checking the browser cache, then OS cache, then router cache, then ISP DNS server, and potentially a recursive lookup to the root DNS servers. Time: 20-120ms.
TCP Handshake — The browser opens a TCP connection to that IP address. Three packets: SYN (client says “I want to connect”), SYN-ACK (server says “OK”), ACK (client says “Let’s go”). Time: 10-100ms depending on physical distance.
TLS Handshake — For HTTPS, the browser and server negotiate encryption. The server presents its certificate, both sides agree on a cipher suite, and exchange key material. After this, all data is encrypted. Time: 30-150ms.
HTTP Request — The browser sends the actual request: method (GET), path (/users), headers (Authorization, Accept, cookies), and body (for POST/PUT requests). Time: 5-50ms.
Server Processing — The server receives the request, runs middleware (authentication, rate limiting, logging), executes the route handler, queries the database or cache, and constructs the response. Time: 10-500ms.
HTTP Response — The server sends back status code (200 OK), headers (Content-Type, Cache-Control), and the response body (JSON, HTML, etc.). Time: 5-50ms.
Browser Rendering — The browser parses HTML, builds the DOM, applies CSS, executes JavaScript, and paints pixels. Time: 50-300ms.
Total time for a typical request: 130-1270ms. And that is for a happy path. Things go wrong at every step: DNS can time out, TCP connections get dropped, TLS certificates expire, servers crash, databases lock up, and the browser can run out of memory.
Step through each phase of a single HTTP request. Click Next or use Auto Play.
Imagine a highway. Latency is the time it takes one car to travel from one end to the other. Throughput is how many cars pass through per hour. These are different things, and optimizing one does not automatically improve the other.
A single lane road might have low latency (one car goes fast) but terrible throughput (only one car at a time). A ten-lane highway has high throughput but the same latency per car. Adding more lanes helps throughput, not latency. Making cars faster helps latency, not throughput.
Formal definitions:
These two are connected by Little’s Law: L = lambda * W, where L is the number of requests in the system, lambda is the arrival rate (throughput), and W is the average time in the system (latency). If you know any two, you can calculate the third.
Why this matters: if your API has 200ms latency and you need to handle 10,000 QPS, Little’s Law tells you that at any given moment there are 0.2 * 10000 = 2000 requests in flight. Your system needs capacity (connections, memory, threads) for all 2000 simultaneously.
Optimizing one can hurt the other. Batching requests improves throughput (fewer round trips) but increases latency (each request waits for the batch to fill). Compression reduces bandwidth but adds CPU time (higher latency per request). Connection pooling reduces latency (reusing connections) but limits throughput (fixed pool size).
Adjust latency (delay per packet) and throughput (packets per second) to see how they affect total completion time.
Your laptop is slow. You have two options: buy a faster laptop (vertical scaling) or buy more laptops (horizontal scaling).
Vertical scaling (scale up) means adding more resources to a single machine: more CPU, more RAM, faster SSD, more network bandwidth. It is easy — no code changes, no distributed systems complexity. But it has hard limits. You cannot buy a machine with infinite CPU. At some point, the biggest machine available is not enough, and it costs exponentially more as you approach the top.
Horizontal scaling (scale out) means adding more machines and distributing load across them. Two servers handle twice the traffic, ten servers handle ten times. This is virtually unlimited — you can keep adding machines. But it is hard. Your code must be stateless (any request can go to any server), you need a load balancer to distribute traffic, you need to handle data consistency across machines, and failures become normal (if you have 100 servers, expect one to die every few days).
Auto-scaling is the practice of automatically adding or removing servers based on load. When CPU usage exceeds 70%, spin up more servers. When it drops below 30%, remove some. Cloud providers (AWS Auto Scaling, GCP Autoscaler) do this for you, but you need to design your system to support it — stateless servers, health checks, and graceful shutdown.
When to use which: start with vertical scaling for simplicity. Move to horizontal when you hit the limits of a single machine or need the redundancy that multiple machines provide. Most production systems use both: reasonably powerful machines, scaled horizontally.
Start with one server, then scale up (bigger machine) or scale out (more machines) to handle incoming requests.
A store that is open 24/7 but sometimes sells expired food has high availability and low reliability. A store with perfect products that closes randomly has high reliability and low availability. You want both, but they are different things.
Availability is the percentage of time your system is operational and reachable. Measured in “nines”:
| Nines | Downtime per year | Downtime per month |
|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds |
Moving from 99% to 99.9% is straightforward. Moving from 99.99% to 99.999% is exponentially harder and more expensive. Each additional nine requires redundancy, failover mechanisms, multi-region deployment, and extensive testing.
Reliability is the probability that the system performs correctly when it is available. A system can be available but return wrong data (available but unreliable). A system can be down for maintenance but when it is up, every response is correct (reliable but not available).
Three terms you will hear constantly:
In interviews, when someone says “design a system with 99.99% availability,” they are asking you to think about redundancy, failover, and how to minimize downtime across all failure modes.
Every car comes with a spare tire. You do not plan to get a flat, but you prepare for it because flats happen. Fault tolerance is the software equivalent: designing your system so that individual component failures do not cause system-wide outages.
Three terms to distinguish:
The key strategy is redundancy: having multiple copies of every critical component. If you have one database and it dies, your system fails. If you have one primary and one replica, the replica takes over. If you have three servers behind a load balancer and one dies, the other two handle the traffic.
Graceful degradation means the system provides reduced functionality rather than failing completely. If your search service is down, show cached results instead of a blank page. If your recommendation engine fails, show popular items. If your image CDN is slow, serve lower-resolution images. The user gets a degraded experience rather than no experience.
Common fault tolerance patterns:
Click a component to kill it. Watch traffic reroute automatically.
You built an API. It works. But how well does it work? These are the metrics that matter:
QPS (Queries Per Second) is the raw volume metric. How many requests is your system handling right now? This tells you about load. If your QPS is 100 and your system can handle 10,000, you have headroom. If your QPS is 9,500 and your limit is 10,000, you are about to have a bad day.
Latency percentiles tell you about user experience. The most important ones:
Why averages lie: if 99 requests take 50ms and 1 request takes 5000ms, the average is 99.5ms. That looks fine. But p99 is 5000ms, and that one user had a terrible experience. Average hides outliers. Always use percentiles.
Bandwidth vs throughput: bandwidth is the maximum data transfer rate of your network link (measured in Mbps or Gbps). Throughput is the actual data transfer rate you achieve (always less than or equal to bandwidth). You can have a 10Gbps network link (bandwidth) but only push 2Gbps through it (throughput) because your disks are slow, your CPU is busy, or the other end cannot receive fast enough.
Toggle between a healthy and degraded system. Notice how averages hide outliers that percentiles reveal.
Interviewers love capacity estimation questions: “Estimate the storage requirements for Twitter” or “How much bandwidth does Netflix need?” This is not about exact numbers. It is about showing you can think in orders of magnitude and follow a structured approach.
The method is always the same, just with different numbers:
Let’s walk through it for Twitter:
Those numbers are rough estimates, but they are in the right ballpark. The interviewer is checking: can you identify the right variables, use reasonable assumptions, and do the math without getting lost?
Adjust inputs or pick a preset. Watch the math compute step by step.
Answer these without looking back:
| Concept | What It Measures | Typical Unit | Interview Use |
|---|---|---|---|
| Latency | Time per request | Milliseconds | ”p99 under 200ms” |
| Throughput | Requests per time | QPS / RPS | ”10,000 QPS” |
| Availability | Uptime percentage | Nines (99.9%) | “four nines availability” |
| Reliability | Correctness | Error rate % | “error rate below 0.1%“ |
| Bandwidth | Network capacity | Mbps / Gbps | ”10 Gbps backbone” |
| Storage | Data volume | TB / PB | ”500 TB per year” |