Load Balancing & API Gateway: Traffic Control for Distributed Systems

Why Do We Need Load Balancing?

Imagine a restaurant with a single waiter serving 50 tables. Every customer waits. The waiter forgets orders. Customers leave angry.

Now imagine 4 waiters with a host at the front door who seats people evenly. No one waits too long. The host is the load balancer.

In software, the problem is identical. One server can only handle so many concurrent requests. Once CPU hits 100% or memory fills up, every new request slows down — or fails entirely. A single server is also a single point of failure. When it crashes, your entire service goes down.

Without a load balancer:

One server gets crushed while others sit idle
No failover — if the server dies, the service dies
No way to scale — adding more servers doesn’t help if traffic only hits one
Downtime during deploys — restarting the server means zero availability

A load balancer solves all of this by sitting between clients and servers, distributing requests intelligently.

How Load Balancers Work

The load balancer sits between clients and servers. Clients talk to the load balancer’s IP. The load balancer decides which backend server handles each request and forwards it. The client never knows which server actually processed the request.

The LB maintains a pool of backend servers. It needs to know which servers are alive and ready to handle traffic. This is where health checks come in.

Health checks work by periodically pinging each server — either by making an HTTP request to a /health endpoint (active health check) or by watching for failed connections (passive health check). If a server doesn’t respond within the timeout, the LB removes it from the pool and stops sending traffic there. When it starts responding again, the LB adds it back.

The result: if a server crashes, the LB detects it within seconds and redistributes traffic to the remaining healthy servers. Clients see a brief slowdown or a retry, not an outage.

Load Balancer Overview

Toggle the load balancer on and off. Watch how traffic distributes evenly with an LB -- and how one server gets crushed without it.

Server Load

Server 1

Server 2

Server 3

Status

Traffic distributed evenly

Requests are balanced across all servers using least connections.

L4 vs L7 Load Balancers

Not all load balancers are equal. The main distinction is which layer of the network stack they operate at.

L4 (Transport layer) load balancers make routing decisions based on IP address and port number. They see a TCP connection from 203.0.113.42:52341 to 10.0.0.5:443 and forward the entire TCP stream to a backend server. They don’t inspect the HTTP content. This makes them fast — they can handle millions of connections per second with minimal latency.

Think of L4 as a mail room. It looks at the address on the envelope and puts it in the right bin. It doesn’t open the letter.

L7 (Application layer) load balancers understand HTTP. They inspect the request method, path, headers, cookies, and even the request body. This lets them route /api/users to one service and /api/orders to another. They can terminate TLS, rewrite URLs, add headers, and make decisions based on session cookies.

Think of L7 as a receptionist who reads the letter, understands what it says, and decides which department should handle it.

| Feature | L4 | L7 | |---------|----|----| | Decision basis | IP + port | HTTP path, headers, cookies | | Speed | Very fast (millions/sec) | Slower (thousands/sec) | | Content inspection | No | Yes | | Use case | Distribute TCP connections | Route HTTP to microservices | | Examples | HAProxy, NLB (AWS) | Nginx, ALB (AWS), Envoy |

When to use each: use L4 when you need raw throughput and all servers are identical. Use L7 when you have microservices with different endpoints, need SSL termination at the LB, or want content-based routing.

L4 vs L7 Load Balancing

L4 routes by IP and port (fast, no content inspection). L7 routes by HTTP path, headers, and cookies (smart, content-aware). Send requests and watch where they go.

Request:

Layer 4 (Transport)0 requests

Packet inspected

IP: 203.0.113.42
Port: 443

Route decision: IP + port only

Target servers:

Backend Pool A

Handles all traffic on port 443

Backend Pool B

Handles all traffic on port 443

L4 cannot distinguish between /api/users and /api/orders -- same port, same pool.

Layer 7 (Application)0 requests

Full HTTP request inspected

GET /api/users
Host: api.example.com
Authorization: Bearer abc123

Route decision: path + headers

Target services:

User Service

/api/users, /api/auth

Order Service

/api/orders, /api/cart

Product Service

/api/products, /static/*

WebSocket Service

/ws/*

L7 reads the URL path and routes /api/users to User Service, /api/orders to Order Service.

Load Balancing Algorithms

The algorithm is the brain of the load balancer — it decides which server gets each request.

Round Robin is the simplest. Requests cycle through servers in order: A, B, C, D, A, B, C, D. Fair when all servers are identical. Doesn’t account for different server capacities or current load.

Weighted Round Robin accounts for different server sizes. If Server A has 4 cores and Server D has 1 core, you assign weights 4:3:2:1. Server A gets roughly 40% of traffic, Server D gets 10%. Still simple, but respects hardware differences.

Least Connections sends each request to the server with the fewest active connections. Better when requests take variable time — a server handling a slow query won’t accumulate connections because the LB sends new ones elsewhere.

Least Response Time combines connection count with response latency. It sends requests to the server that’s both least busy and fastest-responding. The most adaptive algorithm, but requires more computation.

IP Hash runs the client’s IP through a hash function and uses the result to pick a server. The same IP always hits the same server. This provides session stickiness — useful when sessions are stored locally (though shared sessions or sticky tokens are better solutions).

| Algorithm | Best When | Ignores Capacity | Session Sticky | |-----------|-----------|------------------|----------------| | Round Robin | Identical servers | Yes | No | | Weighted RR | Different server sizes | No (uses weights) | No | | Least Connections | Variable request duration | No (uses active conns) | No | | Least Response Time | Mixed workloads | No (uses latency) | No | | IP Hash | Need session stickiness | Yes | Yes |

Load Balancing Algorithms

Different algorithms distribute traffic differently. Send requests and watch how each algorithm decides which server handles them.

Cycles through servers in order. Simple and fair when servers are identical.

Servers

Server A

Server B

Server C

Server D

Request Log

Press Send or Auto to start

Health Checks & Failover

A load balancer is only as good as its awareness of server health. Health checks are how the LB stays informed.

Active health checks are periodic requests the LB sends to each server, typically a GET /health endpoint every 5-30 seconds. The server responds with 200 OK if healthy. If the LB gets no response, a 5xx error, or a timeout (e.g., 3 seconds), it marks the server as unhealthy.

Passive health checks monitor real traffic. If the LB detects a pattern of failed responses from a server (e.g., 5 consecutive 500s), it marks the server unhealthy without sending explicit health check requests. This catches issues that health endpoints might miss (like a database connection pool exhaustion where /health returns 200 but real requests fail).

Failover strategies:

Active-Active: All servers receive traffic. When one fails, its share is redistributed to the rest. This is the most common setup — you get full capacity and automatic failover.
Active-Passive: One server handles all traffic. A standby server waits idle. If the active server fails, the standby takes over. Simpler but wastes resources — you’re paying for a server that does nothing most of the time.

When a server fails and gets removed from the pool, the LB redistributes its traffic to remaining servers. If those servers were already at 60% capacity, they might jump to 80%. This is why you need capacity headroom — running servers at 100% means any failure causes cascading overload.

Health Checks & Failover

Click "Kill" to take a server offline. Watch the health checker detect the failure and stop sending traffic there. Click "Revive" to bring it back.

Server A

HEALTHY

Requests handled: 0

Pool: ACTIVE

Server B

HEALTHY

Requests handled: 0

Pool: ACTIVE

Server C

HEALTHY

Requests handled: 0

Pool: ACTIVE

Server D

HEALTHY

Requests handled: 0

Pool: ACTIVE

Event Log

Start traffic and kill a server to see events

What Is an API Gateway?

Back to the restaurant analogy. The load balancer is the host who seats people at tables. The API gateway is the maitre d’ who handles reservations, checks dress code, manages the waitlist, and directs you to the right dining room.

An API gateway is a single entry point for all client requests. Instead of clients connecting directly to each microservice, they connect to the gateway, which handles cross-cutting concerns and routes requests to the appropriate backend.

Why not expose services directly? Because each service would need its own auth, rate limiting, logging, and CORS configuration. That’s duplicated effort and inconsistent security. The gateway centralizes all of it.

What a gateway does:

Routing: Maps incoming requests to the right backend service based on path, headers, or method
Authentication: Verifies JWT tokens, API keys, or OAuth credentials before forwarding
Rate limiting: Prevents any single client from overwhelming the system
Request/response transformation: Adds headers, strips internal fields, converts protocols
Logging & monitoring: Records every request for debugging and analytics
API versioning: Routes /v1/ and /v2/ to different service versions

API Gateway Deep Dive

Request routing is the core function. The gateway uses rules to match incoming requests to backend services. Path-based routing sends /api/users/* to the Users service and /api/orders/* to the Orders service. Header-based routing can send requests with X-Version: 2 to a different service version. Method-based routing can separate GET /api/products (read) from POST /api/products (write) into different services.

Authentication at the gateway means backend services don’t need to verify tokens individually. The gateway validates the JWT once, extracts user info, and passes it downstream as headers (X-User-Id, X-User-Role). This simplifies backend code and ensures consistent auth logic.

Rate limiting protects your system from abuse. Common strategies: fixed window (100 requests per minute), sliding window (smoother distribution), token bucket (allows short bursts). The gateway tracks requests per client IP or API key and returns 429 Too Many Requests when the limit is exceeded.

Request/response transformation lets the gateway modify traffic in flight. Add a X-Request-Id header for tracing. Strip internal fields from responses. Convert XML to JSON for legacy clients. Aggregate responses from multiple services into one response.

Logging & monitoring at the gateway gives you a single place to observe all traffic. Log the method, path, status code, response time, and client IP for every request. Feed this into your monitoring system (Datadog, Grafana, ELK) for dashboards and alerts.

API versioning through the gateway lets you run multiple versions simultaneously. Route /v1/users to the legacy service and /v2/users to the new one. Gradually migrate clients without big-bang deployments.

API Gateway Pipeline

Watch a request flow through the gateway pipeline: authentication, rate limiting, routing, transformation, and logging. Try different requests including blocked ones.

Request:

Gateway Pipeline

Select a request and press Send

Step Detail

Press Send to start the pipeline

Event Log

No events yet

Backend Services

Users Service

Receiving traffic

Orders Service

Idle

Payments Service

Idle

Gateway vs Load Balancer

They sound similar, and some tools (like Nginx, Envoy, HAProxy) can do both. But they solve different problems.

A load balancer distributes traffic across multiple instances of the same service. It cares about server health and connection distribution. Its question: “which server instance should handle this request?”

An API gateway manages traffic at the application level across different services. It cares about routing, auth, rate limiting, and transformation. Its question: “which service should handle this request, and does this request have permission?”

| Concern | Load Balancer | API Gateway | |---------|--------------|-------------| | Primary job | Distribute traffic across instances | Manage API access | | Routing | Same service, different instances | Different services by path | | Auth | Usually not | Yes (JWT, API keys, OAuth) | | Rate limiting | Rarely | Yes | | Content inspection | L7 LBs can, L4 cannot | Yes (always) | | TLS termination | Often | Yes | | Health checks | Yes | Sometimes | | Request transformation | No | Yes |

When you need both: in most production systems, you do. The load balancer sits in front of the gateway instances (distributing traffic across multiple gateway replicas for high availability). The gateway sits behind the LB and handles application-level concerns before routing to backend services.

The typical architecture: Client -> LB (L4) -> Gateway replicas (L7) -> Backend services

Self-Check

Can you explain why a single server is a single point of failure?
What’s the difference between L4 and L7 load balancing? When would you choose each?
Which load balancing algorithm would you use for servers with different CPU capacities?
Which algorithm gives you session stickiness without any shared state?
What happens when a health check fails? How does the LB respond?
What’s the difference between active and passive health checks?
What’s the difference between a gateway and a load balancer? Can one tool do both?
Name 3 things an API gateway does that a load balancer typically does not.
Why would you put a load balancer in front of your API gateway instances?
What HTTP status code does a gateway return when a client exceeds the rate limit?

Test Your Knowledge

Question 1 of 610 pts

What is the primary difference between an L4 and an L7 load balancer?

Score: 0 / 700%