Design a Webhook System: Reliable Event Delivery at Scale

What Are Webhooks?

Webhooks are automated HTTP callbacks triggered by events. When something happens in a source system — a payment succeeds, a repository is pushed to, a new user signs up — the source sends an HTTP POST request to a pre-registered URL owned by the consumer.

Unlike polling, where the client repeatedly asks “is there new data?”, webhooks push the data the moment it exists. The client does nothing but listen.

Real-world examples:

Stripe sends payment_intent.succeeded to your endpoint after a charge goes through
GitHub sends push events to CI systems like Jenkins or GitHub Actions
Slack sends event_callback for messages, reactions, and channel activity
Twilio sends status_callback when an SMS is delivered or fails
Shopify sends orders/create when a customer places an order

Polling vs Webhooks

| Aspect | Polling | Webhooks | |---|---|---| | Direction | Client pulls from server | Server pushes to client | | Latency | Depends on poll interval (seconds to minutes) | Near real-time (milliseconds) | | Server load | Wasted requests when no data | Zero wasted requests | | Complexity | Simple to implement | Requires public endpoint, retry logic | | Reliability | Client controls retry | Server must handle retry + delivery guarantees | | Firewall | Client initiates, no inbound needed | Client must expose a public endpoint |

When to Use Webhooks

Webhooks shine when you need real-time event delivery and control over the consumer is limited (third-party integrations). They struggle when clients are behind strict firewalls, cannot maintain a public endpoint, or need ordered exactly-once delivery (which webhooks do not natively guarantee).

Webhook System Requirements

Toggle requirements on/off and cycle priorities to plan your webhook delivery system.

8 MUST4 SHOULD0 NICE|12 / 14 enabled

delivery5/7

Deliver events to registered webhook URLs via HTTP POST

Retry failed deliveries with exponential backoff

Idempotent event delivery via unique event IDs

At-least-once delivery guarantee per event

Fan-out delivery to multiple registered URLs per event

Partial payload delivery (send only changed fields)

Custom retry intervals per client configuration

security2/2

HMAC-SHA256 signature in header for payload verification

Webhook secret rotation without downtime

reliability3/3

Rate limiting per client URL (max N requests per second)

Dead letter queue for events exceeding max retries

Circuit breaker to pause delivery to failing endpoints

observability2/2

Delivery log with timestamps, status codes, and response bodies

Manually replay events from the delivery log

Event Delivery Lifecycle

Every webhook delivery follows the same lifecycle. Understanding this lifecycle is the foundation of any webhook system design.

The Delivery Sequence

Event creation — An event occurs in the source system (e.g., a Stripe charge succeeds)
Persistence — The event is stored in the database with a unique ID and status pending
Enqueue — A delivery job is pushed onto a queue (per client or shared)
Delivery attempt — A worker pulls the job and sends an HTTP POST to the registered URL
Response evaluation — The worker checks the HTTP response code:
- 2xx: mark as delivered, done
- 4xx (client error): the endpoint is misconfigured or rejecting — do not retry
- 5xx (server error): retry with exponential backoff
- Timeout or network error: retry (same as 5xx)
Retry or DLQ — On failure, the event is re-enqueued with a delay. After max retries, it moves to the dead letter queue

Exponential Backoff Schedule

Each retry waits longer than the last. A typical schedule:

{
  "retry_1": "1 minute",
  "retry_2": "5 minutes",
  "retry_3": "15 minutes",
  "retry_4": "1 hour",
  "retry_5": "6 hours",
  "retry_6": "24 hours"
}

Total time before DLQ: approximately 32.7 hours. Each delay is computed as base * 2^(attempt - 1) plus a small random jitter to prevent the thundering herd problem when many clients fail simultaneously.

import random
import time

def compute_delay(attempt: int, base_minutes: int = 1) -> float:
    jitter = random.uniform(0, 0.3 * base_minutes)
    return base_minutes * (2 ** (attempt - 1)) + jitter

# Simulate retry schedule
for attempt in range(1, 7):
    delay_min = compute_delay(attempt)
    print(f"Retry {attempt}: wait {delay_min:.1f} minutes")

Webhook Delivery Flow

Event:evt_001

Status:IDLE

Retry Schedule

#1

1m

#2

5m

#3

15m

#4

1h

#5

6h

#6

24h

AttemptDelay (cumulative)

Max = 6 retries~32.7 hours total

AT-LEAST-ONCE DELIVERY

Each event is retried until the endpoint returns 2xx or max retries are exhausted. Events never silently drop.

DEAD LETTER QUEUE

After 6 failed attempts (~32.7 hours), the event is moved to a DLQ for manual inspection and replay.

Event Created

Queue

Delivery Worker

HTTP POST

2xx?

Retry/Backoff

DLQ

Idempotency via Event IDs

Idempotency means performing an operation multiple times produces the same result as doing it once. In webhooks, this is critical because the at-least-once delivery model means the same event might be delivered twice.

Every event carries a unique idempotency key in the payload:

{
  "id": "evt_3QpLmN7XyZ",
  "idempotency_key": "evt_3QpLmN7XyZ_1716000000",
  "type": "payment_intent.succeeded",
  "data": { ... }
}

The receiver stores seen keys in a cache or database. When a duplicate arrives, the receiver detects the known key and returns a 200 OK without processing the event again.

seen_keys = redis_cache()

def handle_webhook(payload: dict):
    key = payload["idempotency_key"]
    if seen_keys.exists(key):
        return {"status": "duplicate", "existing_response": seen_keys.get(key)}

    result = process_event(payload)
    seen_keys.set(key, result, ttl=86400)
    return result

Common gotcha: the TTL on the idempotency cache must exceed the total retry window. If retries span 32.7 hours, the cache TTL should be at least 48 hours.

Security: Webhook Payload Signing

Without signing, a malicious actor could forge webhook requests and trigger unintended actions in your system. Every webhook provider signs requests so the receiver can verify authenticity.

HMAC-SHA256 Signing

The sender computes a signature over the raw request body using a shared secret:

signature = HMAC-SHA256(webhook_secret, request_body)

The signature is sent in a header. Stripe’s format:

webhook-signature: t=1716000000,v1=5257a869e7ecebeda32affa62cd4fa7c27f6c7d5d5f6e5c5c5f6a7b8c9d0e1f2

The timestamp t prevents replay attacks. The receiver verifies:

Extract t and v1 from the header
Recompute HMAC-SHA256(secret, f"{t}.{body}")
Compare the computed signature to v1 using constant-time comparison
Check that t is within an acceptable time window (e.g., 5 minutes)

import hmac
import hashlib
import time

def verify_signature(
    payload: bytes,
    header: str,
    secret: str,
    tolerance_seconds: int = 300,
) -> bool:
    params = {}
    for part in header.split(","):
        key, value = part.strip().split("=", 1)
        params[key] = value

    timestamp = int(params["t"])
    if abs(time.time() - timestamp) > tolerance_seconds:
        return False

    signed_payload = f"{timestamp}.{payload.decode()}".encode()
    expected = hmac.new(
        secret.encode(), signed_payload, hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(expected, params["v1"])

Webhook Payload Signing

The sender signs the payload with a shared secret. The receiver recomputes the signature and compares. If they match, the payload is authentic. If not, the request is rejected.

Shared Secret

(change to invalidate)

Payload Body

{
  "id": "evt_3QpLmN7XyZ",
  "type": "payment_intent.succeeded",
  "data": {
    "object": {
      "id": "pi_3QpLmN7XyZ",
      "amount": 2999,
      "currency": "usd",
      "status": "succeeded"
    }
  }
}

HMAC-SHA256 Computation

HMAC-SHA256( secret, payload )

1f583619916385f1

Webhook-Signature Header (Stripe format)

webhook-signature: t=1716000000,sig_profile=v1,{"sig":"1f583619916385f1", "v1":"...", "v2":"..."}

SIGNATURE VERIFIED

The payload is authentic and has not been tampered with. Process the webhook.

SENDER (Webhook Provider)

Signs payload with shared secret
Adds webhook-signature header
Posts to client endpoint

RECEIVER (Webhook Client)

Extracts signature from header
Recomputes HMAC from payload + secret
Compares: match = authentic, reject if not

Rate Limiting Per Client

A misbehaving client should not degrade service for others. Each client gets a rate limit defined at registration time (e.g., 100 requests per second). Workers use a token bucket or sliding window counter to throttle delivery.

import asyncio
from token_bucket import TokenBucket

client_limits = {
    "client_a": TokenBucket(rate=100, capacity=100),
    "client_b": TokenBucket(rate=10, capacity=10),
}

async def deliver(webhook_url: str, payload: dict):
    bucket = client_limits.get(url_to_client(webhook_url))
    if not bucket.consume():
        await queue.reenqueue(payload, delay=1.0)
        return
    await http_post(webhook_url, payload)

If a client consistently exceeds their limit, excess events queue up. The admin can adjust the limit or pause delivery entirely.

Delivery Service Architecture

The delivery service sits between event producers and client endpoints. It handles persistence, queuing, rate-limited delivery, retries, and observability.

Queue-Per-Client Design

A shared queue works for low volumes but creates a head-of-line blocking problem: when one client’s endpoint is slow, deliveries for all other clients wait behind it.

The fix: give each client its own queue (a separate Redis list, SQS queue, or RabbitMQ exchange). Workers poll from all queues in a round-robin or priority-based fashion.

class WebhookQueueManager:
    def __init__(self, redis_client):
        self.redis = redis_client

    def enqueue(self, client_id: str, event: dict):
        queue_key = f"webhook_queue:{client_id}"
        self.redis.lpush(queue_key, json.dumps(event))

    def dequeue_multi(self, client_ids: list[str], batch_size: int = 10):
        events = []
        for cid in client_ids:
            for _ in range(batch_size):
                event = self.redis.rpop(f"webhook_queue:{cid}")
                if not event:
                    break
                events.append((cid, json.loads(event)))
        return events

Delivery Worker Pool

A fixed pool of workers (e.g., 50 threads) continuously polls queues, sends HTTP POSTs, and evaluates responses. Workers mark events as delivered, re-enqueue for retry with delay, or move them to DLQ.

async def worker_loop():
    while True:
        events = queue.dequeue_multi(client_registry.all_ids())
        for client_id, event in events:
            url = client_registry.get_url(client_id)
            try:
                async with httpx.AsyncClient() as client:
                    resp = await client.post(url, json=event, timeout=10)
                if resp.status_code in (200, 201, 204):
                    db.mark_delivered(event["id"])
                elif 400 <= resp.status_code < 500:
                    db.mark_failed(event["id"])  # no retry
                else:
                    queue.reenqueue_with_delay(event, compute_delay(event["attempt"]))
            except (httpx.TimeoutException, httpx.NetworkError):
                queue.reenqueue_with_delay(event, compute_delay(event["attempt"]))

Webhook System Architecture

Event Flow

Event Producer

Payment Service, User Service, etc.

|
v

Event Service

Validates, signs, enqueues

|
v

QMessage Queue

Redis, RabbitMQ, or SQS (per client or shared)

|
v

Delivery Workers

N workers pull events, POST to client URLs

|
v

Client Endpoint

Receives POST with signed payload

200 OK

4xx/5xx

Timeout

Dead Letter Queue

Failed after max retries

Delivery Log

Full audit trail

PER-CLIENT QUEUE

Each client gets its own queue. A slow or failing client does not block deliveries to others.

RATE LIMITING

Workers throttle POSTs to each client based on their configured rate limit. Excess events are queued.

SIGNING + IDEMPOTENCY

Payloads are signed with HMAC-SHA256. Each event carries a unique idempotency key for safe retries.

Dead Letter Queues

After the maximum number of retries (typically 6), the event moves to a dead letter queue. A DLQ is a separate queue or database table that stores undeliverable events for manual inspection.

def move_to_dlq(event: dict):
    dlq_key = f"webhook_dlq:{event['id']}"
    redis.set(dlq_key, json.dumps({
        **event,
        "dlq_reason": "max_retries_exceeded",
        "dlq_timestamp": time.time(),
        "attempts": event.get("attempts", 0),
    }))
    notify_admin(event)

Events in the DLQ can be:

Replayed — a new delivery attempt is triggered manually via the admin panel
Inspected — logs show every attempt with request/response bodies
Expired — automatically deleted after a retention period (e.g., 14 days)
Forwarded — sent to an alternative fallback endpoint

Replaying Events

The admin panel shows every failed event with full delivery history. An operator can click “Replay” to re-enqueue the event with fresh delivery attempts. This is essential for handling transient infrastructure issues on the client side.

Delivery Guarantees: At-Least-Once vs Exactly-Once

Webhooks fundamentally provide at-least-once delivery. The event will eventually reach the endpoint or land in the DLQ, but duplicates are possible.

At-least-once means:

Every event is delivered at least one time
Duplicates can happen (network retry, timing edge cases)
The receiver must handle duplicates via idempotency

Exactly-once is not achievable with HTTP-based webhooks over an unreliable network. The closest you can get is:

Idempotent receivers (deduplicate via event IDs)
Strong ordering guarantees within a single queue
Exactly-once semantics within your internal system, surfaced as idempotent webhooks

# Receiver-side deduplication
def handle_webhook(request):
    event_id = request.json["id"]
    with db.transaction():
        if db.events.exists(event_id):
            return {"status": "already_processed"}, 200
        db.events.insert(event_id, request.json)
        process_event(request.json)
    return {"status": "ok"}, 200

Delivery Analytics and Observability

A production webhook system needs comprehensive observability:

Delivery log — every attempt with timestamp, status code, response body, and latency
Metrics — delivery success rate, average attempts per event, DLQ rate, queue depth per client
Alerts — notify the team when a client’s success rate drops below 90% or DLQ grows rapidly
Dashboard — real-time view of delivery health per client

-- Sample query: delivery success rate per client
SELECT
    client_url,
    COUNT(*) AS total_events,
    SUM(CASE WHEN status = 'delivered' THEN 1 ELSE 0 END) AS delivered,
    ROUND(SUM(CASE WHEN status = 'delivered' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS success_rate
FROM webhook_deliveries
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY client_url
ORDER BY success_rate ASC;

Putting It All Together: Stripe-Style Architecture

Stripe’s webhook architecture exemplifies the patterns we have covered. Here is the complete architecture end to end:

Event Producer (Payment Service)
        |
        v
Event Service (validates, assigns ID, signs payload)
        |
        v
Event Store (PostgreSQL)
        |
        v
Per-Client Queue (Redis streams or RabbitMQ)
        |
        v
Delivery Workers (N processes, rate-limited per client)
        |
        v
Client Endpoint (HTTP POST with Stripe-Signature header)
        |
        +--> 200 OK: mark delivered, log success
        |
        +--> 5xx/Timeout: re-enqueue with backoff delay
        |
        +--> Max retries: move to Dead Letter Queue
                           |
                           v
                    Admin Panel (view, replay, inspect)

What You Need to Implement

Building a webhook system from scratch requires these components:

Event schema with unique ID, type, data, idempotency key, and timestamp
Database for event persistence and delivery tracking
Queue per client (Redis streams, SQS, or RabbitMQ)
Worker pool for HTTP delivery with rate limiting
Retry engine with exponential backoff and jitter
Signing layer using HMAC-SHA256
Dead letter queue for failed events
Admin panel for monitoring and replaying
Observability pipeline with metrics, logs, and alerts

Self-Check Questions

Before deploying your webhook system, ask yourself:

Can clients verify the authenticity of every webhook request?
What happens if the client’s endpoint returns 500 for 32 hours straight?
How does a client recover from data loss if they miss a webhook?
Can a slow client block delivery to other clients?
How do you handle a compromised webhook secret?
Can an operator view, filter, and replay any failed delivery?
What is the TTL on your idempotency cache, and does it exceed the retry window?

Test Your Knowledge

Question 1 of 710 pts

What does a 4xx HTTP response from a webhook endpoint typically indicate, and how should the delivery system handle it?

Score: 0 / 790%

Building a reliable webhook system is not complicated at the component level, but the edge cases around retries, duplicates, and misbehaving clients make the difference between a system that is trusted and one that is constantly on fire.