Webhooks at Scale: What We Learned Sending 10M/Day

We send about 10 million webhooks per day now. Started at 100K/day a year ago. Thought it would scale fine. It didn’t. Here’s everything that broke and how we fixed it.

The naive implementation

def send_webhook(event_type, payload, url):
    try:
        requests.post(
            url,
            json={"event": event_type, "data": payload},
            timeout=5
        )
    except Exception:
        # ¯\_(ツ)_/¯
        pass

Call this function whenever something happens. Fire and forget. What could go wrong?

Problem 1: Blocking the main thread

Webhook delivery was part of the request cycle. User creates an order, we save it, then send webhooks to their configured endpoints. If their endpoint is down or slow, the user’s request times out.

Customer reported checkout taking 30 seconds. Their webhook endpoint was timing out but we waited the full 5 seconds before giving up.

Fix: Move to background jobs.

from celery import shared_task

@shared_task
def send_webhook_async(event_type, payload, url):
    try:
        response = requests.post(
            url,
            json={"event": event_type, "data": payload},
            timeout=5
        )
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        # Still swallowing errors but at least not blocking users
        logger.error(f"Webhook failed: {e}")

Now webhook delivery happens in Celery workers, not during the request. User checkouts are fast again.

Problem 2: No retries

Webhooks failed silently. Customer endpoints go down for maintenance? They miss events permanently.

Added retries with exponential backoff:

@shared_task(bind=True, max_retries=5)
def send_webhook_async(self, event_type, payload, url, attempt=0):
    try:
        response = requests.post(
            url,
            json={"event": event_type, "data": payload},
            timeout=5,
            headers={"X-Webhook-Attempt": str(attempt + 1)}
        )
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        # Retry with exponential backoff: 1min, 5min, 25min, 2h, 10h
        self.retry(exc=e, countdown=60 * (5 ** attempt))

Better, but created a new problem.

Problem 3: Retry storms

A popular customer’s endpoint went down. We had 50K webhooks queued for them. All 50K retried at the same time after 1 minute. Their endpoint came back up, got hit with 50K requests in 10 seconds, went down again.

Retry storm. Classic distributed systems problem.

Solution: Add jitter and rate limiting per endpoint.

import random
from datetime import datetime, timedelta

@shared_task(bind=True, max_retries=5)
def send_webhook_async(self, event_type, payload, url, webhook_id, attempt=0):
    # Rate limit: max 100 req/min per endpoint
    rate_limit_key = f"webhook:ratelimit:{hash(url)}"
    current_count = redis.incr(rate_limit_key)
    if current_count == 1:
        redis.expire(rate_limit_key, 60)

    if current_count > 100:
        # Back off if we're sending too fast
        self.retry(countdown=random.randint(60, 120))
        return

    try:
        response = requests.post(url, json={
            "event": event_type,
            "data": payload,
            "webhook_id": webhook_id,
            "attempt": attempt + 1
        }, timeout=10)
        response.raise_for_status()

        # Log success
        redis.hincrby("webhook:stats", "success", 1)

    except requests.exceptions.RequestException as e:
        # Add jitter to retry timing
        base_delay = 60 * (5 ** attempt)
        jitter = random.randint(0, base_delay // 2)
        self.retry(exc=e, countdown=base_delay + jitter)

Now retries are spread out over time and we don’t hammer any single endpoint.

System architecture diagram

Problem 4: No delivery guarantees

Webhooks still got lost. Celery workers crashed, Redis went down, tasks vanished.

Moved to a proper queue with persistence:

# Store webhook in database first
webhook = Webhook.objects.create(
    event_type=event_type,
    payload=payload,
    url=url,
    status='pending'
)

# Then queue for delivery
send_webhook_async.delay(webhook.id)

@shared_task(bind=True, max_retries=5)
def send_webhook_async(self, webhook_id, attempt=0):
    webhook = Webhook.objects.get(id=webhook_id)

    if webhook.status == 'delivered':
        return  # Already delivered, skip

    try:
        response = requests.post(webhook.url, json={
            "event": webhook.event_type,
            "data": webhook.payload,
            "id": str(webhook.id),
            "attempt": attempt + 1
        }, timeout=10)

        if response.status_code == 200:
            webhook.status = 'delivered'
            webhook.delivered_at = timezone.now()
            webhook.save()
        else:
            raise Exception(f"HTTP {response.status_code}")

    except Exception as e:
        webhook.attempts += 1
        webhook.last_error = str(e)
        webhook.save()

        if attempt >= 4:  # Final retry
            webhook.status = 'failed'
            webhook.save()
        else:
            self.retry(exc=e, countdown=60 * (5 ** attempt) + random.randint(0, 300))

Now we have a source of truth. If a webhook wasn’t delivered, we know about it.

Problem 5: Database bottleneck

10M webhooks/day = ~115 per second. Each webhook is 2 DB writes (create + update on success/fail). That’s 230 writes/second to one table.

Database started struggling. Connection pool maxed out. Queries queued.

Optimization 1: Batch updates

# Instead of updating each webhook individually
# Collect successful webhook IDs in Redis
redis.sadd(f"webhook:batch:success:{minute}", webhook_id)

# Separate job runs every minute
@periodic_task(run_every=timedelta(minutes=1))
def batch_update_webhooks():
    minute = datetime.now().strftime("%Y-%m-%d:%H:%M")
    success_ids = redis.smembers(f"webhook:batch:success:{minute}")

    if success_ids:
        Webhook.objects.filter(id__in=success_ids).update(
            status='delivered',
            delivered_at=timezone.now()
        )
        redis.delete(f"webhook:batch:success:{minute}")

Reduced DB writes by 80%. Single batch update instead of individual updates.

Optimization 2: Partition the table

-- PostgreSQL partitioning by month
CREATE TABLE webhooks (
    id BIGSERIAL,
    created_at TIMESTAMP,
    -- other fields
) PARTITION BY RANGE (created_at);

CREATE TABLE webhooks_2026_01 PARTITION OF webhooks
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE webhooks_2026_02 PARTITION OF webhooks
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Queries only scan relevant partitions. Old partitions can be archived without affecting current data.

Problem 6: Webhook signature verification

Customers asked “how do we know webhooks are really from you?”

Added HMAC signatures:

import hmac
import hashlib

def send_webhook_async(webhook_id):
    webhook = Webhook.objects.get(id=webhook_id)
    customer = webhook.customer

    payload = json.dumps({
        "event": webhook.event_type,
        "data": webhook.payload,
        "id": str(webhook.id),
        "timestamp": int(time.time())
    })

    # Sign with customer's secret key
    signature = hmac.new(
        customer.webhook_secret.encode(),
        payload.encode(),
        hashlib.sha256
    ).hexdigest()

    response = requests.post(
        webhook.url,
        data=payload,
        headers={
            "Content-Type": "application/json",
            "X-Webhook-Signature": f"sha256={signature}",
            "X-Webhook-ID": str(webhook.id)
        },
        timeout=10
    )

Customers verify signatures on their end:

# Customer's webhook receiver
def verify_webhook(request):
    payload = request.body
    signature = request.headers.get("X-Webhook-Signature")

    expected = hmac.new(
        SECRET_KEY.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(f"sha256={expected}", signature):
        return HttpResponse(status=401)

    # Process webhook

Problem 7: Observability

When webhooks failed, we had no visibility. Is our system broken or is the customer’s endpoint down?

Added comprehensive logging:

# Log attempt
logger.info("webhook.attempt", extra={
    "webhook_id": webhook.id,
    "customer_id": webhook.customer_id,
    "event_type": webhook.event_type,
    "attempt": attempt,
    "url": webhook.url
})

# Log result
logger.info("webhook.delivered", extra={
    "webhook_id": webhook.id,
    "duration_ms": (time.time() - start) * 1000,
    "status_code": response.status_code
})

# Or on failure
logger.warning("webhook.failed", extra={
    "webhook_id": webhook.id,
    "error": str(e),
    "will_retry": attempt < 4
})

Structured logging feeds into Datadog. We built dashboards showing:

Delivery rate by customer
Failed webhooks by error type
Average delivery time
Retry success rate

Now when a customer reports missing webhooks, we can see exactly what happened.

The current architecture

User action
    ↓
Create webhook record (DB)
    ↓
Queue webhook (Redis/Celery)
    ↓
Worker picks up task
    ↓
Rate limit check (Redis)
    ↓
Send HTTP request
    ↓
Success? → Batch update DB
    ↓
Failure? → Retry with backoff
    ↓
Max retries exceeded? → Mark as failed, alert customer

Handles 10M/day without breaking a sweat. Could probably scale to 100M/day with more workers.

What actually matters

Don’t block user requests - webhooks must be async
Retries are mandatory - networks are unreliable
Add jitter to retries - prevent thundering herds
Persist everything - queues can lose data
Rate limit per endpoint - protect customer servers
Sign your webhooks - customers need to verify authenticity
Batch database writes - individual updates don’t scale
Log everything - you’ll need it for debugging

We went from constant webhook issues to a system that just works. The key was treating webhooks as a first-class feature with proper architecture, not an afterthought.

Also learned that most webhook receivers are terribly implemented. We’ve seen endpoints that return 200 but don’t actually process the webhook, endpoints that hang for 60 seconds, endpoints that crash on duplicate deliveries. Plan for that.

The naive implementation#

Problem 1: Blocking the main thread#

Problem 2: No retries#

Problem 3: Retry storms#

Problem 4: No delivery guarantees#

Problem 5: Database bottleneck#

Problem 6: Webhook signature verification#

Problem 7: Observability#

The current architecture#

What actually matters#