Rate limiting seems simple until it isn’t. We thought we had it figured out - Redis counters, sliding windows, the works. Then a client with a distributed system hit our API and everything fell apart.

What we built initially

Standard stuff. Token bucket in Redis:

def check_rate_limit(api_key: str) -> bool:
    key = f"rate_limit:{api_key}"
    current = redis.get(key)

    if current and int(current) >= 100:  # 100 req/min
        return False

    pipe = redis.pipeline()
    pipe.incr(key)
    pipe.expire(key, 60)
    pipe.execute()
    return True

Worked great in testing. 100 requests per minute per API key, clean reset every minute.

Shipped it. Felt smart.

The problem

Week 2 of production, support ticket: “We’re getting 429s but we’re only making 60 requests per minute.”

Weird. Checked logs. They were indeed making ~60 req/min. But our counter showed 180.

Turns out they were running the same API key across 3 different servers. Each server was making 60 req/min. Our rate limiter saw 180 req/min from the same key and blocked them.

“Working as intended,” I thought. But that’s not how customers think. They think in terms of total requests, not requests per origin server.

Attempt 1: Fix the counter

Maybe we just needed better counting:

def check_rate_limit(api_key: str, ip: str) -> bool:
    key = f"rate_limit:{api_key}:{ip}"
    # ... same logic

Now each IP gets its own bucket. Problem solved, right?

Nope. Some clients use load balancers with IP rotation. One request from 1.2.3.4, next from 1.2.3.5. They blew through our limits because each IP got a fresh bucket.

Plus we couldn’t actually enforce account-level limits anymore. A malicious user could just spin up 100 IPs and get 100x the quota.

What actually works

We needed to track both:

  • Per-API-key total (account-level quota)
  • Per-IP rate (prevent single-IP abuse)

But here’s the key insight: use different limits for different purposes.

from datetime import datetime, timedelta
from typing import Tuple

def check_rate_limits(api_key: str, ip: str) -> Tuple[bool, str]:
    now = datetime.now()
    minute_key = now.strftime("%Y-%m-%d:%H:%M")

    # Account-level: 1000 req/hour (sliding window)
    account_key = f"rl:account:{api_key}:{minute_key}"
    account_count = sum(
        int(redis.get(f"rl:account:{api_key}:{(now - timedelta(minutes=i)).strftime('%Y-%m-%d:%H:%M')}") or 0)
        for i in range(60)
    )

    if account_count >= 1000:
        return False, "Account quota exceeded"

    # IP-level: 200 req/min (fixed window)
    ip_key = f"rl:ip:{ip}:{minute_key}"
    ip_count = int(redis.get(ip_key) or 0)

    if ip_count >= 200:
        return False, "IP rate limit exceeded"

    # Increment both
    pipe = redis.pipeline()
    pipe.incr(account_key)
    pipe.expire(account_key, 3600)  # 1 hour
    pipe.incr(ip_key)
    pipe.expire(ip_key, 60)  # 1 minute
    pipe.execute()

    return True, "OK"

This is closer but still has issues. That sum() call does 60 Redis GET operations. Doesn’t scale.

The actual solution: Sorted sets

Redis sorted sets let us do sliding windows efficiently:

import time

def check_rate_limit(api_key: str, ip: str) -> Tuple[bool, str]:
    now = time.time()
    minute_ago = now - 60
    hour_ago = now - 3600

    # Account quota: sliding 1-hour window
    account_key = f"rl:account:{api_key}"
    pipe = redis.pipeline()
    pipe.zremrangebyscore(account_key, 0, hour_ago)  # Remove old entries
    pipe.zcard(account_key)  # Count remaining
    pipe.zadd(account_key, {f"{now}:{ip}": now})  # Add current request
    pipe.expire(account_key, 3600)

    # IP rate: sliding 1-minute window
    ip_key = f"rl:ip:{ip}"
    pipe.zremrangebyscore(ip_key, 0, minute_ago)
    pipe.zcard(ip_key)
    pipe.zadd(ip_key, {f"{now}:{api_key}": now})
    pipe.expire(ip_key, 60)

    results = pipe.execute()

    account_count = results[1]
    ip_count = results[5]

    if account_count >= 1000:
        return False, "Account quota exceeded (1000/hour)"
    if ip_count >= 200:
        return False, "Too many requests from IP (200/min)"

    return True, "OK"

Each request adds a timestamped entry to the sorted set. We remove entries older than the window, count what’s left, and add the new request. All in one pipeline.

This actually works. Sliding windows, accurate counts, handles distributed clients gracefully.

API monitoring dashboard

The performance catch

Sorted sets use more memory than simple counters. For 1M API keys making 1000 req/hour each:

  • Simple counters: ~50MB
  • Sorted sets: ~2GB

We handled this with Redis memory limits and LRU eviction:

maxmemory 4gb
maxmemory-policy allkeys-lru

If memory gets tight, Redis evicts least-recently-used keys. Which is fine - if an API key hasn’t been used in a while, losing its rate limit history doesn’t matter. Next request starts fresh.

Burst handling

Some clients have legitimate burst patterns - quiet for 50 minutes, then 800 requests in 5 minutes.

We added burst allowance:

def check_rate_limit(api_key: str, ip: str) -> Tuple[bool, dict]:
    # ... existing code ...

    # Check burst: max 300 req in any 5-minute window
    five_min_ago = now - 300
    pipe.zcount(account_key, five_min_ago, now)

    # ... execute pipeline ...

    burst_count = results[8]  # From the zcount
    if burst_count >= 300:
        return False, {
            "error": "Burst limit exceeded",
            "limit": 300,
            "window": "5min",
            "retry_after": 60
        }

This catches abuse while allowing normal traffic spikes.

Response headers

We added rate limit info to response headers (following GitHub’s API pattern):

@app.after_request
def add_rate_limit_headers(response):
    if hasattr(g, 'rate_limit_info'):
        info = g.rate_limit_info
        response.headers['X-RateLimit-Limit'] = info['limit']
        response.headers['X-RateLimit-Remaining'] = info['remaining']
        response.headers['X-RateLimit-Reset'] = info['reset']
    return response

Clients can see their quota usage without guessing. Support tickets dropped by 40%.

Cost considerations

Redis isn’t free. We’re using ~4GB for rate limiting across all our API keys. On AWS ElastiCache, that’s ~$150/month.

Could we use a cheaper solution? Probably. In-memory counters with sticky sessions, or database-backed rate limiting with caching.

But Redis handles our peak load (50K req/sec) without breaking a sweat. The $150/month is worth not waking up at 3am because the rate limiter crashed.

Lessons learned

  1. Test with distributed clients. Our initial testing was single-threaded curl requests. Useless.

  2. Different limits for different purposes. Per-IP and per-account limits catch different abuse patterns.

  3. Sliding windows > fixed windows. Users understand “1000 per hour” better than “1000 per hour but it resets at :00.”

  4. Return helpful error messages. {"error": "rate limit exceeded"} vs {"error": "rate limit exceeded", "retry_after": 45, "quota_resets": "2026-01-05T12:00:00Z"} - second one gets way fewer support tickets.

  5. Memory management matters. Sorted sets eat RAM. Plan for it.

Would I build this differently today? Maybe use a proper rate limiting service like Upstash or Redis Enterprise. But for our scale, this works. And I understand every line of it, which matters when things go wrong at 3am.