Rate limiting seems simple until it isn’t. We thought we had it figured out - Redis counters, sliding windows, the works. Then a client with a distributed system hit our API and everything fell apart.
What we built initially
Standard stuff. Token bucket in Redis:
def check_rate_limit(api_key: str) -> bool:
key = f"rate_limit:{api_key}"
current = redis.get(key)
if current and int(current) >= 100: # 100 req/min
return False
pipe = redis.pipeline()
pipe.incr(key)
pipe.expire(key, 60)
pipe.execute()
return True
Worked great in testing. 100 requests per minute per API key, clean reset every minute.
Shipped it. Felt smart.
The problem
Week 2 of production, support ticket: “We’re getting 429s but we’re only making 60 requests per minute.”
Weird. Checked logs. They were indeed making ~60 req/min. But our counter showed 180.
Turns out they were running the same API key across 3 different servers. Each server was making 60 req/min. Our rate limiter saw 180 req/min from the same key and blocked them.
“Working as intended,” I thought. But that’s not how customers think. They think in terms of total requests, not requests per origin server.
Attempt 1: Fix the counter
Maybe we just needed better counting:
def check_rate_limit(api_key: str, ip: str) -> bool:
key = f"rate_limit:{api_key}:{ip}"
# ... same logic
Now each IP gets its own bucket. Problem solved, right?
Nope. Some clients use load balancers with IP rotation. One request from 1.2.3.4, next from 1.2.3.5. They blew through our limits because each IP got a fresh bucket.
Plus we couldn’t actually enforce account-level limits anymore. A malicious user could just spin up 100 IPs and get 100x the quota.
What actually works
We needed to track both:
- Per-API-key total (account-level quota)
- Per-IP rate (prevent single-IP abuse)
But here’s the key insight: use different limits for different purposes.
from datetime import datetime, timedelta
from typing import Tuple
def check_rate_limits(api_key: str, ip: str) -> Tuple[bool, str]:
now = datetime.now()
minute_key = now.strftime("%Y-%m-%d:%H:%M")
# Account-level: 1000 req/hour (sliding window)
account_key = f"rl:account:{api_key}:{minute_key}"
account_count = sum(
int(redis.get(f"rl:account:{api_key}:{(now - timedelta(minutes=i)).strftime('%Y-%m-%d:%H:%M')}") or 0)
for i in range(60)
)
if account_count >= 1000:
return False, "Account quota exceeded"
# IP-level: 200 req/min (fixed window)
ip_key = f"rl:ip:{ip}:{minute_key}"
ip_count = int(redis.get(ip_key) or 0)
if ip_count >= 200:
return False, "IP rate limit exceeded"
# Increment both
pipe = redis.pipeline()
pipe.incr(account_key)
pipe.expire(account_key, 3600) # 1 hour
pipe.incr(ip_key)
pipe.expire(ip_key, 60) # 1 minute
pipe.execute()
return True, "OK"
This is closer but still has issues. That sum() call does 60 Redis GET operations. Doesn’t scale.
The actual solution: Sorted sets
Redis sorted sets let us do sliding windows efficiently:
import time
def check_rate_limit(api_key: str, ip: str) -> Tuple[bool, str]:
now = time.time()
minute_ago = now - 60
hour_ago = now - 3600
# Account quota: sliding 1-hour window
account_key = f"rl:account:{api_key}"
pipe = redis.pipeline()
pipe.zremrangebyscore(account_key, 0, hour_ago) # Remove old entries
pipe.zcard(account_key) # Count remaining
pipe.zadd(account_key, {f"{now}:{ip}": now}) # Add current request
pipe.expire(account_key, 3600)
# IP rate: sliding 1-minute window
ip_key = f"rl:ip:{ip}"
pipe.zremrangebyscore(ip_key, 0, minute_ago)
pipe.zcard(ip_key)
pipe.zadd(ip_key, {f"{now}:{api_key}": now})
pipe.expire(ip_key, 60)
results = pipe.execute()
account_count = results[1]
ip_count = results[5]
if account_count >= 1000:
return False, "Account quota exceeded (1000/hour)"
if ip_count >= 200:
return False, "Too many requests from IP (200/min)"
return True, "OK"
Each request adds a timestamped entry to the sorted set. We remove entries older than the window, count what’s left, and add the new request. All in one pipeline.
This actually works. Sliding windows, accurate counts, handles distributed clients gracefully.
The performance catch
Sorted sets use more memory than simple counters. For 1M API keys making 1000 req/hour each:
- Simple counters: ~50MB
- Sorted sets: ~2GB
We handled this with Redis memory limits and LRU eviction:
maxmemory 4gb
maxmemory-policy allkeys-lru
If memory gets tight, Redis evicts least-recently-used keys. Which is fine - if an API key hasn’t been used in a while, losing its rate limit history doesn’t matter. Next request starts fresh.
Burst handling
Some clients have legitimate burst patterns - quiet for 50 minutes, then 800 requests in 5 minutes.
We added burst allowance:
def check_rate_limit(api_key: str, ip: str) -> Tuple[bool, dict]:
# ... existing code ...
# Check burst: max 300 req in any 5-minute window
five_min_ago = now - 300
pipe.zcount(account_key, five_min_ago, now)
# ... execute pipeline ...
burst_count = results[8] # From the zcount
if burst_count >= 300:
return False, {
"error": "Burst limit exceeded",
"limit": 300,
"window": "5min",
"retry_after": 60
}
This catches abuse while allowing normal traffic spikes.
Response headers
We added rate limit info to response headers (following GitHub’s API pattern):
@app.after_request
def add_rate_limit_headers(response):
if hasattr(g, 'rate_limit_info'):
info = g.rate_limit_info
response.headers['X-RateLimit-Limit'] = info['limit']
response.headers['X-RateLimit-Remaining'] = info['remaining']
response.headers['X-RateLimit-Reset'] = info['reset']
return response
Clients can see their quota usage without guessing. Support tickets dropped by 40%.
Cost considerations
Redis isn’t free. We’re using ~4GB for rate limiting across all our API keys. On AWS ElastiCache, that’s ~$150/month.
Could we use a cheaper solution? Probably. In-memory counters with sticky sessions, or database-backed rate limiting with caching.
But Redis handles our peak load (50K req/sec) without breaking a sweat. The $150/month is worth not waking up at 3am because the rate limiter crashed.
Lessons learned
Test with distributed clients. Our initial testing was single-threaded curl requests. Useless.
Different limits for different purposes. Per-IP and per-account limits catch different abuse patterns.
Sliding windows > fixed windows. Users understand “1000 per hour” better than “1000 per hour but it resets at :00.”
Return helpful error messages.
{"error": "rate limit exceeded"}vs{"error": "rate limit exceeded", "retry_after": 45, "quota_resets": "2026-01-05T12:00:00Z"}- second one gets way fewer support tickets.Memory management matters. Sorted sets eat RAM. Plan for it.
Would I build this differently today? Maybe use a proper rate limiting service like Upstash or Redis Enterprise. But for our scale, this works. And I understand every line of it, which matters when things go wrong at 3am.