We were spending $12,000/month on Redis before I realized we were doing it completely wrong.
Not “slightly inefficient” wrong. Full-on “why is our cache bigger than our database” wrong. Here’s what I learned after three months of firefighting and optimization.
The problem nobody warned us about
Our Redis instance hit 32GB of memory. Our actual PostgreSQL database? 8GB.
Something was very, very wrong.
Started digging through our caching logic. Found this gem:
def get_user_feed(user_id):
cache_key = f"feed:{user_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
# Generate feed (expensive operation)
feed = generate_feed(user_id)
# Cache it
redis.set(cache_key, json.dumps(feed))
return feed
Looks innocent. We thought so too. Until we realized we never set an expiration.
We had feed data from users who hadn’t logged in for 18 months. Dead accounts. Banned users. Test accounts from 2023. All sitting in Redis, eating memory, costing money.
Mistake #1: No TTL means forever
The fix seems obvious now:
# Set expiration - 1 hour for feeds
redis.setex(cache_key, 3600, json.dumps(feed))
But here’s the thing nobody tells you: choosing the right TTL is hard.
Too short? Cache misses everywhere, database gets hammered.
Too long? Stale data and wasted memory.
What actually worked for us:
# Different TTLs for different data types
CACHE_TTLS = {
'user_profile': 86400, # 24 hours - rarely changes
'user_feed': 3600, # 1 hour - updates frequently
'trending_posts': 300, # 5 minutes - needs to be fresh
'static_content': 604800, # 1 week - almost never changes
}
def cache_set(key, value, data_type='default'):
ttl = CACHE_TTLS.get(data_type, 3600)
redis.setex(key, ttl, json.dumps(value))
Dropped our Redis memory usage by 40% immediately.
Mistake #2: Caching entire objects
Found this pattern everywhere:
def get_post(post_id):
cache_key = f"post:{post_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
# Fetch post with ALL relationships
post = db.query(Post).options(
joinedload(Post.author),
joinedload(Post.comments),
joinedload(Post.tags),
joinedload(Post.reactions)
).filter_by(id=post_id).first()
redis.set(cache_key, json.dumps(post.to_dict()))
return post
A single post cache entry was 50KB because we included:
- 200 comments with full user profiles
- All reactions (thousands of them)
- Complete tag metadata
- Author’s full profile
We were caching data we didn’t need 90% of the time.
Better approach:
def get_post(post_id):
# Cache just the post
cache_key = f"post:{post_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
# Fetch ONLY the post
post = db.query(Post).filter_by(id=post_id).first()
redis.setex(cache_key, 3600, json.dumps(post.to_dict()))
return post
def get_post_with_comments(post_id, limit=20):
# Separate cache for comments
post = get_post(post_id)
comments_key = f"post:{post_id}:comments:{limit}"
cached_comments = redis.get(comments_key)
if cached_comments:
post['comments'] = json.loads(cached_comments)
else:
comments = fetch_comments(post_id, limit)
redis.setex(comments_key, 600, json.dumps(comments))
post['comments'] = comments
return post
Cache size dropped another 60%. Why? Because most requests just need the post, not everything.
Mistake #3: Cache stampede
This one took down our API for 10 minutes.
Scenario: Popular post’s cache expires. Suddenly 500 requests all hit the database at once trying to regenerate it. Database melts. Everything breaks.
Our original code had zero protection:
cached = redis.get(cache_key)
if not cached:
# 500 requests all do this simultaneously
data = expensive_database_query()
redis.set(cache_key, data)
The fix: cache locking
import time
import uuid
def get_with_lock(cache_key, fetch_func, ttl=3600):
# Try to get from cache
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
# Try to acquire lock
lock_key = f"lock:{cache_key}"
lock_id = str(uuid.uuid4())
# Atomic lock acquisition with 10 second timeout
lock_acquired = redis.set(
lock_key,
lock_id,
nx=True, # Only set if doesn't exist
ex=10 # Lock expires in 10 seconds
)
if lock_acquired:
try:
# We got the lock, fetch the data
data = fetch_func()
redis.setex(cache_key, ttl, json.dumps(data))
return data
finally:
# Release lock only if we still own it
script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
redis.eval(script, 1, lock_key, lock_id)
else:
# Someone else is fetching, wait and retry
time.sleep(0.1)
return get_with_lock(cache_key, fetch_func, ttl)
Haven’t had a stampede since.
Mistake #4: Not monitoring cache hit rate
We had no idea if our cache was even working. Turns out, our hit rate was 23%. Terrible.
Added monitoring:
def get_cached(key, fetch_func, ttl=3600):
cached = redis.get(key)
if cached:
# Track hit
statsd.increment('cache.hit', tags=[f'key_prefix:{key.split(":")[0]}'])
return json.loads(cached)
# Track miss
statsd.increment('cache.miss', tags=[f'key_prefix:{key.split(":")[0]}'])
data = fetch_func()
redis.setex(key, ttl, json.dumps(data))
return data
This revealed that certain cache keys had 5% hit rates. We were caching data that was almost never reused. Removed those caches entirely.
Now we’re at 87% hit rate.
Mistake #5: JSON everywhere
We were JSON encoding everything:
redis.set(f"counter:{user_id}", json.dumps(count)) # WHY?
For simple values, just use native Redis commands:
# Counters
redis.incr(f"counter:{user_id}")
# Sets
redis.sadd(f"followers:{user_id}", follower_id)
# Sorted sets for leaderboards
redis.zadd("leaderboard", {user_id: score})
# Hashes for objects
redis.hset(f"user:{user_id}", mapping={
"name": "John",
"email": "[email protected]"
})
Faster. Less memory. Built-in atomic operations.
Mistake #6: Caching the wrong things
We cached database query results. But we should have been caching computation results.
Example: User reputation score calculation involves:
- Counting posts (DB query)
- Counting upvotes (DB query)
- Counting comments (DB query)
- Complex formula
We cached each query. Still slow because we recalculated the formula every time.
Better:
def get_reputation_score(user_id):
cache_key = f"reputation:{user_id}"
score = redis.get(cache_key)
if score:
return int(score)
# Calculate ONCE
posts = count_user_posts(user_id)
upvotes = count_user_upvotes(user_id)
comments = count_user_comments(user_id)
score = calculate_reputation(posts, upvotes, comments)
# Cache the RESULT
redis.setex(cache_key, 3600, score)
return score
Cache the computation, not the components.
What we learned
After three months of fixes:
- Redis memory: 32GB → 9GB
- Monthly cost: $12,000 → $3,000
- Average API response time: 450ms → 120ms
- Cache hit rate: 23% → 87%
Key lessons:
- Always set TTLs - Default to 1 hour, adjust based on data
- Cache small and specific - Not entire object graphs
- Protect against stampedes - Use locks or probabilistic early expiration
- Monitor everything - Hit rate, memory usage, evictions
- Use native Redis types - Not everything needs JSON
- Cache computation, not data - Cache what’s expensive to calculate
Redis is amazing when you use it right. We just had to learn the hard way.
One more thing: we now have a weekly job that analyzes cache key patterns and reports unused or rarely-hit keys. Found 2GB of garbage last week.
Your cache should make things faster AND cheaper. If it’s not doing both, you’re doing it wrong.