We were spending $12,000/month on Redis before I realized we were doing it completely wrong.

Not “slightly inefficient” wrong. Full-on “why is our cache bigger than our database” wrong. Here’s what I learned after three months of firefighting and optimization.

The problem nobody warned us about

Our Redis instance hit 32GB of memory. Our actual PostgreSQL database? 8GB.

Something was very, very wrong.

Started digging through our caching logic. Found this gem:

def get_user_feed(user_id):
    cache_key = f"feed:{user_id}"
    cached = redis.get(cache_key)
    
    if cached:
        return json.loads(cached)
    
    # Generate feed (expensive operation)
    feed = generate_feed(user_id)
    
    # Cache it
    redis.set(cache_key, json.dumps(feed))
    
    return feed

Looks innocent. We thought so too. Until we realized we never set an expiration.

We had feed data from users who hadn’t logged in for 18 months. Dead accounts. Banned users. Test accounts from 2023. All sitting in Redis, eating memory, costing money.

Mistake #1: No TTL means forever

The fix seems obvious now:

# Set expiration - 1 hour for feeds
redis.setex(cache_key, 3600, json.dumps(feed))

But here’s the thing nobody tells you: choosing the right TTL is hard.

Too short? Cache misses everywhere, database gets hammered.
Too long? Stale data and wasted memory.

What actually worked for us:

# Different TTLs for different data types
CACHE_TTLS = {
    'user_profile': 86400,      # 24 hours - rarely changes
    'user_feed': 3600,          # 1 hour - updates frequently
    'trending_posts': 300,      # 5 minutes - needs to be fresh
    'static_content': 604800,   # 1 week - almost never changes
}

def cache_set(key, value, data_type='default'):
    ttl = CACHE_TTLS.get(data_type, 3600)
    redis.setex(key, ttl, json.dumps(value))

Dropped our Redis memory usage by 40% immediately.

Mistake #2: Caching entire objects

Found this pattern everywhere:

def get_post(post_id):
    cache_key = f"post:{post_id}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Fetch post with ALL relationships
    post = db.query(Post).options(
        joinedload(Post.author),
        joinedload(Post.comments),
        joinedload(Post.tags),
        joinedload(Post.reactions)
    ).filter_by(id=post_id).first()
    
    redis.set(cache_key, json.dumps(post.to_dict()))
    return post

A single post cache entry was 50KB because we included:

  • 200 comments with full user profiles
  • All reactions (thousands of them)
  • Complete tag metadata
  • Author’s full profile

We were caching data we didn’t need 90% of the time.

Better approach:

def get_post(post_id):
    # Cache just the post
    cache_key = f"post:{post_id}"
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Fetch ONLY the post
    post = db.query(Post).filter_by(id=post_id).first()
    redis.setex(cache_key, 3600, json.dumps(post.to_dict()))
    return post

def get_post_with_comments(post_id, limit=20):
    # Separate cache for comments
    post = get_post(post_id)
    
    comments_key = f"post:{post_id}:comments:{limit}"
    cached_comments = redis.get(comments_key)
    
    if cached_comments:
        post['comments'] = json.loads(cached_comments)
    else:
        comments = fetch_comments(post_id, limit)
        redis.setex(comments_key, 600, json.dumps(comments))
        post['comments'] = comments
    
    return post

Cache size dropped another 60%. Why? Because most requests just need the post, not everything.

Mistake #3: Cache stampede

This one took down our API for 10 minutes.

Scenario: Popular post’s cache expires. Suddenly 500 requests all hit the database at once trying to regenerate it. Database melts. Everything breaks.

Our original code had zero protection:

cached = redis.get(cache_key)
if not cached:
    # 500 requests all do this simultaneously
    data = expensive_database_query()
    redis.set(cache_key, data)

The fix: cache locking

import time
import uuid

def get_with_lock(cache_key, fetch_func, ttl=3600):
    # Try to get from cache
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Try to acquire lock
    lock_key = f"lock:{cache_key}"
    lock_id = str(uuid.uuid4())
    
    # Atomic lock acquisition with 10 second timeout
    lock_acquired = redis.set(
        lock_key, 
        lock_id, 
        nx=True,  # Only set if doesn't exist
        ex=10     # Lock expires in 10 seconds
    )
    
    if lock_acquired:
        try:
            # We got the lock, fetch the data
            data = fetch_func()
            redis.setex(cache_key, ttl, json.dumps(data))
            return data
        finally:
            # Release lock only if we still own it
            script = """
            if redis.call("get", KEYS[1]) == ARGV[1] then
                return redis.call("del", KEYS[1])
            else
                return 0
            end
            """
            redis.eval(script, 1, lock_key, lock_id)
    else:
        # Someone else is fetching, wait and retry
        time.sleep(0.1)
        return get_with_lock(cache_key, fetch_func, ttl)

Haven’t had a stampede since.

Mistake #4: Not monitoring cache hit rate

We had no idea if our cache was even working. Turns out, our hit rate was 23%. Terrible.

Added monitoring:

def get_cached(key, fetch_func, ttl=3600):
    cached = redis.get(key)
    
    if cached:
        # Track hit
        statsd.increment('cache.hit', tags=[f'key_prefix:{key.split(":")[0]}'])
        return json.loads(cached)
    
    # Track miss
    statsd.increment('cache.miss', tags=[f'key_prefix:{key.split(":")[0]}'])
    
    data = fetch_func()
    redis.setex(key, ttl, json.dumps(data))
    return data

This revealed that certain cache keys had 5% hit rates. We were caching data that was almost never reused. Removed those caches entirely.

Now we’re at 87% hit rate.

Mistake #5: JSON everywhere

We were JSON encoding everything:

redis.set(f"counter:{user_id}", json.dumps(count))  # WHY?

For simple values, just use native Redis commands:

# Counters
redis.incr(f"counter:{user_id}")

# Sets
redis.sadd(f"followers:{user_id}", follower_id)

# Sorted sets for leaderboards
redis.zadd("leaderboard", {user_id: score})

# Hashes for objects
redis.hset(f"user:{user_id}", mapping={
    "name": "John",
    "email": "[email protected]"
})

Faster. Less memory. Built-in atomic operations.

Mistake #6: Caching the wrong things

We cached database query results. But we should have been caching computation results.

Example: User reputation score calculation involves:

  • Counting posts (DB query)
  • Counting upvotes (DB query)
  • Counting comments (DB query)
  • Complex formula

We cached each query. Still slow because we recalculated the formula every time.

Better:

def get_reputation_score(user_id):
    cache_key = f"reputation:{user_id}"
    score = redis.get(cache_key)
    
    if score:
        return int(score)
    
    # Calculate ONCE
    posts = count_user_posts(user_id)
    upvotes = count_user_upvotes(user_id)
    comments = count_user_comments(user_id)
    
    score = calculate_reputation(posts, upvotes, comments)
    
    # Cache the RESULT
    redis.setex(cache_key, 3600, score)
    return score

Cache the computation, not the components.

What we learned

After three months of fixes:

  • Redis memory: 32GB → 9GB
  • Monthly cost: $12,000 → $3,000
  • Average API response time: 450ms → 120ms
  • Cache hit rate: 23% → 87%

Key lessons:

  1. Always set TTLs - Default to 1 hour, adjust based on data
  2. Cache small and specific - Not entire object graphs
  3. Protect against stampedes - Use locks or probabilistic early expiration
  4. Monitor everything - Hit rate, memory usage, evictions
  5. Use native Redis types - Not everything needs JSON
  6. Cache computation, not data - Cache what’s expensive to calculate

Redis is amazing when you use it right. We just had to learn the hard way.

One more thing: we now have a weekly job that analyzes cache key patterns and reports unused or rarely-hit keys. Found 2GB of garbage last week.

Your cache should make things faster AND cheaper. If it’s not doing both, you’re doing it wrong.