Common Redis Caching Mistakes (Q&A Format)

Q: Our Redis instance is 32GB but our database is only 8GB. Is this normal?

A: No, this is a red flag indicating poor cache hygiene.

We encountered this exact scenario. Investigation revealed cache entries from:

  • Users inactive for 18+ months
  • Deleted/banned accounts
  • Test accounts from 2023
  • Orphaned session data

Problem root cause: Missing TTL (Time To Live) on cache entries.

# ❌ Wrong - lives forever
redis.set(cache_key, json.dumps(data))

# ✅ Correct - expires after 1 hour
redis.setex(cache_key, 3600, json.dumps(data))

Impact: Set appropriate TTLs → 40% memory reduction immediately.


Q: How do I choose the right TTL value?

A: TTL should match data volatility, not be arbitrary.

Our TTL strategy by data type:

Data TypeTTLRationale
User Profile24hRarely changes, safe to cache long
User Feed1hUpdates frequently, needs freshness
Trending Content5minReal-time requirements
Static Content7dAlmost never changes
Session Data30minSecurity consideration
CACHE_TTLS = {
    'user_profile': 86400,      # 24 hours
    'user_feed': 3600,          # 1 hour
    'trending_posts': 300,      # 5 minutes
    'static_content': 604800,   # 1 week
}

def cache_set(key, value, data_type='default'):
    ttl = CACHE_TTLS.get(data_type, 3600)
    redis.setex(key, ttl, json.dumps(value))

Testing methodology: Start with conservative (short) TTLs, monitor cache hit rate, adjust incrementally.


Q: Why are my cache entries so large?

A: You’re likely caching entire object graphs instead of focused data.

Our mistake:

# ❌ Wrong - caches 50KB per post
post = db.query(Post).options(
    joinedload(Post.author),
    joinedload(Post.comments),      # 200 comments!
    joinedload(Post.tags),
    joinedload(Post.reactions)       # Thousands of reactions
).filter_by(id=post_id).first()

redis.set(f"post:{post_id}", json.dumps(post.to_dict()))

Better approach - granular caching:

# ✅ Correct - cache only what you need
def get_post_basic(post_id):
    """Cache just the post (2KB)"""
    cache_key = f"post:{post_id}"
    if cached := redis.get(cache_key):
        return json.loads(cached)
    
    post = db.query(Post).filter_by(id=post_id).first()
    redis.setex(cache_key, 3600, json.dumps(post.to_dict()))
    return post

def get_post_comments(post_id, limit=20):
    """Separate cache for comments (10KB)"""
    comments_key = f"post:{post_id}:comments:{limit}"
    if cached := redis.get(comments_key):
        return json.loads(cached)
    
    comments = fetch_comments(post_id, limit)
    redis.setex(comments_key, 600, json.dumps(comments))
    return comments

Result: 60% cache size reduction. Why? Most API calls need basic post data, not everything.


Q: What’s a cache stampede and how do I prevent it?

A: Cache stampede: When popular cache entry expires, hundreds of requests simultaneously hit the database to regenerate it.

Symptom: Periodic API timeouts coinciding with cache expiration.

Our incident:

  • Popular post cache expired
  • 500 concurrent requests hit database
  • Database connection pool exhausted
  • 10 minute outage

Solution - Cache Locking Pattern:

import uuid
import time

def get_with_lock(cache_key, fetch_func, ttl=3600):
    # Try cache first
    if cached := redis.get(cache_key):
        return json.loads(cached)
    
    # Attempt lock acquisition
    lock_key = f"lock:{cache_key}"
    lock_id = str(uuid.uuid4())
    
    lock_acquired = redis.set(
        lock_key,
        lock_id,
        nx=True,    # Only set if doesn't exist
        ex=10       # Lock expires in 10 seconds
    )
    
    if lock_acquired:
        try:
            # Winner fetches data
            data = fetch_func()
            redis.setex(cache_key, ttl, json.dumps(data))
            return data
        finally:
            # Release lock with Lua script (atomic)
            script = """
            if redis.call("get", KEYS[1]) == ARGV[1] then
                return redis.call("del", KEYS[1])
            end
            """
            redis.eval(script, 1, lock_key, lock_id)
    else:
        # Losers wait and retry
        time.sleep(0.1)
        return get_with_lock(cache_key, fetch_func, ttl)

Key concepts:

  1. Only first request acquires lock and fetches data
  2. Other requests wait briefly then retry (data will be cached)
  3. Lock timeout prevents deadlocks
  4. Atomic lock release prevents race conditions

Q: How do I know if my caching strategy is working?

A: Measure cache hit rate.

Our discovery: Cache hit rate was 23%. Terrible. We were wasting resources.

Instrumentation code:

def get_cached(key, fetch_func, ttl=3600):
    cached = redis.get(key)
    key_prefix = key.split(":")[0]
    
    if cached:
        statsd.increment('cache.hit', tags=[f'prefix:{key_prefix}'])
        return json.loads(cached)
    
    statsd.increment('cache.miss', tags=[f'prefix:{key_prefix}'])
    
    data = fetch_func()
    redis.setex(key, ttl, json.dumps(data))
    return data

Monitoring dashboard metrics:

  • Overall hit rate: hits / (hits + misses)
  • Per-key-type hit rate
  • Memory usage by key prefix
  • Eviction rate

Action items from monitoring:

  • Identified cache keys with <10% hit rate → Removed them
  • Found poorly tuned TTLs → Adjusted based on actual access patterns
  • Discovered over-cached data → Implemented granular caching

Current metrics: 87% hit rate (target: >80%)


Q: Should I JSON-encode everything I cache?

A: No. Use Redis native data types when appropriate.

Inefficient:

# ❌ Wrong - unnecessary JSON overhead
redis.set(f"counter:{user_id}", json.dumps(42))
redis.set(f"online_users", json.dumps([1, 2, 3, 4]))

Efficient:

# ✅ Correct - use native Redis types

# Counters
redis.incr(f"counter:{user_id}")                # Atomic increment
value = int(redis.get(f"counter:{user_id}"))

# Sets
redis.sadd(f"online_users", 1, 2, 3, 4)        # No duplicates
users = redis.smembers(f"online_users")

# Sorted Sets (leaderboards)
redis.zadd("leaderboard", {user_id: score})
top10 = redis.zrevrange("leaderboard", 0, 9, withscores=True)

# Hashes (objects)
redis.hset(f"user:{user_id}", mapping={
    "name": "John",
    "email": "[email protected]"
})
user_data = redis.hgetall(f"user:{user_id}")

Advantages:

  • Less memory (no JSON encoding overhead)
  • Faster operations
  • Atomic operations built-in
  • Type-specific commands (e.g., ZINCRBY, SINTER)

Q: What should I cache - the data or the computation?

A: Cache the expensive computation result, not intermediate data.

Inefficient approach:

# ❌ Caching queries, still doing computation
def get_reputation_score(user_id):
    # Cache each query
    posts = cache_get(f"posts:{user_id}", lambda: count_posts(user_id))
    upvotes = cache_get(f"upvotes:{user_id}", lambda: count_upvotes(user_id))
    comments = cache_get(f"comments:{user_id}", lambda: count_comments(user_id))
    
    # Still computing this every time!
    return posts * 5 + upvotes * 2 + comments * 3

Efficient approach:

# ✅ Cache the final computation
def get_reputation_score(user_id):
    cache_key = f"reputation:{user_id}"
    
    if score := redis.get(cache_key):
        return int(score)
    
    # Calculate once
    posts = count_posts(user_id)
    upvotes = count_upvotes(user_id)
    comments = count_comments(user_id)
    
    score = posts * 5 + upvotes * 2 + comments * 3
    
    # Cache the result
    redis.setex(cache_key, 3600, score)
    return score

Principle: Cache what’s expensive to compute, not what’s expensive to fetch.


Q: Our team keeps adding cache keys. How do we maintain cache hygiene?

A: Implement automated cache analysis.

Our solution - Weekly audit job:

def analyze_cache_patterns():
    """Analyze cache usage and identify waste"""
    
    cursor = 0
    patterns = {}
    
    while True:
        cursor, keys = redis.scan(cursor, count=1000)
        
        for key in keys:
            prefix = key.split(b":")[0].decode()
            ttl = redis.ttl(key)
            size = len(redis.dump(key))
            
            if prefix not in patterns:
                patterns[prefix] = {
                    'count': 0,
                    'total_size': 0,
                    'no_ttl': 0
                }
            
            patterns[prefix]['count'] += 1
            patterns[prefix]['total_size'] += size
            if ttl == -1:  # No TTL set
                patterns[prefix]['no_ttl'] += 1
        
        if cursor == 0:
            break
    
    # Generate report
    report = []
    for prefix, stats in sorted(patterns.items(), 
                                 key=lambda x: x[1]['total_size'], 
                                 reverse=True):
        report.append({
            'prefix': prefix,
            'key_count': stats['count'],
            'total_mb': stats['total_size'] / 1024 / 1024,
            'keys_without_ttl': stats['no_ttl']
        })
    
    return report

Weekly actions:

  1. Review top memory consumers
  2. Identify keys without TTL
  3. Check for orphaned test keys
  4. Analyze cache hit rates by prefix
  5. Remove unused cache patterns

Last week’s findings: Removed 2GB of abandoned test data.


Summary Checklist

Always set TTLs - Match TTL to data volatility
Cache granularly - Small, focused cache entries
Prevent stampedes - Use lock patterns for expensive operations
Monitor continuously - Track hit rate, memory, evictions
Use native types - Avoid JSON when Redis types suffice
Cache computation - Not intermediate data
Audit regularly - Weekly cache hygiene review

Results After 3 Months

MetricBeforeAfterImprovement
Redis Memory32GB9GB-72%
Monthly Cost$12,000$3,000-75%
API Response Time450ms120ms-73%
Cache Hit Rate23%87%+278%

Total engineering time invested: ~80 hours
Monthly savings: $9,000
ROI: Positive after 1 month


Need help with Redis optimization? Common issues include: improper TTL strategy, cache stampedes, poor key design, and inadequate monitoring. Address these systematically for best results.