Backend

Microservices Auth: We Tried 4 Patterns, Here's What Actually Worked

Authentication between microservices is one of those things that seems simple until you actually try to implement it. We went through 4 different patterns over 18 months. Each one solved some problems but created new ones. Here’s what we learned, and what we eventually settled on. The problem You have 15 microservices. API Gateway authenticates users with JWT. But how do internal services verify requests from each other? Service A needs to call Service B. Service B needs to know: ...

Node.js Memory Leak: Two Weeks to Find One Missing removeListener()

Incident Report: Node.js Memory Leak Analysis Date: 2026-01-20 Severity: P1 (Production Impact) MTTR: 14 days Root Cause: Event listener leak in WebSocket handler Timeline of Events Day 1 - Jan 6, 09:30 UTC Monitoring alerts: API instances restarting every 6 hours Memory usage shows sawtooth pattern (gradual climb, sudden drop) Initial hypothesis: Database connection leak Day 3 - Jan 8 Ruled out database connections (pool metrics normal) Added heap profiling to staging environment Identified EventEmitter instances growing unbounded Day 7 - Jan 13 ...

Redis Caching: The Mistakes That Cost Us $12K/Month

Common Redis Caching Mistakes (Q&A Format) Q: Our Redis instance is 32GB but our database is only 8GB. Is this normal? A: No, this is a red flag indicating poor cache hygiene. We encountered this exact scenario. Investigation revealed cache entries from: Users inactive for 18+ months Deleted/banned accounts Test accounts from 2023 Orphaned session data Problem root cause: Missing TTL (Time To Live) on cache entries. # ❌ Wrong - lives forever redis.set(cache_key, json.dumps(data)) # ✅ Correct - expires after 1 hour redis.setex(cache_key, 3600, json.dumps(data)) Impact: Set appropriate TTLs → 40% memory reduction immediately. ...

PostgreSQL JSONB: Three Months of Pain and What I Learned

Executive Summary This document analyzes a three-month production deployment of PostgreSQL JSONB columns, documenting performance issues encountered, indexing strategies implemented, and architectural patterns that proved effective. The findings are relevant for teams considering JSONB for schema flexibility in high-traffic applications. Key Metrics: Initial table size: 2M records Average JSONB column size: 50KB Write throughput: 10K updates/minute Query degradation: 200ms → 14s (70x regression) Background and Context In Q4 2025, our engineering team migrated user preference storage from normalized columns to JSONB format. Primary motivation was reducing schema migration overhead as product requirements evolved. ...

Why Our API Rate Limiter Failed (And How We Fixed It)

Rate limiting seems simple until it isn’t. We thought we had it figured out - Redis counters, sliding windows, the works. Then a client with a distributed system hit our API and everything fell apart. What we built initially Standard stuff. Token bucket in Redis: def check_rate_limit(api_key: str) -> bool: key = f"rate_limit:{api_key}" current = redis.get(key) if current and int(current) >= 100: # 100 req/min return False pipe = redis.pipeline() pipe.incr(key) pipe.expire(key, 60) pipe.execute() return True Worked great in testing. 100 requests per minute per API key, clean reset every minute. ...

Database Indexing: The Stuff Nobody Tells You

Indexes are supposed to make queries fast. Sometimes they make them slower. Here’s what I wish someone told me before I tanked production performance trying to “optimize” our database. The query that started it all Support said the admin dashboard was timing out. Found this in the slow query log: SELECT * FROM orders WHERE customer_id = 12345 AND status IN ('pending', 'processing') AND created_at > '2025-12-01' ORDER BY created_at DESC LIMIT 20; Took 8 seconds on a table with 4 million rows. Obviously needs an index, right? ...

Webhooks at Scale: What We Learned Sending 10M/Day

We send about 10 million webhooks per day now. Started at 100K/day a year ago. Thought it would scale fine. It didn’t. Here’s everything that broke and how we fixed it. The naive implementation def send_webhook(event_type, payload, url): try: requests.post( url, json={"event": event_type, "data": payload}, timeout=5 ) except Exception: # ¯\_(ツ)_/¯ pass Call this function whenever something happens. Fire and forget. What could go wrong? Problem 1: Blocking the main thread Webhook delivery was part of the request cycle. User creates an order, we save it, then send webhooks to their configured endpoints. If their endpoint is down or slow, the user’s request times out. ...

GraphQL N+1 Problem: From 2000 Queries to 3

GraphQL makes it stupidly easy to write queries that murder your database. Our homepage was hitting the database 2,031 times per page load. Yeah. Here’s how we got it down to 3 queries without changing the API. The query that looked innocent query Homepage { posts { id title author { id name avatar } comments { id text author { name } } } } Looks fine. Perfectly reasonable GraphQL query. Returns 10 posts with their authors and comments. ...

Go HTTP Timeouts & Resilience Defaults

Client defaults Set Timeout on http.Client; set Transport with DialContext timeout (e.g., 3s), TLSHandshakeTimeout (3s), ResponseHeaderTimeout (5s), IdleConnTimeout (90s), MaxIdleConns/MaxIdleConnsPerHost. Retry only idempotent methods with backoff + jitter; cap attempts. Use context.WithTimeout per request; cancel on exit. Server defaults ReadHeaderTimeout (e.g., 5s) to mitigate slowloris. ReadTimeout/WriteTimeout to bound handler time (align with business SLAs). IdleTimeout to recycle idle connections; prefer HTTP/2 when available. Patterns Wrap handlers with middleware for deadline + logging when timeouts hit. For upstreams, expose metrics: connect latency, TLS handshake, TTFB, retries. Prefer connection re-use; avoid per-request clients. Checklist Timeouts set on both client and server. Retries limited to idempotent verbs with jitter. Connection pooling tuned; idle conns reused. Metrics for latency stages and timeouts.

Benchmarking Go REST APIs with k6

Test design Define goals: latency budgets (p95/p99), error ceilings, throughput targets. Scenarios: ramping arrival rate, soak tests, spike tests; match production payloads. Include auth headers and realistic think time. k6 script skeleton import http from "k6/http"; import { check, sleep } from "k6"; export const options = { thresholds: { http_req_duration: ["p(95)<300"] }, scenarios: { api: { executor: "ramping-arrival-rate", startRate: 10, timeUnit: "1s", preAllocatedVUs: 50, maxVUs: 200, stages: [ { target: 100, duration: "3m" }, { target: 100, duration: "10m" }, { target: 0, duration: "2m" }, ]}, }, }; export default function () { const res = http.get("https://api.example.com/resource"); check(res, { "status 200": (r) => r.status === 200 }); sleep(1); } Run & observe Capture k6 summary + JSON output; feed to Grafana for trends. Correlate with Go metrics (pprof, Prometheus) to find CPU/alloc hot paths. Record build SHA; compare runs release over release. Checklist Scenarios match prod traffic shape. Thresholds tied to SLOs; tests fail on regressions. Service metrics/pprof captured during runs.