All posts

Deep, practical engineering stories — browse everything.

Jan 20, 2025 php

Laravel Queues with Horizon: Reliable Setup

Workers & balancing Define queue priorities; dedicate workers per queue (emails, webhooks, default). Use balance strategies (simple, auto) and cap max processes per supervisor. Reliability Set retry/backoff per job; push non-idempotent tasks carefully. Configure timeout and retry_after (keep retry_after > max job time). Use Redis with persistence; enable horizon:supervisors monitors. Observability Horizon dashboard: throughput, runtime, failures, retries. Alert on rising failures and long runtimes; log payload/context for failed jobs. Prune failed jobs with retention policy; send to DLQ when needed. Deployment Restart Horizon on deploy to pick up code; use horizon:terminate. Ensure supervisor/systemd restarts Horizon if it dies. Checklist Queues prioritized; supervisors sized. Retries/backoff and timeouts set; DLQ plan. Monitoring/alerts configured; failed job retention in place.

Read

Jan 12, 2025 security

API Security Hardening Checklist

Identity & access Enforce strong auth (OIDC/JWT); short-lived tokens + refresh; audience/issuer checks. Fine-grained authz (RBAC/ABAC); deny-by-default; rate-limit per identity. Input & data Validate/normalize input; reject oversized bodies; JSON schema where possible. Output encode; avoid reflecting raw user data; paginate results. Store secrets in vault/KMS; rotate keys; never log secrets/tokens. Transport & headers TLS everywhere; HSTS; modern ciphers. Security headers: Content-Security-Policy, X-Content-Type-Options=nosniff, X-Frame-Options=DENY, Referrer-Policy. Abuse protection Rate limit + burst control; CAPTCHA/step-up for sensitive actions. Bot detection where relevant; geo/IP allow/deny for admin surfaces. Observability Structured audit logs with identity, action, resource, result; avoid PII spill. Alerts on auth failures, unusual rate spikes, and 5xx anomalies. Checklist AuthZ enforced; least privilege. Input validated; size limits set. TLS + security headers applied. Rate limits + abuse controls configured. Secrets vaulted and rotated.

Read

Jan 8, 2025 java

Spring Boot Observability: Metrics, Traces, Logs

Metrics Use Micrometer + Prometheus: management.endpoints.web.exposure.include=prometheus,health,info. Add JVM+Tomcat/db pool meters; set percentiles for latencies. Create SLIs: request latency, error rate, saturation (threads/connections), GC pauses. Traces Spring Boot 3 ships with OTel starter: add spring-boot-starter-actuator + micrometer-tracing-bridge-otel + exporter (OTLP/Zipkin/Jaeger). Propagate headers (traceparent); ensure async executors use ContextPropagatingExecutor. Sample smartly: lower rates on noisy paths; raise for errors. Logs Use JSON layout; include traceId/spanId for correlation. Avoid verbose INFO in hot paths; keep payload size bounded. Dashboards & alerts Latency/error SLO dashboards per endpoint. DB pool saturation, thread pool queue depth, GC pause, heap used %, 5xx rate. Alerts on SLO burn rates; include exemplars linking metrics → traces → logs. Checklist Actuator endpoints secured and exposed only where needed. OTLP exporter configured; sampling tuned. Trace/log correlation verified in staging. Dashboards + alerts reviewed with oncall.

Read

Dec 11, 2024 devops

DevOps Incident Response Playbook

During incident Roles: incident commander, comms lead, ops/feature SMEs, scribe. Declare severity quickly; open shared channel/bridge; timestamp actions. Stabilize first: roll back, feature-flag off, scale up, or shed load. Runbooks & tooling Prebuilt runbooks per service: restart/rollback steps, dashboards, logs, feature flags. One-click access to dashboards (metrics, traces, logs), recent deploys, and toggles. Paging rules with escalation; avoid noisy alerts. Comms Single source of truth: incident doc; external status page if needed. Regular updates with impact, scope, mitigation, ETA. After incident Blameless postmortem; timeline, root causes, contributing factors. Action items with owners/deadlines; track to completion. Add tests/alerts/runbook updates; reduce time-to-detect and time-to-recover.

Read

PostgreSQL performance tuning illustration

Dec 2, 2024 postgres

PostgreSQL Performance Tuning: The Power of work_mem

PostgreSQL’s work_mem parameter is one of the most impactful yet misunderstood configuration settings for database performance. This post explores how adjusting work_mem can dramatically improve query execution times, especially for operations involving sorting, hashing, and joins. Understanding work_mem work_mem specifies the amount of memory PostgreSQL can use for internal sort operations and hash tables before writing to temporary disk files. The default value is 4MB, which is conservative to ensure PostgreSQL runs on smaller machines. ...

Read

Nov 22, 2024 java

Tuning Kafka Consumers (Java)

Core settings max.poll.interval.ms sized to processing time; max.poll.records to batch size. fetch.min.bytes/fetch.max.wait.ms to trade latency vs throughput. enable.auto.commit=false; commit sync/async after processing batch. Concurrency Prefer multiple consumer instances over massive max.poll.records. For CPU-bound steps, hand off to bounded executor; avoid blocking poll thread. Ordering & retries Keep partition affinity when ordering matters; use DLT for poison messages. Backoff with jitter on retries; limit attempts per message. Observability Metrics: lag per partition, commit latency, rebalances, processing time, error rates. Log offsets and partition for errors; trace batch sizes. Checklist Poll loop never blocks; work delegated to bounded pool. Commits after successful processing; DLT in place. Lag and rebalance metrics monitored.

Read

Nov 15, 2024 api

REST API Design Best Practices: Building Production-Ready APIs

Designing a REST API that is intuitive, maintainable, and scalable requires following established best practices. Here’s a comprehensive guide. 1. Use Nouns, Not Verbs Good GET /users GET /users/123 POST /users PUT /users/123 DELETE /users/123 Bad GET /getUsers GET /getUserById POST /createUser PUT /updateUser DELETE /deleteUser 2. Use Plural Nouns Good GET /users GET /orders GET /products Bad GET /user GET /order GET /product 3. Use HTTP Methods Correctly // GET - Retrieve resources GET /users // Get all users GET /users/123 // Get specific user // POST - Create new resources POST /users // Create new user // PUT - Update entire resource PUT /users/123 // Replace user // PATCH - Partial update PATCH /users/123 // Update specific fields // DELETE - Remove resources DELETE /users/123 // Delete user 4. Use Proper HTTP Status Codes // Success 200 OK // Successful GET, PUT, PATCH 201 Created // Successful POST 204 No Content // Successful DELETE // Client Errors 400 Bad Request // Invalid request 401 Unauthorized // Authentication required 403 Forbidden // Insufficient permissions 404 Not Found // Resource doesn't exist 409 Conflict // Resource conflict // Server Errors 500 Internal Server Error 503 Service Unavailable 5. Consistent Response Format { "data": { "id": 123, "name": "John Doe", "email": "[email protected]" }, "meta": { "timestamp": "2024-11-15T10:00:00Z" } } Error Response Format { "error": { "code": "VALIDATION_ERROR", "message": "Invalid input data", "details": [ { "field": "email", "message": "Invalid email format" } ] } } 6. Versioning URL Versioning /api/v1/users /api/v2/users Header Versioning Accept: application/vnd.api+json;version=1 7. Filtering, Sorting, Pagination GET /users?page=1&limit=20 GET /users?sort=name&order=asc GET /users?status=active&role=admin GET /users?search=john 8. Nested Resources GET /users/123/posts POST /users/123/posts GET /users/123/posts/456 PUT /users/123/posts/456 DELETE /users/123/posts/456 9. Use HTTPS Always use HTTPS in production to encrypt data in transit. ...

Read

Nov 12, 2024 go

Go Profiling in Production with pprof

Capturing safely Expose /debug/pprof behind auth/VPN; or run curl -sK -H "Authorization: Bearer ...". CPU profile: go tool pprof http://host/debug/pprof/profile?seconds=30. Heap profile: .../heap; Goroutines: .../goroutine?debug=2. For containers: kubectl port-forward then grab profiles; avoid prod CPU throttle when profiling. Reading CPU profiles Look at flat vs. cumulative time; identify hot functions. Flamegraph: go tool pprof -http=:8081 cpu.pprof. Check GC activity and syscalls; watch for mutex contention. Reading heap profiles Compare live allocations vs. in-use objects; watch large []byte and map growth. Look for leaks via rising heap over time; diff profiles between runs. Goroutine dumps Spot leaked goroutines (blocked on channel/lock/I/O). Common culprits: missing cancel, unbounded worker creation, stuck time.After. Best practices Add pprof only when needed in prod; default on in staging. Sample under load close to real traffic. Keep artifacts: store profiles with build SHA + timestamp; compare after releases. Combine with metrics (alloc rate, GC pauses, goroutines) to validate fixes.

Read

Microservices communication patterns illustration

Nov 5, 2024 microservices

Microservices Communication Patterns: Synchronous vs Asynchronous

Microservices communicate through various patterns, each with trade-offs. Understanding when to use synchronous vs asynchronous communication is crucial for building scalable, resilient systems. Communication patterns overview Microservices can communicate through: Synchronous: Direct request-response (REST, gRPC) Asynchronous: Message-based (queues, events, pub/sub) Hybrid: Combination of both approaches Synchronous communication REST API Most common pattern for service-to-service communication. Advantages: Simple and familiar Language agnostic Easy to debug Good tooling support Disadvantages: Tight coupling Cascading failures Latency accumulation No guaranteed delivery Example: ...

Read

Oct 30, 2024 java

Circuit Breakers with Resilience4j

Core settings Sliding window (count/time), failure rate threshold, slow-call threshold, minimum calls. Wait duration in open state; half-open permitted calls; automatic transition. Patterns Wrap HTTP/DB/queue clients; combine with timeouts/retries/bulkheads. Tune per dependency; differentiate fast-fail vs. tolerant paths. Provide fallback only when safe/idempotent. Observability Export metrics: state changes, calls/success/failure/slow, not permitted count. Log state transitions; add exemplars linking to traces. Alert on frequent open/half-open oscillation. Checklist Per-downstream breaker with tailored thresholds. Timeouts and retries composed correctly (timeout → breaker → retry). Metrics/logs/traces wired; alerts on open rate.

Read