How would you scale a Django app for high traffic reliably?
Django Developer
answer
A solid Django scaling plan layers horizontal scaling, async work, and smart delivery. I scale a Django app with stateless web pods behind a load balancer, sticky-less sessions in Redis, and autoscaling. Heavy or slow tasks move to Celery with priority queues and retries; websockets/long I/O go through ASGI. For data, read replicas and database sharding protect the primary. A CDN offloads assets. Caching (per-view + keys) trims DB load while observability guards p95 latency.
Long Answer
Scaling a Django app for high traffic is a playbook, not a switch. Eliminate single points of failure, push non-interactive work off the request path, and make each layer elastic and observable.
1) Stateless web + horizontal scaling
Run multiple gunicorn/uvicorn pods behind a layer-7 load balancer. Keep state out of pods: sessions in Redis, files in S3/GCS, config in a secret store. Disable sticky sessions; autoscale on CPU/latency with health-checked rolling deploys. Terminate TLS at the edge; tune workers to cores.
2) Async with Celery and ASGI
Long or failure-prone work (emails, media processing, webhooks, ETL) goes to Celery with Redis/RabbitMQ. Separate queues by priority, rate-limit external APIs, and make tasks idempotent with dedupe keys and backoff. Keep the request path skinny: enqueue and return.
3) Database + replicas + database sharding
Start with indexing, keyset pagination, and query budgets to stop N+1. Add read replicas and a router that ships safe SELECTs off the primary; monitor lag and fall back when stale. When one database runs hot, shard by tenant or hashed id. Keep migrations additive and avoid cross-shard joins on the hot path.
4) Caching and CDN
Per-view caching for anonymous traffic, fragment caches for dynamic pages, and explicit key caches for hot queries. Version keys per tenant and invalidate on write (signals or write-through) with stampede control locks. Set ETag/Cache-Control and let a CDN serve public pages and assets worldwide.
5) Files and the edge
Move media to object storage with signed URLs; generate thumbnails in background tasks. Use a CDN for images, fonts, and scripts; enable brotli and HTTP/2/3.
6) Concurrency and back pressure
Fit gunicorn workers to CPUs; prefer async workers under ASGI for I/O-heavy endpoints. Bound queues and set timeouts; add circuit breakers for flaky dependencies. Protect Postgres with pgbouncer; keep transactions short. Back-pressure slow clients with streaming or 429s.
7) Observability and SLOs
Track latency by endpoint, queue depth, DB load, replica lag, cache hit ratio, and CDN offload. Trace edge→app→tasks→DB. Define SLOs for latency/error rate and tie autoscaling and alerts to them. Run chaos-lite drills to expose hidden coupling.
8) Safe deploys
Blue/green or canary deploys localize risk. Feature flags ramp costly paths gradually. Ship additive migrations; guard destructive changes behind kill-switches. Enforce budgets in CI: max queries per view, task runtime ceilings.
9) Cost
Expire caches, cap queue concurrency, delete stale blobs. Measure ‘queries avoided’ and ‘bytes offloaded to CDN’ as wins. A tight Django scaling loop is: profile → shed work → push to Celery or edge → shard/cache → measure → repeat.
Together these moves turn a monolith into a resilient service: horizontally scalable, async where it counts, data-tier aware, and aggressively cached—observable and safe to change under load.
Table
Common Mistakes
Treating scaling as “add more servers” while keeping sticky sessions and local file writes; pods can’t scale if state is glued to them. Leaving background work on request threads, so spikes stall pages. Overusing caches without versioning or stampede control, causing stale data or thundering herds. Relying on read replicas but ignoring lag, returning ghosts to users. Using offset pagination for infinite scroll, hammering the database at deep pages. Running Celery with a single queue so low-value jobs starve the urgent ones. Piling connections into Postgres without pgbouncer, then timing out under load. Skipping SLOs and tracing, so autoscaling reacts to CPU graphs instead of latency. Attempting cross-shard joins on the hot path. Deploying without canaries, rolling a bad migration everywhere at once. Finally, treating the CDN as a magic shield; without proper headers and purges, the edge just hides problems.
Sample Answers (Junior / Mid / Senior)
Junior:
I'd run multiple app instances behind a load balancer and keep the Django app stateless by moving sessions to Redis and media to S3. I'd add per-view cache for anonymous pages and push emails to Celery so requests return fast. A CDN serves static files globally.
Mid:
My plan is layered: autoscaled pods, Redis/Memcached caching (fragment + keys), and Celery with separate high/low priority queues and retries. I'd add read replicas with a router for SELECTs, guard with pgbouncer, and switch to keyset pagination. Assets live on a CDN; I track p95 latency and queue depth to tune capacity.
Senior:
I design for failure first: blue/green or canary deploys, feature flags, and SLO-driven autoscaling. Database hot spots move to database sharding by tenant; cross-shard joins are pushed to analytics. I add stampede control, cache versioning, circuit breakers, and edge caching of HTML where policy allows. Tracing ties edge->app->tasks->DB so we see bottlenecks before users do.
Evaluation Criteria
Strong answers show a layered Django scaling plan tied to user outcomes. Look for: stateless web tier behind a load balancer; Redis sessions; autoscaling without stickiness; async offload via Celery with priorities, idempotency, and backoff; read replicas with routing and lag awareness; a path to database sharding (by tenant/hash) with no hot-path cross-shard joins. Candidates explain caching (per-view, fragments, explicit keys, stampede control) and CDN strategy with correct headers/purges. They quantify ops with SLOs, traces, queue depth, replica lag, and cache hit ratio; deploy safely with blue/green or canaries; guard DB with pgbouncer and short transactions. Senior signals: cost thinking (edge offload %, queries avoided), CI guardrails (query budgets, task runtime caps), and rollback plans. Bonus: seek vs offset pagination tradeoffs, circuit breakers/timeouts for dependencies, and cache key versioning per tenant to prevent stale collisions. Weak answers hand-wave 'add servers' or 'use a CDN' without invalidation, ignore observability, or overlook data-tier limits.
Preparation Tips
Build a mini stack that mirrors prod: Nginx/ALB -> gunicorn/uvicorn -> Redis -> Postgres -> Celery workers -> CDN (dev mode). Create a slow endpoint; baseline p95 and queries. Fix N+1, add keyset pagination, verify with EXPLAIN. Add per-view cache and a fragment cache; design explicit key caches with versioning and stampede control. Move emails/image processing into Celery with two queues and idempotent tasks; simulate failures and verify retries/backoff. Add a read replica (or logical) and route SELECTs; test lag handling. Introduce pgbouncer and confirm connection stability under load. Script k6/Locust spikes; watch SLO dashboards (latency, errors, queue depth, cache hit ratio). Add canary deploys in your demo with weighted traffic; ship a backward-compatible migration and rollback. Configure CDN rules for assets and a simple HTML route; validate purge via webhook. Record before/after (p95, RPS, DB CPU, replica lag, cache hit %, CDN offload %), and keep a runbook noting limits so you can defend tradeoffs.
Real-world Context
A marketplace hit 6× traffic during a TV promo. The old monolith ran hot despite big instances. We replatformed the Django app to stateless pods behind an ALB, moved sessions to Redis, and pushed emails/webhooks to Celery with high/low queues. Per-view cache plus key caches for product queries dropped DB reads 65%. Replica routing handled catalog reads while the primary focused on writes. A CDN served images and precompressed assets; edge cache absorbed 70% of requests. Pgbouncer stabilized connections; keyset pagination removed deep OFFSET scans on search. During a later launch, canary deploys caught a slow migration in 5% traffic and auto-rolled back. Observability paid off: we watched p95 latency, queue depth, replica lag, and cache hit ratio in one dashboard. Outcome: p95 from 1.8s to 280ms at 4× baseline RPS, cost flat due to offload and caching. Later we introduced tenant sharding for the largest merchants; joins moved to analytics, and hot paths stayed simple and fast.
Key Takeaways
- Make the web tier stateless; scale horizontally behind a load balancer.
- Offload long work to Celery; keep requests skinny and idempotent.
- Use replicas first, then database sharding when a single node runs hot.
- Cache at multiple layers and push assets/pages to a CDN.
- Observe everything and deploy safely with canaries and SLOs.
Practice Exercise
Scenario: You must scale a checkout API from 300 to 2,000 RPS without raising p95 above 300 ms.
Tasks:
- Baseline: capture p95 latency, DB CPU, query count per request, and cache hit ratio under k6. Identify top two slow endpoints.
- Web tier: run 4 stateless pods behind a load balancer; disable stickiness; tune gunicorn workers to cores. Add autoscaling on latency.
- Async: move email/receipt webhooks and third-party calls to Celery with high/low queues, idempotent tasks, and exponential backoff.
- Data: add indexes for the hottest predicates, switch slow lists to keyset pagination, enable pgbouncer, and add a read replica for catalog reads.
- Caching: add per-view cache for anonymous pages; implement explicit key caches for product snapshots with versioning and single-flight locks.
- Edge: route static/media to a CDN; set Cache-Control/ETag; validate purge via webhook. Precompress with brotli.
- Safety: canary deploy the change set at 5% traffic; ship backward-compatible migrations with kill-switch.
- Observability: add traces edge→app→tasks→DB; alert on queue depth, replica lag, and error budget burn.
- Rerun load: aim for p95 ≤ 300 ms at 2,000 RPS; document queries avoided, cache offload %, and cost deltas.
- Post-mortem drill: kill a worker and a replica; verify graceful degradation (429s, retries) and recovery time. Document lessons learned and the next step (tenant database sharding for the top 1% merchants).

