How to architect an enterprise-scale web app for multi-tenancy and HA?
Enterprise Web Developer
answer
A robust enterprise-scale web application uses a clear multi-tenancy architecture with tenant isolation, global high availability, and cohesive domains. Run stateless services behind a global load balancer across regions; scope per-tenant data in schemas/partitions and enforce access via claims/RBAC. Add queues, caching, idempotent writes, and the outbox pattern; version contracts; automate DR with tested RTO/RPO. Track SLOs per tenant and shard hot tenants to keep latency and cost predictable.
Long Answer
A production-grade enterprise-scale web application must balance multi-tenancy architecture, complex business logic, and global high availability without becoming a distributed hairball. This blueprint layers tenancy, app shape, data, and ops.
1) Tenancy model and isolation
Choose pooled (shared DB keyed by tenantId), siloed (DB per tenant), or hybrid. Make tenant a first-class key across APIs, events, logs, and caches. Enforce isolation in middleware, data (schemas/partitions/RLS), and edge (auth claims + RBAC/ABAC).
2) Domain architecture
Start as a modular monolith split by domains (Accounts, Billing, Orders). Layers: controllers → services → repositories; domain events handle cross-module reactions. Extract hot domains via an API gateway only when scale or ownership demands it.
3) Contracts and reliability
Treat OpenAPI/GraphQL as truth; generate clients/servers. Version APIs/events. Use idempotency keys on writes and an outbox to publish reliable events.
4) Data layout and consistency
Pick stores per workload. For global strong consistency use Spanner-class systems; else a regional primary with read replicas. Scope tenant data via schemas (SQL) or partition keys (NoSQL). Denormalize read models to hit 1–2 queries per view. Stream operational data to search/analytics; apply per-tenant quotas.
5) Global high availability
Front with a global anycast load balancer; route by latency to ≥2 regions. Keep services stateless (tokens/Redis). Ship identical builds; use health checks/circuit breakers, cache at CDN + regional layers, and design DB failover (multi-region replicas or promoted secondaries). Drill RTO/RPO.
6) Async workflows
Offload long work to queues/schedulers. Orchestrate with Temporal/Step Functions; keep steps idempotent and time-bounded. Persist saga state per tenant.
7) Security and compliance
Zero-trust: mTLS/service identity, short-lived tokens, least privilege. Encrypt in transit/at rest; segregate keys for premium tenants as needed. Centralize policy (OPA/ABAC). Capture audit trails and provide tenant exports.
8) Observability and SLOs
Emit structured logs with tenantId, region, correlation IDs. Trace end-to-end (OpenTelemetry). Define SLOs per tier; alert on error-budget burn. Operate a dashboard for hot tenants, throttling, queue depth, replica lag.
9) CI/CD and change safety
Build once, deploy many with canary/blue-green per region and auto-rollback on SLO regression. Run online DB migrations (expand-migrate-contract) behind flags. Keep IaC and secrets in a vault.
10) Cost and capacity
Expose tenant limits and bursts. Scale on RPS/queue depth and cache hot paths. Map pricing tiers to technical limits; track cost per tenant.
Result: an enterprise-scale web application with clean multi-tenancy architecture, preserved complex business logic, and global high availability through stateless compute, resilient data, and disciplined ops.
Table
Common Mistakes
Building microservices first and recreating a distributed monolith; a modular monolith would have shipped faster with fewer faults. Skipping tenant context at the edge, then trying to retrofit checks deep in code. Using a single shared DB with no partitions or RLS—one bug leaks data across tenants. Letting complex business logic live in controllers; no services, no domain events. No idempotency or outbox, so retries duplicate orders. Global sessions in memory; region failover logs everyone out. Reads routed cross-ocean to a single primary, blowing p95. Unbounded queues and cron jobs without timeouts. Contracts drift from reality—no versioning or codegen. CDN disabled for dynamic-but-cacheable GETs. No SLOs or burn-rate alerts; teams fly blind. One-shot deploys to all regions with no canary, and migrations that delete columns before readers switch. Zero cost visibility per tenant; noisy neighbors dominate the bill and throttle everyone else.
Sample Answers
Junior:
I’d keep a modular monolith, split by domains, and include tenantId in every request. Data sits in one DB with schemas per tenant. Two regions behind a global load balancer; sessions via tokens. I’d cache GETs and write basic SLOs.
Mid:
My multi-tenancy architecture uses pooled storage with partitions and RLS. Routes are thin; services hold complex business logic. Contracts are OpenAPI with versioning; writes use idempotency keys and outbox to a queue. Two active regions with CDN/Redis, and read replicas to cut latency. SLO dashboards show per-tenant p95 and errors.
Senior:
Tenancy is hybrid: pooled by default, siloed for regulated tiers. Domains start in a modular monolith and graduate to services via an API gateway when needed. Per-tenant quotas protect capacity. Global HA uses anycast LB, three regions, staged canary, and online migrations. Observability is OTEL with correlation + cost per tenant. Security follows zero-trust with OPA policy, short-lived creds, and audited admin actions.
Evaluation Criteria
Assess whether the candidate can tie multi-tenancy architecture, complex business logic, and global high availability into one coherent plan.
- Tenancy: Clear stance (pooled/siloed/hybrid) and how tenant context flows through auth, services, data, logs, and caches. Isolation proven via RLS/partitions and centralized checks.
- Architecture: Modular monolith first; clean layering; domain events; a credible path to services without chatty calls.
- Data & reliability: Store choices per workload; replicas; read models; idempotency + outbox; saga/orchestration where needed.
- HA & performance: Global LB, ≥2 regions, CDN/cache, failover tested, and p95 targets by region/tenant.
- Security & compliance: mTLS, RBAC/ABAC, secrets management, audit trails.
- Ops: OTEL traces, per-tenant SLOs and burn-rate alerts; canary/blue-green; online migrations; IaC.
Red flags: No tenant in path/claims, microservices by default, DB as global singleton, no idempotency, one region only, and “scan-and-hope” observability.
Preparation Tips
Build a tiny reference enterprise-scale web application: modular monolith with Accounts/Billing/Orders, OpenAPI contracts, and a tenant middleware that injects/validates tenantId.
Add pooled storage with schemas or partitions and RLS; write one dashboard using a denormalized read model. Implement idempotent POST + outbox → queue. Add per-tenant limits and rate-limit headers.
Stand up two regions behind a global LB with CDN and Redis; simulate region loss and document RTO/RPO. Add a background workflow using Temporal/Step Functions, proving idempotent steps and timeouts.
Instrument OTEL traces and a per-tenant SLO dashboard; create burn-rate alerts. Practice canary/blue-green with an online migration (expand-migrate-contract) and instant rollback. Automate with IaC and a secret vault.
Finally, compute cost per tenant (storage, egress, CPU) and set quotas. Draft a security checklist (mTLS, RBAC/ABAC, audit). Write a one-page ADR explaining why you kept a modular monolith and when to extract services.
Real-world Context
SaaS CRM: Began pooled with partitions and RLS; hot enterprise tenants triggered noisy-neighbor issues. Moving their workloads to siloed DBs (hybrid model) cut p95 by 35% and simplified compliance audits. Outbox + idempotency eliminated duplicate lead imports during retries.
Global retail: Two regions became three after a DR drill showed long RTO. Anycast LB + CDN + regional caches dropped median latency 28%. A modular monolith held complex promotions logic; only checkout extracted to a service, reducing release risk.
Fintech B2B: Strict audit trails and ABAC policies made multi-tenancy safe. A saga for payouts orchestrated via Temporal recovered cleanly from a downstream outage; outbox guaranteed exactly-once effects at boundaries.
Analytics platform: Operational queries left OLTP sluggish. Streaming events to search/BigQuery produced 1–2 query views; per-tenant cost dashboards enabled tiered pricing and stopped surprise bills. Quarterly index/pruning and SLO burn-alerts prevented capacity drift over time.
Key Takeaways
- Put tenant context everywhere; enforce it once at the edge and in data.
- Start modular monolith; extract only where scale or ownership demands.
- Contracts + idempotency + outbox keep complex business logic reliable.
- Design for global high availability with stateless compute and tested failover.
- Track SLOs and cost per tenant; control noisy neighbors.
Practice Exercise
Scenario:
You’re launching an enterprise SaaS in three regions with free/standard/enterprise tiers. Tenants expect SSO, regional failover, and auditability. Propose a blueprint balancing multi-tenancy architecture, complex logic, and global high availability.
Tasks:
- Tenancy: Pick pooled, siloed, or hybrid. Define tenantId flow through auth (OIDC claims), logs, caches, and DB (schemas/partitions). State isolation (RLS/ABAC) and per-tenant quotas.
- Domain shape: Sketch a modular monolith (Accounts, Billing, Catalog, Orders). Name one domain to extract first and why.
- Contracts & reliability: Publish OpenAPI for two writes with idempotency keys. Describe outbox → queue → consumer and exactly-once across retries.
- Data & HA: Choose SQL vs NoSQL per domain. Propose replication (primary + replicas vs multi-region). Set RTO/RPO and failover automation; list what is stateless.
- Async workflows: Pick one long process; outline a saga/orchestration with timeouts and compensation.
- Security & ops: Zero-trust (mTLS, short-lived tokens), secrets vault, audit schema. Define SLOs (availability, p95) and a burn-rate alert.
- CI/CD: Describe canary/blue-green per region, online migrations, and rollback.
- Cost: Draft a per-tenant cost model (storage, egress, CPU) and tier limits.
Deliverable:
A 1–2 page brief + diagram tracing one request across regions, showing isolation points, and explaining how the design scales while staying maintainable.

