How do you ensure consistency across microservices?
Microservices Developer
answer
I avoid two-phase commit and design distributed transactions with sagas. Commands change local state atomically, publish events via an outbox to guarantee delivery, and define compensations for failures. Read models follow CQRS, consuming events to stay eventually consistent. I pick orchestrated sagas for complex, branching flows and choreography for simple chains. Idempotency keys, exactly-once–effect semantics, and retry with backoff protect from duplicates. Observability and timeouts close the loop.
Long Answer
In microservices, strong consistency via a global transaction manager is a dead end: 2PC harms latency, availability, and operational agility. My strategy makes consistency explicit, using sagas, eventual consistency, CQRS, and the outbox pattern, plus hard guardrails: idempotency, ordering guarantees where needed, and deep observability.
1) Principle: local ACID, global eventual
Each service owns its data and performs an atomic local transaction per request. Cross-service work is split into steps whose outcomes converge through messages. We accept temporary divergence but constrain it with timeouts, retries, and compensations.
2) Saga patterns: orchestrated vs choreographed
- Orchestrated saga: a coordinator issues commands (ReserveCredit, AllocateInventory, CreateShipment), listens for replies, and drives compensations on failure (ReleaseInventory, RefundPayment). I choose this when flows branch, have deadlines, or require policy decisions.
- Choreographed saga: services react to domain events (OrderPlaced → PaymentReserved → InventoryAllocated). It is simpler and loose-coupled but can hide coupling in event graphs. I add sequence diagrams and contract tests to keep reasoning clear.
3) Outbox → reliable publication
To prevent the classic “updated DB but failed to publish event,” I use the outbox pattern: write the domain change and the event record in the same local transaction; a relay (poller or CDC) publishes to the broker. This yields at-least-once delivery; receivers must be idempotent. For high scale, I batch reads, mark sent with a durable cursor, and shard outbox tables to avoid lock contention.
4) Idempotency, deduplication, ordering
Every command/event carries an idempotency key and version (aggregate or vector clock). Consumers store processed keys to drop duplicates. If a stream requires per-aggregate ordering (for example, Order-123 events), I route by key to a single partition. Global ordering is avoided; it does not scale.
5) CQRS for read-side speed and isolation
I separate writes (commands) from reads (queries). Write models enforce invariants; read models are projections updated asynchronously from events. This isolates read scaling and allows denormalized, query-optimized stores (search indexes, caches). Read freshness is communicated to users with “last updated” stamps or step-state indicators.
6) Compensations and semantic integrity
Compensation is not rollback; it is a new action that restores business meaning (cancel shipment, issue refund, re-open inventory). I model compensations alongside the command, including safety windows (do not ship if payment not captured in N minutes), and timeouts to avoid hung sagas. Partial failure states are explicit and queryable.
7) Timeouts, retries, and backoff
All cross-service calls set deadlines. Retries use exponential backoff with jitter and circuit breakers to avoid storms. If a step cannot complete by its SLA, the saga escalates: compensates or parks for manual intervention. Poison messages go to a DLQ with correlation IDs.
8) Message contracts and evolution
Events are versioned, additive-first, and schema-validated. I publish a contract (OpenAPI/AsyncAPI/Protobuf) and run consumer-driven contract tests. Breaking changes ship as new topics or versions with a deprecation window.
9) Observability and governance
I trace sagas end-to-end with distributed tracing (span per step, saga ID as trace/group key). Metrics track success rates, latency per step, compensation frequency, and outbox lag. Dashboards show in-flight sagas and stuck states. Alerts fire when compensation rates exceed baseline or when outbox/backlog grows.
10) Storage choices and consistency knobs
For write models I prefer strong local consistency (ACID DB or a single-writer log). For read models I use stores that match access patterns (search, cache, analytics). When inter-service agreement must be fast (for example, inventory), I gate with reservation timelines and release expirations to bound inconsistency.
11) Testing the unhappy path
I run fault injection: drop or delay messages, reorder them, crash the orchestrator mid-step, and simulate duplicate deliveries. Property tests assert saga invariants (never ship without charge; never double-refund). Replay tooling rebuilds read models from an event log to verify determinism.
12) When to use which
- Simple chain (order → pay → allocate): choreography + outbox, idempotent handlers.
- Complex branching (discounts, mixed inventory, partial capture): orchestration for clarity.
- High-read systems: CQRS with multiple projections.
- Legacy systems lacking events: introduce transactional outbox or CDC to bootstrap.
This toolbox yields resilient, observable event-driven flows that preserve business correctness without global locks, keeping latency low and change velocity high.
Table
Common Mistakes
- Emulating monolith transactions with 2PC, increasing latency and failure coupling.
- Publishing events outside the DB transaction (lost updates on crash).
- Omitting idempotency, letting retries double-charge or double-ship.
- Relying on global ordering instead of per-aggregate ordering and keys.
- Overusing choreography so business logic hides in event spaghetti; no clear owner.
- Treating compensation as rollback rather than a forward action with business meaning.
- Skipping timeouts and DLQs, leaving sagas stuck forever.
- Ignoring schema evolution and breaking consumers with implicit changes.
Sample Answers
Junior:
“I avoid 2PC and use a saga. Each service commits locally, writes an outbox record, and a relay publishes the event. Consumers are idempotent and store processed IDs. For reads I build a projection so the UI is eventually consistent.”
Mid:
“For simple flows I use choreography; for branching and deadlines I pick orchestration with compensations. I ensure per-aggregate ordering by partitioning on the entity ID. Events are versioned, and a DLQ captures poison messages. Traces use saga IDs.”
Senior:
“My design applies CQRS for scale, outbox/CDC for guaranteed delivery, and sagas for business invariants. I encode timeouts, retries with jitter, and compensations as first-class steps. Contracts are versioned and tested. Dashboards show outbox lag, compensation rate, and in-flight sagas, enabling rapid diagnosis and safe evolution.”
Evaluation Criteria
Look for a concrete plan that replaces global transactions with sagas and eventual consistency, guaranteed by the outbox. Strong answers distinguish orchestrated vs choreographed sagas and justify when each fits. They include CQRS for read scaling, idempotency and per-aggregate ordering to tame duplicates, and explicit compensations with timeouts and DLQs. They mention message contracts and versioning, and observability with tracing and lag metrics. Red flags: pushing 2PC, ignoring idempotency, no outbox, or hand-waving around failure and schema evolution.
Preparation Tips
- Build a small order-payment-inventory demo: start with choreography, then switch to orchestration; compare clarity.
- Implement the outbox pattern (TX insert + relay) and verify delivery under crash/restart.
- Add idempotency keys and processed-message stores; write a duplicate-delivery test.
- Partition the message stream by aggregate ID; confirm in-order handling per entity.
- Create a CQRS read model and rehearse replay from the event log.
- Add compensations and deadlines; simulate timeouts and verify invariant preservation.
- Version an event and run consumer-driven contract tests to prevent breaks.
- Add tracing with saga IDs; build a dashboard for outbox lag and compensation rate.
Real-world Context
A marketplace replaced a brittle 2PC checkout with an orchestrated saga. Payments reserve funds, inventory allocates stock, shipping prepares labels; any failure triggers compensations. Outbox + CDC removed lost-event incidents and cut mean time to recovery. A ride-hailing firm adopted CQRS projections for driver ETA queries, decoupling heavy reads from the write path. A fintech added per-account partitioning and idempotent handlers, eliminating duplicate settlements during broker outages. In each case, sagas, eventual consistency, and outbox delivered correctness with low latency and high change velocity.
Key Takeaways
- Prefer sagas over 2PC; commit locally, coordinate globally.
- Guarantee delivery with the outbox pattern; design idempotent consumers.
- Use CQRS to scale reads and isolate writes.
- Enforce per-aggregate ordering; avoid global ordering.
- Make compensations, timeouts, retries, and observability first-class.
Practice Exercise
Scenario:
You must implement an order workflow across Payment, Inventory, and Shipping services. The business requires “never ship without payment capture” and “always release inventory on failure.” The system must tolerate broker outages and prevent double actions under retries.
Tasks:
- Model the flow as a saga. Choose orchestration. Define steps: ReservePayment → AllocateInventory → CapturePayment → CreateShipment. Define compensations: ReleaseInventory, RefundPayment, CancelShipment.
- Implement the outbox in each service: on local commit, write an event row; a relay publishes to the broker with at-least-once delivery.
- Add idempotency keys per order and store processed keys. Partition topics by orderId to guarantee per-order ordering.
- Build a CQRS read model (OrderStatusView) that projects events for the UI. Include a “freshness” timestamp.
- Set timeouts per step and retries with exponential backoff and jitter. Send stuck messages to a DLQ with full correlation data.
- Version your event schema and add a consumer-driven contract test.
- Add tracing with a sagaId spanning all steps; create dashboards for outbox lag, compensation rate, and step latency.
Deliverable:
A working design and notes proving that sagas, eventual consistency, CQRS, and the outbox pattern satisfy invariants, prevent duplication, and keep latency predictable under failures.

