How do you keep data consistent in eventually consistent systems?
Back-End Developer
answer
In distributed systems, ensure data consistency by designing around eventual consistency: model clear ownership, use idempotency keys, and apply sagas (TCC/compensation) instead of distributed locks. Persist events with the outbox pattern and deliver via exactly-once semantics at the application layer (dedup + retries). For reads, combine quorum/read-repair with versioning (ETags, vector/Lamport clocks). Where conflicts are expected, use CRDTs. Monitor invariants with audits and drift alarms.
Long Answer
Distributed systems trade instantaneous agreement for availability and scale; your job is to deliver data consistency and integrity over time without blocking the world. That means making writes safe, reads honest about staleness, and conflicts deterministic.
1) Model boundaries and ownership
Start with clear data domains and a single writer per aggregate. Cross-service updates flow by events, not in-place remote writes. This shrinks contention and clarifies who resolves conflicts. Use stable keys, immutable events, and append-only logs so history is auditable.
2) Idempotency, retries, and dedup
Networks fail; clients retry. Every mutating request carries an idempotency key so the handler can upsert once. Keep a dedup store keyed by (operation, key, idempotency_id) with result hashes and expiry. Return the same outcome on replay. This is your real “exactly once.”
3) Transactional messaging: outbox/inbox
Use the outbox pattern: commit business state and an event in the same DB transaction; a dispatcher relays the event to Kafka/SNS and marks it sent. Consumers keep an inbox table to dedup deliveries. This avoids the “write DB then maybe publish” split-brain that creates ghosts or missing updates.
4) Sagas over 2PC
Avoid distributed 2PC. Use sagas: a series of local transactions with compensation if a later step fails (e.g., release inventory, refund payment). For stronger guarantees, apply TCC (try-confirm-cancel) with hold/confirm timeouts and fencing tokens to prevent stale confirms. Persist saga state for replay and idempotent compensation.
5) Concurrency control and guards
Protect invariants with optimistic concurrency (row/version/ETag) and reject stale writes. For shared resources, use logical leases + fencing tokens issued by a sequencer (ZK/etcd). Avoid naïve Redis locking (split brain). When you must serialize, constrain scope and timeout aggressively.
6) Replication, quorum, and repair
Replication makes data available but inconsistent for a while. Choose quorum parameters (N, R, W) so W+R>N gives read-your-writes on overlap. Use read-repair and anti-entropy to converge replicas. Expose consistency levels to callers (local, quorum, linearizable where needed) so safety-critical reads can pay the price.
7) Conflict handling and CRDTs
Conflicts happen with multi-region, offline edits, or fan-out writes. Prefer mergeable state: counters, sets, or document fields that can be composed. Use CRDTs (G-Counter, OR-Set, LWW-Register) when user intent tolerates commutative merges (likes, presence, cart items). For business-critical aggregates (payments), forbid concurrent writers and funnel through a sequencer.
8) Versioning and time
Wall clocks lie; rely on Lamport or vector clocks to order causally related events. Store version, updated_at, and a monotonic logical clock; compare on write to prevent lost updates. For APIs, surface ETag and require If-Match for updates. For clients, advertise staleness (“as of T”) and provide “refresh” affordances.
9) Read models and projections
Build CQRS: commands mutate the write model; subscribers project events into read stores (denormalized views). Reads are eventually consistent; compensate with stale-while-revalidate, cache keys that include version, and invalidate precisely from event types. For totals, maintain materialized counters updated atomically from the event stream; reconcile nightly to catch drift.
10) Integrity via invariants and reconciliation
Encode invariants as code and checks: “sum(ledger) == sum(balances)”, “no negative stock,” “one open order per id.” Run continuous reconciliation jobs that compare authoritative sources and read models; flag and auto-heal divergences with compensating events. Track drift budgets and alert if exceeded.
11) Schemas, evolution, and contracts
Use schema-compatible serialization (Avro/Protobuf/JSON-Schema) and version topics/fields. Roll out backward/forward changes with feature flags and dual-write/dual-read until all consumers upgrade. Store content hashes to detect partial replays or out-of-order merges.
12) Testing, chaos, and SLOs
Test with failure injection: drop/duplicate/reorder messages, pause consumers, and simulate partitions (Jepsen-style). Assert idempotency and convergence. Define SLOs: time-to-consistency, max allowed drift, duplicate rate. Monitor lag, DLQ depth, and compensation counts; they’re your leading indicators.
13) Security and durability
Sign events, encrypt at rest/in flight, and keep append-only logs with retention to rebuild state. Treat the log as the system of record; snapshots are optimizations. Backups include offsets and outbox/inbox so recovery is consistent.
The result is a system that admits temporary divergence but guarantees convergence with deterministic merges, guarded writes, and continuous verification. That’s integrity in an eventually consistent world.
Table
Common Mistakes
Treating retries as harmless without idempotency, causing duplicates. Publishing events outside the database transaction—missed or phantom messages. Reaching for 2PC instead of sagas, then deadlocking the distributed world. Using wall-clock timestamps for conflict resolution; skew makes lies. Blindly relying on “eventual” without quorum or read-repair, so some replicas never converge. Ignoring schema evolution, breaking consumers mid-deploy. Hiding staleness from users; trust erodes when numbers “jump.” No reconciliation or DLQ; errors quietly accumulate until audits explode.
Sample Answers (Junior / Mid / Senior) (1000–1100 chars)
Junior:
“I add idempotency keys to writes and use optimistic concurrency with ETags. For cross-service flows I avoid locks and use simple sagas with compensation. Reads show ‘as of’ timestamps so users know about staleness.”
Mid:
“I implement outbox/inbox for transactional messaging and dedup. Sagas coordinate orders, payments, inventory with TCC. Read side uses quorum where needed and read-repair. We run daily reconciliation queries to detect drift and emit compensating events.”
Senior:
“We define single-writer ownership, event-sourced aggregates, and CQRS projections. Mutations are idempotent; fencing tokens guard confirms. Conflicts use CRDTs where semantics allow; otherwise, a sequencer enforces order. We measure time-to-consistency SLOs, test with failure injection, and gate deploys on drift budgets.”
Evaluation Criteria
Look for: idempotency and dedup at the app layer; outbox/inbox for atomic write+publish; sagas/TCC instead of 2PC; optimistic concurrency and fencing tokens; quorum/read-repair; conflict strategies (CRDTs vs single writer); CQRS with precise invalidation; staleness surfaced in UX; invariants and reconciliation with alarms; schema evolution. Weak answers name only one tool (“use transactions”) or ignore retries, ordering, replays, and user-visible staleness.
Preparation Tips
Build a demo with two services (orders, payments). Add outbox in orders, inbox in payments; make writes idempotent. Coordinate with a saga: reserve → capture → confirm or compensate. Add ETag-based updates and idempotency keys on APIs. Provide a read projection that’s eventually consistent; show “as of T” and a refresh button. Inject chaos: duplicate and reorder messages, pause consumers, simulate partition—verify convergence and compensation. Write reconciliation SQL that compares ledgers vs balances; alert on drift. Practice a 60–90s story focused on guarantees and trade-offs.
Real-world Context
An e-commerce platform replaced ad-hoc callbacks with outbox/inbox; duplicate payments vanished under retries. Inventory oversells stopped after moving to saga-based TCC with short holds and fencing tokens. A social app adopted CRDT sets for likes; conflicts disappeared across regions. A fintech added reconciliation jobs with compensating events; ledger and balance drift dropped below 0.1%. Exposing “as of block/time” in dashboards improved trust: users understood staleness and could force a refresh when decisions were critical.
Key Takeaways
- Idempotency + outbox/inbox = practical “exactly once.”
- Prefer sagas/TCC over 2PC; compensate, don’t lock.
- Use optimistic concurrency and fencing tokens to avoid lost updates.
- Choose quorum/repair or CRDTs to converge replicas deterministically.
Surface staleness; enforce invariants; reconcile and alert on drift.
Practice Exercise
Scenario: You run orders, payments, and inventory services across two regions. The business tolerates eventual consistency but demands no double charges, no negative stock, and <1-minute time-to-consistency for orders.
Tasks:
- Define ownership: payments write the ledger; inventory owns stock; orders orchestrate via a saga.
- Implement idempotency on all POSTs; store (op,key,idempotency_id) + outcome for 48h.
- Add outbox to orders (DB txn includes event) and inbox to payments/inventory with dedup.
- Use TCC: reserve stock and payment hold with TTL; confirm or cancel with fencing tokens.
- Expose read models via projections; mark responses “as of T.” Enable quorum reads for payment status APIs.
- Write reconciliation jobs: (a) sum(ledger) == sum(balances), (b) stock ≥ 0, (c) orders closed iff both payment+shipment events exist.
- Chaos test: duplicate, drop, reorder messages; pause a region; verify convergence within SLO and only compensations, not inconsistencies.
Deliverable: a 60–90s run-through explaining guarantees, failure handling, and how the system converges safely.

