How to ensure data integrity & consistency in distributed DBs?
Database Developer
answer
I ensure data integrity and distributed database consistency with layered controls: strong schemas and constraints, idempotent writes, and transactional guarantees (ACID where required, eventual consistency where safe). I use consensus (Raft/Paxos) or per-partition quorum writes/reads, plus conflict resolution (CRDTs/CfV, last-write-wins only when harmless). I add CDC + audit logs, checksum guards, and Saga/outbox patterns across services. Observability (SLIs/SLOs) validates invariants in production.
Long Answer
Achieving data integrity and distributed database consistency is about protecting invariants while embracing the realities of partitions, latency, and independent failures. My strategy layers schema guarantees, transactional design, consensus/quorum choices, and operational checks so the system remains correct under stress.
1) Model invariants explicitly
Start by writing down invariants: “an order’s total equals the sum of its items,” “inventory never negative,” “user id unique.” Encode them in the database via schemas (types, NOT NULL), keys/unique indexes, foreign keys (or application-enforced references when the engine lacks them), and check constraints. Add domain guards in code for cross-entity rules that no single table can express.
2) Choose the right consistency per boundary
Not all data needs the same strength. For money, require ACID (single-region primary or multi-leader with strict conflict resolution). For read-heavy product catalogs, eventual consistency is acceptable. Use a vocabulary: strong (linearizable), read-your-writes, monotonic reads, or eventual. Document which API endpoints promise which level so callers design appropriate UX (e.g., optimistic UI with reconciliation).
3) Transactions and Sagas
Within a shard/replica set, favor transactions for atomic multi-row updates. Across services or databases, use Saga patterns: split a business action into steps with compensations. Pair Sagas with the outbox pattern (write event + state atomically; a relay publishes from the outbox) to avoid dual-write anomalies. Idempotency keys on commands allow safe retries after timeouts.
4) Replication, quorum, and consensus
Pick replication that matches consistency goals. With Raft/Paxos systems, writes commit after majority; reads can be linearizable (leader) or bounded-stale (followers). In quorum stores (e.g., Dynamo-style), select R/W quorums (R + W > N) for strong reads, or lower for latency. For geo-distribution, pin write leaders near writers or use per-partition leaders to minimize cross-region hops.
5) Conflict handling
Conflicts arise with multi-writer topologies and offline edits. Prefer convergent structures: CRDTs for counters/sets, or conflict-free versions (CfV) using vector clocks/lamport timestamps plus merge functions. Only fall back to last-write-wins when the domain tolerates loss (e.g., ephemeral presence). Store causal metadata so merges are deterministic; log reconciliations for audit.
6) Integrity during change
For schema evolution, use backward-/forward-compatible changes: add-then-write-then-read-then-remove, deploy in order (DB → writers → readers). Migrate with online tools (shadow tables/backfills), gating writes behind flags. Verify parity with dual-read or CDC comparisons until cutover.
7) Observability & guardrails
Track invariants as metrics: count of FK violations prevented, negative inventory attempts, stale-read ratio, read-your-write misses. Emit checksums (periodic row-level hashes per chunk) and compare leaders vs. followers; trigger repair if drift detected. Alert on replication lag, quorum failures, and transaction abort spikes. Define SLOs for correctness (e.g., <0.01% reconciliation events per day).
8) Performance without sacrificing correctness
Batch writes where possible, use idempotent upserts, and lean on indexes that align with access paths to minimize lock contention. Prefer optimistic concurrency (ETags/version columns) with retry loops to avoid hot locks; only escalate to serializable isolation for truly critical sections. Cache safely: stampede protection and write-through for strong correctness, write-behind only with replayable logs.
9) Backup, repair, and audits
Enable point-in-time recovery and regular snapshots. Continuously verify backups with restore drills. Run background scrub jobs (range scans with checksums) and compare against CDC to detect silent corruption. Maintain immutable audit logs of mutations for post-incident forensics.
10) Real examples
— Payments: single-leader ledger with ACID transactions; outbox to publish balance events; monotonic reads for statements.
— Social feed: eventually consistent fan-out with CRDT likes; background reconciliation repairs missed writes.
— Inventory: quorum writes + conditional updates; Sagas handle reservation/compensation on timeouts.
Together these practices keep distributed database consistency high while safeguarding data integrity, even as topology, load, and schemas evolve.
Table
Common Mistakes
Treating all data as needing linearizability bloats latency and cost. The opposite—declaring everything eventual consistency—breaks money/inventory invariants. Teams often perform dual writes (DB + queue) without an outbox, creating ghost events. Some pick last-write-wins by default; later they discover user edits vanish. Ignoring idempotency turns retries into duplication. Skipping schema constraints and relying solely on app logic lets bad rows slip in during deploy races. Many migrate schemas without backward compatibility, breaking older writers/readers. Finally, poor observability—no replication-lag alerts, no checksum scrubs, no SLOs for stale reads—means integrity issues are caught by customers, not dashboards.
Sample Answers (Junior / Mid / Senior) (1000–1100 chars)
Junior:
“I enforce schemas, unique keys, and foreign keys for data integrity. For reads, caches are updated via write-through. I use optimistic concurrency (row version) and retries.”
Mid-Level:
“I map domains to consistency levels: strong for payments, eventual consistency for catalogs. Inside a shard I use ACID; across services I implement Saga + outbox to avoid dual-writes. Reads honor read-your-writes where promised.”
Senior:
“I design for distributed database consistency with per-partition leaders and quorum writes (R+W>N). Conflicts are resolved with CRDTs or capability-specific merges; LWW only when harmless. Schema evolution is backward-compatible with online backfills. Idempotency keys and OCC prevent duplication under retries. Observability tracks stale-read rate, replication lag, and checksum drift; CDC compares replicas. We drill restores and keep SLOs tied to invariants, not just latency.”
Evaluation Criteria
Interviewers look for:
- Clear grasp of data integrity vs. distributed database consistency trade-offs.
- Correct use of ACID transactions within shards and Saga/outbox across systems.
- Thoughtful application of strong vs. eventual consistency by domain.
- Conflict strategies (CRDTs, vector clocks) beyond naive LWW.
- Quorum/consensus understanding and how to tune R/W vs. latency.
- Backward-compatible schema evolution and online migration practices.
- Operational rigor: CDC, checksums, replication-lag alerts, tested backups.
Candidates who tie design to measurable SLOs and show idempotency/OCC patterns typically score highest.
Preparation Tips
Build a demo with two services (orders, inventory). In Postgres or Cassandra, enforce keys/constraints and add a version column for OCC. Implement Saga with outbox so “create order → reserve stock → confirm” is atomic from a business view. Simulate partitions and tune quorums (R/W) to see latency vs. correctness. Add a CRDT counter for “likes” to practice conflict resolution. Run CDC to a verifier that computes per-chunk checksums between leader and follower. Practice a backward-compatible migration (add column, backfill online, switch readers). Instrument SLOs: stale-read rate, reconciliation attempts, replication lag. Rehearse a 60–90s story showing a failed dual-write, how outbox fixed it, and how SLOs caught a drift bug before users did.
Real-world Context
A marketplace moved inventory from single-DB ACID to geo-replicated quorum writes; they kept data integrity by using conditional updates and Sagas for reservations, dropping oversells to near zero. A social app adopted CRDT sets for reactions and eliminated merge incidents during offline edits. A fintech introduced an outbox after losing events in a dual-write crash; reconciliation rate fell 95%. A retailer’s cross-region catalog used eventual consistency with read-repair and bounded staleness; they published guarantees so UX handled short lags. Another team added CDC + checksums to detect replica drift early, repairing via background scrubs. Each case shows careful consistency choices paired with operational checks.
Key Takeaways
- Encode invariants in schemas and guard with constraints.
- Match consistency to domain: ACID for money, eventual consistency for reads.
- Use Saga + outbox across services; design idempotent operations.
- Resolve conflicts deterministically (CRDTs/CfV), avoid blind LWW.
- Observe and repair: CDC, checksums, lag alerts, tested restores.
Practice Exercise
Scenario: You run a distributed commerce platform with services for Orders, Payments, and Inventory across two regions. Users complain about occasional oversells and stale order views after failovers. Your goal: raise distributed database consistency while preserving latency targets.
Tasks:
- Invariants: Write and enforce constraints so inventory ≥ 0 and order totals match item sums. Add version columns for OCC.
- Consistency mapping: Make Payments strong (single leader or quorum with R+W>N), Orders read-your-writes, Catalog eventually consistent with bounded staleness.
- Sagas: Implement “create order → reserve stock → capture payment,” with compensations for each step. Add outbox to publish events atomically.
- Conflicts: Use CRDT counters for reservations in the cache; reconcile to the source via deterministic merges.
- Quorums: Tune write quorum for Inventory; measure latency and oversell rate before/after.
- Observability: Track stale-read SLO, replication lag, reconciliation attempts; emit periodic checksums and compare replicas.
- Migrations: Perform a backward-compatible change adding reserved_qty; backfill online, then flip readers.
- Report: Provide before/after metrics (oversells, stale-reads, p95 write latency) and a rollback plan.
Deliverable: A short deck with graphs proving integrity improved, oversells dropped, and latency targets held.

