How do you handle errors, retries, and idempotency in middleware?
Middleware Engineer
answer
In high-throughput pipelines I classify errors (transient vs permanent), apply bounded retries with exponential backoff and jitter, and enforce idempotency via request keys, sequence numbers, or de-dup stores. I keep handlers side-effect free until commit and use outbox, sagas, or transactional messaging to avoid partial writes. Circuit breakers, timeouts, and dead-letter queues contain blast radius, while metrics and traces validate success rates and latency.
Long Answer
High-throughput middleware must stay correct under failure while protecting throughput and latency. My approach blends clear error taxonomy, bounded jittered retries, end-to-end idempotency, and backpressure so the system never amplifies an incident.
1) Error taxonomy and policy
I separate transient faults (timeouts, 5xx, resets) from permanent/semantic errors (4xx validation, business rules). Transient faults get automatic retries; permanent ones go to dead-letter with rich context. Policies define max attempts, per-stage budgets, and retryable exception classes.
2) Retry discipline
Retries are bounded, exponential, and jittered. I apply per-stage timeouts and propagate deadlines in headers so deep calls fail fast. Token-bucket admission and concurrency limits keep queues stable. For hot resources, requests are coalesced (single-flight). All retries must be idempotent; otherwise we add compensations.
3) Idempotency end-to-end
Each transaction carries an idempotency key (caller id + payload hash + logical time). Middleware persists the key and outcome in a fast store (Redis TTL or key-value table) and short-circuits on replays. For ordered streams I use sequence numbers and at-least-once + de-dup to achieve exactly-once-like effects. Side effects occur only after validation.
4) Atomic side effects
Use the outbox pattern: write state and an event in one DB transaction; a relay publishes the outbox exactly once. For multi-service flows, sagas with compensations record intent and completion so recovery is deterministic. Transactional messaging or two-phase commit is rare; availability favors sagas.
5) Queues and storage
Queues are durable and right-sized. Visibility timeouts exceed p99 handler time and renew as needed. A dead-letter queue captures permanent failures with replay tooling that preserves idempotency. Small batches keep tail latency low.
6) Observability
We log retry count, last error, idempotency hits, and DLQ reasons. Dashboards show success rate, p95/p99 latency, saturation, and error-budget burn. Distributed tracing ties stages together; alerts trigger on SLO breaches.
7) Fault isolation
Circuit breakers and bulkheads isolate dependencies; timeout budgets keep event loops free. During incidents we shed load for low-priority classes and prefer graceful degradation (cache, stale reads). Replays include a dry-run mode to validate idempotency before execution.
8) Testing and chaos
Property-based tests enforce idempotency; failure-injection covers timeouts and partial commits. Periodic chaos (latency, packet loss) validates budgets and compensations without crushing throughput.
Bottom line: classify errors, retry prudently with jitter, enforce idempotency keys and sequence checks, commit side effects atomically, and instrument everything.
Table
Common Mistakes
- Treating all errors as retryable, amplifying outages and costs.
- Infinite retries without backoff + jitter, creating thundering herds and hot partitions.
- Missing idempotency keys so duplicates cause double charges, duplicate emails, or reissued webhooks.
- Performing side effects before idempotency validation or outside transactions; partial writes remain.
- Oversized batches hide tail latency, block fair scheduling, and starve DLQ processing during incidents.
- No visibility timeout renewal; the broker re-delivers while a worker still runs, causing duplication.
- Lack of observability: no split of first-attempt vs retry metrics, no correlation ids, shallow logs.
- Replays without a dry-run mode; bulk reprocessing retriggers external calls and breaks rate limits.
- Assuming the broker provides exactly-once semantics; consumers must de-dup explicitly.
- Forgetting to cap total retry budget per transaction, letting one request monopolize capacity.
Sample Answers (Junior / Mid / Senior)
Junior:
“I separate transient from permanent errors, using exponential backoff with jitter for the former and a DLQ for the latter. I add an idempotency key so retries return the same result, and I keep writes inside transactions to avoid partial updates. I document retry policies so operators know behavior.”
Mid:
“I propagate deadlines, cap attempts per stage, and enforce idempotency with a Redis store and sequence numbers for ordered topics. I set visibility timeouts above p99 and renew them. I log retry count and DLQ reason, and I coalesce duplicate inflight calls to hot dependencies. I add tracing spans around retries to see where time is spent.”
Senior:
“Our pipelines are stateless and idempotent end-to-end. We use outbox + sagas for atomic effects, circuit breakers and bulkheads for isolation, and token-bucket admission to keep queues stable. Replays run through a dry-run validator before execution. Dashboards show p95/p99, idempotency hit rate, and error-budget burn; SLOs gate changes and rollbacks. We run chaos drills quarterly to validate budgets and recovery.”
Evaluation Criteria
Strong answers present a repeatable plan: classify errors by cause, route permanent failures to DLQ, and apply bounded retries with jitter and deadlines. They explain idempotency end-to-end using keys, sequence numbers, and consumer de-dup, and keep side effects behind atomic commits (outbox) or sagas. Candidates should cover broker settings (visibility timeouts, renewal), replay tooling, and observability (structured logs, metrics for retry counts and DLQ age, plus distributed traces). They cite circuit breakers, bulkheads, rate limits, and load shedding to bound blast radius, and they discuss testing (property-based, failure injection, chaos).
Red flags: infinite retries, relying on “exactly-once broker delivery,” missing idempotency keys, side effects before commit, or no plan for replays and DLQs. Bonus signals: SLOs and error-budget alerts that gate changes, runbooks for pause/drain/replay, and single-flight/coalescing for hot dependencies. Mature answers also mention backpressure (token-bucket or concurrency caps) to stabilize queues under load.
Preparation Tips
- Build a sandbox stage with explicit retryable vs permanent errors; add exponential backoff + jitter and verify timing with logs and traces.
- Implement an idempotency store (Redis TTL) and stress test with concurrent duplicates; prove the same key cannot produce multiple effects.
- Add the outbox pattern to a write path; crash between commit and publish to verify the relay produces exactly one event.
- Configure a DLQ; write a safe replay tool with dry-run and per-batch caps; record correlation ids.
- Instrument metrics (retry count, idempotency hit rate, DLQ depth, p95/p99 latency) and tracing spans for each stage; build dashboards.
- Run chaos: add latency/packet loss to a dependency; observe breakers opening, backpressure holding, and SLO alerts.
- Define SLOs and an on-call runbook: pause intake, drain queues, replay safely, and rollback.
- Document policies in the repo (error taxonomy, retry limits, deadlines) so ops and devs share the same playbook.
Real-world Context
A payments pipeline saw duplicate charges during network flaps. Adding idempotency keys with a 24-hour TTL, validating before side effects, and moving effects behind an outbox eliminated duplicates and enabled safe replays; chargebacks dropped by 60%.
An orders service hammered a flaky inventory API. Introducing exponential backoff with full jitter, single-flight for identical requests, and a circuit breaker cut upstream error rates by 80% and stabilized p99 latency without overprovisioning.
A notifications system built large batches and let visibility timeouts expire; messages reappeared and processed twice. We added renewal, smaller batches, and a replay tool with dry-run and rate limits; throughput recovered while avoiding duplicate sends.
Post-mortems added dashboards for idempotency hit rate and DLQ age, plus SLO alerts that now gate risky changes in CI/CD.
Key Takeaways
- Classify errors; retry only the transient ones with backoff + jitter.
- Make effects idempotent using keys, sequences, and de-dup stores.
- Commit state atomically with outbox; use sagas for workflows.
- Protect capacity with breakers, bulkheads, deadlines, and backpressure.
- Instrument retries, DLQ, and idempotency; rehearse replays safely.
Practice Exercise
Scenario:
You own a middleware service ingesting 50k TPS of payments events. A dependency intermittently times out, causing duplicates and a growing backlog.
Tasks:
- Introduce idempotency keys (caller id + payload hash + logical time) stored in Redis with TTL; short-circuit replays and expose a /replay-check endpoint.
- Add exponential backoff with full jitter, cap attempts to N, and propagate a deadline header across hops; log when budgets are exhausted.
- Wrap the dependency with timeouts, circuit breaker, and single-flight to collapse duplicate inflight calls; add bulkhead pools per dependency.
- Move writes to an outbox table and add a publisher that resumes reliably after crashes; verify exactly-once publication with a crash test.
- Tune the queue: visibility timeout > p99 handler time; enable renewal; reduce batch size to stabilize tail latency; enforce per-tenant rate limits.
- Ship metrics (retry count, idempo hit rate, DLQ depth/age, p95/p99, breaker open time) and traces; alert on error-budget burn.
- Build a replay tool with dry-run and rate caps that validates idempotency before execution; include correlation ids in audit logs.
- Run a chaos drill: inject 300 ms latency and 5% failure for one hour; prove SLOs hold, queues do not explode, and recovery is automatic.
Deliverable:
A runbook and PoC showing fewer duplicates, bounded latency, safe measurable reprocessing, and dashboards that a new on-call engineer can use on day one.

