How do you design observability and failure handling in Rust?
Rust Web Developer
answer
I design observability and failure handling in Rust with three layers. At the code level I use thiserror for precise domain errors and anyhow for request scopes, always attaching context. At the platform edge I add tower middleware for timeouts, retries, rate limiting, and structured logging. I instrument tracing spans with OpenTelemetry exporters and emit metrics and logs with correlation identifiers. Finally, I implement graceful shutdown that signals, stops intake, and drains in-flight work before exit.
Long Answer
A reliable Rust web service treats observability and failure handling as first-class design goals. The aim is to expose clear signals when things go wrong, contain blast radius automatically, and shut down safely. I use a layered approach: structured errors in the domain, defensive middleware at the edge, rich spans and metrics for visibility, and a deliberate graceful shutdown sequence that drains in-flight work.
1) Error taxonomy and propagation
I start with a clear error taxonomy. In library and domain code I define enums with thiserror, mapping each variant to a stable code and human readable message. Variants carry rich context such as identifiers and boundary fields so operators can diagnose quickly. At service boundaries I convert errors to typed responses through a single error mapper. Inside request handling I allow anyhow for ergonomic composition, but I always annotate with .context("meaningful step") so the chain is searchable. I separate expected errors (validation, not found, conflict) from unexpected ones (invariants, dependency failures) and tag them differently for metrics and alerts.
2) Edge policies with tower middleware
Edge concerns belong in reusable tower middleware. I wrap handlers with timeouts to cap waiting, concurrency limiters to protect downstreams, and retry with exponential backoff for transient classes. I add a circuit breaker to trip when failure rate exceeds thresholds and to recover with half-open probes. I normalize request and response logging, redacting secrets and personal data by default. A request identity middleware assigns a correlation identifier that flows through spans, logs, and metrics. Rate limiting and load shedding protect the process during surges and keep tail latency bounded.
3) Tracing spans, logs, and metrics
I instrument the service with tracing. Each request opens a root span that includes route, method, tenant, and correlation identifier. Critical internal steps create child spans: parse, validate, call to database, call to cache, call to queue, render. Errors record at the span that detected them and bubble up with cause chains. For logs I emit structured, machine readable events, never printf style text. Metrics capture request rate, success rate, and latency percentiles per route and dependency. I add counters for specific error classes, histogram buckets for durations, and gauges for queue depth and connection pools. Span fields include keys that enable high-cardinality filters only where necessary; otherwise identifiers are hashed to bound cardinality.
4) OpenTelemetry export and sampling
I wire OpenTelemetry to export traces and metrics to a collector. I use parent-based sampling so internal spans inherit the decision from the incoming context, which preserves end-to-end visibility across services. For high traffic routes I set head sampling to a small percentage while raising the rate during incidents. I attach resource attributes such as service name, version, and deployment region so dashboards can slice and compare. Logs are shipped in the same context pipeline to enable log-trace correlation inside one tool.
5) Health checks and readiness
I separate liveness and readiness. Liveness indicates that the process should be kept alive. Readiness checks dependencies and configuration such as database connectivity, queue producers, and essential caches. During deploy or incident I can take the process out of rotation by failing readiness while keeping liveness healthy to allow draining. Health endpoints also expose build information, feature flags, and configuration hashes for faster triage.
6) Failure handling patterns
Transient dependency errors use retry with jitter and upper bounds, always respecting idempotency. Non-retryable errors fail fast and include actionable messages. Timeouts wrap every external call so hung sockets do not block threads. I include dead letter queues for messages that repeatedly fail. For background tasks I store durable state and idempotency keys to avoid duplicate effects after restarts. Backpressure is visible through metrics; if queues grow beyond thresholds the service begins to reject new work with a clear status and retry hints.
7) Graceful shutdown and in-flight draining
Graceful shutdown begins with signal capture. On the first signal the server stops accepting new connections and closes listeners. A cancellation token is broadcast to long running tasks so they can decide to finish or abort safely. In-flight requests receive a generous deadline; background workers finish the current unit of work and checkpoint offsets. The process waits until either the drain completes or a hard timeout elapses, then forces remaining work to a safe parking area such as a queue or durable table. I ensure that shutdown hooks flush tracing and metrics exporters so operators do not lose the final spans.
8) Testing and chaos exercises
I test error mapping with table driven cases that cover each error variant. I run integration tests that simulate timeouts, connection resets, duplicate deliveries, and partial failures. I add chaos scenarios that kill the process mid-request, then verify that graceful shutdown drains on the next deploy and that idempotency prevents double effects. I assert that sampling and exporters keep trace trees intact across internal and external boundaries.
9) Governance and documentation
A runbook documents error codes, retry guidance, and escalation steps. Dashboards show service level indicators, error budgets, and dependency health. A weekly review addresses top offenders in error counts and tail latency. The codebase includes templates for new routes and new background workers so teams get consistent middleware, spans, and metrics without manual effort.
By combining thiserror and anyhow for precise and contextual errors, tower middleware for protective edge policies, tracing with OpenTelemetry for visibility, and a robust graceful shutdown that drains in-flight work, a Rust service becomes both observable and resilient.
Table
Common Mistakes
Relying on generic error strings without stable codes or context. Mixing anyhow everywhere and losing domain specificity. Logging free-form text instead of structured fields, making search and correlation impossible. Omitting tower middleware for timeouts and retries so hung dependencies stall threads. Spraying high-cardinality identifiers into metrics and breaking storage. Exporting traces without parent-based sampling and losing cross-service visibility. Treating liveness and readiness as the same check. On shutdown, killing the process immediately without stopping intake or draining in-flight work, which creates partial side effects and duplicate processing after restart.
Sample Answers
Junior:
“I define domain error enums with thiserror and map them to responses. I wrap handlers with tower timeouts and retries. I instrument routes with tracing and export to OpenTelemetry. On shutdown I stop new requests and allow current ones to finish.”
Mid-level:
“I separate expected and unexpected errors, attach context with anyhow in request scopes, and use stable error codes. Middleware adds rate limits, backoff, and correlation identifiers. Traces include child spans for database and cache calls, and metrics track latency percentiles. Readiness removes instances from rotation, then graceful shutdown drains workers.”
Senior:
“I enforce a service template with thiserror taxonomies, tower policy layers, and tracing with parent-based sampling. I gate risky dependencies with circuit breakers, and I bound cardinality. Deploys use readiness to drain, then signal driven cancellation with deadlines and durable checkpoints. Telemetry pipelines ship traces, metrics, and logs together for precise incident triage.”
Evaluation Criteria
Look for a coherent plan that combines structured errors (clear codes, context), protective tower middleware (timeouts, retries, limits, circuit breaking), and comprehensive tracing with OpenTelemetry export. The candidate should distinguish liveness from readiness, describe graceful shutdown that stops intake and drains in-flight work, and show awareness of idempotency and durable checkpoints. Strong signals include correlation identifiers, bounded cardinality, latency histograms, and error budgets. Red flags are ad hoc error strings, missing middleware, no parent-based sampling, and immediate hard exits that risk data loss.
Preparation Tips
Create a starter Rust service with axum or actix-web and a tower stack for timeouts, retries, and logging. Define domain errors with thiserror and a unified response mapper. Add tracing spans for each dependency and export to a local OpenTelemetry collector; verify correlation identifiers link logs to traces. Implement liveness and readiness endpoints and simulate dependency failures. Add a signal handler that stops listeners, cancels tasks, drains a queue, and flushes telemetry. Write tests that force timeouts, duplicate deliveries, and mid-flight termination. Measure latency histograms, error rates, and queue depth; set alert thresholds and practice the runbook.
Real-world Context
A payments service replaced string errors with thiserror enums and saw mean time to resolution drop because dashboards grouped failures by stable codes. Adding tower timeouts and circuit breaking prevented a downstream outage from saturating threads, keeping the service responsive. With tracing and OpenTelemetry, operators followed a request from gateway to database and identified a slow index through span durations. During deploys the platform used readiness to drain connections; a signal driven graceful shutdown let workers finish messages and checkpoint offsets. After a crash test, idempotent handlers and durable checkpoints ensured no duplicate charges and no lost events.
Key Takeaways
- Use thiserror for domain specificity and anyhow for contextual chains.
- Enforce timeouts, retries, rate limits, and circuit breaking with tower middleware.
- Instrument deep tracing spans and export through OpenTelemetry with parent-based sampling.
- Separate liveness from readiness and publish health that reflects dependencies.
- Implement graceful shutdown that stops intake, cancels safely, drains in-flight work, and flushes telemetry.
Practice Exercise
Scenario:
You own a Rust web service that processes orders and publishes events. Under dependency slowness the service must stay responsive, and during deploys it must not lose or duplicate work. You need observability to diagnose issues quickly.
Tasks:
- Define domain errors with thiserror and map them to stable response codes. In handlers, wrap operations with anyhow and contextual messages.
- Build a tower stack: request identifier, structured logging, timeout per dependency, retry with exponential backoff for transient classes, rate limit, and a circuit breaker.
- Instrument routes with tracing root spans and child spans for database, cache, and queue operations. Add metrics for request rate, success rate, latency percentiles, queue depth, and breaker state.
- Export traces and metrics to an OpenTelemetry collector with parent-based sampling. Include resource attributes for service name, version, and region.
- Implement liveness and readiness endpoints. Readiness must fail if essential dependencies are down.
- Implement graceful shutdown: on signal, stop accepting new connections, broadcast cancellation, wait for in-flight requests to finish, drain a message queue with idempotency keys, checkpoint offsets, and flush telemetry.
- Write tests and a chaos script that introduces timeouts and kills the process mid-request. Verify that retries respect idempotency and that restart does not duplicate effects.
- Deliver a short runbook with dashboards to investigate top errors and slow spans.
Deliverable:
A reference service that demonstrates robust observability, defensive failure handling, and reliable graceful shutdown with in-flight draining in Rust.

