How do you design safe and reliable error handling in Ruby?
Ruby Developer
answer
Safe and reliable Ruby error handling starts with a clear custom exception hierarchy that encodes intent, not just stack traces. Fail fast on programmer errors, classify operational faults, and surface context with structured logging. Wrap external calls with timeouts and fault-tolerant retries using exponential backoff, jitter, caps, and circuit breakers. Make actions idempotent so retries are safe. Emit correlation identifiers, sanitize sensitive data, and route unrecoverable cases to dead-letter paths for review.
Long Answer
Effective Ruby error handling is a deliberate design that encodes business meaning into exceptions, protects user experience with graceful degradation, and empowers operators to diagnose and recover quickly. The pillars are: an explicit custom exception hierarchy, structured logging with rich context, and fault-tolerant retries that are safe by construction.
1) Model errors with a custom hierarchy
Create a base AppError < StandardError and derive semantically precise types such as ValidationError, NotAuthorizedError, NotFoundError, ConflictError, DependencyTimeout, and RateLimited. Reserve NoMethodError, TypeError, and similar for programmer bugs; do not rescue them broadly. Separate programmer errors (bugs) from operational errors (network, timeouts, resource limits). This taxonomy lets you route failures correctly: present user messages for validation, retry on transient dependencies, and alert on invariants.
2) Fail fast and validate early
Validate inputs at boundaries: controllers, workers, and service objects. Raise ValidationError with a machine-readable code and human message. Use contracts or dry-validation to keep rules explicit. Early validation shrinks blast radius and simplifies retries because only well-formed payloads proceed to side effects.
3) Context-first structured logging
Use a structured logger (for example, Logger with a JSON formatter or a library like Semantic Logger or Lograge in Rails). Every log should include event, severity, service, environment, request_id, correlation_id, user_id (hashed), and tenant. When rescuing, log error_class, message, a sanitized backtrace, and domain context (for example, order_id, amount_cents). Never log secrets, tokens, or personally identifiable information; apply redaction at the formatter. Prefer a small number of well-known error events over free text so alerts can be precise.
4) Timeouts everywhere and defensive wrappers
All external calls can fail. Wrap HTTP, database, cache, and queue interactions with timeouts lower than your user budget. Use libraries that support deadlines and per-call timeouts (Net::HTTP with read_timeout and open_timeout, Faraday timeouts, Redis timeouts). Convert library-specific errors into your hierarchy so callers do not depend on vendor types, for example, rescue Faraday::TimeoutError and raise DependencyTimeout.
5) Fault-tolerant retries that do not harm
Retries must be bounded, jittered, and safe. Implement exponential backoff with full jitter to avoid thundering herds. Cap attempts or total elapsed time. Retry only transient classes such as DependencyTimeout or RateLimited. Never retry fatal classes such as ValidationError or NotAuthorizedError. Make operations idempotent: use natural keys, upserts, or compare-and-set so repeat attempts do not double charge or duplicate records. Pair retries with a circuit breaker that opens after consecutive failures, probes half-open with a single request, and closes on success.
6) Escalation paths and dead letters
After the final retry, record the event for human review. In background systems, route messages to a dead-letter queue with payload digest, error class, and last stack frame. Provide a replay tool that verifies idempotency before reprocessing. In synchronous flows, present a user-safe error and capture a support reference with the correlation identifier.
7) Global exception boundaries
Install exception boundaries in the web layer (Rack middleware or Rails ActionDispatch::ShowExceptions) and in job runners (Sidekiq error handlers). Translate exceptions into consistent HTTP responses or job outcomes. For example, map NotFoundError to 404, NotAuthorizedError to 403, ValidationError to 422, and operational failures to 503 with a retry-after hint. Centralize rescue logic; do not scatter rescue blocks throughout business code.
8) Observability and alerting
Emit metrics for error counts, rates by class, retry attempts, circuit breaker state, and dead-letter volume. Attach release and commit identifiers to logs and errors to correlate regressions. Integrate an error tracker to group exceptions by fingerprint and surface the top offenders. Alert on user-impacting symptoms such as error budget burn or spikes in DependencyTimeout, not on every single WARN log.
9) Testing and chaos drills
Write tests that assert translation: given Faraday::TimeoutError, the service raises DependencyTimeout. Test the backoff policy deterministically. Fuzz idempotency by invoking the same operation multiple times and asserting one observable effect. Use fault injection in non-production to simulate timeouts, slow responses, and partial failures. Verify that circuit breakers open and close as expected.
10) Safe messages and user experience
Users should receive clear, non-leaky messages. For validation, echo actionable feedback. For authorization, avoid revealing resource existence. For operational errors, apologize and suggest a retry with a stable support reference. This balance preserves trust while keeping attackers in the dark.
When Ruby error handling encodes intent with a custom exception hierarchy, emits high-fidelity structured logging, and uses fault-tolerant retries with idempotency and circuit breaking, teams achieve reliability without masking real defects. The result is a system that fails predictably, recovers gracefully, and is maintainable under pressure.
Table
Common Mistakes
Catching Exception or rescuing broadly and swallowing programmer errors. Mixing library error types throughout business code instead of converting to a custom exception hierarchy. Retrying everything, including validation or authorization failures. Implementing retries without backoff or jitter, causing synchronized storms. Omitting timeouts and letting calls hang until clients give up. Logging raw payloads, secrets, or personally identifiable information. Returning generic 500 errors for domain cases like not found or conflict. Skipping idempotency so retries double charge or duplicate records. Ignoring dead letters, leaving poison messages to recycle forever. Alerting on every exception rather than user-impacting symptoms.
Sample Answers (Junior / Mid / Senior)
Junior:
“I define a base AppError and specific child classes like ValidationError and DependencyTimeout. I use structured logging with a correlation identifier and sanitize fields. External calls have timeouts. I retry transient errors with exponential backoff and stop on validation errors.”
Mid:
“I wrap third-party errors and expose a stable custom exception hierarchy to callers. I implement fault-tolerant retries with jitter, caps, and a circuit breaker. Operations are idempotent using upserts or natural keys, and unrecoverable jobs go to a dead-letter queue with metadata and a replay tool.”
Senior:
“I design global exception boundaries in Rack and workers, mapping domain errors to precise HTTP statuses. Logs are structured with tenant and release tags, and alerts are tied to error budget burn. Backoff policies are tested, breakers protect dependencies, and we run chaos drills. Privacy is enforced in logs, and every final failure is auditable and replayable.”
Evaluation Criteria
Strong answers demonstrate a clear custom exception hierarchy, separation of programmer versus operational errors, and conversion of library exceptions into domain types. They show logging as structured, contextual, and sanitized. They use fault-tolerant retries with exponential backoff, jitter, caps, and circuit breakers, and they make operations idempotent so retries are safe. They define global exception boundaries that map to correct HTTP semantics, provide dead-letter handling and replay, and add metrics and alerts tied to user impact. Red flags include rescuing Exception, retrying everything, no timeouts, unstructured logs, no idempotency, and missing dead-letter handling.
Preparation Tips
Create a small Ruby or Rails service with a payment call. Introduce a custom exception hierarchy and wrap Faraday errors into DependencyTimeout or RateLimited. Add structured logging with a JSON formatter, redaction, correlation identifiers, and release tags. Implement a retry helper with exponential backoff, jitter, caps, and classify retryable versus fatal errors. Add a simple circuit breaker. Make the charge operation idempotent with an upsert keyed by a natural identifier. Add Rack middleware to map ValidationError to 422 and NotAuthorizedError to 403. Build a dead-letter store for final failures with a replay command. Test timeouts, chaos latency, and duplicate attempts.
Real-world Context
A subscription service experienced duplicate charges when a gateway flapped. Introducing idempotency keys and converting gateway errors to DependencyTimeout enabled fault-tolerant retries without double billing. A marketplace drowned in noisy stack traces; structured logging with correlation identifiers and sanitized fields cut mean time to diagnose by more than half. Another team saw synchronized retry storms during a cloud outage; exponential backoff with jitter and a circuit breaker stabilized dependencies. Dead-letter routing plus a replay tool turned mysterious failures into auditable, fixable cases. With a custom exception hierarchy and precise mappings, support could identify user errors versus real incidents instantly.
Key Takeaways
- Encode intent with a custom exception hierarchy and separate programmer from operational errors.
- Use structured, sanitized logging with correlation identifiers and domain context.
- Implement fault-tolerant retries with exponential backoff, jitter, caps, and circuit breakers.
- Ensure idempotency so retries are safe and side effects are not duplicated.
- Provide dead-letter handling, replay, and precise HTTP mappings at global boundaries.
Practice Exercise
Scenario:
You are adding a “charge customer” feature that calls an external payment gateway. The gateway can time out or rate limit, and your team must avoid duplicate charges while providing actionable diagnostics.
Tasks:
- Define AppError plus ValidationError, NotAuthorizedError, DependencyTimeout, RateLimited, and GatewayConflict. Ensure library exceptions are wrapped into these types.
- Implement structured logging with a JSON formatter. Include request_id, correlation_id, user_id (hashed), tenant, error_class, and a sanitized backtrace.
- Add timeouts to the gateway client. Convert timeouts and 429 responses into your domain exceptions.
- Write a retry helper that applies exponential backoff with full jitter, caps attempts and total elapsed time, and retries only DependencyTimeout and RateLimited.
- Make charges idempotent by upserting a payments row keyed by a natural identifier or idempotency key and by verifying final state before attempting a second call.
- Add a circuit breaker around the gateway.
- Build a dead-letter store for final failures with a replay command that respects idempotency.
- Create Rack middleware that maps domain errors to precise HTTP responses and logs a support reference.
Deliverable:
A code sample or design document that demonstrates safe Ruby error handling, structured logging, and fault-tolerant retries that prevent duplicate side effects and accelerate diagnosis.

