How do you design robust error handling and observability in Go?

Go (Golang) Developer

How do you design and test REST or gRPC APIs in Go?

How do you optimize Go performance: profiling, GC, allocations?

How do you design robust error handling and observability in Go?

How do you structure Go projects for scale and maintainability?

How do you design a high-concurrency system in Go?

answer

In Go, I treat errors as values: return them, wrap with context (fmt.Errorf("…: %w", err)), and classify via custom types and sentinels. Callers use errors.Is/As to branch on retryability or user-facing messages. Each boundary adds context (operation, resource, IDs) and observability hooks: structured logs, metrics (error counters by class), and trace spans with error status. Across services, I propagate correlation IDs, redact PII, and map error classes to SLO-driven alerts.

Long Answer

Great Go error handling balances clarity for developers, signal quality for operators, and safety for users. I design around three pillars: classification, context, and observability—with consistent patterns from libraries to edge APIs.

1) Errors as values with strong contracts

Go’s explicit returns make control flow visible. Every exported function’s doc states whether it can return temporary (retryable) vs permanent errors, and what sentinel/custom types it may yield. Inside packages I keep an internal error taxonomy; at boundaries (HTTP, gRPC, CLI) I translate to domain-specific responses without leaking internals.

2) Wrapping and preserving cause

I wrap with %w to preserve the chain:

if err := repo.Save(ctx, u); err != nil {

return fmt.Errorf("user.save id=%s: %w", u.ID, err)

}

Callers check with errors.Is (for sentinels like ErrNotFound) or errors.As (for typed errors, e.g., *RateLimitError). This yields precise branching—retry/backoff on timeouts, 404s for not found, 400s for validation.

3) Custom types and interfaces

I define typed errors for behavior, not cosmetics:

type Retryable interface{ Temporary() bool }

type RateLimitError struct{ Limit, Reset int }

func (e *RateLimitError) Error() string { return "rate limited" }

func (e *RateLimitError) Temporary() bool { return true }

‍

At edges I convert to protocol codes (gRPC codes.ResourceExhausted, HTTP 429) and include safe metadata (retry-after).

4) Context: who/what/where

Each layer adds operation context: operation name, resource key, shard/region, and a stable correlation ID from context.Context (e.g., request-id, trace-id). I avoid spewing raw input or PII; instead log hashed IDs and counts. For libraries, I accept a context.Context and never capture globals, enabling cancelation and deadlines that reduce cascading failures.

5) Observability glue (logs, metrics, traces)

Structured logs: key–value fields (op, err.class, resource, tenant, attempt) with log levels by policy. Application messages stay concise; stack traces only at debug or when a panic occurs behind a recover boundary.
Metrics: counters like app_errors_total{op, class, code, retryable} and histograms for latency. Error rates roll into SLO burn dashboards.
Tracing: OpenTelemetry spans mark StatusError; I attach the wrapped message (sanitized) and error attributes (error.type, error.cause). Exemplars connect spikes in error counters to specific traces.

6) Mapping to transport and UX

For HTTP I translate: validation → 400 with machine-readable fields; not found → 404; conflict → 409; policy → 403; internal → 500 with generic message and hidden details. For gRPC I map to canonical codes. Clients get actionable hints (retry-after seconds) without leaking stack frames.

7) Resilience patterns

Retries with backoff/jitter only for retryable classes; propagate contexts for deadline budgets.
Circuit breakers around flaky deps; errors carry Upstream: name for quick diagnosis.
Bulkheads: separate pools; error counters per dependency prevent noisy neighbors from masking signals.

8) Testing and linting

Table-driven tests assert both behavior and classification: given an injected timeout, the service must return a Temporary() error; given a missing row, errors.Is(err, ErrNotFound) must be true and HTTP code 404 must result. Linters (staticcheck, errcheck) ensure errors are handled—not silently dropped.

9) Panic boundaries

Library code should not panic for expected states. At process edges (HTTP handler, worker main) I use recover middleware to convert panics into 500s, log with stack trace (rate-limited), mark traces as Error, and keep the process healthy.

10) Evolving taxonomies

As systems grow, I keep a central error registry (package errs) defining classes (Invalid, NotFound, Conflict, Unauthenticated, Permission, Unavailable, Deadline, Internal). Each wraps an underlying cause. This keeps dashboards consistent and prevents ad-hoc labels that dilute signal.

By treating error values as first-class domain signals, preserving cause with wrapping, exposing behavior via types, and feeding high-quality context into logs/metrics/traces, Go services become debuggable in development and observable in production—without leaking secrets or overwhelming operators.

‍

Table

Aspect	Practice	Implementation	Outcome
Wrapping	Preserve cause with context	`fmt.Errorf("op: %w", err)`	Root-cause intact, rich breadcrumbs
Classification	Sentinels + typed errors	`errors.Is/As, Temporary()`	Precise branching & retries
Context	IDs, op, resource, region	Add fields from `context.Context`	Joinable logs & traces
Transport	Canonical code mapping	HTTP 4xx/5xx, gRPC codes	Predictable client behavior
Metrics	Counters by class/op	`app_errors_total{class, op}`	Trendable SLO signals
Tracing	Spans + error attrs	OTel `StatusError`, attributes	Fast drill from SLO to trace
Privacy	Redact PII, hash IDs	Log policy + linters	Safe, compliant telemetry
Resilience	Retry/backoff, CBs	Only for `Temporary()`, breakers	Fewer cascades, controlled load
Testing	Table-driven checks	Assert `Is/As`, HTTP mapping	Regression-proof contracts
Panics	Recover at edges	Middleware logs stack, 500	Process stays up, clean signal

‍

Common Mistakes

Losing causes by creating new errors without %w, breaking root-cause analysis.
Using only strings; no typed errors or sentinels, so callers can’t branch correctly.
Encoding user/secret data in error messages that escape to logs/clients.
Mapping every failure to HTTP 500; clients can’t distinguish validation vs not-found vs conflict.
Retrying on all errors; hammering dependencies and amplifying incidents.
Ignoring contexts: no deadlines/cancelation, leading to orphaned work.
Logging stacks for every common error, flooding signal-to-noise.
No correlation IDs; can’t stitch logs, traces, and metrics.
Panicking for expected states (e.g., empty results), crashing workers.
Ad-hoc labels in metrics; dashboards can’t aggregate, alerts flap.

Sample Answers

Junior:
“I return errors and wrap with %w so callers can use errors.Is/As. For HTTP, I convert validation errors to 400 and not-found to 404. I log in JSON with request IDs and avoid putting secrets in messages.”

Mid:
“I define typed errors (e.g., RateLimitError with Temporary()), classify with errors.As, and decide retries with backoff only for temporary classes. Each boundary adds context (op, resource, tenant). I export app_errors_total{class,op} and mark OTel spans with StatusError so we can pivot from alerts to traces.”

Senior:
“I keep a central error taxonomy and helpers: wrap (OpErr(op, err)), classify (Is/As), and translate to transport (HTTP/gRPC) consistently. Observability is first-class: structured logs with correlation IDs, metrics by class, and trace attributes. Privacy is enforced by redaction. Panics are caught at edges; we page on SLO burn, not raw counts, and run table-driven tests to lock mappings.”

‍

Evaluation Criteria

Wrapping discipline: Uses %w consistently; preserves causal chains.
Classification: Clear taxonomy (sentinels, typed errors) enabling errors.Is/As branching.
Context & privacy: Adds op/resource/IDs from context, redacts PII.
Transport mapping: Correct, consistent HTTP/gRPC codes; actionable client guidance.
Observability: Structured logs, metrics by class/op, trace error status/attrs; correlation IDs.
Resilience: Retries/backoff only for temporary errors; circuit breakers and deadlines.
Testing: Table-driven tests for Is/As, mapping, and retry decisions.
Red flags: String-only errors, leaked secrets, blanket 500s, retries on permanent errors, no context propagation, panics for normal control flow.

Preparation Tips

Build a small library: errs.Wrap(op, err), sentinels (ErrNotFound), and typed errors (Timeout, RateLimit).
Practice errors.Is/As branching; write tests proving mapping to HTTP/gRPC codes.
Add OpenTelemetry: set span status on error, attach error.type, error.cause.
Emit metrics: app_errors_total and latency histograms; create a Grafana panel by class/op.
Implement retry with exponential backoff + jitter only for Temporary(); unit-test budgets with context deadlines.
Add a recover middleware; assert stack capture and sanitized client message.
Write a log policy (fields, redaction rules) and enforce with linters.
Run a chaos drill: disable DB, observe retries, breaker open/close, and alert flow tied to SLO burn.
Document the taxonomy in pkg/errs for reuse across services.

Real-world Context

Payments API: All DB errors surfaced as 500; operators couldn’t separate conflicts from timeouts. We introduced a taxonomy (Conflict, NotFound, Unavailable) and mapped to HTTP codes. Error counters by class revealed a hotspot; a missing unique index caused most conflicts—fixed in a day.

Messaging service: Retries hammered an unstable broker. Adding Temporary() classification plus jittered backoff cut traffic during incidents by 60% and reduced MTTR.

Multi-region read path: A silent context leak caused background work after client cancellation. Passing ctx through drivers and honoring deadlines eliminated tail-latency outliers.

Observability uplift: Structured logs with trace_id + OTel error spans let on-call pivot from an SLO burn alert directly to the failing query plan; a rolled index resolved p99 error spikes in minutes.

‍

Key Takeaways

Use %w to preserve causes; branch with errors.Is/As.
Define a small error taxonomy and typed errors for behavior.
Add context (op, IDs) and keep messages free of secrets.
Map errors to correct HTTP/gRPC codes; guide clients.
Feed logs/metrics/traces; alert on SLO burn, not raw counts.

Practice Exercise

Scenario:
You’re building a Go microservice (orders) with HTTP + gRPC endpoints and a Postgres repo. During incidents, operators can’t tell retryable errors from user mistakes, and clients receive inconsistent codes. Implement an error system that improves developer clarity and production observability.

Tasks:

Taxonomy: Create pkg/errs with sentinels (ErrNotFound, ErrConflict, ErrValidation, ErrUnavailable) and typed errors (RateLimitError, TimeoutError implementing Temporary()), plus helpers Wrap(op, err) and Op(op string).
Wrapping: In repo/services, wrap all returns with %w including op, resource IDs, and shard/region from ctx.
Classification: In handlers, decide flows using errors.Is/As. Map to HTTP (400/404/409/429/503/500) and gRPC codes; include safe hints (retry-after).
Observability: Add structured logging (JSON) with trace_id, op, err.class, resource, tenant. Export metrics orders_errors_total{class,op} and latency histograms; set OTel span status and attributes (error.type, error.cause).
Resilience: Implement a retry helper with backoff+jitter that activates only for Temporary() and respects context deadlines. Add a circuit breaker around the repo.
Recovery: Middleware catches panics, logs stack (rate-limited), marks span error, returns sanitized 500.
Tests: Table-driven tests asserting Is/As behavior, HTTP/gRPC mapping, and retry decisions under timeouts vs validation errors.
Runbook: Document classes, mappings, dashboards, and alert policies (burn-rate for availability).

Deliverable:
A minimal repo with pkg/errs, handlers, middleware, metrics/trace wiring, and tests—plus a dashboard screenshot showing errors by class and a failing trace linked from an alert.

How do you design robust error handling and observability in Go?

answer

Long Answer

1) Errors as values with strong contracts

2) Wrapping and preserving cause

3) Custom types and interfaces

4) Context: who/what/where

5) Observability glue (logs, metrics, traces)

6) Mapping to transport and UX

7) Resilience patterns

8) Testing and linting

9) Panic boundaries

10) Evolving taxonomies

Table

Common Mistakes

Sample Answers

Evaluation Criteria

Preparation Tips

Real-world Context

Key Takeaways

Practice Exercise

Still got questions?

Privacy Preferences