How do you design observability middleware that scales?

Bake in structured logging, tracing, metrics, and actionable alerting via middleware.
Build observability middleware for structured logs, tracing, metrics, and automated alerts—at low latency.

answer

I design observability as pluggable middleware that standardizes context and emits data at every hop. Requests get a correlation ID, tenant/user tags, and timing; logs are structured JSON with levels, codes, and redaction. Tracing wraps handlers and clients, propagating context over HTTP/gRPC with sampling. Metrics record SLIs (latency, errors, saturation) via histograms and counters. Alerts derive from SLOs; noisy ones are suppressed with burn-rate windows.

Long Answer

Observability middleware turns each request into evidence of behavior and performance. I design logs, traces, metrics, and alerts as one contract that instruments the critical path with low overhead and privacy by default.

1) Structured logging schema
Canonical JSON fields: time, level, service, env, trace_id, span_id, correlation_id, route, method, status, latency_ms, tenant (hashed), event_code. Middleware attaches this context to request locals and outbound clients so every log joins to its trace. PII is hashed or dropped; bodies sampled behind flags. Hot paths emit concise logs; verbose detail comes via dynamic sampling and debug tokens.

2) Distributed tracing by default
OpenTelemetry wraps inbound handlers and DB/queue/HTTP clients. Each request creates a root span propagated with W3C Trace-Context/Baggage. Attributes: route, status, retries, cache hits, bounded payload size. Sampling: small baseline with boosts for errors, p95 latency, new releases, or targeted tenants. Exemplars link trace IDs to histogram buckets for p99 drills.

3) Metrics that mirror UX
Expose RED/USE/golden signals—counters (requests_total{route,code}, errors_total{type}); histograms (request_duration_seconds with SLO buckets, db_latency); gauges (in_flight, worker_utilization). Middleware timestamps start/stop and updates metrics even on errors; label cardinality is capped via route templates and allow-lists.

4) Context across boundaries
Outbound calls and async jobs inherit correlation IDs/tracing headers. Retries/backoffs annotate spans with attempt numbers. Idempotency keys are logged to reconstruct journeys.

5) Error taxonomy & policy hooks
Errors normalize to (kind, code, cause, http_status, retriable, user_safe). Middleware maps to HTTP codes, emits consistent logs, bumps metrics, and tags traces; sensitive details are masked in responses but preserved (redacted) in telemetry. Throttling protects sinks during storms.

6) Alerting from SLOs
Alerts derive from SLOs (availability, latency, quality) using multi-window, multi-burn-rate policies that page on sustained budget burn; informational symptoms become tickets. Alerts link runbooks/dashboards; charts mark releases and flags.

7) Privacy, performance, cost
PII redaction maps, security-event sampling overrides, tiered retention with TTLs, and caps on event size/rate. Hot path uses lock-free counters and batched exporters; exporters run off-thread with backpressure. Overhead targets <1–2% CPU and <1ms via precomputed labels and low cardinality.

8) Dev experience & ownership
Local pretty logs + in-memory exporters; test helpers assert metrics/spans; scaffolding enforces consistency. Dashboards track golden signals and top user journeys with visible error budgets. Ownership metadata routes pages; scorecards track alert fatigue and MTTR.

This middleware makes telemetry a product feature—consistent, privacy-aware, and actionable—so teams answer “what broke?” and “who is affected?” fast.

Table

Area Practice Implementation Outcome
Logging Canonical JSON; redaction trace_id, corr_id, route, latency; PII hash Joinable, private logs
Tracing OTel spans & propagation W3C Trace-Context; root span; auto-instr. End-to-end causality
Metrics RED/USE signals Counters, SLO histograms, gauges Clear SLIs, low cardinality
Sampling Tail-aware boosts 1–10% base + error/latency bump Data when it matters
Errors Typed taxonomy kind/code/cause/status; map responses Consistent signals
Boundaries Cross-hop context IDs in HTTP/gRPC/queues; idempotency Trace continuity
Alerting SLO, multi-window 2h/24h burn-rate; page vs ticket Low-noise paging
Privacy/Cost Redaction, TTLs, caps PII maps; size/rate limits; tiers Compliance, predictable spend
Performance Low-overhead path Lock-free counters; batched export <1 ms, small CPU export
DX/Ownership Scaffolds, dashboards Test helpers; golden boards; owners Fast adoption; right team

Common Mistakes

Treating logs, traces, and metrics as separate add-ons instead of one contract. Unstructured text logs with no correlation IDs or redaction. Unlimited labels (user IDs, full URLs) exploding metrics cardinality and cost. Head-rate sampling only; rare errors and new releases go unseen. No typed error taxonomy; every failure is 500 with identical logs. Alerting on raw thresholds (CPU>80%) rather than SLO burn causing page noise. PII in logs “for debugging,” with no retention/TTL. Forgetting async context propagation in queues/jobs, breaking traces. Exporters run on the request thread, adding tail latency. No runbooks or owners on alerts; the wrong team is paged. Missing backpressure/size caps, so bursts overload the logging pipeline. Dashboards per microservice but none per user journey, hindering RCA. Missing release/flag markers on charts, hiding change correlation. CI lacks tests for emitted metrics/spans; regressions slip through.

Sample Answers

Junior:
I add middleware that logs JSON with a correlation ID and status. I use OpenTelemetry to create a span per request and export to the local collector. Prometheus records request duration histograms and error counters. Alerts are basic but tied to a latency SLO. I avoid logging PII and cap log sizes.

Mid:
I define a logging schema and inject IDs into inbound/outbound calls. Tracing propagates W3C headers across HTTP and queues; retries annotate spans. Metrics follow RED with SLO-aligned buckets and label allow-lists. Alerts use two burn-rate windows; each links to a runbook and dashboard. Exporters are batched off-thread to keep tail latency low.

Senior:
I ship a reusable observability middleware kit: structured logs with redaction maps; OTel spans on handlers, DB, cache, and jobs; metrics for RED/USE; exemplars link histograms to traces. Sampling is tail-aware (boost on errors/new releases). SLOs drive paging; everything else tickets. Ownership metadata, release markers, and test helpers make telemetry consistent and actionable.

Evaluation Criteria

  • Design clarity: Treats logs, traces, metrics, and alerts as one system with a declared schema and contract.
  • Logging quality: JSON, correlation IDs, redaction, stable event codes; concise on hot paths.
  • Tracing depth: OTel everywhere (inbound, DB/queue/HTTP); W3C propagation; sampling strategy with tail boosts and exemplars.
  • Metrics rigor: RED/USE with SLO-aligned histograms; cardinality controls; coverage of async jobs.
  • Alert hygiene: SLO burn-rate with multi-window policies; links to runbooks/dashboards; low page noise.
  • Privacy & cost: PII controls, TTLs, caps; exporters off-thread with backpressure; measured overhead.
  • Operations: Dashboards per service and user journey; ownership metadata; release/flag markers.
  • DX & testing: Local zero-config, test helpers asserting metrics/spans, CI checks for telemetry.
    Red flags: Unstructured logs, no IDs, head-only sampling, high-cardinality labels, threshold-only alerts, exporters on request threads, no runbooks or owners.

Preparation Tips

  • Draft a minimal observability contract: logging schema, required labels, trace attributes, SLI list.
  • Implement middleware that injects correlation/trace IDs and emits JSON logs; add redaction maps and size caps.
  • Add OpenTelemetry to handlers and clients; verify W3C propagation locally with a collector/Jaeger.
  • Expose RED metrics with SLO-aligned buckets; enforce label allow-lists to avoid cardinality blowups.
  • Configure burn-rate alerts (short + long windows) for one SLO; link to a runbook and dashboard.
  • Build a dashboard template per service and one per top user journey; add release/flag markers.
  • Move exporters off-thread; batch and compress. Measure overhead and set budgets (<1 ms, <2% CPU).
  • Write test helpers that assert metrics/spans/log fields during integration tests.
  • Run a privacy pass: PII inventory, redaction, TTLs. Enable sampling exceptions for security events.
  • Rehearse a 60-second pitch: “schema, propagation, RED metrics, SLO alerts, low overhead, strong privacy.”

Real-world Context

Checkout latency spike: A retailer saw p99 jump after a release. Exemplars linked the p99 bucket to traces showing cache misses on a new flag. Rollback fixed it within 10 minutes; the SLO burn-rate alert paged once, not every minute.

Queue black hole: A SaaS job runner lost correlation across SQS. Adding IDs to message headers restored trace continuity and exposed a retry storm; a backoff fix dropped error counters by 70%.

Log bill shock: High-cardinality user labels exploded cost. Switching to route templates and an allow-list reduced cardinality 90% and cut logging by half with no loss of signal.

On-call sanity: Burn-rate policies plus runbooks cut pages/week from 30 to 6; MTTR fell as dashboards linked alerts to the last deployment and owner team.

Privacy audit: A fintech mapped PII fields and applied redaction in middleware. Logs became privacy-safe, and tiered retention with TTLs passed compliance without starving engineers of data. Exporters moved off-thread shaved ~0.7 ms from tail latency.

Key Takeaways

  • Treat logs, traces, metrics, and alerts as one contract.
  • Use structured JSON, correlation IDs, and redaction by default.
  • Instrument with OpenTelemetry; add tail-aware sampling and exemplars.
  • Expose RED/USE metrics and alert on SLO burn rates, not raw thresholds.
  • Keep overhead low; add ownership, runbooks, and journey dashboards.

Practice Exercise

Scenario:
You’re introducing observability middleware into a polyglot web stack (edge gateway, API, worker jobs). Outages are noisy, logs are unstructured, and no one can tie p99 spikes to a change. Build a plan and prove it in code.

Tasks:

  1. Contract: Define a one-page schema for logs (JSON fields), trace attributes, and SLIs (requests, errors, latency). Publish event codes and a redaction map.
  2. Middleware: Implement correlation/trace ID injection at the gateway; propagate W3C headers to APIs/workers. Emit JSON logs with route template, status, latency.
  3. Tracing: Add OpenTelemetry spans to inbound handlers and DB/HTTP clients; enable exemplars from latency histograms to traces.
  4. Metrics: Export RED metrics with SLO buckets (e.g., 50/200/500/2000 ms). Cap labels to service, route, method, code.
  5. Alerting: Create a latency SLO and two burn-rate alerts (fast 5m/1h, slow 1h/24h). Link to a runbook that includes dashboards and rollback steps.
  6. Privacy/Cost: Enforce PII redaction, size/rate caps, and tiered retention. Sample verbose logs; boost sampling for errors/new releases.
  7. DX: Provide a docker-compose with collector + UI; add test helpers that assert on metrics/spans/log fields in CI.
  8. Demo: Trigger a controlled regression (disable cache). Show: alert fires once, dashboard highlights release marker, exemplar jumps from p99 to the slow span, and owner team is paged.

Deliverable:
Repo + docs demonstrating the middleware, dashboards, alerts, and a recorded demo of detection → diagnosis → rollback within SLO policy.

Still got questions?

Privacy Preferences

Essential cookies
Required
Marketing cookies
Personalization cookies
Analytics cookies
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.