How do you design observability middleware that scales?
Middleware Engineer
answer
I design observability as pluggable middleware that standardizes context and emits data at every hop. Requests get a correlation ID, tenant/user tags, and timing; logs are structured JSON with levels, codes, and redaction. Tracing wraps handlers and clients, propagating context over HTTP/gRPC with sampling. Metrics record SLIs (latency, errors, saturation) via histograms and counters. Alerts derive from SLOs; noisy ones are suppressed with burn-rate windows.
Long Answer
Observability middleware turns each request into evidence of behavior and performance. I design logs, traces, metrics, and alerts as one contract that instruments the critical path with low overhead and privacy by default.
1) Structured logging schema
Canonical JSON fields: time, level, service, env, trace_id, span_id, correlation_id, route, method, status, latency_ms, tenant (hashed), event_code. Middleware attaches this context to request locals and outbound clients so every log joins to its trace. PII is hashed or dropped; bodies sampled behind flags. Hot paths emit concise logs; verbose detail comes via dynamic sampling and debug tokens.
2) Distributed tracing by default
OpenTelemetry wraps inbound handlers and DB/queue/HTTP clients. Each request creates a root span propagated with W3C Trace-Context/Baggage. Attributes: route, status, retries, cache hits, bounded payload size. Sampling: small baseline with boosts for errors, p95 latency, new releases, or targeted tenants. Exemplars link trace IDs to histogram buckets for p99 drills.
3) Metrics that mirror UX
Expose RED/USE/golden signals—counters (requests_total{route,code}, errors_total{type}); histograms (request_duration_seconds with SLO buckets, db_latency); gauges (in_flight, worker_utilization). Middleware timestamps start/stop and updates metrics even on errors; label cardinality is capped via route templates and allow-lists.
4) Context across boundaries
Outbound calls and async jobs inherit correlation IDs/tracing headers. Retries/backoffs annotate spans with attempt numbers. Idempotency keys are logged to reconstruct journeys.
5) Error taxonomy & policy hooks
Errors normalize to (kind, code, cause, http_status, retriable, user_safe). Middleware maps to HTTP codes, emits consistent logs, bumps metrics, and tags traces; sensitive details are masked in responses but preserved (redacted) in telemetry. Throttling protects sinks during storms.
6) Alerting from SLOs
Alerts derive from SLOs (availability, latency, quality) using multi-window, multi-burn-rate policies that page on sustained budget burn; informational symptoms become tickets. Alerts link runbooks/dashboards; charts mark releases and flags.
7) Privacy, performance, cost
PII redaction maps, security-event sampling overrides, tiered retention with TTLs, and caps on event size/rate. Hot path uses lock-free counters and batched exporters; exporters run off-thread with backpressure. Overhead targets <1–2% CPU and <1ms via precomputed labels and low cardinality.
8) Dev experience & ownership
Local pretty logs + in-memory exporters; test helpers assert metrics/spans; scaffolding enforces consistency. Dashboards track golden signals and top user journeys with visible error budgets. Ownership metadata routes pages; scorecards track alert fatigue and MTTR.
This middleware makes telemetry a product feature—consistent, privacy-aware, and actionable—so teams answer “what broke?” and “who is affected?” fast.
Table
Common Mistakes
Treating logs, traces, and metrics as separate add-ons instead of one contract. Unstructured text logs with no correlation IDs or redaction. Unlimited labels (user IDs, full URLs) exploding metrics cardinality and cost. Head-rate sampling only; rare errors and new releases go unseen. No typed error taxonomy; every failure is 500 with identical logs. Alerting on raw thresholds (CPU>80%) rather than SLO burn causing page noise. PII in logs “for debugging,” with no retention/TTL. Forgetting async context propagation in queues/jobs, breaking traces. Exporters run on the request thread, adding tail latency. No runbooks or owners on alerts; the wrong team is paged. Missing backpressure/size caps, so bursts overload the logging pipeline. Dashboards per microservice but none per user journey, hindering RCA. Missing release/flag markers on charts, hiding change correlation. CI lacks tests for emitted metrics/spans; regressions slip through.
Sample Answers
Junior:
I add middleware that logs JSON with a correlation ID and status. I use OpenTelemetry to create a span per request and export to the local collector. Prometheus records request duration histograms and error counters. Alerts are basic but tied to a latency SLO. I avoid logging PII and cap log sizes.
Mid:
I define a logging schema and inject IDs into inbound/outbound calls. Tracing propagates W3C headers across HTTP and queues; retries annotate spans. Metrics follow RED with SLO-aligned buckets and label allow-lists. Alerts use two burn-rate windows; each links to a runbook and dashboard. Exporters are batched off-thread to keep tail latency low.
Senior:
I ship a reusable observability middleware kit: structured logs with redaction maps; OTel spans on handlers, DB, cache, and jobs; metrics for RED/USE; exemplars link histograms to traces. Sampling is tail-aware (boost on errors/new releases). SLOs drive paging; everything else tickets. Ownership metadata, release markers, and test helpers make telemetry consistent and actionable.
Evaluation Criteria
- Design clarity: Treats logs, traces, metrics, and alerts as one system with a declared schema and contract.
- Logging quality: JSON, correlation IDs, redaction, stable event codes; concise on hot paths.
- Tracing depth: OTel everywhere (inbound, DB/queue/HTTP); W3C propagation; sampling strategy with tail boosts and exemplars.
- Metrics rigor: RED/USE with SLO-aligned histograms; cardinality controls; coverage of async jobs.
- Alert hygiene: SLO burn-rate with multi-window policies; links to runbooks/dashboards; low page noise.
- Privacy & cost: PII controls, TTLs, caps; exporters off-thread with backpressure; measured overhead.
- Operations: Dashboards per service and user journey; ownership metadata; release/flag markers.
- DX & testing: Local zero-config, test helpers asserting metrics/spans, CI checks for telemetry.
Red flags: Unstructured logs, no IDs, head-only sampling, high-cardinality labels, threshold-only alerts, exporters on request threads, no runbooks or owners.
Preparation Tips
- Draft a minimal observability contract: logging schema, required labels, trace attributes, SLI list.
- Implement middleware that injects correlation/trace IDs and emits JSON logs; add redaction maps and size caps.
- Add OpenTelemetry to handlers and clients; verify W3C propagation locally with a collector/Jaeger.
- Expose RED metrics with SLO-aligned buckets; enforce label allow-lists to avoid cardinality blowups.
- Configure burn-rate alerts (short + long windows) for one SLO; link to a runbook and dashboard.
- Build a dashboard template per service and one per top user journey; add release/flag markers.
- Move exporters off-thread; batch and compress. Measure overhead and set budgets (<1 ms, <2% CPU).
- Write test helpers that assert metrics/spans/log fields during integration tests.
- Run a privacy pass: PII inventory, redaction, TTLs. Enable sampling exceptions for security events.
- Rehearse a 60-second pitch: “schema, propagation, RED metrics, SLO alerts, low overhead, strong privacy.”
Real-world Context
Checkout latency spike: A retailer saw p99 jump after a release. Exemplars linked the p99 bucket to traces showing cache misses on a new flag. Rollback fixed it within 10 minutes; the SLO burn-rate alert paged once, not every minute.
Queue black hole: A SaaS job runner lost correlation across SQS. Adding IDs to message headers restored trace continuity and exposed a retry storm; a backoff fix dropped error counters by 70%.
Log bill shock: High-cardinality user labels exploded cost. Switching to route templates and an allow-list reduced cardinality 90% and cut logging by half with no loss of signal.
On-call sanity: Burn-rate policies plus runbooks cut pages/week from 30 to 6; MTTR fell as dashboards linked alerts to the last deployment and owner team.
Privacy audit: A fintech mapped PII fields and applied redaction in middleware. Logs became privacy-safe, and tiered retention with TTLs passed compliance without starving engineers of data. Exporters moved off-thread shaved ~0.7 ms from tail latency.
Key Takeaways
- Treat logs, traces, metrics, and alerts as one contract.
- Use structured JSON, correlation IDs, and redaction by default.
- Instrument with OpenTelemetry; add tail-aware sampling and exemplars.
- Expose RED/USE metrics and alert on SLO burn rates, not raw thresholds.
- Keep overhead low; add ownership, runbooks, and journey dashboards.
Practice Exercise
Scenario:
You’re introducing observability middleware into a polyglot web stack (edge gateway, API, worker jobs). Outages are noisy, logs are unstructured, and no one can tie p99 spikes to a change. Build a plan and prove it in code.
Tasks:
- Contract: Define a one-page schema for logs (JSON fields), trace attributes, and SLIs (requests, errors, latency). Publish event codes and a redaction map.
- Middleware: Implement correlation/trace ID injection at the gateway; propagate W3C headers to APIs/workers. Emit JSON logs with route template, status, latency.
- Tracing: Add OpenTelemetry spans to inbound handlers and DB/HTTP clients; enable exemplars from latency histograms to traces.
- Metrics: Export RED metrics with SLO buckets (e.g., 50/200/500/2000 ms). Cap labels to service, route, method, code.
- Alerting: Create a latency SLO and two burn-rate alerts (fast 5m/1h, slow 1h/24h). Link to a runbook that includes dashboards and rollback steps.
- Privacy/Cost: Enforce PII redaction, size/rate caps, and tiered retention. Sample verbose logs; boost sampling for errors/new releases.
- DX: Provide a docker-compose with collector + UI; add test helpers that assert on metrics/spans/log fields in CI.
- Demo: Trigger a controlled regression (disable cache). Show: alert fires once, dashboard highlights release marker, exemplar jumps from p99 to the slow span, and owner team is paged.
Deliverable:
Repo + docs demonstrating the middleware, dashboards, alerts, and a recorded demo of detection → diagnosis → rollback within SLO policy.

