How would you design a scalable integration architecture?

Systems Integrator

How do you plan and execute system migrations with no downtime?

How do you decide between iPaaS and custom middleware?

How do you implement error handling, retries, and monitoring?

How do you secure auth, authz, and data transfer across systems?

How would you design a scalable integration architecture?

answer

A resilient integration architecture is event-driven, contract-first, and idempotent by design. Use a message backbone (Kafka/RabbitMQ/SNS+SQS) with CDC/outbox to publish domain events, isolate systems behind APIs, and apply saga orchestration for multi-step workflows. Ensure data consistency via at-least-once delivery, idempotent consumers, and replayable logs; protect with circuit breakers, retries, DLQs, and bulkheads. Scale horizontally with partitioning and stateless workers; govern change using a schema registry and semantic versioning.

Long Answer

Designing a cross-environment integration architecture that spans legacy, cloud, and SaaS begins by separating concerns: transport, contracts, orchestration, and governance. The core principles are event-driven communication, contract discipline, idempotency, and operational resilience at scale.

1) Topology: hub-and-spoke over point-to-point
Avoid brittle meshes of direct integrations. Place an event backbone (Kafka, Pulsar, or SNS+SQS) at the center for asynchronous flows, and an API gateway for synchronous requests. Legacy systems publish state changes via CDC (e.g., Debezium) or an outbox table so application writes remain atomic while integration messages are guaranteed. SaaS apps integrate through webhooks ingested into the backbone; normalize via adapters.

2) Contracts and schema governance
Adopt contract-first design. Define events and APIs with Avro/Protobuf/OpenAPI and store them in a schema registry. Enforce compatibility (backward/forward) and semantic versioning. Provide canonical domain events (e.g., OrderPlaced, InvoicePaid) and avoid leaking internal DB schemas. For queries across systems, expose read models via APIs rather than consuming foreign write models directly.

3) Consistency model: exactly-once effects through idempotency
In heterogeneous estates, prefer at-least-once delivery with idempotent consumers and deduplication keys to approximate exactly-once outcomes. Use transactional outbox + polling publisher or CDC to avoid dual-write races. Store processed message IDs or hashes per consumer to discard duplicates. For cross-system workflows, implement sagas (orchestration or choreography). Orchestrators manage compensations when a step fails (refund, status revert), ensuring data consistency without distributed transactions.

4) Fault tolerance and backpressure
Protect upstreams with circuit breakers, timeouts, and bulkheads so one slow dependency cannot sink the fleet. Apply exponential backoff and jitter on retries. Size concurrency via work queues; when overloaded, shed non-critical load and prioritize critical topics with QoS. Capture failures in dead-letter queues (DLQs) with rich headers (trace ID, schema version) to enable reprocessing after fixes.

5) Scalability patterns
Partition high-volume topics by a stable key (tenant, aggregate ID) to scale consumers horizontally while preserving order within a key. Keep workers stateless and 12-factor; use compacted topics for latest-value streams and retention policies to bound storage. For sync traffic, horizontally scale API gateways and BFFs behind a CDN; cache read-only projections aggressively with TTL + ETag. Apply CQRS: writes produce events; reads build materialized views optimized for client patterns.

6) Interfacing legacy and SaaS
For legacy RDBMS, avoid polling tables directly; prefer CDC to emit insert/update/delete as ordered events. For SaaS, verify webhook signatures and place an ingress queue between the internet and core to level spikes. Wrap SaaS rate limits with token buckets and schedule backfills with incremental cursors. Keep each external adapter isolated; never let a vendor’s data model bleed into your domain events.

7) Observability and operability
Propagate correlation IDs across hops (HTTP headers / message metadata). Capture golden signals: throughput, consumer lag, P95/P99 latency, retry rates, DLQ counts. Emit business metrics (orders synced, payouts settled) from the integration layer—these catch silent data drift. Add replay tooling: selective re-ingestion from DLQs or timestamp ranges to repair downstream views after schema or logic fixes.

8) Security and compliance
Encrypt in transit and at rest. Minimize payload PII; reference sensitive data via tokens or secure lookups. Use customer-managed keys where required. Attach least-privilege IAM roles to each adapter. Redact logs; segregate multi-tenant traffic via namespaces or per-tenant topics to honor data residency.

9) Testing and change management
Create a contract test harness: consumers validate they can read prior schemas; producers verify they do not break backward compatibility. Run end-to-end tests in ephemeral envs with seeded topics and fake SaaS webhooks. Roll out changes with canaries: route a slice of partitions or tenants to a new consumer version, watch lag/error budgets, then ramp.

10) Build/buy and governance
Decide between custom backbone vs iPaaS/ESB based on scale and compliance. Even with iPaaS, keep contracts and events owned by your domain. Establish an integration review board that approves new topics, schemas, and SLAs, and maintains a catalog (who produces/consumes what, with contact and RTO/RPO).

This produces a scalable, fault-tolerant systems integration platform where data moves as immutable events, sync calls are minimized and protected, and data consistency is enforced through idempotency, sagas, and replayable logs—future-proofed for more clients and higher volumes.

‍

Table

Area	Design Choice	Why It Works	Trade-offs
Backbone	Event bus + API gateway	Decouples producers/consumers; async by default	More infra pieces to run
Consistency	CDC/outbox + idempotent consumers + sagas	Avoids dual writes, enables compensations	Extra storage/logic for dedupe & state
Contracts	Avro/Protobuf + schema registry	Safe evolution, polyglot clients	Governance overhead
Fault Tolerance	Circuit breakers, retries, DLQs, bulkheads	Localizes failures; recoverable errors	Tuning queues and policies
Scalability	Partitioned topics, stateless workers, CQRS	Horizontal scale; ordered per key	Rebalancing & hotspot keys
Legacy/SaaS	Adapters, verified webhooks, rate limiting	Isolates vendors; stable domain events	Ongoing vendor drift handling
Observability	Correlation IDs, lag/latency, replay tools	Fast triage, data repair ability	Requires disciplined telemetry

‍

Common Mistakes

Point-to-point integrations that balloon into N² links and fragile chains.
Dual writes (DB + message) in one code path without outbox/CDC, causing drift.
Assuming “exactly-once delivery” from the broker; skipping idempotency and dedupe.
Letting vendor payloads leak into internal integration architecture contracts.
No DLQs or replay tooling—errors pile up or require manual DB edits.
Overusing synchronous calls for cross-system orchestration, leading to cascading timeouts.
Ignoring schema evolution; producers ship breaking fields and brick consumers.
No backpressure controls; spikes overwhelm legacy endpoints or rate-limited SaaS.

Sample Answers (Junior / Mid / Senior)

Junior:
“I would use a message queue to decouple systems and an API gateway for synchronous calls. I would define schemas for events, use retries with DLQs, and make consumers idempotent so duplicates do not break data. For legacy, I would use CDC to publish changes instead of polling.”

Mid:
“My integration architecture centers on Kafka with CDC/outbox for producers and a schema registry to evolve contracts. I partition topics by tenant/aggregate, scale stateless consumers, and orchestrate multi-step flows with sagas. Circuit breakers, backoff, and DLQs provide fault tolerance. I expose read models via APIs and validate compatibility with contract tests.”

Senior:
“I implement domain events with strict versioned schemas, CDC/outbox to prevent dual-write anomalies, and choreography-first sagas with compensations. Consistency comes from at-least-once + idempotency and replayable logs. Observability includes lag, P99, and business counters. Adapters isolate SaaS and legacy. Changes roll out via canaries per partition set with automated rollback.”

‍

Evaluation Criteria

Look for an integration architecture that avoids point-to-point sprawl, uses an event backbone plus an API gateway, and formalizes contracts with a schema registry. Strong answers cover data consistency (CDC/outbox, idempotency, sagas), fault tolerance (retries, DLQs, circuit breakers, bulkheads), and scalability (partitioning, stateless workers, CQRS). They address legacy+SaaS adapters, webhook verification, rate limits, and replay tooling. Red flags: reliance on exactly-once semantics, dual writes, no schema governance, synchronous orchestration everywhere, or lack of DLQs/observability. The best responses add rollout strategy, testing, and governance.

‍

Preparation Tips

Build a demo with one legacy DB producing CDC events, one cloud microservice, and one SaaS webhook.
Implement an outbox pattern and a polling publisher; prove no drift under failure.
Add Avro schemas to a registry; practice backward-compatible changes and consumer contract tests.
Create a small saga: order → reserve inventory → capture payment → ship; include compensations.
Configure retries with backoff, DLQs, and replay tooling; simulate poison messages.
Partition a topic by aggregate ID; measure consumer lag and rebalance behavior.
Add circuit breakers and bulkheads; load test to observe backpressure.
Instrument correlation IDs and business counters; set alert thresholds for lag and P99s.

Real-world Context

A retailer replaced nightly file drops with CDC→Kafka and idempotent consumers; order synchronization latency fell from hours to seconds and drift incidents dropped to near zero. A fintech implemented sagas for account funding; failed payment compensations kept balances consistent without 2PC. A SaaS integrator added schema registry and contract tests; a producer’s breaking change was caught pre-prod. A marketplace adopted DLQs and replay, turning “stuck orders” from war rooms into scheduled reprocess jobs. Across cases, the combination of CDC/outbox, idempotency, sagas, and partitioned scaling delivered dependable systems integration at scale.

‍

Key Takeaways

Use an event backbone + API gateway; avoid N² point-to-point links.
Guarantee data consistency with CDC/outbox, idempotency, and sagas.
Engineer fault tolerance with retries, DLQs, circuit breakers, and bulkheads.
Achieve scalability via partitioning, stateless consumers, and CQRS read models.
Govern with schema registries, contract tests, and controlled rollouts.

Practice Exercise

Scenario:
You must integrate a legacy ERP (on-prem RDBMS), a payments SaaS, and a cloud inventory service. Orders originate in the web app, payments clear asynchronously, and inventory updates must propagate in near-real-time. The system must maintain data consistency, remain fault tolerant, and scale for seasonal spikes.

Tasks:

Propose the backbone (broker choice), API gateway, and partition keys for high-volume topics.
Describe how you will emit events from the ERP using CDC or an outbox without dual writes.
Define event contracts (OrderPlaced, PaymentCaptured, InventoryAdjusted) and schema evolution rules; choose a registry strategy.
Design a saga that coordinates order → payment → inventory; specify compensations and timeouts.
Specify idempotency mechanisms for each consumer (keys, state tables, or natural IDs).
Detail fault tolerance: retries/backoff, DLQs, circuit breakers, and bulkheads; include reprocessing tooling.
Outline observability: correlation IDs, consumer lag dashboards, P95/P99 alerts, and business counters.
Provide a rollout plan with canary consumers per partition subset and an automated rollback trigger.

Deliverable:
An integration design doc and runbook demonstrating a scalable, fault-tolerant integration architecture that preserves data consistency across legacy, cloud, and SaaS systems.

How would you design a scalable integration architecture?

answer

Long Answer

Table

Common Mistakes

Sample Answers (Junior / Mid / Senior)

Evaluation Criteria

Preparation Tips

Real-world Context

Key Takeaways

Practice Exercise

Still got questions?

Privacy Preferences