How do you detect and fix database performance bottlenecks in prod?
Database Developer
answer
Detecting database bottlenecks starts with monitoring: slow query logs, execution plans, and resource metrics (CPU, I/O, locks). Use profiling to identify queries with high cost or wait times. Fixes include query tuning, indexes (covering, composite), caching hot reads, partitioning, and connection pool tuning. Long-term, adopt observability dashboards, automate regression checks, and test schema changes in staging to keep production performant and resilient.
Long Answer
Detecting and resolving database performance bottlenecks in production requires a mix of observability, root-cause analysis, and corrective action. A production database is mission-critical: every second of latency impacts user experience and revenue. The key is to move systematically from detection → diagnosis → resolution → prevention.
1) Detect bottlenecks via monitoring and profiling
Start with native DB tools:
- MySQL/Postgres: EXPLAIN, EXPLAIN ANALYZE, slow query logs.
- SQL Server: Query Store, Profiler, Extended Events.
- Oracle: AWR reports, SQL Trace.
Complement with infrastructure metrics: CPU saturation, disk IOPS, memory pressure, and connection pool usage.
Enable application tracing (APM tools like New Relic, Datadog, OpenTelemetry) to connect slow endpoints to queries.
2) Classify bottlenecks
- Query-level: bad SQL (SELECT *), missing filters, N+1 issues.
- Index-level: missing indexes, non-selective indexes, bloated indexes.
- Locking/Contention: long-running transactions, deadlocks, blocking writes.
- Resource-level: insufficient CPU, memory, IOPS, or misconfigured pools.
- Schema-level: poor normalization/denormalization tradeoffs, skewed partitions.
3) Resolve with targeted tuning
- Query optimization: Rewrite queries to use set-based logic, remove subqueries, replace SELECT * with column lists. Use CTEs or window functions to reduce scans.
- Indexing: Add composite or covering indexes; remove redundant ones. Monitor index fragmentation and rebuild as needed.
- Execution plan fixes: Review EXPLAIN output—if sequential scans dominate, add indexes; if nested loops explode, refactor joins.
- Caching: Move hot read queries to in-memory cache (Redis, Memcached). Use materialized views for expensive aggregates.
- Partitioning/sharding: Break down large tables by range, hash, or list to reduce scan scope.
- Connection pool tuning: Adjust max pool size to avoid thrashing while preventing exhaustion.
4) Case study example
In a fintech production system, checkout latency spiked. Profiling showed a query doing a table scan on transactions with a WHERE customer_id filter. No index existed. Adding a composite index (customer_id, status) cut latency from 2s to 50ms. Later, to handle scale, the table was partitioned by customer region. Monitoring confirmed stable performance.
5) Prevent recurrence
- Add query regression testing in CI/CD.
- Automate schema review with linters.
- Track query performance dashboards over time; set alerts for anomalies.
- Educate developers on writing performant SQL and using indexes properly.
- Run chaos/performance drills in staging to simulate heavy load.
By combining observability, structured diagnosis, and proactive governance, a Database Developer ensures production databases stay fast and reliable even under scale.
Table
Common Mistakes
Typical mistakes include optimizing blindly without evidence, adding too many indexes (hurting writes), or relying only on vertical scaling. Teams often ignore execution plans, missing obvious scan/loop issues. Long transactions with wide locks block others. Developers frequently use SELECT * in production queries, bloating payloads. Another error: caching without invalidation strategy, leading to stale results. Finally, many fail to monitor continuously—performance regressions slip by until customers complain.
Sample Answers (Junior / Mid / Senior)
Junior:
“I’d check slow query logs and run EXPLAIN on the heaviest queries. I’d add indexes and avoid SELECT *. For blocking, I’d shorten transactions.”
Mid:
“I’d profile queries with EXPLAIN ANALYZE and Query Store, then optimize hot queries with better indexing and joins. For scalability, I’d add caching for read-heavy endpoints and partition large tables. I’d also set up monitoring dashboards.”
Senior:
“I’d integrate APM tracing into CI/CD to flag slow queries before release. In production, I’d monitor execution plans, lock waits, and resource metrics. I’d use composite indexes, materialized views, and sharding for scale. I’d automate schema reviews and regression tests. In one case, I cut checkout latency by 95% by indexing and partitioning transaction data.”
Evaluation Criteria
Interviewers expect answers covering:
- Use of profiling tools (EXPLAIN, slow query logs, AWR).
- Recognition of bottleneck types: query, index, lock, resource, schema.
- Concrete tuning strategies: indexing, query rewrites, caching, partitioning.
- Awareness of monitoring/alerting systems for production.
- Emphasis on data integrity while optimizing.
- Real-world debugging example.
Weak answers only mention “adding indexes” or “scaling servers.” Strong answers demonstrate systematic diagnosis, multiple mitigation tactics, and proactive governance.
Preparation Tips
Practice by setting up a sample Postgres or SQL Server DB. Run a workload, capture slow queries, and use EXPLAIN ANALYZE. Add and remove indexes to observe plan changes. Test partitioning on a large table. Simulate deadlocks with long transactions; analyze lock waits. Try caching results in Redis, then invalidate. Monitor CPU/I/O under load with CloudWatch or pg_stat_statements. Build a Grafana dashboard with query latencies. Prepare a 60–90s story of a bottleneck you diagnosed, tools used, and resolution.
Real-world Context
In e-commerce, spikes during sales often expose bottlenecks. One retailer found checkout queries doing full scans on orders. Adding composite indexes plus caching cut response times from 2s to 100ms. In fintech, long-running reporting queries blocked transactional updates, causing deadlocks; separating OLAP and OLTP workloads fixed it. A SaaS company scaled read replicas to offload reporting but later added sharding for tenants. Continuous monitoring with pg_stat_statements and APM dashboards proved critical to catch regressions before users noticed.
Key Takeaways
- Use profiling and monitoring to detect bottlenecks.
- Optimize queries, avoid SELECT *, and use proper indexing.
- Address locks and contention with shorter transactions.
- Leverage caching, partitioning, and sharding for scale.
- Automate regression testing and observability.
Practice Exercise
Scenario: You manage a Postgres production DB for an e-commerce app. During peak sales, API latency jumps from 200ms to 2s.
Tasks:
- Enable slow query log; identify top 5 slow queries.
- Run EXPLAIN ANALYZE on them—determine if sequential scans or nested loops are culprits.
- Add composite indexes for frequent filters (e.g., (customer_id, status)).
- Refactor queries: replace SELECT * with needed columns.
- Test caching hot queries in Redis; set TTL and invalidation rules.
- Partition the orders table by region/date; rerun performance tests.
- Monitor locks/waits; shorten transaction lengths; tune isolation.
- Add Grafana dashboard with CPU, I/O, latency. Alert if >500ms p95.
Deliverable: Prepare a 60–90s narrative describing detection steps, query/index tuning, caching/partitioning, and monitoring setup.

