14 min read
Operational Reliability — Interview Q&A
Deep-dive answers for production-readiness and reliability questions common at senior / SDE3 interviews.
"How Do You Handle Server Downtime?"
This question tests whether you design for failure by default, not as an afterthought.
Layer 1 — Process Crashes: PM2 Cluster Mode
javascript// ecosystem.config.js
module.exports = {
apps: [{
name: 'api',
script: 'dist/index.js',
instances: 'max', // one per CPU core
exec_mode: 'cluster', // share port across workers
autorestart: true,
max_restarts: 10,
min_uptime: '5s', // don't count crash if process dies in < 5s
restart_delay: 1000, // wait 1s between restarts
watch: false, // never watch in prod
env_production: {
NODE_ENV: 'production',
PORT: 3000,
},
}],
};
// pm2 start ecosystem.config.js --env production
// pm2 monit ← live CPU/memory/logs
// pm2 logs ← aggregated log streamWhen one worker crashes, PM2 restarts it. The other workers keep serving traffic. Zero downtime for single-process crashes.
Layer 2 — Graceful Shutdown (SIGTERM)
javascript// When a load balancer removes a server from rotation, it sends SIGTERM.
// The process should drain in-flight requests before exiting.
const server = app.listen(PORT);
let shuttingDown = false;
process.on('SIGTERM', async () => {
console.log('SIGTERM received — starting graceful shutdown');
shuttingDown = true;
// 1. Stop accepting new connections
server.close(async () => {
console.log('HTTP server closed');
// 2. Drain open resources
try {
await Promise.all([
dbPool.end(), // close DB connection pool
redisClient.quit(), // close Redis connection
mqChannel.close(), // close message queue channel
]);
console.log('Resources drained — exiting cleanly');
process.exit(0);
} catch (err) {
console.error('Error during shutdown', err);
process.exit(1);
}
});
// 3. Safety timeout — force exit if drain takes too long
setTimeout(() => {
console.error('Graceful shutdown timed out — forcing exit');
process.exit(1);
}, 15_000);
});
// Reject new requests during shutdown (return 503 to load balancer):
app.use((req, res, next) => {
if (shuttingDown) {
res.setHeader('Connection', 'close');
return res.status(503).json({ error: 'Server shutting down' });
}
next();
});Layer 3 — Health Checks
javascript// Load balancers (AWS ALB, nginx, k8s) probe /health every 10-30s.
// Return 503 → server removed from rotation immediately.
app.get('/health', async (req, res) => {
const checks: Record<string, 'ok' | 'fail'> = {};
// Check DB connectivity
try {
await pool.query('SELECT 1');
checks.db = 'ok';
} catch {
checks.db = 'fail';
}
// Check Redis connectivity
try {
await redis.ping();
checks.redis = 'ok';
} catch {
checks.redis = 'fail';
}
const healthy = Object.values(checks).every(v => v === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'degraded',
uptime: process.uptime(),
checks,
});
});
// Kubernetes liveness vs readiness:
// /health/live → is the process up? (restart if not)
// /health/ready → can it serve traffic? (remove from Service if not)
app.get('/health/live', (_req, res) => res.json({ status: 'alive' }));
app.get('/health/ready', async (req, res) => {
// readiness fails during startup or drain
if (shuttingDown || !dbReady) return res.status(503).json({ status: 'not ready' });
res.json({ status: 'ready' });
});Layer 4 — Zero-Downtime Deployments
Three strategies — choose based on risk tolerance:
1. Rolling Deployment (default in k8s)
┌────────────────────────────────────┐
│ v1 v1 v1 v1 → v2 v1 v1 v1 │
│ → v2 v2 v1 v1 │
│ → v2 v2 v2 v1 │
│ → v2 v2 v2 v2 ✓ │
└────────────────────────────────────┘
- No extra infrastructure needed
- Both versions briefly serve traffic simultaneously
- DB schema must be backward-compatible during rollout
2. Blue-Green Deployment
┌────────────────────────────────────┐
│ Load Balancer │
│ ↓ │
│ [Blue: v1] ← live traffic │
│ [Green: v2] ← dark (idle) │
│ │
│ Switch LB: Blue → Green (< 1s) │
│ Keep Blue idle for instant rollback│
└────────────────────────────────────┘
- Instant cutover and instant rollback
- Double the infrastructure cost
- DB migrations must run before switch
3. Canary Release
┌────────────────────────────────────┐
│ 95% → v1 (stable) │
│ 5% → v2 (canary) │
│ │
│ Monitor error rate, latency, etc. │
│ Gradually increase to 100% if OK │
└────────────────────────────────────┘
- Safest: real production validation
- Requires traffic splitting (nginx, AWS ALB weights, Istio)
- Automatic rollback if error rate spikesjavascript// Database migration rule for zero-downtime:
// Never break the schema while old code is still running.
// Bad: rename column in one migration
// ALTER TABLE users RENAME COLUMN name TO full_name;
// Old code breaks immediately.
// Good: expand-then-contract (3 deploys):
// Migration 1: Add new column (backward compatible)
// ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
// UPDATE users SET full_name = name;
// Deploy v2: writes to BOTH name and full_name, reads from full_name
// Migration 2 (after v2 is stable): drop old column
// ALTER TABLE users DROP COLUMN name;"How Do You Optimize an API's Response Time?"
Step 1 — Measure Before You Touch Anything
javascript// Add structured timing logs to find bottlenecks:
app.use((req, res, next) => {
const start = Date.now();
const timings: Record<string, number> = {};
req.time = (label: string) => {
timings[label] = Date.now() - start;
};
res.on('finish', () => {
logger.info('request completed', {
method: req.method,
path: req.path,
status: res.statusCode,
totalMs: Date.now() - start,
timings,
});
});
next();
});
// Usage in route:
app.get('/api/orders', async (req, res) => {
const user = await getUser(req.userId);
req.time('user_fetch'); // logged: { user_fetch: 45 }
const orders = await getOrders(req.userId);
req.time('orders_fetch'); // logged: { orders_fetch: 340 } ← bottleneck
res.json({ user, orders });
});Metrics to care about: p99 latency (not average — average hides long tail), error rate, throughput (req/s).
Step 2 — Parallelize Independent I/O
javascript// Sequential (bad) — waits for each before starting next:
const user = await db.users.find(id); // 50ms
const orders = await db.orders.find(id); // 80ms
const notifs = await db.notifs.find(id); // 40ms
// Total: 170ms
// Parallel (good) — all fire at once:
const [user, orders, notifs] = await Promise.all([
db.users.find(id),
db.orders.find(id),
db.notifs.find(id),
]);
// Total: 80ms (longest of the three)
// Optional parallel (don't fail if non-critical service is down):
const [critical, optional] = await Promise.allSettled([
db.orders.find(id), // must succeed
recommendations.get(id), // nice to have
]);
const orders = critical.status === 'fulfilled' ? critical.value : [];
const recs = optional.status === 'fulfilled' ? optional.value : null;Step 3 — Eliminate N+1 Queries
javascript// N+1 pattern (the most common DB killer):
const posts = await db.posts.findAll(); // 1 query
for (const post of posts) {
post.author = await db.users.findById(post.userId); // N queries!
}
// For 100 posts: 101 queries
// Fix 1: JOIN (best for simple cases)
const posts = await db.query(`
SELECT posts.*, users.name AS author_name, users.avatar AS author_avatar
FROM posts
JOIN users ON posts.user_id = users.id
WHERE posts.status = 'published'
ORDER BY posts.created_at DESC
LIMIT 20
`);
// Fix 2: DataLoader batching (for GraphQL or reusable loaders)
const userLoader = new DataLoader(async (ids: string[]) => {
const users = await db.users.findByIds(ids); // 1 query for all IDs
const map = new Map(users.map(u => [u.id, u]));
return ids.map(id => map.get(id));
});
// Now each post.author call gets batched into a single query:
const posts = await db.posts.findAll();
await Promise.all(posts.map(p => userLoader.load(p.userId)));Step 4 — Cache Aggressively
javascript// Cache hierarchy: in-memory > Redis > DB
// Level 1: In-process memory (< 1ms)
const cache = new Map<string, { data: unknown; expiry: number }>();
function memCache<T>(key: string, ttlMs: number, fn: () => Promise<T>): Promise<T> {
const hit = cache.get(key);
if (hit && hit.expiry > Date.now()) return Promise.resolve(hit.data as T);
return fn().then(data => {
cache.set(key, { data, expiry: Date.now() + ttlMs });
return data;
});
}
// Level 2: Redis (1-5ms, survives restarts, shared across instances)
async function redisCache<T>(key: string, ttlSec: number, fn: () => Promise<T>): Promise<T> {
const cached = await redis.get(key);
if (cached) return JSON.parse(cached) as T;
const data = await fn();
await redis.setex(key, ttlSec, JSON.stringify(data));
return data;
}
// Usage — cache expensive DB aggregation:
app.get('/api/leaderboard', async (req, res) => {
const data = await redisCache('leaderboard:global', 30, async () => {
return db.query(`
SELECT user_id, SUM(score) as total
FROM events
GROUP BY user_id
ORDER BY total DESC
LIMIT 100
`);
});
res.json(data);
});Step 5 — Paginate, Stream, and Compress
javascript// Cursor-based pagination (more efficient than OFFSET for large tables):
app.get('/api/feed', async (req, res) => {
const cursor = req.query.cursor as string | undefined;
const limit = Math.min(parseInt(req.query.limit as string) || 20, 100);
const rows = await db.query(`
SELECT id, title, created_at
FROM posts
WHERE ($1::uuid IS NULL OR created_at < (SELECT created_at FROM posts WHERE id = $1))
ORDER BY created_at DESC
LIMIT $2
`, [cursor ?? null, limit + 1]);
const hasMore = rows.length > limit;
const items = hasMore ? rows.slice(0, limit) : rows;
res.json({
items,
nextCursor: hasMore ? items[items.length - 1].id : null,
});
});
// Stream large responses instead of buffering in memory:
app.get('/api/export/csv', async (req, res) => {
res.setHeader('Content-Type', 'text/csv');
res.setHeader('Content-Disposition', 'attachment; filename=export.csv');
res.write('id,email,created_at\n'); // header
const cursor = db.queryStream('SELECT id, email, created_at FROM users ORDER BY id');
for await (const row of cursor) {
res.write(`${row.id},${row.email},${row.created_at}\n`);
}
res.end();
// Memory usage stays constant regardless of row count
});
// Enable compression (30-70% size reduction for JSON):
import compression from 'compression';
app.use(compression({ threshold: 1024 })); // compress responses > 1KB"How Do You Handle Traffic Spikes?"
Spike handling strategy (in order of speed to implement):
1. Caching (instant) — serve repeated data from cache, not DB
2. Rate limiting (minutes) — protect DB and downstream services
3. Queue-based leveling (hours) — accept work, process at your own pace
4. Horizontal scaling (hours to days) — more instances
5. Auto-scaling (configured upfront) — scale on CPU/RPS metricsjavascript// Rate limiting with Redis (token bucket per user):
import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';
const limiter = rateLimit({
windowMs: 60_000, // 1 minute window
max: 100, // 100 requests per window per IP
standardHeaders: true,
store: new RedisStore({ sendCommand: (...args) => redis.sendCommand(args) }),
handler: (req, res) => {
res.status(429).json({
error: 'Too many requests',
retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
});
},
});
app.use('/api/', limiter);
// Load shedding — reject when system is overloaded:
app.use((req, res, next) => {
const { heapUsed, heapTotal } = process.memoryUsage();
const heapRatio = heapUsed / heapTotal;
const cpuLoad = os.loadavg()[0] / os.cpus().length; // 1-min load per core
if (heapRatio > 0.90 || cpuLoad > 0.95) {
return res.status(503).json({
error: 'Service temporarily unavailable — high load',
retryAfter: 5,
});
}
next();
});
// Queue-based leveling for expensive work:
app.post('/api/reports/generate', async (req, res) => {
// Accept immediately, process asynchronously
const jobId = await queue.add('generate-report', {
userId: req.userId,
params: req.body,
});
res.status(202).json({
jobId,
statusUrl: `/api/reports/${jobId}/status`,
message: 'Report queued — check statusUrl for progress',
});
});"How Do You Respond to a Production Incident?"
Incident Response Runbook
Severity levels:
P0: Site down / data loss / security breach → wake up entire team
P1: Partial outage / significant degradation → wake up on-call
P2: Degraded feature / elevated errors → fix during business hours
P3: Minor issue → schedule in next sprint
Response steps (DAIR):
1. DETECT — monitoring alert or user report
- Acknowledge within SLA (P0: 5 min, P1: 15 min)
- Open incident channel: #inc-YYYY-MM-DD-brief-description
- Post initial update in status page
2. ASSESS — understand scope before acting
- What is broken? Which users/services affected?
- When did it start? (correlate with recent deploys)
- Is it getting better, worse, or stable?
- Check dashboards: error rate, latency, DB connections, queue depth
3. INVESTIGATE — find the cause
- git log --since="2 hours ago" (recent deploys?)
- Check logs: grep for ERROR/FATAL around incident start time
- Metrics: before vs after comparison
- Isolate: is it one region, one service, or global?
4. REMEDIATE — fix or mitigate
- If recent deploy: rollback first, investigate later
- If DB query: kill offending query, add index
- If memory leak: restart instances, add node --max-old-space-size limit
- Communicate: update status page every 30 minutesjavascript// Structured logging makes incident investigation fast:
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
});
// Every log line has trace context for distributed tracing:
app.use((req, res, next) => {
req.log = logger.child({
requestId: req.headers['x-request-id'] ?? crypto.randomUUID(),
userId: req.user?.id,
path: req.path,
method: req.method,
});
next();
});
// Error handling with structured context:
app.use((err: Error, req: Request, res: Response, _next: NextFunction) => {
req.log.error({
err: { message: err.message, stack: err.stack, name: err.name },
msg: 'Unhandled request error',
});
res.status(500).json({ error: 'Internal server error', requestId: req.requestId });
// requestId lets users report issues you can trace immediately
});Post-Mortem Template
markdown## Incident Post-Mortem: [Title]
**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** P1
**Author:** [Name]
### Summary
One paragraph: what happened, who was affected, how long.
### Timeline
- 14:32 — Alert fired: error rate > 5%
- 14:35 — On-call acknowledged
- 14:40 — Identified root cause: deploy at 14:30 introduced regression
- 14:45 — Rolled back deploy
- 14:47 — Error rate returned to baseline
### Root Cause
New connection pool config set max: 2 instead of max: 20.
Under load, requests queued behind connection wait → timeout → 503s.
### Impact
~15 minutes of elevated errors (12% error rate vs 0.1% baseline).
Estimated 3,400 failed requests.
### What Went Well
- Alert fired within 2 minutes of problem starting
- Rollback executed quickly
### Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add config validation test for pool settings | @eng | 2024-02-15 |
| Add connection wait time to dashboards | @ops | 2024-02-10 |
### Lessons
No blame — focus on system improvements.
The config was not validated in CI, allowing an invalid value to reach production."How Do You Monitor a Node.js Application in Production?"
javascript// Key metrics to track:
// 1. Process health (every 30s):
setInterval(() => {
const { heapUsed, heapTotal, external, rss } = process.memoryUsage();
const cpuUsage = process.cpuUsage();
metrics.gauge('nodejs.heap.used', heapUsed);
metrics.gauge('nodejs.heap.total', heapTotal);
metrics.gauge('nodejs.rss', rss);
metrics.gauge('nodejs.event_loop.lag_ms', getEventLoopLag()); // see below
}, 30_000).unref();
// 2. Event loop lag (sign of CPU starvation):
function getEventLoopLag(): number {
const start = process.hrtime.bigint();
return new Promise<number>(resolve =>
setImmediate(() => {
const lag = Number(process.hrtime.bigint() - start) / 1e6; // ms
resolve(lag);
})
) as unknown as number;
}
// > 100ms event loop lag = something is blocking the event loop
// 3. HTTP request metrics (middleware):
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
const route = req.route?.path ?? req.path;
metrics.histogram('http.request.duration', duration, {
method: req.method,
route,
status: String(res.statusCode),
});
metrics.increment('http.requests.total', {
method: req.method,
route,
status: String(res.statusCode),
});
});
next();
});
// 4. Alerts to configure:
// - Error rate > 1% for 5 minutes → P1
// - p99 latency > 2s for 10 minutes → P2
// - Heap usage > 80% → P2
// - Event loop lag > 200ms for 2 minutes → P1
// - DB connection pool exhausted → P1
// - Failed health checks on > 1 instance → P1"What Is Your Strategy for Database Query Optimization?"
sql-- 1. EXPLAIN ANALYZE is your best friend:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.id, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > NOW() - INTERVAL '30 days'
GROUP BY u.id;
-- Look for:
-- Seq Scan (bad on large tables) → missing index
-- Nested Loop on large sets → missing index or bad join
-- "rows=10000 actual rows=1" → stale statistics (ANALYZE table)
-- Hash Batches > 1 → work_mem too low
-- 2. Index the right columns:
-- Index WHERE clauses, JOIN keys, ORDER BY columns
CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at DESC);
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
-- CONCURRENTLY = no table lock during creation
-- 3. Partial indexes (index only rows you query):
-- Most orders queries only touch 'pending' status
CREATE INDEX idx_orders_pending ON orders (created_at DESC)
WHERE status = 'pending';
-- Tiny index, very fast for that use case
-- 4. Covering index (index stores all needed columns — no heap fetch):
CREATE INDEX idx_users_email_covering ON users (email)
INCLUDE (id, name, status);
-- Query: SELECT id, name, status FROM users WHERE email = ?
-- Never touches the table — pure index scanjavascript// Connection pool tuning (Node.js with pg):
const pool = new Pool({
max: Math.min(10, os.cpus().length * 2), // never more than DB allows
min: 2,
idleTimeoutMillis: 30_000,
connectionTimeoutMillis: 3_000, // fail fast, don't queue forever
statement_timeout: 10_000, // kill queries > 10s (prevent runaway)
query_timeout: 10_000,
});
// Always set statement_timeout to prevent runaway queries
// taking down your entire DB under load"How Do You Prevent and Handle Cascading Failures?"
A cascading failure: Service A slow → A's thread pool exhausted →
A returns 500s → B retries A aggressively → A gets more traffic →
A is completely overwhelmed → B also goes down → entire system fails.
Prevention toolkit:
1. Timeouts — never wait forever on a dependency
2. Circuit breaker — stop calling a failing service
3. Bulkhead — isolate resources per dependency
4. Backpressure — signal upstream to slow down
5. Retry jitter — prevent thundering herd on recoverytypescript// Production-grade circuit breaker:
type State = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
class CircuitBreaker {
private state: State = 'CLOSED';
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private readonly threshold = 5, // failures to open
private readonly resetTimeout = 30_000, // ms before trying again
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.resetTimeout) {
this.state = 'HALF_OPEN'; // probe one request
} else {
throw new Error('Circuit is OPEN — dependency unavailable');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.threshold || this.state === 'HALF_OPEN') {
this.state = 'OPEN';
console.warn(`Circuit OPEN after ${this.failureCount} failures`);
}
}
}
// Per-dependency breakers (bulkhead pattern):
const paymentBreaker = new CircuitBreaker(5, 30_000);
const emailBreaker = new CircuitBreaker(3, 60_000);
async function checkout(order: Order) {
// Payment is critical — let it throw if circuit is open
const payment = await paymentBreaker.execute(() => paymentService.charge(order));
// Email is non-critical — graceful degradation
try {
await emailBreaker.execute(() => emailService.sendConfirmation(order));
} catch (err) {
logger.warn('Email service unavailable — skipping confirmation', { orderId: order.id });
// Queue for retry later
await queue.add('send-email', { orderId: order.id }, { delay: 60_000 });
}
return payment;
}
// Retry with jitter to avoid thundering herd:
async function retryWithJitter<T>(fn: () => Promise<T>, maxAttempts = 4): Promise<T> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxAttempts) throw err;
const base = Math.min(100 * 2 ** attempt, 10_000); // exponential: 200, 400, 800...
const jitter = Math.random() * base; // add 0-100% randomness
await new Promise(r => setTimeout(r, base + jitter));
}
}
throw new Error('unreachable');
}Quick-Fire: Operational Q&A
| Question | Answer |
|---|---|
| p99 vs average latency? | p99 = 99th percentile — 1% of requests are slower. Average hides the long tail. Always track p99. |
| When to use blue-green vs canary? | Blue-green for high-risk deploys needing instant rollback. Canary when you want gradual validation. |
| What is a thundering herd? | Many clients retry at the same time → overload recovering service. Prevent with jitter. |
| How many DB connections per Node process? | min(10, cpu_count × 2). Never let pool exhaust. Set connectionTimeoutMillis. |
| What is backpressure? | Signal to slow down: readable stream pausing writeable, queue depth triggering 429s. |
| Rolling restart without downtime? | Wait for health check to pass before routing traffic to new instance. minReadySeconds in k8s. |
| How do you test resilience? | Chaos engineering: kill random instances, inject latency, block network. Use tc netem or Chaos Monkey. |
| What's the difference between RTO and RPO? | RTO = max downtime (recovery time). RPO = max data loss (recovery point). |
| How do you handle a bad database migration? | Run both old+new column, deploy dual-write code, verify, then drop old column in a later migration. |
| What makes an API idempotent? | Same request produces same result if called N times. Use idempotency keys for POST operations. |
[prev·next]