Operational Reliability — Interview Q&A

Deep-dive answers for production-readiness and reliability questions common at senior / SDE3 interviews.

"How Do You Handle Server Downtime?"

This question tests whether you design for failure by default, not as an afterthought.

Layer 1 — Process Crashes: PM2 Cluster Mode

javascript// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'api',
    script: 'dist/index.js',
    instances: 'max',        // one per CPU core
    exec_mode: 'cluster',    // share port across workers
    autorestart: true,
    max_restarts: 10,
    min_uptime: '5s',        // don't count crash if process dies in < 5s
    restart_delay: 1000,     // wait 1s between restarts
    watch: false,            // never watch in prod
    env_production: {
      NODE_ENV: 'production',
      PORT: 3000,
    },
  }],
};

// pm2 start ecosystem.config.js --env production
// pm2 monit   ← live CPU/memory/logs
// pm2 logs    ← aggregated log stream

When one worker crashes, PM2 restarts it. The other workers keep serving traffic. Zero downtime for single-process crashes.

Layer 2 — Graceful Shutdown (SIGTERM)

javascript// When a load balancer removes a server from rotation, it sends SIGTERM.
// The process should drain in-flight requests before exiting.

const server = app.listen(PORT);

let shuttingDown = false;

process.on('SIGTERM', async () => {
  console.log('SIGTERM received — starting graceful shutdown');
  shuttingDown = true;

  // 1. Stop accepting new connections
  server.close(async () => {
    console.log('HTTP server closed');

    // 2. Drain open resources
    try {
      await Promise.all([
        dbPool.end(),          // close DB connection pool
        redisClient.quit(),    // close Redis connection
        mqChannel.close(),     // close message queue channel
      ]);
      console.log('Resources drained — exiting cleanly');
      process.exit(0);
    } catch (err) {
      console.error('Error during shutdown', err);
      process.exit(1);
    }
  });

  // 3. Safety timeout — force exit if drain takes too long
  setTimeout(() => {
    console.error('Graceful shutdown timed out — forcing exit');
    process.exit(1);
  }, 15_000);
});

// Reject new requests during shutdown (return 503 to load balancer):
app.use((req, res, next) => {
  if (shuttingDown) {
    res.setHeader('Connection', 'close');
    return res.status(503).json({ error: 'Server shutting down' });
  }
  next();
});

Layer 3 — Health Checks

javascript// Load balancers (AWS ALB, nginx, k8s) probe /health every 10-30s.
// Return 503 → server removed from rotation immediately.

app.get('/health', async (req, res) => {
  const checks: Record<string, 'ok' | 'fail'> = {};

  // Check DB connectivity
  try {
    await pool.query('SELECT 1');
    checks.db = 'ok';
  } catch {
    checks.db = 'fail';
  }

  // Check Redis connectivity
  try {
    await redis.ping();
    checks.redis = 'ok';
  } catch {
    checks.redis = 'fail';
  }

  const healthy = Object.values(checks).every(v => v === 'ok');

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'degraded',
    uptime: process.uptime(),
    checks,
  });
});

// Kubernetes liveness vs readiness:
// /health/live  → is the process up? (restart if not)
// /health/ready → can it serve traffic? (remove from Service if not)
app.get('/health/live', (_req, res) => res.json({ status: 'alive' }));
app.get('/health/ready', async (req, res) => {
  // readiness fails during startup or drain
  if (shuttingDown || !dbReady) return res.status(503).json({ status: 'not ready' });
  res.json({ status: 'ready' });
});

Layer 4 — Zero-Downtime Deployments

Three strategies — choose based on risk tolerance:

1. Rolling Deployment (default in k8s)
   ┌────────────────────────────────────┐
   │ v1 v1 v1 v1  →  v2 v1 v1 v1      │
   │              →  v2 v2 v1 v1      │
   │              →  v2 v2 v2 v1      │
   │              →  v2 v2 v2 v2  ✓   │
   └────────────────────────────────────┘
   - No extra infrastructure needed
   - Both versions briefly serve traffic simultaneously
   - DB schema must be backward-compatible during rollout

2. Blue-Green Deployment
   ┌────────────────────────────────────┐
   │ Load Balancer                       │
   │     ↓                               │
   │  [Blue: v1] ← live traffic         │
   │  [Green: v2] ← dark (idle)        │
   │                                     │
   │  Switch LB: Blue → Green (< 1s)    │
   │  Keep Blue idle for instant rollback│
   └────────────────────────────────────┘
   - Instant cutover and instant rollback
   - Double the infrastructure cost
   - DB migrations must run before switch

3. Canary Release
   ┌────────────────────────────────────┐
   │ 95% → v1 (stable)                 │
   │  5% → v2 (canary)                 │
   │                                     │
   │ Monitor error rate, latency, etc.  │
   │ Gradually increase to 100% if OK  │
   └────────────────────────────────────┘
   - Safest: real production validation
   - Requires traffic splitting (nginx, AWS ALB weights, Istio)
   - Automatic rollback if error rate spikes

javascript// Database migration rule for zero-downtime:
// Never break the schema while old code is still running.

// Bad: rename column in one migration
// ALTER TABLE users RENAME COLUMN name TO full_name;
// Old code breaks immediately.

// Good: expand-then-contract (3 deploys):
// Migration 1: Add new column (backward compatible)
// ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
// UPDATE users SET full_name = name;

// Deploy v2: writes to BOTH name and full_name, reads from full_name

// Migration 2 (after v2 is stable): drop old column
// ALTER TABLE users DROP COLUMN name;

"How Do You Optimize an API's Response Time?"

Step 1 — Measure Before You Touch Anything

javascript// Add structured timing logs to find bottlenecks:
app.use((req, res, next) => {
  const start = Date.now();
  const timings: Record<string, number> = {};

  req.time = (label: string) => {
    timings[label] = Date.now() - start;
  };

  res.on('finish', () => {
    logger.info('request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      totalMs: Date.now() - start,
      timings,
    });
  });

  next();
});

// Usage in route:
app.get('/api/orders', async (req, res) => {
  const user = await getUser(req.userId);
  req.time('user_fetch');                    // logged: { user_fetch: 45 }
  const orders = await getOrders(req.userId);
  req.time('orders_fetch');                  // logged: { orders_fetch: 340 } ← bottleneck
  res.json({ user, orders });
});

Metrics to care about: p99 latency (not average — average hides long tail), error rate, throughput (req/s).

Step 2 — Parallelize Independent I/O

javascript// Sequential (bad) — waits for each before starting next:
const user    = await db.users.find(id);       // 50ms
const orders  = await db.orders.find(id);      // 80ms
const notifs  = await db.notifs.find(id);      // 40ms
// Total: 170ms

// Parallel (good) — all fire at once:
const [user, orders, notifs] = await Promise.all([
  db.users.find(id),
  db.orders.find(id),
  db.notifs.find(id),
]);
// Total: 80ms (longest of the three)

// Optional parallel (don't fail if non-critical service is down):
const [critical, optional] = await Promise.allSettled([
  db.orders.find(id),           // must succeed
  recommendations.get(id),      // nice to have
]);
const orders  = critical.status === 'fulfilled' ? critical.value : [];
const recs    = optional.status === 'fulfilled' ? optional.value : null;

Step 3 — Eliminate N+1 Queries

javascript// N+1 pattern (the most common DB killer):
const posts = await db.posts.findAll();             // 1 query
for (const post of posts) {
  post.author = await db.users.findById(post.userId); // N queries!
}
// For 100 posts: 101 queries

// Fix 1: JOIN (best for simple cases)
const posts = await db.query(`
  SELECT posts.*, users.name AS author_name, users.avatar AS author_avatar
  FROM posts
  JOIN users ON posts.user_id = users.id
  WHERE posts.status = 'published'
  ORDER BY posts.created_at DESC
  LIMIT 20
`);

// Fix 2: DataLoader batching (for GraphQL or reusable loaders)
const userLoader = new DataLoader(async (ids: string[]) => {
  const users = await db.users.findByIds(ids);      // 1 query for all IDs
  const map = new Map(users.map(u => [u.id, u]));
  return ids.map(id => map.get(id));
});

// Now each post.author call gets batched into a single query:
const posts = await db.posts.findAll();
await Promise.all(posts.map(p => userLoader.load(p.userId)));

Step 4 — Cache Aggressively

javascript// Cache hierarchy: in-memory > Redis > DB

// Level 1: In-process memory (< 1ms)
const cache = new Map<string, { data: unknown; expiry: number }>();

function memCache<T>(key: string, ttlMs: number, fn: () => Promise<T>): Promise<T> {
  const hit = cache.get(key);
  if (hit && hit.expiry > Date.now()) return Promise.resolve(hit.data as T);
  return fn().then(data => {
    cache.set(key, { data, expiry: Date.now() + ttlMs });
    return data;
  });
}

// Level 2: Redis (1-5ms, survives restarts, shared across instances)
async function redisCache<T>(key: string, ttlSec: number, fn: () => Promise<T>): Promise<T> {
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached) as T;
  const data = await fn();
  await redis.setex(key, ttlSec, JSON.stringify(data));
  return data;
}

// Usage — cache expensive DB aggregation:
app.get('/api/leaderboard', async (req, res) => {
  const data = await redisCache('leaderboard:global', 30, async () => {
    return db.query(`
      SELECT user_id, SUM(score) as total
      FROM events
      GROUP BY user_id
      ORDER BY total DESC
      LIMIT 100
    `);
  });
  res.json(data);
});

Step 5 — Paginate, Stream, and Compress

javascript// Cursor-based pagination (more efficient than OFFSET for large tables):
app.get('/api/feed', async (req, res) => {
  const cursor = req.query.cursor as string | undefined;
  const limit  = Math.min(parseInt(req.query.limit as string) || 20, 100);

  const rows = await db.query(`
    SELECT id, title, created_at
    FROM posts
    WHERE ($1::uuid IS NULL OR created_at < (SELECT created_at FROM posts WHERE id = $1))
    ORDER BY created_at DESC
    LIMIT $2
  `, [cursor ?? null, limit + 1]);

  const hasMore = rows.length > limit;
  const items   = hasMore ? rows.slice(0, limit) : rows;

  res.json({
    items,
    nextCursor: hasMore ? items[items.length - 1].id : null,
  });
});

// Stream large responses instead of buffering in memory:
app.get('/api/export/csv', async (req, res) => {
  res.setHeader('Content-Type', 'text/csv');
  res.setHeader('Content-Disposition', 'attachment; filename=export.csv');

  res.write('id,email,created_at\n'); // header

  const cursor = db.queryStream('SELECT id, email, created_at FROM users ORDER BY id');

  for await (const row of cursor) {
    res.write(`${row.id},${row.email},${row.created_at}\n`);
  }

  res.end();
  // Memory usage stays constant regardless of row count
});

// Enable compression (30-70% size reduction for JSON):
import compression from 'compression';
app.use(compression({ threshold: 1024 })); // compress responses > 1KB

"How Do You Handle Traffic Spikes?"

Spike handling strategy (in order of speed to implement):

1. Caching (instant) — serve repeated data from cache, not DB
2. Rate limiting (minutes) — protect DB and downstream services
3. Queue-based leveling (hours) — accept work, process at your own pace
4. Horizontal scaling (hours to days) — more instances
5. Auto-scaling (configured upfront) — scale on CPU/RPS metrics

javascript// Rate limiting with Redis (token bucket per user):
import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';

const limiter = rateLimit({
  windowMs: 60_000,  // 1 minute window
  max: 100,          // 100 requests per window per IP
  standardHeaders: true,
  store: new RedisStore({ sendCommand: (...args) => redis.sendCommand(args) }),
  handler: (req, res) => {
    res.status(429).json({
      error: 'Too many requests',
      retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
    });
  },
});

app.use('/api/', limiter);

// Load shedding — reject when system is overloaded:
app.use((req, res, next) => {
  const { heapUsed, heapTotal } = process.memoryUsage();
  const heapRatio = heapUsed / heapTotal;
  const cpuLoad   = os.loadavg()[0] / os.cpus().length; // 1-min load per core

  if (heapRatio > 0.90 || cpuLoad > 0.95) {
    return res.status(503).json({
      error: 'Service temporarily unavailable — high load',
      retryAfter: 5,
    });
  }
  next();
});

// Queue-based leveling for expensive work:
app.post('/api/reports/generate', async (req, res) => {
  // Accept immediately, process asynchronously
  const jobId = await queue.add('generate-report', {
    userId: req.userId,
    params: req.body,
  });

  res.status(202).json({
    jobId,
    statusUrl: `/api/reports/${jobId}/status`,
    message: 'Report queued — check statusUrl for progress',
  });
});

"How Do You Respond to a Production Incident?"

Incident Response Runbook

Severity levels:
  P0: Site down / data loss / security breach → wake up entire team
  P1: Partial outage / significant degradation → wake up on-call
  P2: Degraded feature / elevated errors → fix during business hours
  P3: Minor issue → schedule in next sprint

Response steps (DAIR):

1. DETECT — monitoring alert or user report
   - Acknowledge within SLA (P0: 5 min, P1: 15 min)
   - Open incident channel: #inc-YYYY-MM-DD-brief-description
   - Post initial update in status page

2. ASSESS — understand scope before acting
   - What is broken? Which users/services affected?
   - When did it start? (correlate with recent deploys)
   - Is it getting better, worse, or stable?
   - Check dashboards: error rate, latency, DB connections, queue depth

3. INVESTIGATE — find the cause
   - git log --since="2 hours ago" (recent deploys?)
   - Check logs: grep for ERROR/FATAL around incident start time
   - Metrics: before vs after comparison
   - Isolate: is it one region, one service, or global?

4. REMEDIATE — fix or mitigate
   - If recent deploy: rollback first, investigate later
   - If DB query: kill offending query, add index
   - If memory leak: restart instances, add node --max-old-space-size limit
   - Communicate: update status page every 30 minutes

javascript// Structured logging makes incident investigation fast:
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});

// Every log line has trace context for distributed tracing:
app.use((req, res, next) => {
  req.log = logger.child({
    requestId: req.headers['x-request-id'] ?? crypto.randomUUID(),
    userId: req.user?.id,
    path: req.path,
    method: req.method,
  });
  next();
});

// Error handling with structured context:
app.use((err: Error, req: Request, res: Response, _next: NextFunction) => {
  req.log.error({
    err: { message: err.message, stack: err.stack, name: err.name },
    msg: 'Unhandled request error',
  });
  res.status(500).json({ error: 'Internal server error', requestId: req.requestId });
  // requestId lets users report issues you can trace immediately
});

Post-Mortem Template

markdown## Incident Post-Mortem: [Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM
**Severity:** P1
**Author:** [Name]

### Summary
One paragraph: what happened, who was affected, how long.

### Timeline
- 14:32 — Alert fired: error rate > 5%
- 14:35 — On-call acknowledged
- 14:40 — Identified root cause: deploy at 14:30 introduced regression
- 14:45 — Rolled back deploy
- 14:47 — Error rate returned to baseline

### Root Cause
New connection pool config set max: 2 instead of max: 20.
Under load, requests queued behind connection wait → timeout → 503s.

### Impact
~15 minutes of elevated errors (12% error rate vs 0.1% baseline).
Estimated 3,400 failed requests.

### What Went Well
- Alert fired within 2 minutes of problem starting
- Rollback executed quickly

### Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add config validation test for pool settings | @eng | 2024-02-15 |
| Add connection wait time to dashboards | @ops | 2024-02-10 |

### Lessons
No blame — focus on system improvements.
The config was not validated in CI, allowing an invalid value to reach production.

"How Do You Monitor a Node.js Application in Production?"

javascript// Key metrics to track:

// 1. Process health (every 30s):
setInterval(() => {
  const { heapUsed, heapTotal, external, rss } = process.memoryUsage();
  const cpuUsage = process.cpuUsage();

  metrics.gauge('nodejs.heap.used',  heapUsed);
  metrics.gauge('nodejs.heap.total', heapTotal);
  metrics.gauge('nodejs.rss',        rss);
  metrics.gauge('nodejs.event_loop.lag_ms', getEventLoopLag()); // see below
}, 30_000).unref();

// 2. Event loop lag (sign of CPU starvation):
function getEventLoopLag(): number {
  const start = process.hrtime.bigint();
  return new Promise<number>(resolve =>
    setImmediate(() => {
      const lag = Number(process.hrtime.bigint() - start) / 1e6; // ms
      resolve(lag);
    })
  ) as unknown as number;
}
// > 100ms event loop lag = something is blocking the event loop

// 3. HTTP request metrics (middleware):
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    const route = req.route?.path ?? req.path;
    metrics.histogram('http.request.duration', duration, {
      method:  req.method,
      route,
      status:  String(res.statusCode),
    });
    metrics.increment('http.requests.total', {
      method: req.method,
      route,
      status: String(res.statusCode),
    });
  });
  next();
});

// 4. Alerts to configure:
// - Error rate > 1% for 5 minutes → P1
// - p99 latency > 2s for 10 minutes → P2
// - Heap usage > 80% → P2
// - Event loop lag > 200ms for 2 minutes → P1
// - DB connection pool exhausted → P1
// - Failed health checks on > 1 instance → P1

"What Is Your Strategy for Database Query Optimization?"

sql-- 1. EXPLAIN ANALYZE is your best friend:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.id, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > NOW() - INTERVAL '30 days'
GROUP BY u.id;

-- Look for:
-- Seq Scan (bad on large tables) → missing index
-- Nested Loop on large sets → missing index or bad join
-- "rows=10000 actual rows=1" → stale statistics (ANALYZE table)
-- Hash Batches > 1 → work_mem too low

-- 2. Index the right columns:
-- Index WHERE clauses, JOIN keys, ORDER BY columns
CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at DESC);
CREATE INDEX CONCURRENTLY idx_orders_user_id   ON orders (user_id);
-- CONCURRENTLY = no table lock during creation

-- 3. Partial indexes (index only rows you query):
-- Most orders queries only touch 'pending' status
CREATE INDEX idx_orders_pending ON orders (created_at DESC)
WHERE status = 'pending';
-- Tiny index, very fast for that use case

-- 4. Covering index (index stores all needed columns — no heap fetch):
CREATE INDEX idx_users_email_covering ON users (email)
INCLUDE (id, name, status);
-- Query: SELECT id, name, status FROM users WHERE email = ?
-- Never touches the table — pure index scan

javascript// Connection pool tuning (Node.js with pg):
const pool = new Pool({
  max: Math.min(10, os.cpus().length * 2), // never more than DB allows
  min: 2,
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 3_000,    // fail fast, don't queue forever
  statement_timeout: 10_000,         // kill queries > 10s (prevent runaway)
  query_timeout: 10_000,
});

// Always set statement_timeout to prevent runaway queries
// taking down your entire DB under load

"How Do You Prevent and Handle Cascading Failures?"

A cascading failure: Service A slow → A's thread pool exhausted →
A returns 500s → B retries A aggressively → A gets more traffic →
A is completely overwhelmed → B also goes down → entire system fails.

Prevention toolkit:
1. Timeouts     — never wait forever on a dependency
2. Circuit breaker — stop calling a failing service
3. Bulkhead      — isolate resources per dependency
4. Backpressure  — signal upstream to slow down
5. Retry jitter  — prevent thundering herd on recovery

typescript// Production-grade circuit breaker:
type State = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

class CircuitBreaker {
  private state: State = 'CLOSED';
  private failureCount = 0;
  private lastFailureTime = 0;

  constructor(
    private readonly threshold = 5,       // failures to open
    private readonly resetTimeout = 30_000, // ms before trying again
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = 'HALF_OPEN'; // probe one request
      } else {
        throw new Error('Circuit is OPEN — dependency unavailable');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold || this.state === 'HALF_OPEN') {
      this.state = 'OPEN';
      console.warn(`Circuit OPEN after ${this.failureCount} failures`);
    }
  }
}

// Per-dependency breakers (bulkhead pattern):
const paymentBreaker = new CircuitBreaker(5, 30_000);
const emailBreaker   = new CircuitBreaker(3, 60_000);

async function checkout(order: Order) {
  // Payment is critical — let it throw if circuit is open
  const payment = await paymentBreaker.execute(() => paymentService.charge(order));

  // Email is non-critical — graceful degradation
  try {
    await emailBreaker.execute(() => emailService.sendConfirmation(order));
  } catch (err) {
    logger.warn('Email service unavailable — skipping confirmation', { orderId: order.id });
    // Queue for retry later
    await queue.add('send-email', { orderId: order.id }, { delay: 60_000 });
  }

  return payment;
}

// Retry with jitter to avoid thundering herd:
async function retryWithJitter<T>(fn: () => Promise<T>, maxAttempts = 4): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (attempt === maxAttempts) throw err;
      const base  = Math.min(100 * 2 ** attempt, 10_000); // exponential: 200, 400, 800...
      const jitter = Math.random() * base;                 // add 0-100% randomness
      await new Promise(r => setTimeout(r, base + jitter));
    }
  }
  throw new Error('unreachable');
}

Quick-Fire: Operational Q&A

Question	Answer
p99 vs average latency?	p99 = 99th percentile — 1% of requests are slower. Average hides the long tail. Always track p99.
When to use blue-green vs canary?	Blue-green for high-risk deploys needing instant rollback. Canary when you want gradual validation.
What is a thundering herd?	Many clients retry at the same time → overload recovering service. Prevent with jitter.
How many DB connections per Node process?	`min(10, cpu_count × 2)`. Never let pool exhaust. Set `connectionTimeoutMillis`.
What is backpressure?	Signal to slow down: readable stream pausing writeable, queue depth triggering 429s.
Rolling restart without downtime?	Wait for health check to pass before routing traffic to new instance. `minReadySeconds` in k8s.
How do you test resilience?	Chaos engineering: kill random instances, inject latency, block network. Use `tc netem` or Chaos Monkey.
What's the difference between RTO and RPO?	RTO = max downtime (recovery time). RPO = max data loss (recovery point).
How do you handle a bad database migration?	Run both old+new column, deploy dual-write code, verify, then drop old column in a later migration.
What makes an API idempotent?	Same request produces same result if called N times. Use idempotency keys for POST operations.