Database Migrations — Zero-Downtime Strategies
Why Migrations Are Risky
In production, you can't stop the app during a migration. At any point during a deploy, two versions of your app run simultaneously — old (v1) and new (v2). Your schema must be compatible with both.
Deploy Timeline:
t=0: 100% v1 → all use old schema
t=1: 80% v1, 20% v2 → schema must work for BOTH
t=2: 50% v1, 50% v2 → ...
t=3: 0% v1, 100% v2 → only new schema needed
Problem: If v2 needs column `user_type` but v1 writes rows without it,
and it's NOT NULL with no default → v1 writes FAIL during deploy!The Expand-Contract Pattern
The golden rule for zero-downtime migrations. Every schema change is split into multiple deploys.
Phase 1: EXPAND — add new schema (backward compatible)
Phase 2: MIGRATE — backfill data, update app code
Phase 3: CONTRACT — remove old schema (now safe)Common Migration Scenarios
Adding a Column
❌ WRONG: Add NOT NULL column without default in one deploy
→ Old app writes rows without the column → DB rejects them
✅ RIGHT: Expand-contract
Deploy 1 (EXPAND): Add column as NULLABLE
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
Deploy 2 (APP): Both old and new code work.
New code writes phone when available.
Old code ignores the column.
Deploy 3 (BACKFILL): Fill missing values
UPDATE users SET phone = '' WHERE phone IS NULL;
Deploy 4 (CONTRACT): Add NOT NULL constraint
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
-- or: ALTER TABLE users ALTER COLUMN phone SET DEFAULT '';sql-- PostgreSQL: add column with default is instant in PG 11+ (no table rewrite)
-- Safe to do in one deploy if you have a sensible default:
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}' NOT NULL;Renaming a Column
❌ WRONG: ALTER TABLE users RENAME COLUMN name TO full_name;
→ Old app still writes to `name` → column not found → ERROR
✅ RIGHT: Three-phase
Deploy 1 (EXPAND): Add new column, copy data
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name; -- or trigger
Deploy 1b: Add trigger to keep both in sync during transition
CREATE OR REPLACE FUNCTION sync_name() RETURNS trigger AS $$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
NEW.full_name := COALESCE(NEW.full_name, NEW.name);
NEW.name := COALESCE(NEW.name, NEW.full_name);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_name_trigger
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_name();
Deploy 2 (APP): Update all app code to use `full_name`
Deploy 3 (CONTRACT): Drop trigger, drop old column
DROP TRIGGER sync_name_trigger ON users;
ALTER TABLE users DROP COLUMN name;Changing a Column Type
❌ WRONG: ALTER TABLE orders ALTER COLUMN amount TYPE DECIMAL(10,2);
→ Locks the table for potentially minutes (rewrites all rows)
✅ RIGHT: New column approach
Deploy 1 (EXPAND):
ALTER TABLE orders ADD COLUMN amount_v2 DECIMAL(10,2);
-- Trigger to keep in sync
CREATE OR REPLACE FUNCTION sync_amount() RETURNS trigger AS $$
BEGIN
NEW.amount_v2 := NEW.amount::DECIMAL(10,2);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_amount_trigger
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION sync_amount();
Deploy 2 (BACKFILL):
UPDATE orders SET amount_v2 = amount::DECIMAL(10,2)
WHERE amount_v2 IS NULL; -- batch this!
Deploy 3 (APP): Use amount_v2 everywhere
Deploy 4 (CONTRACT):
DROP TRIGGER sync_amount_trigger ON orders;
ALTER TABLE orders DROP COLUMN amount;
ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;Dropping a Column
❌ WRONG: DROP COLUMN in same deploy as removing code
→ Old app still references column during rolling deploy → errors
✅ RIGHT:
Deploy 1 (APP): Remove all code references to the column
-- App ignores the column but doesn't error when it exists
Deploy 2 (CONTRACT): Drop the column
ALTER TABLE users DROP COLUMN legacy_field;Adding an Index
❌ WRONG: CREATE INDEX idx_users_email ON users(email);
→ Locks the table during index creation (minutes on large tables)
✅ RIGHT: CONCURRENTLY
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- No locks! Reads and writes continue.
-- Takes longer but safe for production.
-- Check progress:
SELECT phase, blocks_done, blocks_total
FROM pg_stat_progress_create_index;Adding a Foreign Key
sql-- NOT VALID: add constraint without scanning existing rows (instant)
ALTER TABLE orders
ADD CONSTRAINT fk_orders_user
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;
-- Validate in background (no lock on writes)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user;
-- This takes a ShareUpdateExclusiveLock — allows reads & writesAdding a NOT NULL Constraint
sql-- Fast approach: check constraint first, then NOT NULL
-- Step 1: Add check constraint (NOT VALID = no scan)
ALTER TABLE orders
ADD CONSTRAINT orders_amount_not_null
CHECK (amount IS NOT NULL) NOT VALID;
-- Step 2: Validate (no write lock)
ALTER TABLE orders VALIDATE CONSTRAINT orders_amount_not_null;
-- Step 3: Add NOT NULL (cheap when check constraint exists)
ALTER TABLE orders ALTER COLUMN amount SET NOT NULL;
-- Step 4: Drop redundant check constraint
ALTER TABLE orders DROP CONSTRAINT orders_amount_not_null;Large Table Backfills
Updating millions of rows in one query = long lock, possible timeout, huge WAL write.
sql-- BAD: Single giant update
UPDATE users SET status = 'active' WHERE status IS NULL;
-- Locks all rows, generates huge undo log, can OOM
-- GOOD: Batched updates with delay
DO $$
DECLARE
batch_size INT := 1000;
last_id BIGINT := 0;
max_id BIGINT;
BEGIN
SELECT MAX(id) INTO max_id FROM users;
WHILE last_id < max_id LOOP
UPDATE users
SET status = 'active'
WHERE id > last_id
AND id <= last_id + batch_size
AND status IS NULL;
last_id := last_id + batch_size;
PERFORM pg_sleep(0.01); -- brief pause between batches
END LOOP;
END;
$$;js// Node.js batched backfill script
async function backfillUserStatus() {
let lastId = 0;
const batchSize = 1000;
while (true) {
const result = await db.query(`
UPDATE users
SET status = 'active'
WHERE id > $1 AND id <= $1 + $2 AND status IS NULL
RETURNING id
`, [lastId, batchSize]);
if (result.rows.length === 0) break;
lastId += batchSize;
console.log(`Backfilled up to id ${lastId}`);
await new Promise(r => setTimeout(r, 10)); // 10ms pause
}
}Migration Tools
Prisma Migrate
bash# Development
npx prisma migrate dev --name add_phone_to_users
# Creates migration file + applies it
# Production
npx prisma migrate deploy
# Applies pending migrations (non-interactive)
# Check status
npx prisma migrate statussql-- migrations/20240115_add_phone/migration.sql
-- Generated by Prisma, but edit if needed for safety:
ALTER TABLE "users" ADD COLUMN "phone" TEXT;
-- Note: Prisma doesn't add CONCURRENTLY — add manually for large tablesPrisma Limitations:
- Doesn't generate
CONCURRENTLYfor indexes — add manually - Can't handle complex data migrations — use custom scripts
migrate devwith shadow DB may not match prod behavior exactly
Custom migration with Prisma
typescript// prisma/migrations/20240115_backfill_phone/migration.ts
import { PrismaClient } from '@prisma/client';
const prisma = new PrismaClient();
export async function up() {
// Custom logic Prisma can't auto-generate
let cursor: string | undefined;
do {
const users = await prisma.user.findMany({
take: 1000,
skip: cursor ? 1 : 0,
cursor: cursor ? { id: cursor } : undefined,
where: { phone: null },
select: { id: true, legacyPhone: true },
});
if (users.length === 0) break;
await Promise.all(users.map(u =>
prisma.user.update({
where: { id: u.id },
data: { phone: formatPhone(u.legacyPhone) },
})
));
cursor = users[users.length - 1].id;
} while (true);
}Flyway / Liquibase (Java-world, used in some Node shops)
sql-- V1__create_users.sql
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL
);
-- V2__add_phone_nullable.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- V3__backfill_phone.sql
UPDATE users SET phone = '' WHERE phone IS NULL;
-- V4__phone_not_null.sql
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;Locking Reference (PostgreSQL)
| Operation | Lock Type | Blocks |
|---|---|---|
SELECT |
AccessShare | Only AccessExclusive |
INSERT/UPDATE/DELETE |
RowExclusive | ShareLock, ExclusiveLock |
CREATE INDEX |
ShareLock | ALL writes |
CREATE INDEX CONCURRENTLY |
ShareUpdateExclusive | Only other schema changes |
ALTER TABLE ADD COLUMN (nullable, no default) |
AccessExclusive | Everything |
ALTER TABLE ADD COLUMN DEFAULT (PG 11+) |
AccessExclusive | Brief (no rewrite) |
ALTER TABLE DROP COLUMN |
AccessExclusive | Everything |
VALIDATE CONSTRAINT |
ShareUpdateExclusive | Only schema changes |
AccessExclusive = full table lock. Avoid in production for large tables.
Migration Checklist for Production
Before:
□ Test migration on staging with production-size data
□ Measure migration duration on staging
□ Have rollback plan ready
□ Schedule during low-traffic window if risky
SQL review:
□ No table locks on large tables (use CONCURRENTLY, NOT VALID)
□ Large backfills are batched, not single UPDATE
□ New NOT NULL columns have defaults or are added in phases
□ Indexes created CONCURRENTLY
Deployment:
□ Schema changes deployed BEFORE code that requires them
□ Old code still works with new schema (expand phase)
□ Monitor error rates during deploy
□ Monitor DB CPU/locks during migration
After:
□ Contract phase: drop old columns/indexes once all instances updatedInterview Questions
Q: How do you rename a column in PostgreSQL without downtime?
Three deploys: (1) EXPAND — add new column, backfill from old, add trigger to keep both in sync. (2) APP — update all code to read/write new column name. (3) CONTRACT — drop trigger and old column. Doing it in one step (RENAME COLUMN) requires no application change, but if any old app instances are running they'll fail.
Q: Why is CREATE INDEX CONCURRENTLY important?
Regular CREATE INDEX takes a ShareLock, blocking all writes for the duration (minutes on large tables). CONCURRENTLY uses ShareUpdateExclusiveLock, allowing reads and writes to continue. It takes longer (two passes over the table) but is safe for production.
Q: What is the expand-contract pattern? A three-phase approach to zero-downtime schema changes: (1) EXPAND — add new backward-compatible schema (new nullable columns, new tables). Both old and new app code work. (2) MIGRATE — update app code to use new schema, backfill data. (3) CONTRACT — remove old schema after all instances run new code. Prevents the two-version compatibility problem during rolling deploys.
Q: How do you add a NOT NULL column to a large table safely?
Phase 1: Add the column as nullable. Phase 2: Backfill in batches (not a single UPDATE). Phase 3: Add CHECK (col IS NOT NULL) NOT VALID constraint, then VALIDATE CONSTRAINT (no write lock). Phase 4: Add NOT NULL (cheap now that the check constraint proves all rows are non-null). Phase 5: Drop the redundant check constraint.