Distributed Systems — Deep Dive
CAP Theorem
A distributed system can guarantee at most 2 of 3:
Consistency
△
/ \
/ \
/ \
/ CA \
/─────────\
/ | \
/ CP | AP \
▽───────┼───────▽
Availability Partition
Tolerance- Consistency (C): Every read receives the most recent write (or an error)
- Availability (A): Every request receives a response (not necessarily most recent data)
- Partition Tolerance (P): System continues operating when network partitions occur
The key insight: Network partitions are unavoidable in distributed systems. So the real trade-off is C vs A during a partition.
Real-world Database Classifications
| System | Type | Trade-off |
|---|---|---|
| PostgreSQL (single node) | CA | No partition tolerance |
| Cassandra | AP | Eventually consistent, always available |
| DynamoDB | AP | Eventually consistent (strong consistency optional) |
| MongoDB | CP | Consistent, may reject writes during partition |
| HBase | CP | Consistent, may be unavailable during partition |
| CockroachDB | CP | Consistent (uses Raft) |
| Redis Cluster | AP (default) | Eventually consistent during partition |
Example: AP vs CP during network partition
Node A ←──── partition ────→ Node B
AP (Cassandra):
Write to A succeeds → A and B diverge temporarily
Reads from B return stale data
Partition heals → eventual consistency (CRDT/LWW)
CP (MongoDB):
Write requires majority quorum
If B is majority, writes to A fail (error returned)
System unavailable on A's side, but data consistentPACELC Theorem
Extends CAP: even when no partition, there's a latency vs consistency trade-off.
If Partition: C vs A
Else (normal): L vs C| System | P | EL | EC |
|---|---|---|---|
| DynamoDB | AP | EL | EC |
| Cassandra | AP | EL | EC |
| CockroachDB | CP | EL | EC |
| MongoDB | CP | EL | EC |
| BigTable | CP | EL | EC |
Consistency Models
From strongest to weakest:
Linearizability (Strict)
└─ Operations appear instantaneous, globally ordered
└─ Reads always see the latest write
└─ Cost: high latency (needs coordination)
Sequential Consistency
└─ Operations appear in some sequential order
└─ Each process sees operations in its own order
└─ Different processes may see different orderings
Causal Consistency
└─ Causally related operations seen in order
└─ Concurrent (unrelated) operations may be seen differently
Read-your-writes Consistency
└─ After a write, same client always reads that write
└─ Others may still see stale data
Eventual Consistency
└─ All replicas converge if no new writes
└─ No guarantee on when or ordering
└─ Cost: lowest latencyLinearizability in practice
Timeline:
Client A: ──write(x=1)──────────────────
Client B: ────────────read(x)───────────
Client C: ──────────────────────read(x)─
Linearizable: B might read 0 or 1 (depends on timing)
C MUST read 1 (write completed before C's read)
Non-linearizable: C reads 0 (stale) — violates guaranteeConsensus Algorithms
Why Consensus is Hard
In an asynchronous distributed system, even with 1 faulty node, it's impossible to guarantee consensus (FLP impossibility theorem, 1985).
In practice: algorithms use timeouts/leases to work around this.
Paxos
The original consensus algorithm. Foundation for many systems.
Roles:
- Proposers — propose values
- Acceptors — vote on proposals
- Learners — learn the decided value
Two phases:
Phase 1 — Prepare:
Proposer → Acceptors: "Prepare(n)" (n = proposal number)
Acceptors → Proposer: "Promise(n)" + highest accepted value (if any)
Phase 2 — Accept:
Proposer → Acceptors: "Accept(n, v)" (v = chosen value)
Acceptors → Proposer: "Accepted(n, v)"
Learner learns value once majority has accepted.Problems with Paxos:
- Underspecified (Lamport's paper was confusing)
- Multi-Paxos (for log replication) requires many extensions
- Leader election not defined
Raft — Understandable Consensus
Designed to be more understandable than Paxos. Used by etcd, CockroachDB, TiKV.
Key Concepts:
- Leader election — one leader at a time
- Log replication — leader receives writes, replicates to followers
- Safety — never commits a different value at same log index
Leader Election
States:
Follower → Candidate → Leader
↑ |
└────────────────────────┘ (heartbeat / step down)
Term: monotonically increasing epoch number
(like a logical clock for leadership)
Timeout:
Follower: random 150-300ms election timeout
If no heartbeat → become Candidate
Candidate: vote for self, request votes from all
If majority votes → become Leader
If another leader → revert to FollowerLog Replication
Client ──write("x=1")──→ Leader
│
┌────────────┼────────────┐
↓ ↓ ↓
Follower1 Follower2 Follower3
│ │ │
└─────ACK────┴────ACK─────┘
│
majority ACKs
│
Leader commits
Leader replies to client
Leader sends commit to followersLog Entry:
Index: 1 Term: 1 Command: x=1
Index: 2 Term: 1 Command: y=2
Index: 3 Term: 2 Command: x=3 ← needs majority before commitSafety Properties
- Leader has all committed entries (never loses them)
- Election: candidate must have up-to-date log to win
- Only one leader per term
Clock Synchronization
The Problem
Clocks in distributed systems drift. NTP can sync to ~1-10ms accuracy but can go backward, jump, or be unreliable.
Why it matters:
- Event ordering across nodes
- Distributed tracing
- Conflict resolution in CRDTs
- Distributed transactions
Lamport Timestamps
Logical clock that captures causal ordering.
Rules:
- Before any event in process: increment clock
- Before sending message: increment clock, attach timestamp
- On receiving message:
clock = max(local, received) + 1
Process A: 1────────2────────────5────────
\ ↑
send(ts=2) receive
Process B: 1────────────3────────────
(max(1,2)+1 = 3)Limitation: Lamport timestamps can tell you A happened before B, but NOT that A and B are concurrent.
Vector Clocks
One counter per process. Captures both causality AND concurrency.
3 processes: A, B, C
Vector clock: [A, B, C]
Process A:
event1 → [1,0,0]
send to B → [2,0,0] (attached to message)
Process B:
event1 → [0,1,0]
receive from A → [2,2,0] (max each component + 1 for receive)
send to C → [2,3,0]
Process C:
event1 → [0,0,1]
receive from B → [2,3,2]Comparing vector clocks:
VC(a) < VC(b) = a happened-before b
↔ all components of VC(a) ≤ VC(b)
AND at least one component strictly <
VC(a) || VC(b) = concurrent
↔ neither VC(a) < VC(b) nor VC(b) < VC(a)
Example:
[1,2,0] vs [1,1,2]:
[1,2,0] < [1,1,2]? No (2 > 1 in position B)
[1,1,2] < [1,2,0]? No (2 > 0 in position C)
→ Concurrent!Used in: DynamoDB, Riak, CRDTs for conflict detection.
Hybrid Logical Clocks (HLC)
Combines physical clock with logical clock. Used in CockroachDB.
HLC = (physical_time, logical_counter)
Properties:
- HLC ≥ physical_time (never goes backward)
- Preserves happened-before like Lamport
- Close to physical time (within NTP skew)Distributed Transactions
2-Phase Commit (2PC)
Coordinator Participant 1 Participant 2
│ │ │
│──── PREPARE ─────────► │
│──── PREPARE ──────────────────────────►│
│ │ │
│◄─── VOTE YES ───────── │
│◄─── VOTE YES ──────────────────────────│
│ │ │
│──── COMMIT ──────────► │
│──── COMMIT ───────────────────────────►│
│ │ │
│◄─── ACK ────────────── │
│◄─── ACK ───────────────────────────────│Problems with 2PC:
- Coordinator is single point of failure
- Participants block while waiting for Phase 2
- If coordinator crashes after PREPARE but before COMMIT → participants stuck in "in-doubt" state
Saga Pattern (for microservices)
Long-running transactions across services via compensating transactions.
Order Service Payment Service Inventory Service
│ │ │
Create Order Charge Card Reserve Item
│ │ │
│────────────────────────────────────────►
│ │ │
│◄─── Failure (item out of stock) ────────
│ │ │
Cancel Order Refund Card -
│ │ │
(compensate) (compensate)Types:
- Choreography: Services publish events, others react
- Orchestration: Central coordinator calls services in sequence
Distributed Locking
Redis Redlock
js// Simple Redis lock (single instance — not distributed)
async function withLock(redisClient, key, ttlMs, fn) {
const lockValue = crypto.randomUUID();
const acquired = await redisClient.set(
`lock:${key}`, lockValue,
'PX', ttlMs, // expire after ttlMs
'NX' // only set if not exists
);
if (!acquired) throw new Error('Could not acquire lock');
try {
return await fn();
} finally {
// Release: only if we still own it (Lua script for atomicity)
await redisClient.eval(`
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`, 1, `lock:${key}`, lockValue);
}
}Redlock (N Redis nodes):
- Get current timestamp
- Try to acquire lock on N/2+1 nodes (majority)
- Lock is valid only if acquired majority within TTL/2
- Release on all nodes when done
Controversy (Martin Kleppmann): Redlock has edge cases with clock skew. For true safety, use fencing tokens (monotonically increasing number) passed with the lock to the resource.
Gossip Protocol
Used for cluster membership, failure detection (Cassandra, DynamoDB).
Node A ──gossip──► Node B (A's view)
Node B ──gossip──► Node C (merged view)
Node C ──gossip──► Node D
...
Every node knows about every other within O(log N) roundsProperties:
- Epidemic/viral spread
- Eventually consistent
- Fault-tolerant (no single point of failure)
- O(log N) convergence
Interview Questions
Q: Explain CAP theorem. Which do you pick in practice? P (partition tolerance) is non-negotiable in distributed systems — networks fail. So the real choice is C vs A during a partition. For financial systems: choose CP (can't show stale balances). For shopping carts: choose AP (better to show stale cart than error). Most systems are tunable (DynamoDB eventual vs strong consistency).
Q: What's the difference between Paxos and Raft? Both solve distributed consensus (replicated state machine). Raft was designed to be more understandable: clear leader election with randomized timeouts, explicit log replication. Paxos is more foundational but harder to implement correctly (Multi-Paxos needed for production). etcd, CockroachDB use Raft.
Q: What are vector clocks used for? Vector clocks track causal relationships between events. They can determine if event A happened before B, or if they're concurrent. Used in conflict detection for multi-master databases (DynamoDB, Riak). If two writes are concurrent (no causal relationship), the database can apply a conflict resolution strategy (LWW, merge, ask user).
Q: Why is 2PC a problem? Blocking protocol — participants lock resources until coordinator commits/aborts. If coordinator crashes after PREPARE, participants are stuck in-doubt until coordinator recovers. This creates availability problems. Alternatives: Saga pattern for eventual consistency across services, or use a database with distributed transactions (CockroachDB, Spanner).
Q: What is a fencing token and why is it needed for distributed locks? A fencing token is a monotonically increasing number issued with each lock grant. The protected resource rejects requests with a lower token number than the highest seen. This prevents a client that held a lock but was paused (GC pause, network delay) from making writes after the lock expired and another client acquired it with a higher token.