Back to Blog

Consensus: How Distributed Systems Agree

Three servers. One number. They all need to agree it's 42.

Server A says "42." Server B says "I heard 42." Server C never responds.

Is it 42? Do we wait for C? What if C comes back later and says "99"?

This is the consensus problem. It sounds trivial (just pick a number and tell everyone), but networks are unreliable, servers crash, and sometimes a server might even lie. Getting machines to agree on anything turns out to be one of the hardest problems in computer science.

We'll figure out how to make servers agree. We'll start by breaking things, then fix them step by step until we arrive at the algorithms that power every modern distributed database.

Why This Is Hard

Your servers talk by sending messages over a network. Networks are not your friend.

Network Chaos Toggle failures to see how messages behave
ABC
Sent
0
Received
0

Enable "Drop Messages" and click "Broadcast." Some messages vanish mid-flight. Enable "Delay Messages" and watch them crawl. That's what networks do.

Server A sends "the value is 42" to Server B. Server B never receives it. Server A thinks everyone knows. Server B has no idea. They've already diverged.

Now add delays. Server A sends "42" to both B and C. The message reaches B instantly but takes 3 seconds to reach C. In those 3 seconds, B and C have different views of the world.

Why not just retry until messages arrive?

You can. But how do you know a message was lost versus just delayed? If you wait too long, your system is slow. If you retry too quickly, you create duplicates. And if a server crashes mid-retry, you're back to square one.

In the asynchronous network model, you cannot tell the difference between a slow message and a lost one, between a crashed server and a busy one.

Two Kinds of Failure

Not all failures are equal. The difference matters enormously.

Crash vs Byzantine Failures Click node A to change its failure mode
Node A is: Normal (sends 42 to all)

Click node A to cycle through modes: Normal → Crashed → Byzantine. In Byzantine mode, hit "A Broadcasts."

See what happens?

Crash failures are predictable. A crashed server is silent. It doesn't respond, doesn't send anything. You can detect the absence of heartbeats and route around it.

Byzantine failures are chaos. A compromised server might tell Server A "the value is 42" while simultaneously telling Server B "the value is 99". From the outside, both A and B think they have the correct value.

Most consensus algorithms assume only crash failures. They're simpler and sufficient for most infrastructure. Your datacenter servers aren't actively trying to deceive each other. Byzantine fault tolerance is for adversarial environments like blockchains.

The Byzantine Generals

Four generals surround an enemy city. They need to agree on a plan: attack or retreat. They can only communicate by messenger. One of them might be a traitor.

How do you reach agreement when someone might be lying?

Byzantine Generals Select traitor and step through the protocol
Generals:
Traitor:
Order:
CmdTRAITOR
L1
L2
L3
Step 1: Commander sends order to all lieutenants

Set "Generals" to 3 and make one of them a traitor. Step through the protocol.

No consensus. The traitor sends different messages to different generals, and there's no way to tell who's lying.

Now try 4 generals with the same traitor. Step through again.

With 4 generals, the loyal ones can exchange messages and use majority voting. The traitor's lies get outvoted.

The pattern:

  • 3 generals, 1 traitor: fails
  • 4 generals, 1 traitor: works
  • 5 generals, 2 traitors: fails
  • 7 generals, 2 traitors: works

To tolerate f Byzantine failures, you need at least 3f + 1 nodes. For 1 traitor, you need 4. For 2 traitors, you need 7. This is a fundamental limit. No algorithm can do better.

For crash failures (the common case), the bound is better: you need 2f + 1 nodes. For 1 crash, you need 3. For 2 crashes, you need 5.

Quorums

You need quorums to make decisions. A quorum is a group of nodes that can make a decision that sticks. For any decision to be valid, enough nodes must agree.

Quorum Overlap Click nodes to toggle quorum membership
Quorum A (previous decision)3/5 (majority)
Overlap: 3
Quorum B (new decision)3/5 (majority)
Node 3 saw both decisions. Agreement preserved.

Try to select two separate groups of 3 nodes (from a 5-node cluster) that don't overlap.

You can't. Any two groups of 3+ nodes must share at least one member.

That shared member is the bridge. It knows both decisions and can prevent conflicts. Consensus protocols require majority quorums because any two majorities must intersect.

First Attempt: Two-Phase Commit

Two-Phase Commit (2PC) is how databases coordinate transactions. One node is the coordinator, the others are participants.

Phase 1 (Prepare): The coordinator asks everyone "Are you ready to commit?"

Phase 2 (Commit): If all say yes, the coordinator tells everyone "Commit!" If anyone says no, everyone aborts.

Two-Phase Commit Crash the coordinator at different phases
Coordinator
idle
P1
idle
P2
idle
P3
idle
Transaction ready. Coordinator will send PREPARE.

Step through the happy path first. Everything commits cleanly.

Now step through until the participants are in the "prepared" state, then crash the coordinator.

The participants said "yes, I'm ready" and are now holding locks, waiting for the final decision. They can't commit because they don't know if everyone else agreed. They can't abort because maybe the coordinator will come back and say "commit." They're stuck.

The fundamental flaw of 2PC: it's blocking. If the coordinator fails at the wrong moment, participants wait forever.

So 2PC is useless?

Not at all. 2PC is used heavily in databases and message queues. But it's for environments where coordinator failures are rare and recovery is fast. For truly fault-tolerant consensus, we need something stronger.

Paxos

In 1989, Leslie Lamport proved that consensus is possible despite crashes. His algorithm, Paxos, is the foundation of distributed systems.

Every proposal gets a unique number. Higher numbers win.

Paxos has two phases:

Phase 1 (Prepare): A proposer picks a number (n) and asks acceptors to promise not to accept any proposal with a number less than n.

Phase 2 (Accept): If a majority promise, the proposer sends its value. Acceptors accept if they haven't made a conflicting promise.

Paxos: Single Round Step through Prepare → Promise → Accept → Accepted
Proposer
n=1, v="X"
A1
promised: -
accepted: -
A2
promised: -
accepted: -
A3
promised: -
accepted: -
Prepare: Proposer sends Prepare(1) to all acceptors

Step through all four phases. Watch the acceptors update their "promised" and "accepted" values as messages flow.

Notice how the proposer doesn't send its value until it gets promises from a majority. This is the quorum in action.

When Proposers Collide

What happens when two proposers try different values at the same time?

Paxos: Competing Proposers Watch how a higher proposal number wins
Proposer A
n=1, v="X"
idle
Proposer B
n=2, v="Y"
idle
A1
promised: -
accepted: -
A2
promised: -
accepted: -
A3
promised: -
accepted: -
Step 1: Proposer A sends Prepare(1) with value "X"

Step through and watch Proposer A get rejected when B swoops in with a higher proposal number.

Proposer A sends Prepare(1) and gets promises. Meanwhile, Proposer B sends Prepare(2). B's higher number invalidates A's promises. When A tries Accept(1, "X"), the acceptors reject it because they already promised to wait for B.

But wait, what value does B propose?

When acceptors promise to B, they tell B about any value they've already accepted. B must use that value, not its own. Paxos maintains consistency even under contention this way.

The downside: two proposers can duel forever, each invalidating the other. In practice, systems use random backoff and leader election to prevent livelocks.

Raft

Paxos works, but it has a reputation for being impossible to understand. In 2014, Diego Ongaro published Raft with an explicit goal: be as understandable as possible.

Raft makes three simplifying choices:

  1. Strong leader: All requests go through the leader
  2. Election first: Before agreeing on values, elect a leader
  3. Log-based: Agree on a log of commands, not individual values

Leader Election

Every node is a Follower, Candidate, or Leader. Time is divided into terms, each term has at most one leader.

Raft Leader Election Click a node to trigger election timeout
term 1
term 1
term 1
term 1
term 1
All nodes start as followers in term 1

Click any follower to trigger an election timeout. It becomes a candidate, collects votes, and if it gets a majority, becomes leader.

Try crashing the leader. Watch a new election start automatically.

The randomized timeout is key. Without it, nodes would time out simultaneously and split the vote every time. Random timeouts mean someone usually wins quickly.

Log Replication

Once elected, the leader handles all client requests. Each request becomes an entry in the log. The leader replicates entries to followers.

Raft Log Replication Send commands and watch them replicate
empty
empty
empty
empty
empty
Uncommitted
Committed
Click a follower to partition/unpartition it

Send a few commands, then click "Replicate." Watch entries turn green when committed.

Now try partitioning a follower first, click it to disconnect. Send more commands and replicate. The partitioned follower falls behind.

Reconnect it and replicate again. The follower catches up.

The leader tracks each follower's progress. If someone is behind, the leader sends the missing entries. Followers always converge to match the leader.

Followers don't need complex logic. They just accept whatever the leader says.

Choosing an Algorithm

AlgorithmToleratesBlocking?ComplexityUsed In
2PCCrashesYesSimpleDatabases
PaxosCrashesNoComplexResearch
RaftCrashesNoModerateetcd, Consul, CockroachDB
PBFTByzantineNoVery ComplexBlockchains

2PC is fine when coordinator failures are rare. It's simpler and faster than full consensus.

Paxos is the theoretical foundation. Correct and complete, but notoriously hard to implement.

Raft is the practical choice for most new systems. It's designed to be understandable and has been implemented correctly many times.

PBFT handles Byzantine failures. Slower and more complex, but necessary when you don't trust all participants.

The FLP Impossibility

One last thing. In 1985, Fischer, Lynch, and Paterson proved something troubling:

In an asynchronous system, no consensus algorithm can guarantee termination if even one node might crash.

This doesn't mean consensus is impossible. It means you can't have all three of:

  • Safety: Never return conflicting values
  • Liveness: Eventually return a value
  • Fault tolerance: Work despite crashes

Real systems work around this by making timing assumptions. Raft assumes messages typically arrive within some timeout. The system is "mostly live" in practice, even if theoretical edge cases exist.

Summary

Consensus is hard because networks are unreliable and failures happen.

  • The Byzantine Generals Problem sets the limits: 3f + 1 nodes for Byzantine failures, 2f + 1 for crashes
  • Quorums work because any two majorities must overlap
  • 2PC is simple but blocking. Don't use it where failures are common
  • Paxos proved consensus is possible using proposal numbers and quorums
  • Raft simplifies Paxos with strong leadership and log replication

When your database claims to use "consensus," it probably means Raft. Now you know what that means: leader election, log replication, and quorum commits.

Three servers. One number. They agree it's 42. All of them. Forever.

Or at least until someone unplugs something.