Leader election: how distributed systems pick a boss
You have five database servers (we call each one a node, just a computer in the network) that need to stay synchronized. When someone writes data to one server, all five need to agree on the change.
How do they coordinate? The simplest approach: every node talks to every other node directly.
Starting simple: everyone talks to everyone
Imagine you have 5 nodes in a cluster, a group of computers working together as one system. When node 1 wants to make a change, it sends a message to nodes 2, 3, 4, and 5. When node 2 wants to make a change, it sends messages to nodes 1, 3, 4, and 5. Every node communicates with every other node.
This is called peer-to-peer coordination. Each node is equal, and they all participate in every decision.
Here's the problem: count the messages. With 5 nodes, each node sends to 4 others. That's 5 × 4 = 20 messages per decision. Every time you want to agree on something, you need 20 round trips.
What if you had 10 nodes? Each sends to 9 others. That's 10 × 9 = 90 messages. At 100 nodes, you need 100 × 99 = 9,900 messages for a single decision. The formula is n × (n - 1), which grows quickly. We write this as O(n²) in computer science notation, meaning the cost grows with the square of the number of nodes.
This feels wasteful. Do we really need every node talking to every other node?
What if one node coordinated?
Here's an idea: what if we picked one node to be the coordinator? When anyone wants to make a change, they tell the coordinator. The coordinator decides, then broadcasts the result (sends it to all nodes). That's 1 message in, plus n messages out, for a total of n + 1 messages per decision.
With 5 nodes, that's 6 messages instead of 20. With 100 nodes, that's 101 messages instead of 9,900.
We call this special coordinator the leader. The leader collects requests, makes decisions, and tells everyone what happened.
Toggle between modes and watch the message count. Notice how the gap widens as you add more nodes.
The leader handles all coordination. The other nodes (we call them followers) just send their requests to the leader and wait for the result. Simple, fast, efficient.
But here's the catch: what if the leader crashes?
Running into the first wall: who becomes the new leader?
If the leader crashes, the remaining nodes need to pick a replacement. But how do they agree on who that should be?
We need a rule that everyone follows automatically, with no room for disagreement. The simplest rule: whoever has the highest ID wins. Every node has a unique number (1, 2, 3, 4, 5, 6). The node with the highest number that's still alive becomes the new leader.
This is called the bully algorithm, named after how the highest-ranked node "bullies" everyone else into accepting it.
Here's how it works. When a node notices the leader stopped responding (maybe it didn't receive the expected heartbeat message, a periodic "I'm alive" signal the leader sends), it starts an election:
- Send an ELECTION message to all nodes with higher IDs than yours
- Wait for ALIVE responses
- If no one responds, you're the highest. Declare yourself leader
- If someone responds, they'll handle the election. Step back and wait
Say node 3 notices the leader (node 6) crashed. Node 3 sends ELECTION messages to nodes 4, 5, and 6. Node 4 and 5 respond with ALIVE. Node 6 doesn't respond because it's dead. Now node 4 and 5 each start their own elections, sending ELECTION messages to nodes above them. Eventually, node 5 gets no response from node 6, so node 5 declares itself leader.
Try this: crash node 6, then start an election from node 3. Watch the messages flow. Then bring node 6 back online and see what happens (it immediately takes over).
The algorithm is simple and deterministic. Everyone agrees on the outcome because everyone follows the same rule: highest ID wins.
But look at the messages. In the worst case, if the lowest-ranked node initiates the election, every node sends ELECTION messages to every higher-ranked node. That's roughly n²/2 messages. We're back to O(n²) cost, just like peer-to-peer coordination.
Can we do better?
Reducing the message cost: passing a token around
The bully algorithm wastes messages because every node broadcasts to multiple others. What if we could use exactly n messages instead?
Here's an idea: arrange nodes in a ring, a logical ordering where each node knows its "next" neighbor. Node 1's successor is node 2, node 2's successor is node 3, and so on until the last node wraps back to node 1.
When a node detects the leader is gone, it starts a message traveling around the ring. The message contains a list of IDs. Each node adds its own ID to the list and passes the message to its successor. When the message makes it all the way around the ring and returns to the node that started the election, that node looks at the list and announces the highest ID as the winner.
Think of it like passing a sign-up sheet around a meeting table. Everyone writes their name, and when it returns to the person who started it, they read out who signed up.
Click nodes to crash them, then start an election. Watch how the message travels around the ring, skipping crashed nodes and collecting IDs.
The ring algorithm uses exactly n messages (one per node), no matter who starts the election. That's O(n) instead of O(n²). Much better.
But notice the new constraint: we now depend on this ring structure. Every node needs to know its successor. If a node crashes, we need to detect that and skip to the next alive node. This adds complexity. The bully algorithm had no topology requirement at all; the ring algorithm trades message efficiency for structural dependency.
Still, both algorithms work. We can reliably elect a new leader. But we've been assuming something: that all nodes can communicate. What happens if they can't?
Hitting the real wall: network failures
So far we've assumed that if a node doesn't respond, it's dead. But what if the node is alive and the network cable between you and that node is unplugged?
This is called a network partition, where some nodes can talk to each other but not to others. The cluster splits into groups that can communicate internally but not with other groups.
Here's the disaster scenario: you have 6 nodes. The network splits into two groups of 3. Nodes 3 can talk to each other. Nodes 6 can talk to each other. But the two groups can't talk to each other.
From group 1's perspective, nodes 4, 5, and 6 are dead. They elect node 3 as leader. From group 2's perspective, nodes 1, 2, and 3 are dead. They elect node 6 as leader.
Now you have two leaders. Both groups are accepting writes from clients. Both think they're the source of truth. This is called split brain, and it's catastrophic.
Partition the network and watch both groups elect their own leader. When the partition heals, you have conflicting data. Some writes went to leader 3, some to leader 6. Which version is correct? You have no way to know.
Both the bully and ring algorithms have this fatal flaw: they can't distinguish between a crashed node and a node on the other side of a network partition. They assume if you can't reach a node, it's dead, so they proceed with the election.
How do we prevent split brain?
Requiring a majority: the quorum approach
Here's the key insight: if we require more than half the nodes to agree, two groups can't both elect a leader.
With 6 nodes, more than half means at least 4. This is called a quorum, the minimum number of votes required for a decision to be valid. If the network splits 3-3, neither group has 4 nodes. Neither group can elect a leader. Only the group with the majority can proceed.
If the split is 4-2, the group of 4 can elect a leader. The group of 2 knows it doesn't have a quorum, so it refuses to serve writes. It stays in a read-only or unavailable state until it can reconnect to the majority.
The math is simple: if you need more than n/2 votes, two groups can't both reach a quorum at the same time. At most one group has the majority.
This is the foundation of modern consensus algorithms like Raft, Paxos, and ZAB (used in ZooKeeper). They all use quorums to prevent split brain.
The tradeoff is availability. If you have 6 nodes and lose 3, the remaining 3 don't have a quorum, so they can't elect a leader. The system becomes unavailable. We sacrifice availability (the minority partition can't serve requests) to guarantee consistency (at most one leader exists at any time).
With the bully or ring algorithms, the minority partition would keep serving requests, giving you split brain. With quorum-based algorithms, the minority partition shuts down, preserving correctness.
Optimizing for the common case: pre-designated successors
Full elections are expensive, even with quorums. The bully algorithm sends O(n²) messages. Raft requires multiple rounds of voting. But here's the thing: leader failures are rare. If the leader stays up 99.9% of the time, why spend so much effort on elections?
What if we pre-designated the next leader ahead of time?
The current leader maintains a list of successors, nodes that are next in line. When a follower detects the leader crashed, it doesn't start a full election. Instead, it contacts the first successor on the list. If that node is alive, it becomes the new leader immediately. If it's also dead, try the second successor. Only if all successors are dead do you fall back to a full election.
Think of it like a company's succession plan. The CEO lists the VP, then the director, then the senior manager. When the CEO leaves, the VP takes over automatically. No election needed.
Watch how fast the failover is compared to running a full bully election. The successor just takes over. No messages to higher-ranked nodes, no waiting for responses.
This optimization only works if leader failures are rare and you trust the leader to maintain an accurate successor list. It's commonly used in production systems as a fast path, with a full quorum-based election as the fallback for edge cases.
A different approach: starting with many leaders
All the algorithms so far start with the assumption that we need to pick exactly one winner from the start. But what if we relaxed that constraint temporarily?
Here's a different idea: start with every node as the leader of its own tiny group. Then let groups merge until only one remains.
This is the invitation algorithm. Every node begins as the leader of a one-node group. Group leaders invite other nodes or groups to join. When two groups merge, one leader steps down (typically the leader of the smaller group defers to the larger group's leader).
Think of it like independent teams at a company that gradually consolidate. Each team has a lead. When two teams merge, one lead becomes the manager of both.
Watch how groups form and merge. Each merge is a local decision between two groups, but eventually everyone ends up in one large group under one leader.
The advantage is flexibility. Groups can merge in any order with no fixed topology requirement. You don't need a ring structure or pre-assigned IDs. The downside is that during the election, you temporarily have multiple leaders. Your application either needs to wait for full convergence (all nodes in one group) or handle the case where different groups might make conflicting decisions.
This approach is less common in practice but demonstrates that leader election isn't always about picking one winner immediately. Sometimes you can build consensus gradually.
Comparing the tradeoffs
Each algorithm makes different tradeoffs between simplicity, message cost, and safety.
Drag the slider to 20 nodes and watch how the bully algorithm's message count explodes while ring and Raft stay linear (growing proportionally with the number of nodes rather than with the square).
| Algorithm | Messages | Split brain protection | Topology requirement |
|---|---|---|---|
| Bully | O(n²) | No | None |
| Ring | O(n) | No | Must maintain ring |
| Raft/Paxos | O(n) | Yes (quorum) | None |
The bully algorithm is the simplest to understand and requires no special topology (nodes don't need to know about each other ahead of time). But it uses up to n²/2 messages in the worst case and can't handle network partitions.
The ring algorithm cuts messages down to exactly n by passing a single message around the ring. The tradeoff is that you need to set up and maintain the ring structure, and it still can't prevent split brain.
Raft and Paxos use quorums to prevent split brain. They use O(n) messages and don't require a special topology. The tradeoff is complexity (they require multiple rounds of voting and careful state management) and availability (minority partitions can't serve requests).
For toy systems or educational purposes, the bully or ring algorithms are fine. For production databases where correctness matters, you need a quorum-based algorithm like Raft.
Making it work in production
Leader election algorithms answer one question: how do we pick a leader? But production systems need to answer several more questions.
How do we detect the leader crashed? Before electing a new leader, you need to know the old one is gone. Most systems use heartbeats, periodic "I'm alive" messages that the leader sends to followers (like a pulse you check to see if someone is conscious). If followers don't receive a heartbeat for some timeout period (say, 5 seconds), they assume the leader crashed and start an election.
Step through and watch what happens when the leader crashes. The followers' timeout counters tick down, and when they hit zero, an election begins.
But here's a problem: what if the leader is just slow, not dead? What if it's in the middle of a long garbage collection pause, or the network is temporarily congested? If followers start an election while the old leader is still alive (just slow), you might end up with two leaders temporarily.
How do we prevent overlapping leaders? The solution is lease-based leadership, where the leader holds a time-limited permission to lead (like renting leadership for 30 seconds). The leader must renew its lease before it expires by getting acknowledgment from a quorum of followers. If the leader can't renew the lease (maybe due to network issues or being slow), the lease expires and the leader steps down, refusing to accept new writes even if it's still running. Meanwhile, followers can safely elect a new leader once the old lease expires.
Watch the lease timer count down. When the network partition happens, the leader can't reach a quorum to renew. The lease expires and the leader voluntarily steps down, even though it's still running.
This gives you a guarantee: at any moment in time, at most one node holds a valid lease and considers itself the leader.
How do we prevent a stale leader from causing damage? Even with leases, there's a race condition. Imagine the old leader's lease expires, a new leader is elected, but the old leader doesn't realize it yet (maybe it was paused and just woke up). The old leader might try to write data, conflicting with the new leader's writes.
The solution is fencing tokens, monotonically increasing version numbers that increment with each new leader (leader 1 gets token 1, leader 2 gets token 2, leader 3 gets token 3, and so on). Every write request includes the leader's token. The storage layer (the actual disk or database) remembers the highest token it's seen and rejects writes from lower tokens.
Step through the scenario where an old leader wakes up after being paused. Notice how the storage layer rejects its write because its token is outdated.
If the old leader (token 5) tries to write after a new leader (token 6) was elected, the storage sees token 5 < 6 and rejects it. This guarantees that even if an old leader doesn't know it's been replaced, its writes won't corrupt the system.
These three mechanisms—heartbeats for failure detection, leases for preventing overlap, and fencing tokens for preventing stale writes—work together to make leader election safe in the presence of network delays, pauses, and failures.
Putting it all together
We started with a simple problem: too many messages when every node talks to every other node. Having one node coordinate cuts the message count from O(n²) to O(n).
But that creates a new problem: what happens when the coordinator crashes? We need to elect a new one.
The bully algorithm uses a simple rule: highest ID wins. It's easy to understand but uses O(n²) messages in the worst case. The ring algorithm optimizes that to O(n) by passing a token around a ring. Both work, but both share a fatal flaw: they can't handle network partitions. If the network splits, you get two leaders.
The fix is quorums: require more than half the nodes to agree. If you need a majority, two partitions can't both elect leaders. This is how modern systems like Raft and Paxos work. The tradeoff is that minority partitions become unavailable, but that's better than split brain.
Production systems layer on additional mechanisms: heartbeats to detect failures, leases to prevent overlapping leaders, and fencing tokens to block stale writes. These turn a theoretical algorithm into a production-ready system.
When someone says their database uses "leader election," they usually mean a quorum-based algorithm like Raft with these practical additions. The simpler algorithms—bully, ring, invitation—are useful for understanding the problem, but real systems need the safety that only quorums can provide.