Replication: Multiple Copies, One Truth
Your database lives on one server. That server's hard drive fails at 3 AM. Your data is gone. Your application is down. Every user sees an error page.
The fix seems obvious: keep a copy on a second server. If one dies, the other keeps serving. Problem solved.
Not quite. The moment you have two copies, you have a new problem, and it is harder than the one you just fixed.
The copy problem
Say you have two servers, each holding a copy of your data. A user writes balance = 500 to Server A. Another user reads balance from Server B. Server B still shows the old value, balance = 0, because it hasn't received the update yet.
Two users looking at the same account see different numbers. Which one is right?
Both are, depending on when you ask. Server A has the latest write. Server B has a stale copy. The data will converge eventually, but right now the copies disagree.
This is the core problem of replication: keeping multiple copies in sync. And every solution comes down to a single question, when does a write become visible on the other copies?
The coordination tax
The simplest answer: make the write visible everywhere before you tell the client it succeeded.
When a client writes balance = 500 to Server A, Server A forwards the write to Server B and waits. Only after Server B acknowledges, "got it, written to disk," does Server A respond to the client. This is synchronous replication, where the writer waits for all copies to confirm.
The result is what developers expect. You write a value, you read it back, you see what you wrote. Always. Every copy is identical at every moment. This guarantee has a name, linearizability, meaning the system behaves as if there is only one copy of the data.
But there is a cost. If Server A is in New York and Server B is in London, every message between them takes about 70ms for the round trip. The client writes to Server A in New York, Server A sends the write to Server B in London (35ms), Server B acknowledges (35ms back), and only then does Server A respond. Every single write takes at least 70ms of network latency on top of the actual disk write, no matter how fast your servers are.
Add a third copy in Tokyo and it gets worse. The client waits for the slowest replica. If Tokyo takes 150ms to acknowledge, every write takes 150ms, even though London acknowledged in 70ms. The slowest copy sets the pace.
Banks need this. If you deposit $100, your balance must reflect that immediately at every ATM. The latency cost is worth the correctness. But for systems where a few seconds of staleness is acceptable, paying 150ms per write is expensive.
What if we don't wait?
Here's the alternative: don't wait. Write locally, respond to the client immediately, and replicate in the background.
Server A writes balance = 500 to its own disk and responds in maybe 5ms. Meanwhile, it sends the update to Server B and Server C asynchronously. They'll get it in 70ms, or 150ms, or whenever the network delivers it. This is asynchronous replication.
The client gets a fast response. But during that window, Server B and Server C still have the old value. A read from either of them returns stale data. This guarantee, or rather the lack of one, is called eventual consistency: the copies will converge to the same state eventually, but during the convergence window they can disagree.
If you post a tweet, some users might see it immediately and others might not see it for a few seconds. For a social feed, that is fine. For a bank balance, it is not.
Content delivery networks (CDNs, networks of servers around the world that cache copies of your files) use eventual consistency. When you update an image on your website, it might take minutes for the new version to propagate to all edge servers. Users in different regions see different versions of the image temporarily. For static assets, this tradeoff makes sense.
The downside is that your application now has to handle stale reads. The complexity that was previously in the database layer, "keep everything in sync," moves into your application code.
There is a middle ground. Semi-synchronous replication waits for at least one replica to acknowledge but not all of them. If you have three copies and you wait for one backup, you get durability (the write survives even if the primary crashes immediately) without paying the latency of the slowest replica.
Who accepts the writes?
So far we have been assuming one server accepts all the writes and copies them to the others. That is one strategy. There are three, and each answers the question differently: who is allowed to accept a write?
Click through the three strategies below and watch how the write path changes.
Single-leader replication is what we have been describing. One node is the leader (also called primary). All writes go through it. The leader replicates to followers (also called replicas or secondaries). Reads can go to the leader or followers depending on your consistency requirements. PostgreSQL, MySQL, and MongoDB all use this by default. It is simple, it provides strong consistency if you read from the leader, and there is a clear write path. The downside is that the leader is a single point of failure and a bottleneck for write throughput, since you cannot scale writes horizontally.
Multi-leader replication allows multiple nodes to accept writes. Each data center gets its own leader, clients write to the nearest one for lower latency, and leaders sync in the background. The catch is that two leaders can update the same record at the same time, creating a write conflict. You need a conflict resolution strategy: last-write-wins (simple but can lose data), CRDTs (conflict-free replicated data types, data structures designed to merge automatically), or application-level merging. Multi-leader replication is rarely used within a single data center but is common across geographically distributed ones.
Leaderless replication treats all nodes equally. Any node can accept reads and writes. Clients typically write to multiple nodes simultaneously and read from multiple nodes. Amazon's Dynamo popularized this approach, and Cassandra and Riak adopted it. The availability is high, there is no single point of failure, and write latency is low since you write to the nearest nodes. The tradeoff is eventual consistency and the need for quorum tuning, which we will get to shortly.
When the leader goes down
Single-leader is the most common strategy, but it has an obvious weakness: what happens when the leader crashes?
The followers do not know the leader is gone until they stop receiving heartbeats, periodic "I'm alive" signals the leader sends every 100ms or so. Once a follower's heartbeat timer expires (typically after 500ms to 1 second of silence), it concludes the leader is dead and starts an election to choose a new one.
Step through the failover process below and watch the timeline. Notice the gap between the crash and recovery, that is the window where writes are unavailable.
The total failover time is typically 1-2 seconds. During that window, the system cannot accept writes. Reads from followers continue working, but any write attempt fails. For many applications, a brief pause is acceptable. For systems that need continuous availability, this motivates the move to multi-leader or leaderless strategies.
Reads that lie
There is a subtler problem that shows up even during normal operation, without any crashes.
You write a comment to the leader. The leader accepts it and starts replicating to the followers. You refresh the page. Your load balancer routes the read to a follower that hasn't received the write yet. Your comment is gone.
It is not actually gone. It is on the leader and will arrive at the follower in a few milliseconds. But you don't know that. You refresh again, get routed to a different follower, still nothing. You start to worry.
This is the read-your-writes problem. In a linearizable system, you always see your own writes immediately. In an eventually consistent system, you might not, because the replica you are reading from lags behind the one you wrote to.
Solutions exist. Session affinity routes all reads from the same user session to the same replica they wrote to, so that replica always has their latest writes. Read from the leader guarantees freshness but puts more load on the leader. Read from multiple replicas and take the latest version works but is expensive. Many systems provide session consistency: within a single user session you always see your own writes, but different users might see different states.
What if we need guarantees?
Some systems cannot tolerate stale reads at all. A bank balance must be correct. Inventory counts must reflect actual stock, or you oversell.
In leaderless systems, where clients write to multiple nodes, there is a mechanism for this: quorums. The idea is straightforward. If you write to enough nodes and read from enough nodes, the two sets must overlap. At least one node in the read set will have the latest write.
The parameters are W (the number of nodes that must acknowledge a write), R (the number of nodes you read from), and N (the total number of replicas). If W + R > N, the read and write sets overlap, and you are guaranteed to see the latest write.
Adjust W and R below. When does the formula guarantee consistency? What happens when it doesn't?
Notice the tradeoff. Higher W means slower writes (you wait for more nodes). Higher R means slower reads (you query more nodes). You can tune these independently. A read-heavy workload might use W = N, R = 1: writes are slow but reads are fast and always consistent. A write-heavy workload might use W = 1, R = N: writes are fast but reads query every node.
The protocol: how a write gets coordinated
Quorums guarantee that the right nodes have the data. But for systems that need replicas to process the same operations in the same order, which is what databases with complex state need, you need something stronger: state machine replication.
The idea is that if all replicas start in the same state and execute the same commands in the same order, they end up in the same state. To tolerate f crash failures, you need 2f + 1 replicas, so that a majority can always agree even if some nodes are down.
Viewstamped Replication (VSR) is one of the earliest protocols for this. It is simpler than Paxos and Raft, which makes it a good starting point. VSR uses a primary-backup architecture where one replica is the primary and the others are backups. Time is divided into views, each with a designated primary.
Step through a single write below and count the network round trips.
The minimum latency is 2 round trips: client to primary, then primary to backups and back, then primary to client. If your replicas are in the same data center, that might be under 1ms. If they span New York to London (70ms RTT), every write takes at least 140ms. This is why some systems use leaderless or multi-leader replication for geo-distributed setups: they trade consistency for lower latency.
The price of agreement
We started with a server that crashes and loses everything. We fixed that by adding copies. But copies that disagree are worse than useless for some applications, so we added coordination. Coordination costs latency. And the more copies you coordinate across, and the farther apart they are, the more latency you pay.
That is the fundamental tradeoff of replication. The more you coordinate, the slower your writes but the stronger your guarantees. The less you coordinate, the faster your writes but the more your application must handle stale reads, conflicts, and temporary inconsistency.
Network failures make this worse. When the network between replicas goes down (a partition), you have to choose: stop accepting writes until the partition heals (preserving consistency), or keep accepting writes on both sides and reconcile later (preserving availability). You cannot have both. This is the essence of the CAP theorem, and since partitions happen in real networks, it means you are always choosing between consistency and availability during failures.
For most applications, single-leader replication with reads from the leader is the right starting point. It is simple, correct, and handles the majority of workloads. Only relax consistency when latency becomes a measured problem. And the best way to reduce replication latency is to colocate your replicas in the same data center, or at least the same region. Cross-continent replication will always be slow, because the speed of light is not negotiable.