Scaling: build it yourself
There's a moment every developer dreads. Your app works perfectly on your laptop. One server, one database, zero problems. Then you launch. Users arrive. And everything falls apart.
I've been there. You probably have too.
This guide is different. You won't read about scaling. You'll discover it by breaking things. We'll start with a single server and watch it fail. Then we'll fix it. Then it'll fail in a new way. Each failure teaches us the next pattern.
Everything on one machine
Here's your app. One server running everything, web server and database on the same machine. Drag the slider to add traffic and watch what happens when you cross 95%.
Did you see it? Everything crashed together. The web server went down. The database went down. Your entire application went offline because they shared one machine. This is called a single point of failure, a component whose failure brings down the whole system.
So how do we fix this?
Throwing money at the problem
The first thing everyone tries is upgrading the hardware. Bigger CPU. More RAM. This is called vertical scaling, scaling by making your existing machine more powerful.
Set traffic to 100 requests per second. The baseline server struggles. Now double the RAM. Better. Double the CPU. Even better. But keep upgrading and watch what happens to the cost.
Here's the catch. Doubling your RAM gives you about 1.4× the capacity, not 2×. Quadrupling it gets you roughly 2×. By the time you've upgraded to 8× the original, you're only getting about 2.5× the capacity. Each doubling costs the same, but delivers less. This is diminishing returns. Eventually you hit a ceiling: the biggest machine money can buy, and it still has a single point of failure.
There's got to be a better way.
Splitting the machine in two
Before we try something radical, let's try something simple. Remember how everything crashed together? What if we put the web server and database on different machines?
Try this: in the Combined mode, crank the web traffic up and watch how it affects the database. Then switch to Separated and do the same thing.
The database stays calm even when the web server struggles. They're isolated now. When the web server crashes, the database survives. When the database is slow, the web server can still serve cached data. Separation creates fault isolation, where one component failing doesn't drag everything else down with it.
But you still have two single points of failure. One web server. One database. Better than before, but a server crash still takes you offline.
What if we could remove the single points of failure entirely?
What if we just added more?
We've tried making the server bigger (expensive, diminishing returns). We've tried separation (helpful, but still fragile). Now let's try something different: what if we just add more servers?
If one server handles 100 requests per second, two servers should handle 200, right? Start with one maxed-out server and click Add Server.
The load instantly drops to 50% on each server. Add a third, 33% each. A fourth, 25% each.
This is horizontal scaling, scaling by adding more machines instead of making one bigger. Capacity scales linearly: 4 servers means 4× capacity. Cost scales linearly too. And there's no ceiling.
But you just created new problems. How do requests know which server to go to? What if a user logs in on Server 1, then their next request goes to Server 2? And all these servers are still hitting one database, so isn't that still a bottleneck?
Let's solve these one by one.
But which server gets the request?
You have multiple servers now. But when a request comes in, how does it know where to go?
With 3 servers, how does each incoming request decide which server to use?
You need a load balancer, a server that sits in front of your web servers and directs each incoming request to one of them. Click "Start Traffic" and watch how it distributes requests evenly.
The simplest strategy is called round-robin, where the load balancer cycles through the servers in order: first request to Server 1, second to Server 2, third to Server 3, then back to Server 1. Step through the algorithm to see it in action:
Simple, right? But now the load balancer itself is a single point of failure. If it crashes, nothing gets through. The solution is to have two load balancers. If one fails, the other takes over. Production systems do this automatically.
You're starting to see a pattern: every time you eliminate one problem, you create another. The goal isn't eliminating all problems. It's knowing which ones are acceptable.
The server forgets who you are
Here's a subtle bug that bites everyone who tries horizontal scaling for the first time.
A user logs in on Server 1. Their next request goes to Server 2. How does Server 2 know who they are?
A user logs in on Server 1. Their session (their login status, shopping cart, preferences) lives in Server 1's memory. Their next request goes to Server 2. Server 2 has no idea who they are.
One workaround is sticky sessions, where the load balancer always routes the same user to the same server. But if that server crashes, the session is gone. Try it: in Sticky Sessions mode, crash the server and watch what happens. Then try Centralized (Redis) mode, where sessions are stored in Redis (a fast in-memory database that all servers share), and crash any server. The session survives.
The lesson: don't store state in your servers. Put it somewhere external that all servers can reach. Make your servers stateless, meaning they don't hold any user-specific data in memory. Stateless servers are interchangeable. Any server can handle any request. When a server crashes, you just remove it from the pool. No data lost.
Five servers, one database
Remember how we separated the web server and database? Now you have five web servers, but still one database. All five are querying it.
You scaled to 5 web servers, but they all query one database. What happens to the database?
Add more web servers and watch the database load climb:
Here's the thing: most requests are reads, not writes. And most reads are for the same data, product pages, user profiles, settings. What if we stored recently-read data in a fast cache (a temporary store that's much quicker to read from than the database) so we don't have to ask the database every time?
Toggle between No Cache and With Cache and watch the database load:
That's a 95% reduction in database queries. Instead of hitting the database 1000 times, you hit it 50 times. The other 950 requests get the cached answer in milliseconds.
Now experience this firsthand. Type a product ID and click Lookup. Try product_123 (already cached) then product_999 (not cached). Watch the timing: 2ms vs 50ms. Then look up product_999 again. Now it's cached.
The database now only handles cache misses (requests for data not in the cache) and writes. Since most applications are read-heavy, this makes a huge difference.
The last single point of failure
Caching helps with load, but the database is still a single point of failure. If it crashes, you lose everything, cache or not.
You have one database. If it crashes, everything stops working. How do you eliminate this single point of failure?
The solution is replication, keeping copies of your database on multiple servers. One server is the primary (handles all writes), and the others are replicas (copies that handle reads). But this creates a new problem.
Click "Deposit $50" to write to the primary, then immediately click "Read" on the replicas. See the problem? The replicas show stale data while they're syncing.
There are two ways to deal with this. Try both: in Transactional mode, writes wait until all replicas confirm they have the new data. In Merge mode, writes complete immediately and replicas catch up in the background.
You just found a trade-off that doesn't have a clean answer.
With transactional replication (sometimes called strong consistency), writes are slower because every replica must confirm, but data is always the same everywhere. If a replica is down, writes can fail entirely. With merge replication (eventual consistency), writes are fast because the primary doesn't wait, but different replicas might briefly return different values for the same data.
Neither is "better." Bank transactions need transactional replication, because you can't have two servers disagreeing about an account balance. Social media posts are fine with merge replication, because who cares if a like takes 500ms to appear everywhere?
I'll be honest: I find this trade-off unsatisfying. You want both speed and consistency. You can't have both. That's just how distributed systems work.
What you built
Let's step back and look at what you created:
| Architecture | Capacity | Single points of failure | Monthly cost |
|---|---|---|---|
| Single server | 1K req/s | 2 (web + db) | $200 |
| Separated | 3K req/s | 2 (web + db) | $400 |
| Multiple web servers | 10K req/s | 2 (LB + db) | $800 |
| + HA load balancers | 10K req/s | 1 (db only) | $1,000 |
| + Redis cache | 50K req/s | 1 (db only) | $1,200 |
| + DB replication | 50K req/s | 0 | $1,800 |
You started with a laptop that crashed under load. You now have an architecture that handles 50× the traffic with no single points of failure. It costs more, it's more complex, but it works.
But what happens when a server actually fails in this final architecture? Start traffic flowing, then crash Server 1. Watch the dropped requests counter: those are requests that arrive during the 3-second window before the load balancer detects the failure. After detection, traffic redistributes automatically.
What I actually think about scaling
Most apps never need sharding. Most apps never need microservices. Most apps need a couple of servers, a cache, and a replicated database.
The instinct is to build for scale on day one. Don't. Every layer you add is complexity you have to maintain, debug, and pay for. Start with one server. Run it until something breaks. Fix that specific thing. Repeat.
The right architecture is the simplest one that handles your traffic. Not the coolest one. Not the one Netflix uses. The simplest one that works.