TCP Congestion Control
How TCP learns to share the network fairly while pushing the limits of available bandwidth.
What happens when everyone sends at once?
TCP is the Internet's most widely used transport protocol, moving everything from emails to video. But here's the problem: if every sender just blasted data into the network as fast as possible, routers would be overwhelmed. Packets would be dropped. The network would collapse.
And that's not hypothetical. It actually happened.
In October 1986, the NSFNET backbone experienced a "meltdown." Its effective throughput dropped 1000x. Packets were being retransmitted so aggressively that the network became stuck in a loop: loss causes retransmission, which causes more loss. This went on for months.
Imagine a simple network: a router with a 100 Mbps outgoing link. A single sender can use all 100 Mbps. But what if there are 10 senders? Each should get roughly 10 Mbps. How does each sender know this? The Internet doesn't have a central traffic cop. Senders can't query a router and ask "how much bandwidth do you have?"
If TCP senders have no idea what bandwidth is available, how do they avoid sending so much data that the network becomes congested?
The fix came from Van Jacobson's 1988 paper. The insight: packet loss is not random bit errors. It's a signal that congestion has occurred. If a sender detects loss, it should slow down.
Treat packet loss as a congestion signal. When a sender detects loss, it reduces its sending rate. The trick is figuring out how much to slow down and when to speed up again.
Two windows, one pipe
TCP already had flow control: the receiver advertises a window size (awnd) telling the sender
how much data it can buffer. The sender never sends more than this advertised window.
But flow control only protects the receiver. What about the network itself? If the network is congested but the receiver has plenty of buffer space, packets still get dropped at intermediate routers. We need a separate limit.
Congestion control introduces the congestion window (cwnd), the sender's estimate of
how much data the network can handle. The usable window is the minimum of the two:
W = min(cwnd, awnd)
The sender never sends more than W bytes without an acknowledgment. This value is sometimes
called the flight size, the amount of data currently in transit.
The ideal value for cwnd is the bandwidth-delay product (BDP): the product
of the link capacity and the round-trip time.
BDP = Capacity × RTT
Example: 100 Mbps link, 100 ms RTT
BDP = 100 Mbps × 0.1s = 10 Mb = 1.25 MB
If cwnd equals the BDP, the pipeline is full and you get maximum throughput.
If cwnd is smaller, the link goes idle between bursts. If cwnd is larger,
queues build up at routers and packets get dropped.
So the question becomes: how does the sender discover the right value for cwnd?
Starting from nothing
When a TCP connection starts, the sender has no idea how much bandwidth is available. The network could be a gigabit fiber link or a cellular connection. So TCP starts small and carefully increases its sending rate.
This algorithm is called slow start (though it's actually quite fast).
The sender begins with cwnd = 1 SMSS, one segment, typically 1460 bytes.
Here's the key: whenever the sender receives an ACK, it increments cwnd by the number
of bytes ACKed. If ACKs aren't delayed, each ACK increases cwnd by 1 segment:
RTT 0: cwnd = 1, send 1 packet
RTT 1: receive ACK, cwnd = 2, send 2 packets
RTT 2: receive 2 ACKs, cwnd = 4, send 4 packets
RTT 3: receive 4 ACKs, cwnd = 8, send 8 packets
...
RTT k: cwnd = 2^k
Step through the visualization and watch the exponential curve form. Each ACK lets the sender transmit two new packets:
The growth is exponential. Within a few RTTs, you go from 1 packet to hundreds. The algorithm is called "slow start" because it's slower than the alternative (starting with a huge window). But exponential growth means you quickly reach the network's capacity. Eventually, a packet gets dropped.
When a packet is lost, one of two things happens: the sender receives 3 duplicate ACKs
(fast retransmit), or a retransmission timer expires (timeout). Either way, the sender
knows it overshot. It remembers this point as the slow start threshold (ssthresh):
ssthresh = max(cwnd / 2, 2 SMSS)
The sender cuts the estimate in half. This becomes a memory of the last known good operating point.
Why divide by 2?
If loss occurred at cwnd = 100, the network is clearly rejecting a window that large. But it might accept 50. By halving, we find the boundary without being too aggressive. This is called multiplicative decrease.
Careful probing after the first loss
Once cwnd reaches ssthresh, slow start is no longer appropriate. The sender has already
overshot once, so it switches to congestion avoidance, which grows cwnd much more
conservatively.
Instead of doubling cwnd every RTT, congestion avoidance increases it by roughly
1 segment per RTT:
cwnd = cwnd + SMSS / cwnd (for each ACK received)
This results in cwnd increasing by ~1 SMSS per RTT
(approximately linear growth)
Step through this and compare the slope to slow start. The difference is striking:
The growth is linear. The sender probes for additional bandwidth slowly. If the network
remains stable, cwnd keeps growing. If loss happens, it halves again.
Now we can see the full lifecycle. Step through the three phases of TCP's congestion response:
The conservation of packets principle
Van Jacobson observed that in steady state, TCP should maintain a constant number of packets in the network. When an ACK arrives, it frees up space in the pipe, so the sender injects exactly one new packet. This keeps the network in equilibrium.
Watch how cwnd and ssthresh evolve as ACKs arrive and losses occur:
Two approaches to recovery
The original algorithm (Tahoe, 1988) was simple: on any packet loss, set ssthresh = cwnd / 2
and reset cwnd to 1. Always restart from scratch.
ssthresh = cwnd / 2
cwnd = 1 ← Always reset to 1!
Enter slow start
This is safe but wasteful. For a high-BDP network (long RTT, high capacity),
you spend many RTTs climbing back from cwnd = 1 to where you were.
Reno (1990) introduced fast recovery to avoid this waste. When fast retransmit is triggered (3 duplicate ACKs), instead of resetting to 1:
ssthresh = cwnd / 2
cwnd = ssthresh + 3 SMSS ← Partial recovery!
Enter fast recovery (not slow start!)
For each additional duplicate ACK: cwnd += SMSS
On good ACK: cwnd = ssthresh ← Return to congestion avoidance
The insight: during fast recovery, each duplicate ACK means a packet left the network, so you can inject a new one. This keeps the pipe full while recovering.
Toggle between the two variants and notice how Reno avoids the deep valley that Tahoe creates:
NewReno (1998) fixed a subtle problem in Reno: if multiple packets are lost in one window, fast recovery might exit prematurely. NewReno tracks the "recovery point" (the highest sequence number sent when loss was detected) and stays in fast recovery until all lost packets are retransmitted.
SACK (Selective Acknowledgment) lets the receiver report holes in the sequence space. Instead of cumulative ACKs that only report the highest contiguous byte received, SACK tells the sender exactly which packets are missing. The sender can retransmit only the lost segments without guessing.
When the math works against you
Congestion control ultimately depends on two numbers: RTT and packet loss rate. Both limit throughput, and not in the way you might expect.
Higher RTT means fewer ACKs per second. If your RTT is 1 second and you increase cwnd
by 1 segment per RTT, you only grow once per second. Growth is slow.
Throughput ≈ cwnd / RTT
If cwnd = 100 packets and RTT = 10 ms: throughput = 10,000 packets/sec
If cwnd = 100 packets and RTT = 1000 ms: throughput = 100 packets/sec
(Same window, 100x lower throughput!)
Loss causes cwnd to drop. High loss means you're halving cwnd more frequently,
reducing the average window size. The simplified throughput formula reveals why:
Throughput ≈ MSS / (RTT × √(loss_rate))
Note the SQUARE ROOT: even 1% loss cuts throughput significantly.
0.1% loss: ÷ √0.001 ≈ ÷ 0.03 ≈ normal throughput
1% loss: ÷ √0.01 ≈ ÷ 0.1 ≈ 10x reduction
10% loss: ÷ √0.1 ≈ ÷ 0.32 ≈ 3x reduction
Drag the sliders and watch how quickly throughput drops as RTT or loss increases:
Beyond loss-based congestion control
The classic algorithms (Tahoe, Reno, NewReno) were designed for the 1990s. Modern networks have different characteristics: high capacity, long distances, high jitter. The assumptions that worked then don't always hold now.
CUBIC (the Linux default since 2008) replaces linear growth with a cubic function. After a loss event, the window grows slowly near the previous maximum, then accelerates when exploring uncharted territory. This matches the needs of modern high-speed links better than linear growth.
cwnd(t) = C(t - K)³ + W_max
C is a scaling constant
t is time since loss
K is the time to reach W_max with cubic growth
W_max is cwnd before last loss
BBR (Google, 2016) takes a different approach entirely. Instead of using loss as the
primary congestion signal, it measures the bandwidth and RTT directly and sets cwnd to the BDP.
BBR operates in phases:
1. Startup: Quickly probe bandwidth
2. Drain: Drain extra packets from the queue
3. Steady State: Send at (bandwidth × RTT)
If RTT increases (sign of congestion), reduce sending rate.
If RTT decreases, increase sending rate.
BBR performs much better on high-loss networks (like cellular) where packet loss isn't always a congestion signal. However, it's less fair to legacy TCP in mixed environments.
The full algorithm
Here's a simplified pseudocode showing how a TCP sender implements congestion control:
# tcp-congestion-control.pseudo
function init_connection():
cwnd = 1 SMSS
ssthresh = 64 KB
state = SLOW_START
function on_ack(acked_bytes):
if state == SLOW_START:
cwnd += acked_bytes # Exponential growth
if cwnd >= ssthresh:
state = CONGESTION_AVOIDANCE
elif state == CONGESTION_AVOIDANCE:
cwnd += SMSS * SMSS / cwnd # Linear growth
elif state == FAST_RECOVERY:
if ack_is_good(): # Cumulative ACK past recovery point
cwnd = ssthresh
state = CONGESTION_AVOIDANCE
else: # Duplicate ACK
cwnd += SMSS
function on_packet_loss():
if cause == FAST_RETRANSMIT:
ssthresh = max(cwnd / 2, 2 SMSS)
cwnd = ssthresh + 3 SMSS
state = FAST_RECOVERY
elif cause == TIMEOUT:
ssthresh = max(cwnd / 2, 2 SMSS)
cwnd = 1 SMSS
state = SLOW_STARTThis is TCP Reno. Each state transition is triggered by network events (ACKs, losses),
and the algorithm adjusts cwnd accordingly.
Want to go deeper? Read RFC 5681 (TCP Congestion Control) and RFC 9002 (QUIC Loss Detection and Congestion Control). For modern algorithms, explore Google's BBR paper and the CUBIC paper.