TCP Congestion Control

How TCP learns to share the network fairly while pushing the limits of available bandwidth.

What happens when everyone sends at once?

TCP is the Internet's most widely used transport protocol, moving everything from emails to video. But here's the problem: if every sender just blasted data into the network as fast as possible, routers would be overwhelmed. Packets would be dropped. The network would collapse.

And that's not hypothetical. It actually happened.

In October 1986, the NSFNET backbone experienced a "meltdown." Its effective throughput dropped 1000x. Packets were being retransmitted so aggressively that the network became stuck in a loop: loss causes retransmission, which causes more loss. This went on for months.

Imagine a simple network: a router with a 100 Mbps outgoing link. A single sender can use all 100 Mbps. But what if there are 10 senders? Each should get roughly 10 Mbps. How does each sender know this? The Internet doesn't have a central traffic cop. Senders can't query a router and ask "how much bandwidth do you have?"

Problem

If TCP senders have no idea what bandwidth is available, how do they avoid sending so much data that the network becomes congested?

The fix came from Van Jacobson's 1988 paper. The insight: packet loss is not random bit errors. It's a signal that congestion has occurred. If a sender detects loss, it should slow down.

Solution

Treat packet loss as a congestion signal. When a sender detects loss, it reduces its sending rate. The trick is figuring out how much to slow down and when to speed up again.

Two windows, one pipe

TCP already had flow control: the receiver advertises a window size (awnd) telling the sender how much data it can buffer. The sender never sends more than this advertised window.

But flow control only protects the receiver. What about the network itself? If the network is congested but the receiver has plenty of buffer space, packets still get dropped at intermediate routers. We need a separate limit.

Congestion control introduces the congestion window (cwnd), the sender's estimate of how much data the network can handle. The usable window is the minimum of the two:

W = min(cwnd, awnd)

The sender never sends more than W bytes without an acknowledgment. This value is sometimes called the flight size, the amount of data currently in transit.

The ideal value for cwnd is the bandwidth-delay product (BDP): the product of the link capacity and the round-trip time.

BDP = Capacity × RTT

Example: 100 Mbps link, 100 ms RTT
BDP = 100 Mbps × 0.1s = 10 Mb = 1.25 MB

If cwnd equals the BDP, the pipeline is full and you get maximum throughput. If cwnd is smaller, the link goes idle between bursts. If cwnd is larger, queues build up at routers and packets get dropped.

So the question becomes: how does the sender discover the right value for cwnd?

Starting from nothing

When a TCP connection starts, the sender has no idea how much bandwidth is available. The network could be a gigabit fiber link or a cellular connection. So TCP starts small and carefully increases its sending rate.

This algorithm is called slow start (though it's actually quite fast). The sender begins with cwnd = 1 SMSS, one segment, typically 1460 bytes.

Here's the key: whenever the sender receives an ACK, it increments cwnd by the number of bytes ACKed. If ACKs aren't delayed, each ACK increases cwnd by 1 segment:

RTT 0: cwnd = 1, send 1 packet
RTT 1: receive ACK, cwnd = 2, send 2 packets
RTT 2: receive 2 ACKs, cwnd = 4, send 4 packets
RTT 3: receive 4 ACKs, cwnd = 8, send 8 packets
...
RTT k: cwnd = 2^k

Step through the visualization and watch the exponential curve form. Each ACK lets the sender transmit two new packets:

The growth is exponential. Within a few RTTs, you go from 1 packet to hundreds. The algorithm is called "slow start" because it's slower than the alternative (starting with a huge window). But exponential growth means you quickly reach the network's capacity. Eventually, a packet gets dropped.

When a packet is lost, one of two things happens: the sender receives 3 duplicate ACKs (fast retransmit), or a retransmission timer expires (timeout). Either way, the sender knows it overshot. It remembers this point as the slow start threshold (ssthresh):

ssthresh = max(cwnd / 2, 2 SMSS)

The sender cuts the estimate in half. This becomes a memory of the last known good operating point.

Why divide by 2?

If loss occurred at cwnd = 100, the network is clearly rejecting a window that large. But it might accept 50. By halving, we find the boundary without being too aggressive. This is called multiplicative decrease.

Careful probing after the first loss

Once cwnd reaches ssthresh, slow start is no longer appropriate. The sender has already overshot once, so it switches to congestion avoidance, which grows cwnd much more conservatively.

Instead of doubling cwnd every RTT, congestion avoidance increases it by roughly 1 segment per RTT:

cwnd = cwnd + SMSS / cwnd  (for each ACK received)

This results in cwnd increasing by ~1 SMSS per RTT
(approximately linear growth)

Step through this and compare the slope to slow start. The difference is striking:

The growth is linear. The sender probes for additional bandwidth slowly. If the network remains stable, cwnd keeps growing. If loss happens, it halves again.

Now we can see the full lifecycle. Step through the three phases of TCP's congestion response:

The conservation of packets principle

Van Jacobson observed that in steady state, TCP should maintain a constant number of packets in the network. When an ACK arrives, it frees up space in the pipe, so the sender injects exactly one new packet. This keeps the network in equilibrium.

Watch how cwnd and ssthresh evolve as ACKs arrive and losses occur:

Two approaches to recovery

The original algorithm (Tahoe, 1988) was simple: on any packet loss, set ssthresh = cwnd / 2 and reset cwnd to 1. Always restart from scratch.

ssthresh = cwnd / 2
cwnd = 1  ← Always reset to 1!
Enter slow start

This is safe but wasteful. For a high-BDP network (long RTT, high capacity), you spend many RTTs climbing back from cwnd = 1 to where you were.

Reno (1990) introduced fast recovery to avoid this waste. When fast retransmit is triggered (3 duplicate ACKs), instead of resetting to 1:

ssthresh = cwnd / 2
cwnd = ssthresh + 3 SMSS  ← Partial recovery!
Enter fast recovery (not slow start!)
For each additional duplicate ACK: cwnd += SMSS
On good ACK: cwnd = ssthresh  ← Return to congestion avoidance

The insight: during fast recovery, each duplicate ACK means a packet left the network, so you can inject a new one. This keeps the pipe full while recovering.

Toggle between the two variants and notice how Reno avoids the deep valley that Tahoe creates:

NewReno (1998) fixed a subtle problem in Reno: if multiple packets are lost in one window, fast recovery might exit prematurely. NewReno tracks the "recovery point" (the highest sequence number sent when loss was detected) and stays in fast recovery until all lost packets are retransmitted.

SACK (Selective Acknowledgment) lets the receiver report holes in the sequence space. Instead of cumulative ACKs that only report the highest contiguous byte received, SACK tells the sender exactly which packets are missing. The sender can retransmit only the lost segments without guessing.

When the math works against you

Congestion control ultimately depends on two numbers: RTT and packet loss rate. Both limit throughput, and not in the way you might expect.

Higher RTT means fewer ACKs per second. If your RTT is 1 second and you increase cwnd by 1 segment per RTT, you only grow once per second. Growth is slow.

Throughput ≈ cwnd / RTT

If cwnd = 100 packets and RTT = 10 ms: throughput = 10,000 packets/sec
If cwnd = 100 packets and RTT = 1000 ms: throughput = 100 packets/sec
(Same window, 100x lower throughput!)

Loss causes cwnd to drop. High loss means you're halving cwnd more frequently, reducing the average window size. The simplified throughput formula reveals why:

Throughput ≈ MSS / (RTT × √(loss_rate))

Note the SQUARE ROOT: even 1% loss cuts throughput significantly.
0.1% loss: ÷ √0.001 ≈ ÷ 0.03 ≈ normal throughput
1% loss: ÷ √0.01 ≈ ÷ 0.1 ≈ 10x reduction
10% loss: ÷ √0.1 ≈ ÷ 0.32 ≈ 3x reduction

Drag the sliders and watch how quickly throughput drops as RTT or loss increases:

Beyond loss-based congestion control

The classic algorithms (Tahoe, Reno, NewReno) were designed for the 1990s. Modern networks have different characteristics: high capacity, long distances, high jitter. The assumptions that worked then don't always hold now.

CUBIC (the Linux default since 2008) replaces linear growth with a cubic function. After a loss event, the window grows slowly near the previous maximum, then accelerates when exploring uncharted territory. This matches the needs of modern high-speed links better than linear growth.

cwnd(t) = C(t - K)³ + W_max

C is a scaling constant
t is time since loss
K is the time to reach W_max with cubic growth
W_max is cwnd before last loss

BBR (Google, 2016) takes a different approach entirely. Instead of using loss as the primary congestion signal, it measures the bandwidth and RTT directly and sets cwnd to the BDP.

BBR operates in phases:

1. Startup: Quickly probe bandwidth
2. Drain: Drain extra packets from the queue
3. Steady State: Send at (bandwidth × RTT)

If RTT increases (sign of congestion), reduce sending rate.
If RTT decreases, increase sending rate.

BBR performs much better on high-loss networks (like cellular) where packet loss isn't always a congestion signal. However, it's less fair to legacy TCP in mixed environments.

The full algorithm

Here's a simplified pseudocode showing how a TCP sender implements congestion control:

# tcp-congestion-control.pseudo
 
function init_connection():
  cwnd = 1 SMSS
  ssthresh = 64 KB
  state = SLOW_START
 
function on_ack(acked_bytes):
  if state == SLOW_START:
    cwnd += acked_bytes  # Exponential growth
    if cwnd >= ssthresh:
      state = CONGESTION_AVOIDANCE
 
  elif state == CONGESTION_AVOIDANCE:
    cwnd += SMSS * SMSS / cwnd  # Linear growth
 
  elif state == FAST_RECOVERY:
    if ack_is_good():  # Cumulative ACK past recovery point
      cwnd = ssthresh
      state = CONGESTION_AVOIDANCE
    else:  # Duplicate ACK
      cwnd += SMSS
 
function on_packet_loss():
  if cause == FAST_RETRANSMIT:
    ssthresh = max(cwnd / 2, 2 SMSS)
    cwnd = ssthresh + 3 SMSS
    state = FAST_RECOVERY
 
  elif cause == TIMEOUT:
    ssthresh = max(cwnd / 2, 2 SMSS)
    cwnd = 1 SMSS
    state = SLOW_START

This is TCP Reno. Each state transition is triggered by network events (ACKs, losses), and the algorithm adjusts cwnd accordingly.

Want to go deeper? Read RFC 5681 (TCP Congestion Control) and RFC 9002 (QUIC Loss Detection and Congestion Control). For modern algorithms, explore Google's BBR paper and the CUBIC paper.