Build Your Own TCP/IP Stack: TCP Data Flow
You've established a TCP connection. The three-way handshake is complete. Both sides know each other's starting sequence numbers.
Now you need to send a 10MB file.
The naive approach: send one packet, wait for acknowledgment, send the next packet, wait for acknowledgment...
But there's a problem with this.
The Throughput Problem
Let's do the math on sending one packet at a time.
Your round-trip time (RTT) to the server is 100ms. That's the time for a packet to reach the server plus the time for the ACK to come back.
Each packet carries 1,500 bytes of data (typical for Ethernet).
If you send one packet, wait 100ms for the ACK, then send another packet and wait another 100ms, how many packets can you send per second? What's your throughput?
Let's calculate:
1 packet every 100ms = 10 packets per second
10 packets × 1,500 bytes = 15,000 bytes per second
= 15 KB/s
= 0.12 Mbps
That's terrible. You have a gigabit ethernet connection (1000 Mbps) and you're using 0.012% of it.
The problem: you're waiting. The network is sitting idle while the ACK travels back. During those 100ms, you could have sent dozens more packets.
How do we fix this?
The Solution: Pipelining
Instead of waiting for each ACK, what if we send multiple packets before waiting for any ACKs?
Think of it like a pipeline:
- Send packet 1 → (starts traveling)
- Send packet 2 → (starts traveling)
- Send packet 3 → (starts traveling)
- ACK for packet 1 arrives → send packet 4
- ACK for packet 2 arrives → send packet 5
- ...
The network is never idle. While packets are traveling one direction and ACKs are traveling back, you keep sending more packets.
This is called pipelining and it's the key to TCP's speed.
But how many packets can we have "in flight" at once?
Inventing the Sliding Window
We need a limit. You can't send unlimited packets. The receiver might not be able to process them all. It needs some kind of buffer.
Let's track a window of bytes we're allowed to send without waiting for ACKs.
The window has three zones:
- Sent and ACKed: done, safe, can forget about these
- Sent but not ACKed: in flight, waiting for confirmation
- Ready to send: window has room for these
As ACKs arrive, the window slides forward:
Before ACK:
[Acked][In Flight......][Ready...]
^
Window left edge
After ACK arrives:
[Acked.....][In Flight...][Ready...]
^
Window slides right!
Click Send Next Segment multiple times. Notice you can send several segments before any ACKs arrive (these are in the "in flight" zone). Now click Receive ACK and watch the window slide forward, making room for more data.
This mechanism has a name: the sliding window.
The window size determines how much data can be "in flight" at once. Bigger window = more data in flight = better throughput (up to a point).
But who decides the window size?
The receiver controls the window size! Every ACK includes a "window" field saying "I can accept this many more bytes." If the receiver is overwhelmed, it shrinks the window (flow control). If it catches up, it expands it. This prevents the sender from overwhelming the receiver's buffer.
When Packets Get Lost
The sliding window is working great. Packets are flowing, ACKs are coming back, the window slides forward. Life is good.
Then a packet gets lost.
A router's queue overflows. The packet is dropped. The receiver never sees it.
You've sent packets with sequence numbers 1000, 2000, 3000, 4000. Packet 3000 gets lost. The receiver gets 1000, 2000, 4000. What happens?
The sender has packet 3000 in its "sent but not ACKed" zone. It's waiting for an ACK.
But the ACK will never come. The receiver never got the packet.
We need timeouts.
Retransmission Timeout
The sender starts a timer when it sends a packet. If no ACK arrives before the timer expires, it retransmits.
But how long should the timeout be?
- Too short: You retransmit packets that were just slow, wasting bandwidth
- Too long: You wait forever, killing throughput
TCP solves this by measuring RTT (round-trip time) continuously and adjusting the timeout dynamically. Fast network → short timeout. Slow, variable network → longer timeout.
The Faster Way: Duplicate ACKs
There's a cleverer mechanism. Remember, TCP ACKs are cumulative. The ACK number says "I've received everything up to (but not including) this sequence."
If the receiver gets packets 1000, 2000, then jumps to 4000 (missing 3000), what does it ACK?
It ACKs 3000. "I've received everything up to 3000."
When packet 5000 arrives, it still ACKs 3000. "Still waiting for 3000." When packet 6000 arrives, it STILL ACKs 3000. "STILL waiting for 3000!"
These are duplicate ACKs. The receiver keeps sending the same ACK number.
When the sender sees three duplicate ACKs, it knows a packet is missing and retransmits immediately, without waiting for the timeout. This is called fast retransmit and it's much faster than waiting for a timeout.
Closing the Connection
You've sent all your data. The file is transferred. Now you want to close the connection.
But there's a complication: TCP is full-duplex. Data flows in both directions independently. Just because you're done sending doesn't mean the other side is done.
How do you close a connection when each side might finish at different times? How do you ensure both sides agree the connection is closed?
We need a graceful shutdown that handles each direction separately.
The Four-Way Close
Step 1: Client → Server (FIN)
FIN flag set
"I'm done sending data."
Step 2: Server → Client (ACK)
"OK, I received your FIN."
At this point, the client won't send more data. But the server might still have data to send!
Step 3: Server → Client (FIN)
FIN flag set
"I'm also done sending data."
Step 4: Client → Server (ACK)
"OK, I received your FIN. Connection fully closed."
Four messages. Each direction closes independently.
Why the wait? The server might still be sending the last chunks of a large file. It can't close until it's done. The client waits patiently, receiving data, until the server sends its FIN.
TIME_WAIT: The Annoying Wait
After sending the final ACK, the client enters a state called TIME_WAIT. It sits there for about 2 minutes doing nothing.
This is infuriating when you're developing. You close a connection, try to restart your server, and get "Address already in use." The old connection is still in TIME_WAIT!
But it exists for good reasons:
- Ensure the final ACK arrives: If the server's FIN gets retransmitted (because our ACK was lost), we need to still be around to ACK it again
- Kill old duplicate packets: Old packets from this connection might still be bouncing around the network. TIME_WAIT ensures they all die before we reuse the port number
After 2 minutes (2 × Maximum Segment Lifetime), TIME_WAIT expires and the port is truly free.
Building It: Python Implementation
Let's implement the sliding window and retransmission logic:
Sliding Window Implementation
import time
from collections import deque
class SlidingWindow:
"""TCP sliding window for flow control."""
def __init__(self, window_size=65535):
self.window_size = window_size
self.base_seq = 1000 # Start of window
self.next_seq = 1000 # Next byte to send
self.max_seq = 1000 + window_size # End of window
# Track segments in flight
self.in_flight = {} # seq -> (data, send_time)
# Received ACKs
self.last_ack = 1000
def can_send(self, data_len):
"""Check if we can send this much data."""
# In-flight bytes
in_flight_bytes = self.next_seq - self.base_seq
# Can we fit more data in the window?
return in_flight_bytes + data_len <= self.window_size
def send(self, data):
"""Send data if window allows."""
if not self.can_send(len(data)):
raise Exception("Window full! Cannot send.")
seq = self.next_seq
self.next_seq += len(data)
# Track this segment
self.in_flight[seq] = (data, time.time())
print(f"SEND: seq={seq}, len={len(data)}, "
f"window=[{self.base_seq}..{self.next_seq}]")
return seq
def receive_ack(self, ack_num):
"""Process acknowledgment."""
if ack_num <= self.last_ack:
# Duplicate ACK
print(f"ACK: {ack_num} (DUPLICATE)")
return False
print(f"ACK: {ack_num} (NEW, advancing window)")
# Remove ACKed segments from in-flight
acked_seqs = [seq for seq in self.in_flight.keys()
if seq < ack_num]
for seq in acked_seqs:
del self.in_flight[seq]
# Slide window forward
self.base_seq = ack_num
self.last_ack = ack_num
print(f" Window now: [{self.base_seq}..{self.next_seq}]")
print(f" In flight: {len(self.in_flight)} segments")
return True
def get_status(self):
"""Get window status."""
in_flight_bytes = self.next_seq - self.base_seq
available = self.window_size - in_flight_bytes
return {
'base': self.base_seq,
'next': self.next_seq,
'window_size': self.window_size,
'in_flight': in_flight_bytes,
'available': available
}
# Example: Sending data with sliding window
window = SlidingWindow(window_size=6000) # 6KB window
print("=== Sliding Window Demo ===\n")
# Send some segments
window.send(b'A' * 1000) # Send 1KB
window.send(b'B' * 1000) # Send 1KB
window.send(b'C' * 1000) # Send 1KB
print(f"\nStatus: {window.get_status()}")
# Receive ACK for first segment
window.receive_ack(2000) # ACK first 1KB
# Now we can send more
window.send(b'D' * 1000) # Send 1KB
window.send(b'E' * 1000) # Send 1KB
print(f"\nStatus: {window.get_status()}")
# Receive ACK for everything
window.receive_ack(6000) # ACK all
print(f"\nFinal status: {window.get_status()}")Retransmission with Timeouts
import time
from collections import OrderedDict
class TCPSender:
"""TCP sender with retransmission."""
def __init__(self, rto=1.0): # Retransmission timeout in seconds
self.rto = rto
self.window = SlidingWindow()
self.timers = OrderedDict() # seq -> timeout_time
def send(self, data):
"""Send data and start timer."""
seq = self.window.send(data)
# Start retransmission timer
self.timers[seq] = time.time() + self.rto
print(f" Timer started: {self.rto}s")
return seq
def receive_ack(self, ack_num):
"""Process ACK and cancel timers."""
# Slide window
advanced = self.window.receive_ack(ack_num)
if advanced:
# Cancel timers for ACKed segments
cancelled = [seq for seq in self.timers.keys()
if seq < ack_num]
for seq in cancelled:
del self.timers[seq]
print(f" Timer cancelled: seq={seq}")
def check_timeouts(self):
"""Check for timeouts and retransmit."""
now = time.time()
timed_out = []
for seq, timeout_time in self.timers.items():
if now >= timeout_time:
timed_out.append(seq)
for seq in timed_out:
print(f"\nTIMEOUT: seq={seq}")
data, _ = self.window.in_flight[seq]
# Retransmit
print(f"RETRANSMIT: seq={seq}, len={len(data)}")
# Reset timer
self.timers[seq] = now + self.rto
return len(timed_out)
# Example: Simulate packet loss with retransmission
print("\n=== Retransmission Demo ===\n")
sender = TCPSender(rto=0.5) # 500ms timeout
# Send 3 segments
sender.send(b'Segment 1')
sender.send(b'Segment 2')
sender.send(b'Segment 3')
# ACK first two (third one "lost")
sender.receive_ack(1009) # ACK segment 1
sender.receive_ack(1018) # ACK segment 2
# Wait for timeout
print("\nWaiting for timeout...")
time.sleep(0.6)
# Check for timeouts (segment 3 should timeout)
sender.check_timeouts()Fast Retransmit (Duplicate ACKs)
class TCPSenderWithFastRetransmit(TCPSender):
"""TCP sender with fast retransmit."""
def __init__(self, rto=1.0):
super().__init__(rto)
self.dup_ack_count = 0
self.last_ack = 0
def receive_ack(self, ack_num):
"""Process ACK with duplicate ACK detection."""
if ack_num == self.last_ack:
# Duplicate ACK!
self.dup_ack_count += 1
print(f"ACK: {ack_num} (DUPLICATE #{self.dup_ack_count})")
if self.dup_ack_count == 3:
# Fast retransmit!
print(f"\n>>> FAST RETRANSMIT triggered! <<<")
seq = ack_num
if seq in self.window.in_flight:
data, _ = self.window.in_flight[seq]
print(f"RETRANSMIT: seq={seq}, len={len(data)}")
# Reset timer
self.timers[seq] = time.time() + self.rto
else:
# New ACK
super().receive_ack(ack_num)
self.last_ack = ack_num
self.dup_ack_count = 0
# Example: Fast retransmit demo
print("\n=== Fast Retransmit Demo ===\n")
sender = TCPSenderWithFastRetransmit(rto=5.0)
# Send 5 segments
for i in range(5):
sender.send(f'Segment {i+1}'.encode())
# Receiver gets 1, 2, then 4, 5 (3 is lost)
sender.receive_ack(1009) # ACK segment 1
sender.receive_ack(1018) # ACK segment 2
# Now receiver keeps getting out-of-order packets
# It keeps ACKing 1018 (still waiting for segment 3)
sender.receive_ack(1018) # Duplicate ACK #1
sender.receive_ack(1018) # Duplicate ACK #2
sender.receive_ack(1018) # Duplicate ACK #3 -> FAST RETRANSMIT!Output:
=== Fast Retransmit Demo ===
SEND: seq=1000, len=9, window=[1000..1009]
Timer started: 5.0s
SEND: seq=1009, len=9, window=[1000..1018]
Timer started: 5.0s
SEND: seq=1018, len=9, window=[1000..1027]
Timer started: 5.0s
SEND: seq=1027, len=9, window=[1000..1036]
Timer started: 5.0s
SEND: seq=1036, len=9, window=[1000..1045]
Timer started: 5.0s
ACK: 1009 (NEW, advancing window)
Window now: [1009..1045]
In flight: 4 segments
Timer cancelled: seq=1000
ACK: 1018 (NEW, advancing window)
Window now: [1018..1045]
In flight: 3 segments
Timer cancelled: seq=1009
ACK: 1018 (DUPLICATE #1)
ACK: 1018 (DUPLICATE #2)
ACK: 1018 (DUPLICATE #3)
>>> FAST RETRANSMIT triggered! <<<
RETRANSMIT: seq=1018, len=9
Fast retransmit is much faster than waiting for timeouts! Three duplicate ACKs signal "I got something after this, so this specific packet is missing." The sender can immediately retransmit without waiting seconds for the timeout. This keeps throughput high even when packets are lost.
The Complete Stack
Let's step back and see what we've built across this entire series.
Layer 1: Ethernet (Local Delivery)
- Frames with MAC addresses
- ARP for address discovery
- Broadcast for "ask everyone"
- Works only on the local network
Layer 2: IP (Global Routing)
- Packets with IP addresses
- TTL to prevent loops
- Checksum for error detection
- ICMP for diagnostics (ping, traceroute)
- Best-effort, unreliable delivery
Layer 3: TCP (Reliable Connections)
- Three-way handshake to establish connections
- Sequence numbers for ordering
- Acknowledgments for confirmation
- Sliding window for throughput
- Retransmission for loss recovery
- Flow control to protect the receiver
- Four-way close for graceful shutdown
These three layers work together to provide what applications need: a reliable stream of bytes between two computers anywhere on the internet.
What Applications See
Here's the beautiful part: applications don't see any of this complexity.
When you write code, you call:
connect(): TCP does the three-way handshakesend(data): TCP breaks it into packets, adds sequence numbers, manages the windowrecv(): TCP reassembles bytes in order, handles retransmissionsclose(): TCP does the four-way close
The kernel handles everything:
- Building Ethernet frames
- Calculating IP checksums
- Managing TCP state
- Retransmitting lost packets
- Sliding the window
- Handling timeouts
From the application's perspective, there's just a reliable stream of bytes going in and out. The chaos of the unreliable network (lost packets, reordered packets, duplicates) is completely hidden.
This abstraction is one of the internet's greatest achievements.
You've Built the Internet
When you type ping google.com in your terminal, here's what actually happens:
- DNS lookup resolves
google.comto an IP address (e.g.,142.250.185.46) - Routing decision determines the next hop (your home router)
- ARP request discovers your router's MAC address
- ICMP packet (Echo Request) is created
- IP header is added with destination = Google's IP
- Ethernet frame is built with destination = router's MAC
- Frame sent on the wire to your router
- Router receives, decrements TTL, recalculates checksum, forwards to ISP
- Hops across the internet, router by router, until reaching Google
- Google responds with ICMP Echo Reply
- Reply travels back through the same layers in reverse
- Your stack unwraps the frame → IP packet → ICMP reply
pingdisplays "64 bytes from 142.250.185.46: time=23ms"
Every single step uses concepts we've built in this series.
Going Deeper
This is just the beginning. Modern networks add many more layers:
- TLS for encryption
- HTTP/2 and HTTP/3 for efficient web transfers
- QUIC for faster connection setup
- Congestion control algorithms (Reno, CUBIC, BBR)
- Quality of Service (QoS) for prioritizing traffic
But it all builds on the foundation we've created: Ethernet, IP, and TCP.
If you want to go deeper:
- RFC 793: Original TCP specification
- TCP/IP Illustrated by W. Richard Stevens: The classic book
- Build your own: Use a TUN/TAP interface to implement a real TCP/IP stack
The best way to truly understand networking is to build it yourself. Every line of code teaches you something about how the internet actually works.
You've now built the core of the TCP/IP stack from scratch. You understand how data travels from your computer to anywhere in the world, and back.
That's the internet.