Build Your Own Container

You've written the perfect application. You test it on your laptop. It works flawlessly. You deploy it to the production server. It crashes immediately.

The versions don't match. Your laptop has Python 3.11. The server has Python 2.7. Your numpy is version 1.24. The server has 1.19. The system libraries are different. The timezone settings are different. Even the OpenSSL version is different.

You spend the next six hours debugging environment differences instead of shipping features.

This is not a small problem. This is the problem that has plagued software deployment for decades. Containers solve it by packaging everything: your code, the runtime, the libraries, the entire OS userspace.

But here's the question that matters: What IS a container?

The answer might surprise you. A container is not a virtual machine. It's not a sandbox running in some isolated environment. It's not magic.

A container is just a Linux process with some restrictions applied.

In this post, we're going to build our own container from scratch. We'll start with the most basic Linux primitives (processes, system calls, file descriptors) and incrementally add isolation, resource limits, and filesystem layers until we have something that looks and acts like a Docker container.

Let's get started.

What Is a Process?

Before we can understand containers, we need to understand processes. A process is a running program. When you execute python app.py, the kernel creates a process. When you run nginx, that's a process. When you open a browser, each tab is a process.

Every process has five essential components:

1. PID (Process ID) A unique number identifying this process. For example: PID 1234.

2. Memory A private address space divided into:

Code segment (the actual program instructions)
Data segment (global variables)
Heap (dynamically allocated memory via malloc/new)
Stack (local variables and function calls)

3. File Descriptors References to open files, sockets, and pipes:

0: stdin (standard input)
1: stdout (standard output)
2: stderr (standard error)
3+: any other open files (like /var/log/app.log)

4. Environment Variables Key-value pairs like:

PATH=/usr/bin:/bin
HOME=/home/user
USER=alice

5. Working Directory Where relative paths are resolved from (like /app).

When the kernel creates a process, it:

Assigns a unique PID
Allocates memory pages for code, data, heap, and stack
Creates a file descriptor table (0=stdin, 1=stdout, 2=stderr)
Copies environment variables from the parent process
Sets the working directory (usually inherited from parent)

Here's what's important: all processes on a Linux system share the same kernel. They all see the same filesystem at /. They all share the same network interfaces. They can all see each other via /proc. If one process uses too much memory, it can crash other processes.

This is a problem.

The Dependency Problem

Problem

How do we deploy code reliably when development and production environments have different software versions installed?

Look at the demo above. Your laptop has one set of dependencies. The production server has another. Your app imports numpy and expects certain functions to exist. But the production server has an older numpy where those functions don't exist yet, or worse, exist but behave differently.

This happens because both your laptop and the production server are using the host's installed packages. You have no control over what the server has installed. Different system administrators make different choices. Different OS distributions ship different versions. A server that's been running for two years has packages from two years ago.

The traditional solution? "Works on my machine" becomes a meme. You spend days writing deployment documentation: "Install Python 3.11.3 exactly, then pip install numpy==1.24.2, then...". It breaks anyway because the server has a different glibc version or a different openssl.

Solution

Don't rely on the host at all. Package everything your app needs into a single, portable image. The container brings its own Python, its own numpy, its own libc, its own everything. The host only provides the kernel.

Discovering the Container Primitive

Let's do an experiment. Open a terminal and run:

docker run -d nginx
docker ps  # Note the container ID
ps aux | grep nginx

You'll see nginx in the output. It's right there in your process list, running on your kernel, using your CPU, using your memory.

Now try the demo above. Start a container and watch what happens. Toggle between the host view and container view.

Notice something strange? The same process appears in both views, but with different PIDs.

Inside the container, nginx thinks it's PID 1 (the first process, the init process). On the host, it's PID 1235 (or some other high number). How is the same process showing up with two different PIDs?

This is the fundamental insight: a container is not a separate machine or VM. It's a regular Linux process with a different view of the system.

Containers vs Virtual Machines

Let's be precise about the difference.

Virtual machines run a complete operating system with its own kernel. When you boot a VM:

The hypervisor allocates RAM (usually 1-2GB minimum)
A bootloader runs and loads the kernel into memory
The kernel initializes drivers for virtual hardware
Init/systemd starts and launches system services
Finally, your app starts

This takes seconds to minutes and uses gigabytes of memory before your app even loads.

Containers share the host kernel. When you start a container:

The kernel creates some namespaces (instant)
It sets up cgroups (instant)
It mounts an overlay filesystem (instant)
Your app starts (milliseconds)

No kernel to boot. No drivers to initialize. No system services to start. You're just executing a binary with some restrictions.

The trade-off: VMs give you stronger isolation because each VM has its own kernel. A kernel vulnerability in one VM doesn't affect others. With containers, they all share the same kernel, so a kernel exploit could escape. Use containers for microservices where you control the code. Use VMs for multi-tenant environments where you're running untrusted code.

Don't take our word for it. Feel the difference yourself:

Click "Start VM" and watch the boot sequence. Now click "Start Container". The difference in startup time is not marginal. It's orders of magnitude.

So how does Linux make a process think it's PID 1 when it's really PID 1235? How do we give it a different view of the system?

Building Isolation: Namespaces

Problem

How do we make a process believe it's the only thing running on the system, without actually giving it access to everything?

Let's start with a concrete example. Run this command:

ps aux

You see every process on the system. Hundreds of them. Systemd, SSH, cron, your terminal, everything. Every process can see every other process.

Now run docker run -it ubuntu ps aux:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   4624  1628 pts/0    Ss   10:30   0:00 ps aux

Only one process. The container can't see anything running on the host. It can't see other containers. It's completely isolated.

How?

PID Namespaces: Two Views of Reality

In a normal Linux system, all processes share the same PID numbering. PID 1 is systemd (or init). Your shell is maybe PID 1234. Chrome is PID 5678. They all see each other in the same namespace.

A PID namespace gives a process its own private PID numbering. Let's visualize this:

Click "Create Namespace" in the demo above. Watch carefully. The same process exists in both namespaces simultaneously:

In the host namespace: PID 3847 (or whatever the kernel assigns)
In the container namespace: PID 1

The kernel maintains two separate PID tables. When a process in the container namespace calls getpid(), the kernel looks it up in the container's table and returns 1. When the host calls kill 3847, the kernel looks it up in the host's table and finds the same process.

This is not virtualization. This is not emulation. The kernel is just lying to the process about its PID.

Try creating this yourself:

sudo unshare --pid --fork --mount-proc bash
echo $$  # Prints 1
ps aux   # Only shows processes in this namespace

The unshare system call creates a new namespace. The --pid flag says "create a new PID namespace". The --fork flag creates a child process that becomes PID 1 in the new namespace. The --mount-proc flag mounts a new /proc that only shows processes in this namespace.

Now, from another terminal on the host, run ps aux | grep bash. You'll see your bash shell with its real PID (probably something like PID 3847). But inside the namespace, echo $$ prints 1.

The Six Types of Namespaces

PID namespaces are just the beginning. Linux provides six types of namespaces, each isolating a different resource:

Click through each namespace type in the demo. Watch how isolation is built up layer by layer:

PID namespace: Isolates process numbering. Container thinks it's PID 1.
Network namespace: Isolates network interfaces, routing tables, firewall rules. Container gets its own lo interface, its own IP address space.
Mount namespace: Isolates filesystem mounts. Container can mount /proc without seeing host processes.
UTS namespace: Isolates hostname. Container can set its hostname without affecting the host.
IPC namespace: Isolates System V IPC and POSIX message queues. Container's shared memory is separate.
User namespace: Isolates user IDs. Root in the container (UID 0) maps to UID 100000 on the host. Security jackpot.

A typical container uses all six together. Docker creates all these namespaces with a single system call. You can create them individually:

Network Isolation in Action

Network namespaces are particularly interesting because they create complete network stacks. Each namespace has:

Its own network interfaces (lo, eth0, etc.)
Its own IP addresses
Its own routing tables
Its own iptables firewall rules
Its own socket connections

In the demo above, create a network namespace. Notice that it starts with absolutely nothing. Not even a loopback interface. You have to:

Create a virtual ethernet pair (veth)
Put one end in the namespace
Set up IP addresses
Configure routing
Set up NAT on the host

This is why containers are isolated by default. They literally can't see other containers' network traffic because they're in completely separate network stacks. To connect them, you need to explicitly create bridges or overlay networks.

Try it yourself:

# Create network namespace
sudo ip netns add container1
 
# Create veth pair
sudo ip link add veth0 type veth peer name veth1
 
# Move one end into namespace
sudo ip link set veth1 netns container1
 
# Configure the host side
sudo ip addr add 10.0.0.1/24 dev veth0
sudo ip link set veth0 up
 
# Configure the container side
sudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec container1 ip link set veth1 up
sudo ip netns exec container1 ping 10.0.0.1

Limiting Resources: Cgroups

We've solved the isolation problem. Namespaces give each container its own view of processes, network, and filesystems. But there's still a problem:

Problem

Namespaces isolate what a container can see, but a container could still use 100% of your CPU and all your memory. How do we prevent one container from starving the others?

Imagine you have three containers running. Container A is running your API. Container B is running your database. Container C is running a batch job that someone wrote poorly. It has a memory leak.

Without limits, Container C will consume all available memory. The kernel's OOM killer will start terminating processes. It might kill your database. Your entire system goes down because one container misbehaved.

Control groups (cgroups) solve this by limiting how much CPU, memory, disk I/O, and network bandwidth each process can use.

Memory Limits: Preventing Memory Exhaustion

Let's visualize what happens when a container tries to use more memory than its limit:

In the demo above, set a memory limit of 512MB. Now click "Allocate Memory" repeatedly. Watch what happens:

Memory usage increases: 100MB → 200MB → 300MB...
At 512MB, the limit is hit
The kernel's OOM killer terminates the process
The container crashes

This is by design. The container dies rather than taking down the whole system. Your other containers keep running. Your host stays stable.

Cgroups are implemented via a special filesystem at /sys/fs/cgroup. To create a memory limit, you write to files:

# Create a cgroup (just mkdir!)
mkdir /sys/fs/cgroup/memory/mycontainer
 
# Set limit to 512MB
echo 536870912 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
 
# Add a process to this cgroup
echo $PID > /sys/fs/cgroup/memory/mycontainer/cgroup.procs

Now every memory allocation from that process (and its children) is tracked. When total usage exceeds 512MB, the OOM killer strikes.

You can also read the current usage at any time:

cat /sys/fs/cgroup/memory/mycontainer/memory.usage_in_bytes

This is how docker stats works. It's just reading cgroup files.

CPU Limits: Throttling Instead of Killing

CPU limits work fundamentally differently than memory. You can't "kill" a process for using too much CPU (that would be absurd). Instead, Linux throttles it.

In the demo, set a CPU limit of 50%. Start a CPU-intensive task. Watch what happens:

The process runs for 50ms, then gets put to sleep for 50ms. This pattern repeats every 100ms. The process never crashes. It just runs at half speed.

The formula for CPU limits:

CPU usage = (quota / period) × 100%

Examples:

quota=50000, period=100000 → 50% CPU
quota=25000, period=100000 → 25% CPU
quota=200000, period=100000 → 200% CPU (can use 2 cores fully)

To set this up:

# Allow 50% CPU (50,000 microseconds out of 100,000)
echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_us

The kernel's Completely Fair Scheduler (CFS) enforces this. Every period microseconds, the process gets quota microseconds of CPU time. When quota runs out, the process sleeps until the next period starts.

This is brilliant for multi-tenant systems. You can guarantee that a runaway container never steals all CPU from other containers. Each container gets its fair share.

Sharing Code: Overlay Filesystems

We've solved isolation. We've solved resource limits. Now we have a new problem:

Problem

If 10 containers all use the same Ubuntu base image, do we really need 10 copies of Ubuntu on disk? And how do we let containers write to their filesystem without modifying the original image?

Imagine you have 10 containers. All of them run Python apps. All of them start from the same Ubuntu 22.04 base image. That base image is 80MB.

Without sharing: 10 containers × 80MB = 800MB of wasted disk space.

With sharing: 80MB total. All 10 containers share the same base layer.

Container images are built in layers, like a stack of transparent sheets. Each layer contains files that were added or changed:

┌─────────────────────┐
│ Layer 4: Your code  │  ← 5MB
├─────────────────────┤
│ Layer 3: numpy      │  ← 50MB
├─────────────────────┤
│ Layer 2: Python     │  ← 120MB
├─────────────────────┤
│ Layer 1: Ubuntu     │  ← 80MB
└─────────────────────┘

When you pull an image, you only download layers you don't already have. If you already have the Ubuntu layer from another image, Docker skips it. This is how images can be gigabytes large but only download a few megabytes.

But how do you stack read-only layers and still let the container write files?

Overlay Filesystems: Stacking Read-Only Layers

The demo above shows how overlay filesystems work. We have:

Lower layers: Read-only image layers (Ubuntu, Python, numpy, your code)
Upper layer: A writable layer unique to this container
Merged view: What the container actually sees

Try the following in the demo:

1. Read a file from lower layers

Click on app.py in the merged view
The file exists in Layer 4 (your code layer)
Reading from lower layers is instant. No copying happens.

2. Write a new file

Create output.txt in the merged view
It appears in the upper layer only
Lower layers remain unchanged

3. Modify an existing file

Edit app.py in the merged view
The original stays in Layer 4 (lower, read-only)
A copy appears in the upper layer with your changes
This is called copy-on-write (COW)

4. Delete a file

Delete app.py in the merged view
The original still exists in Layer 4 (can't delete from read-only layer)
A whiteout file appears in upper: .wh.app.py
The merged view hides the file

The mount command that makes this work:

mount -t overlay overlay \
  -o lowerdir=/layer4:/layer3:/layer2:/layer1,upperdir=/upper,workdir=/work \
  /merged

When you read a file, the kernel checks layers from top to bottom:

Check upper layer first
If not found, check layer 4
If not found, check layer 3
If not found, check layer 2
If not found, check layer 1
If not found, return "file not found"

When you write or modify a file:

Always goes to upper layer
If modifying, kernel copies from lower to upper first (COW)

When you delete a file:

If it only exists in upper: actually delete it
If it exists in lower: create .wh.filename in upper to hide it

This is why containers start instantly even with gigabyte-sized images. Nothing is copied. The layers are mounted as-is. Only writes create new data in the upper layer.

Hiding the Host: pivot_root

We have namespaces for isolation. We have cgroups for resource limits. We have overlay filesystems for efficient storage. But there's still a gaping security hole:

Problem

The container can still see the host's files at /. How do we completely hide the host filesystem from the container?

Think about what the container can currently see:

/etc/passwd with all the host's users
/home with all the host's home directories
/root with the root user's files
/sys and /proc showing host processes
/boot with the host's kernel

This is a massive security problem. Even though we have mount namespaces, the container still starts with the host's filesystem as /. We need to replace the root filesystem entirely.

pivot_root: Swapping the Root Filesystem

Click through the demo above. Watch what happens to the filesystem tree:

Before pivot_root:

/ (host root)
├── bin/
├── etc/
├── home/
├── containers/
│   └── myapp/  ← container files are here
│       ├── bin/
│       ├── etc/
│       └── app/

After pivot_root:

/ (container root - was /containers/myapp)
├── bin/  ← container's bin
├── etc/  ← container's etc
├── app/  ← your application
└── .old_root/
    ├── bin/  ← host's bin (mounted but hidden)
    ├── etc/  ← host's etc
    └── ...

After unmounting .old_root:

/ (container root)
├── bin/
├── etc/
└── app/

The host filesystem is completely gone. The container's / is the new root. The container cannot access any host files.

Here's how it works:

# Step 1: Prepare the new root
mount --bind /containers/myapp /containers/myapp
cd /containers/myapp
 
# Step 2: Pivot the root
mkdir .old_root
pivot_root . .old_root
# Now:
#   / is /containers/myapp
#   /.old_root is the old host root
 
# Step 3: Clean up
umount -l /.old_root
rmdir /.old_root

The pivot_root system call does two things atomically:

Makes the current directory (.) the new root filesystem
Moves the old root filesystem to .old_root

This is fundamentally different from chroot, which just changes one process's view of /. The chroot syscall is famously easy to escape from. You can chroot to a subdirectory, then chroot("..") repeatedly to escape. With pivot_root, you're actually changing what the kernel considers to be the root mount point. There's no "parent" to escape to.

Fine-Grained Privileges: Capabilities

We have another problem. Linux has traditionally had exactly two privilege levels:

root (UID 0): Can do absolutely anything
non-root (UID > 0): Restricted by permissions

This is binary. You're either all-powerful or you're restricted. There's no middle ground.

Problem

Some containers need some root-like powers (like binding to port 80) without having ALL root powers (like loading kernel modules). How do we give partial privileges?

The classic example: nginx needs to bind to port 80. Only root can bind to ports below 1024. So you run nginx as root. But now nginx has all root powers. It can mount filesystems, load kernel modules, access raw disk devices, trace other processes. If nginx gets compromised, the attacker has full root access.

Linux capabilities solve this by splitting root's powers into 40+ individual permissions. You can grant just CAP_NET_BIND_SERVICE (bind to low ports) without granting CAP_SYS_ADMIN (mount filesystems) or CAP_SYS_MODULE (load kernel modules).

What Docker Keeps vs Drops

Docker containers run with a reduced capability set by default. Here's what they keep:

Safe capabilities (Docker keeps these):

CAP_CHOWN: Change file ownership
CAP_NET_BIND_SERVICE: Bind to ports < 1024
CAP_SETUID / CAP_SETGID: Change user/group IDs
CAP_DAC_OVERRIDE: Bypass file permission checks
CAP_FOWNER: Bypass permission checks for file operations
CAP_KILL: Send signals to other processes

These are useful for normal applications and relatively low-risk.

Dangerous capabilities (Docker drops these):

CAP_SYS_ADMIN: Mount filesystems, trace any process, access raw devices. This is basically root.
CAP_NET_ADMIN: Reconfigure networking. Could escape network namespace.
CAP_SYS_PTRACE: Trace other processes, read their memory. Steal secrets.
CAP_SYS_MODULE: Load kernel modules. Inject code into the kernel itself.
CAP_SYS_RAWIO: Access I/O ports and memory directly. Bypass all isolation.
CAP_SYS_BOOT: Reboot the system. Need I say more?

The --privileged flag grants ALL capabilities:

docker run --privileged ubuntu

Never use this unless you absolutely must. I've seen developers add --privileged to "fix" a permission error without realizing they just turned off container security entirely. An attacker who compromises a privileged container has full root access to the host.

The Principle of Least Privilege

Start with zero capabilities and add only what you need:

# Drop all capabilities, add back only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx

For nginx, CAP_NET_BIND_SERVICE is enough. No need for anything else. If the container gets compromised, the attacker can bind to port 80, but they can't mount filesystems, load kernel modules, or escape the container.

You can list a container's capabilities:

docker run --rm alpine sh -c 'apk add -U libcap; capsh --print'

The Last Line of Defense: Seccomp

We've dropped dangerous capabilities. The container can't mount filesystems or load kernel modules. But there's still a problem:

Problem

Even with capabilities dropped, a container can still make 300+ system calls to the kernel. A vulnerability in any syscall handler could let the container escape. How do we minimize the attack surface?

Every interaction between userspace and the kernel goes through a system call. Want to open a file? open() syscall. Want to allocate memory? mmap() syscall. Want to create a network socket? socket() syscall.

Linux has over 300 system calls. Each one is a potential attack vector. If there's a bug in the kernel's ptrace implementation, an attacker could exploit it. If there's a bug in keyctl, same story.

Seccomp-BPF (Secure Computing with Berkeley Packet Filter) lets you whitelist or blacklist specific syscalls. It's the last line of defense. Even if an attacker gets root inside the container with all capabilities restored, they still can't make blocked syscalls.

What Docker Blocks

Docker's default seccomp profile blocks about 50 dangerous syscalls out of the ~300+ available:

Obviously dangerous:

reboot: Container shouldn't be able to crash your host
swapon / swapoff: Manage swap (could crash the system)
mount / umount: Mount filesystems (could access host data)
init_module / delete_module: Load/unload kernel modules (inject code into the kernel)

Less obvious but equally dangerous:

ptrace: Trace other processes, read their memory. Steal secrets from neighboring containers.
process_vm_readv / process_vm_writev: Read/write another process's memory directly
keyctl: Manage kernel keyrings where encryption keys live
bpf: Load BPF programs into the kernel (could disable seccomp itself!)
perf_event_open: Access hardware performance counters (side-channel attacks)
kexec_load: Load a new kernel (bypass everything)

When a container tries a blocked syscall, the kernel either:

Returns EPERM (permission denied) - application sees an error
Sends SIGKILL - process dies immediately

The default behavior is EPERM, so applications can handle it gracefully.

Attacking the Kernel Through Syscalls

Here's why this matters. Say an attacker exploits a vulnerability in your containerized app and gets a shell. They're root inside the container, but:

Namespaces limit what they see
Cgroups limit resources they can use
Dropped capabilities prevent mounting filesystems or loading modules
Seccomp blocks dangerous syscalls

They can't call ptrace() to trace processes on the host. They can't call keyctl() to steal encryption keys. They can't call bpf() to load a kernel exploit. The attack surface has been reduced from 300+ syscalls to ~250 safe ones.

Customizing the Profile

# Run with Docker's default seccomp profile (recommended)
docker run myimage
 
# Run with a custom profile
docker run --security-opt seccomp=profile.json myimage
 
# Disable seccomp entirely (DANGEROUS - never do this in production)
docker run --security-opt seccomp=unconfined myimage

The default profile is excellent for 99% of applications. Only customize if you have a specific need and you understand the security implications.

Distributing Images: Registries

You've built a container image on your laptop. It's 500MB. You need to deploy it to:

10 production servers
Your CI/CD pipeline
Your teammates' development machines
Staging environment

Problem

How do you distribute this image without copying 5GB of data every time someone needs it? And how do you handle updates when only a few megabytes changed?

Solution

Container registries store images as deduplicated layers identified by SHA256 hashes. Push once, pull anywhere. If a layer already exists (same hash), it's not re-uploaded or re-downloaded.

How Registries Work

Container registries (Docker Hub, GitHub Container Registry, AWS ECR, Google Container Registry) are smart about storage and transfer. Here's what happens when you push an image:

Step 1: Calculate SHA256 hashes

Layer 1 (Ubuntu):  sha256:a3ed95ca...  80MB
Layer 2 (Python):  sha256:b8f3e2d1...  120MB
Layer 3 (numpy):   sha256:c9a4b5f2...  50MB
Layer 4 (app):     sha256:d1e5c7a3...  5MB

Step 2: Check what the registry already has Docker asks: "Do you have a3ed95ca...?" Registry responds: "Yes, skip it"

Docker asks: "Do you have b8f3e2d1...?" Registry responds: "Yes, skip it"

Docker asks: "Do you have c9a4b5f2...?" Registry responds: "No, send it"

Step 3: Upload only missing layers Only layers 3 and 4 get uploaded. 55MB instead of 255MB.

Step 4: Upload manifest

{
  "schemaVersion": 2,
  "config": { "digest": "sha256:e7a2f1b3..." },
  "layers": [
    { "digest": "sha256:a3ed95ca..." },
    { "digest": "sha256:b8f3e2d1..." },
    { "digest": "sha256:c9a4b5f2..." },
    { "digest": "sha256:d1e5c7a3..." }
  ]
}

The manifest is tiny (a few KB) and describes how to assemble the image from layers.

Pulling Images: Same Idea in Reverse

When you pull an image:

Download the manifest (few KB)
Check which layers are already cached locally
Download only missing layers
Assemble the image from local and downloaded layers

If you pull ten different Python apps that all use python:3.11 as their base, you only download the Python layers once. The other nine pulls skip those layers.

Content-Addressable Storage

Layers are identified by their SHA256 hash. This is content-addressable storage. Data is named by what it contains, not where it lives.

Here's why this works: if two layers have the same hash, they're guaranteed to be byte-for-byte identical. You can't have hash collisions with SHA256 (2^256 possible values). Deduplication is trivial and safe:

if registry.has_layer(sha256_hash):
    skip_upload()
else:
    upload_layer()

This is the same concept Git uses for commits. Git identifies commits by their SHA1 hash. Same data = same hash = automatic deduplication.

Why SHA256?

SHA256 produces a 256-bit hash. That's 2^256 possible values, which is roughly 10^77. For comparison, there are only 10^50 atoms in the Earth.

The probability of two different layers having the same SHA256 hash is effectively zero. You'd need to generate trillions of layers per second for billions of years to have even a remote chance of a collision. It's not going to happen.

Try it yourself:

# Push to Docker Hub
docker push yourusername/myapp:latest
 
# Pull from Docker Hub
docker pull yourusername/myapp:latest
 
# Push to a private registry
docker tag myapp myregistry.com/myapp:v1.0
docker push myregistry.com/myapp:v1.0

Putting It All Together

We've built up all the pieces. Now let's see how they work together when you run a single command:

docker run nginx

Click through the demo step-by-step to watch what happens:

Step 1: Get the Image

Check if nginx:latest is cached locally
If not, contact registry (docker.io by default)
Download manifest
Download missing layers (deduplicated by SHA256)
Total download: ~140MB (first time), ~0MB (subsequent)

Step 2: Prepare the Filesystem

# Set up overlay filesystem
mount -t overlay overlay \
  -o lowerdir=/var/lib/docker/overlay2/l/ABC:/var/lib/docker/overlay2/l/DEF:...,
     upperdir=/var/lib/docker/overlay2/XYZ/diff,
     workdir=/var/lib/docker/overlay2/XYZ/work \
  /var/lib/docker/overlay2/XYZ/merged

Stack all read-only image layers
Add a writable upper layer for this container
Mount as a unified view

Step 3: Create Isolation

# Create namespaces
unshare --pid --net --mount --uts --ipc --fork
 
# PID namespace: Container thinks it's PID 1
# Net namespace: Container gets its own network stack
# Mount namespace: Container has its own view of mounts
# UTS namespace: Container can set its own hostname
# IPC namespace: Container's shared memory is isolated

Step 4: Limit Resources

# Create cgroup
mkdir /sys/fs/cgroup/docker/abc123
 
# Set memory limit (if specified)
echo 512M > /sys/fs/cgroup/docker/abc123/memory.limit_in_bytes
 
# Set CPU limit (if specified)
echo 50000 > /sys/fs/cgroup/docker/abc123/cpu.cfs_quota_us
 
# Add process to cgroup
echo $$ > /sys/fs/cgroup/docker/abc123/cgroup.procs

Step 5: Change Root Filesystem

# Pivot to container's filesystem
cd /var/lib/docker/overlay2/XYZ/merged
mkdir .old_root
pivot_root . .old_root
umount -l /.old_root

Now / is the container's filesystem, not the host's.

Step 6: Drop Privileges

# Drop dangerous capabilities
capsh --drop=cap_sys_admin,cap_sys_module,cap_sys_rawio,... --
 
# Apply seccomp profile (blocks ~50 syscalls)
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter)

Step 7: Execute the Entrypoint

exec /docker-entrypoint.sh nginx -g 'daemon off;'

From nginx's perspective:

It's PID 1
It has the whole system to itself
It's running on a fresh Linux install

From the kernel's perspective:

It's just another process (PID 3847)
It's in restricted namespaces
It's limited by cgroups
It's missing most capabilities
It's blocked from dangerous syscalls

Total time: ~100 milliseconds (after first pull).

Defense in Depth: Layered Security

Security isn't one feature. It's multiple layers working together:

Click through the demo. Try to "escape" the container at each layer. Watch how each security mechanism blocks a different attack vector:

Layer 1: Namespaces

Attacker can't see host processes (/proc only shows container processes)
Attacker can't access host network interfaces
Attacker can't see host mounts

Layer 2: Pivot Root

Attacker can't read /etc/passwd (it's the container's, not the host's)
Attacker can't access /home or /root (doesn't exist in container)

Layer 3: Cgroups

Attacker can't exhaust all memory (OOM killer stops them at limit)
Attacker can't steal all CPU (throttled to their quota)

Layer 4: Dropped Capabilities

Attacker can't call mount() (needs CAP_SYS_ADMIN)
Attacker can't load kernel modules (needs CAP_SYS_MODULE)
Attacker can't access raw devices (needs CAP_SYS_RAWIO)

Layer 5: Seccomp

Even if attacker restores capabilities somehow, they can't call blocked syscalls
Can't ptrace() other processes
Can't reboot() the system
Can't kexec_load() a new kernel

To fully escape, an attacker needs to:

Find a vulnerability in the app (get initial access)
Escape the namespace (kernel bug)
Bypass pivot_root (kernel bug)
Bypass cgroups (kernel bug)
Restore capabilities (kernel bug)
Bypass seccomp (kernel bug)

That's six separate exploits chained together. No single layer is perfect, but together they make container escapes extremely difficult.

Summary: What We've Built

We started with a simple question: What is a container?

The answer: A container is just a Linux process with restrictions applied.

Let's recap what we've learned by building our own container:

1. Processes are the Foundation

Every container is a process. It has a PID, memory, file descriptors, and environment variables. Nothing special. It runs on your kernel, uses your CPU, uses your RAM.

2. Namespaces Provide Isolation

PID namespace: Container thinks it's PID 1, but it's really PID 3847 on the host
Network namespace: Container gets its own network stack, isolated from other containers
Mount namespace: Container has its own view of filesystems
UTS namespace: Container can set its own hostname
IPC namespace: Container's shared memory is separate
User namespace: Root in container maps to non-root on host

The kernel maintains multiple views of the same system. When the container asks "what's my PID?", the kernel looks it up in the container's namespace and lies to it.

3. Cgroups Limit Resources

Memory limit: (usage ≤ limit) → continue | (usage > limit) → OOM kill
CPU limit: usage = (quota / period) × 100%

Without cgroups, one runaway container could crash your entire system. With cgroups, it dies alone.

4. Overlay Filesystems Enable Sharing

┌─────────────────────┐
│ Container A: upper  │  5MB (unique)
├─────────────────────┤
│ Container B: upper  │  5MB (unique)
├─────────────────────┤
│ Shared: app layer   │  5MB (shared)
├─────────────────────┤
│ Shared: numpy       │  50MB (shared)
├─────────────────────┤
│ Shared: Python      │  120MB (shared)
├─────────────────────┤
│ Shared: Ubuntu      │  80MB (shared)
└─────────────────────┘

10 containers sharing the same base: 80MB once, not 800MB.

5. Pivot Root Hides the Host

The container's / is not the host's /. The host filesystem is completely unmounted. The container cannot access /etc/passwd, /home, or /root from the host.

6. Capabilities Split Root Powers

Root used to be all-or-nothing. Now you can grant CAP_NET_BIND_SERVICE (bind to port 80) without granting CAP_SYS_MODULE (load kernel modules).

7. Seccomp Blocks Dangerous Syscalls

Even if an attacker gets root with all capabilities, they still can't call ptrace(), reboot(), or kexec_load(). The attack surface shrinks from 300+ syscalls to ~250 safe ones.

Building Your Own Container

If you wanted to build a container from scratch, here's what you'd do:

# 1. Create namespaces
unshare --pid --net --mount --uts --ipc --user --fork
 
# 2. Set up overlay filesystem
mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged
 
# 3. Pivot to new root
cd /merged && mkdir .old_root
pivot_root . .old_root
umount -l /.old_root && rmdir /.old_root
 
# 4. Set resource limits
echo 512M > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
 
# 5. Drop capabilities
capsh --drop=cap_sys_admin,cap_sys_module,... --
 
# 6. Apply seccomp
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter)
 
# 7. Execute your app
exec /app/start.sh

Or you could just run docker run myapp, which does all of this in 100 milliseconds.

The Key Insight

Containers are not magic. They're not VMs. They're not running in a separate environment.

A container is a regular Linux process that the kernel has lied to. It's PID 3847 on the host, but it thinks it's PID 1. It's using the host's kernel, but it can only see its own filesystem. It's on the host's network interface, but it thinks it has its own network stack.

Every piece of container technology (namespaces, cgroups, overlay filesystems, capabilities, seccomp) is a Linux feature that has existed for years. Docker, Podman, and containerd are just automation tools that orchestrate these features.

Understanding what's actually happening (what docker run is really doing) makes containers less mysterious. When something breaks, you can debug it. When you need to tune performance, you know which knobs to turn. When you need to harden security, you know which layers to strengthen.

Containers aren't magic. They're just Linux.