Build Your Own Container
You've written the perfect application. You test it on your laptop. It works flawlessly. You deploy it to the production server. It crashes immediately.
The versions don't match. Your laptop has Python 3.11. The server has Python 2.7. Your numpy is version 1.24. The server has 1.19. The system libraries are different. The timezone settings are different. Even the OpenSSL version is different.
You spend the next six hours debugging environment differences instead of shipping features.
This is not a small problem. This is the problem that has plagued software deployment for decades. Containers solve it by packaging everything: your code, the runtime, the libraries, the entire OS userspace.
But here's the question that matters: What IS a container?
The answer might surprise you. A container is not a virtual machine. It's not a sandbox running in some isolated environment. It's not magic.
A container is just a Linux process with some restrictions applied.
In this post, we're going to build our own container from scratch. We'll start with the most basic Linux primitives (processes, system calls, file descriptors) and incrementally add isolation, resource limits, and filesystem layers until we have something that looks and acts like a Docker container.
Let's get started.
What Is a Process?
Before we can understand containers, we need to understand processes. A process is a running program. When you execute python app.py, the kernel creates a process. When you run nginx, that's a process. When you open a browser, each tab is a process.
Every process has five essential components:
1. PID (Process ID) A unique number identifying this process. For example: PID 1234.
2. Memory A private address space divided into:
- Code segment (the actual program instructions)
- Data segment (global variables)
- Heap (dynamically allocated memory via malloc/new)
- Stack (local variables and function calls)
3. File Descriptors References to open files, sockets, and pipes:
- 0: stdin (standard input)
- 1: stdout (standard output)
- 2: stderr (standard error)
- 3+: any other open files (like
/var/log/app.log)
4. Environment Variables Key-value pairs like:
PATH=/usr/bin:/binHOME=/home/userUSER=alice
5. Working Directory
Where relative paths are resolved from (like /app).
When the kernel creates a process, it:
- Assigns a unique PID
- Allocates memory pages for code, data, heap, and stack
- Creates a file descriptor table (0=stdin, 1=stdout, 2=stderr)
- Copies environment variables from the parent process
- Sets the working directory (usually inherited from parent)
Here's what's important: all processes on a Linux system share the same kernel. They all see the same filesystem at /. They all share the same network interfaces. They can all see each other via /proc. If one process uses too much memory, it can crash other processes.
This is a problem.
The Dependency Problem
How do we deploy code reliably when development and production environments have different software versions installed?
Look at the demo above. Your laptop has one set of dependencies. The production server has another. Your app imports numpy and expects certain functions to exist. But the production server has an older numpy where those functions don't exist yet, or worse, exist but behave differently.
This happens because both your laptop and the production server are using the host's installed packages. You have no control over what the server has installed. Different system administrators make different choices. Different OS distributions ship different versions. A server that's been running for two years has packages from two years ago.
The traditional solution? "Works on my machine" becomes a meme. You spend days writing deployment documentation: "Install Python 3.11.3 exactly, then pip install numpy==1.24.2, then...". It breaks anyway because the server has a different glibc version or a different openssl.
Don't rely on the host at all. Package everything your app needs into a single, portable image. The container brings its own Python, its own numpy, its own libc, its own everything. The host only provides the kernel.
Discovering the Container Primitive
Let's do an experiment. Open a terminal and run:
docker run -d nginx
docker ps # Note the container ID
ps aux | grep nginxYou'll see nginx in the output. It's right there in your process list, running on your kernel, using your CPU, using your memory.
Now try the demo above. Start a container and watch what happens. Toggle between the host view and container view.
Notice something strange? The same process appears in both views, but with different PIDs.
Inside the container, nginx thinks it's PID 1 (the first process, the init process). On the host, it's PID 1235 (or some other high number). How is the same process showing up with two different PIDs?
This is the fundamental insight: a container is not a separate machine or VM. It's a regular Linux process with a different view of the system.
Containers vs Virtual Machines
Let's be precise about the difference.
Virtual machines run a complete operating system with its own kernel. When you boot a VM:
- The hypervisor allocates RAM (usually 1-2GB minimum)
- A bootloader runs and loads the kernel into memory
- The kernel initializes drivers for virtual hardware
- Init/systemd starts and launches system services
- Finally, your app starts
This takes seconds to minutes and uses gigabytes of memory before your app even loads.
Containers share the host kernel. When you start a container:
- The kernel creates some namespaces (instant)
- It sets up cgroups (instant)
- It mounts an overlay filesystem (instant)
- Your app starts (milliseconds)
No kernel to boot. No drivers to initialize. No system services to start. You're just executing a binary with some restrictions.
The trade-off: VMs give you stronger isolation because each VM has its own kernel. A kernel vulnerability in one VM doesn't affect others. With containers, they all share the same kernel, so a kernel exploit could escape. Use containers for microservices where you control the code. Use VMs for multi-tenant environments where you're running untrusted code.
Don't take our word for it. Feel the difference yourself:
Click "Start VM" and watch the boot sequence. Now click "Start Container". The difference in startup time is not marginal. It's orders of magnitude.
So how does Linux make a process think it's PID 1 when it's really PID 1235? How do we give it a different view of the system?
Building Isolation: Namespaces
How do we make a process believe it's the only thing running on the system, without actually giving it access to everything?
Let's start with a concrete example. Run this command:
ps auxYou see every process on the system. Hundreds of them. Systemd, SSH, cron, your terminal, everything. Every process can see every other process.
Now run docker run -it ubuntu ps aux:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 4624 1628 pts/0 Ss 10:30 0:00 ps aux
Only one process. The container can't see anything running on the host. It can't see other containers. It's completely isolated.
How?
PID Namespaces: Two Views of Reality
In a normal Linux system, all processes share the same PID numbering. PID 1 is systemd (or init). Your shell is maybe PID 1234. Chrome is PID 5678. They all see each other in the same namespace.
A PID namespace gives a process its own private PID numbering. Let's visualize this:
Click "Create Namespace" in the demo above. Watch carefully. The same process exists in both namespaces simultaneously:
- In the host namespace: PID 3847 (or whatever the kernel assigns)
- In the container namespace: PID 1
The kernel maintains two separate PID tables. When a process in the container namespace calls getpid(), the kernel looks it up in the container's table and returns 1. When the host calls kill 3847, the kernel looks it up in the host's table and finds the same process.
This is not virtualization. This is not emulation. The kernel is just lying to the process about its PID.
Try creating this yourself:
sudo unshare --pid --fork --mount-proc bash
echo $$ # Prints 1
ps aux # Only shows processes in this namespaceThe unshare system call creates a new namespace. The --pid flag says "create a new PID namespace". The --fork flag creates a child process that becomes PID 1 in the new namespace. The --mount-proc flag mounts a new /proc that only shows processes in this namespace.
Now, from another terminal on the host, run ps aux | grep bash. You'll see your bash shell with its real PID (probably something like PID 3847). But inside the namespace, echo $$ prints 1.
The Six Types of Namespaces
PID namespaces are just the beginning. Linux provides six types of namespaces, each isolating a different resource:
Click through each namespace type in the demo. Watch how isolation is built up layer by layer:
- PID namespace: Isolates process numbering. Container thinks it's PID 1.
- Network namespace: Isolates network interfaces, routing tables, firewall rules. Container gets its own
lointerface, its own IP address space. - Mount namespace: Isolates filesystem mounts. Container can mount
/procwithout seeing host processes. - UTS namespace: Isolates hostname. Container can set its hostname without affecting the host.
- IPC namespace: Isolates System V IPC and POSIX message queues. Container's shared memory is separate.
- User namespace: Isolates user IDs. Root in the container (UID 0) maps to UID 100000 on the host. Security jackpot.
A typical container uses all six together. Docker creates all these namespaces with a single system call. You can create them individually:
Network Isolation in Action
Network namespaces are particularly interesting because they create complete network stacks. Each namespace has:
- Its own network interfaces (
lo,eth0, etc.) - Its own IP addresses
- Its own routing tables
- Its own iptables firewall rules
- Its own socket connections
In the demo above, create a network namespace. Notice that it starts with absolutely nothing. Not even a loopback interface. You have to:
- Create a virtual ethernet pair (veth)
- Put one end in the namespace
- Set up IP addresses
- Configure routing
- Set up NAT on the host
This is why containers are isolated by default. They literally can't see other containers' network traffic because they're in completely separate network stacks. To connect them, you need to explicitly create bridges or overlay networks.
Try it yourself:
# Create network namespace
sudo ip netns add container1
# Create veth pair
sudo ip link add veth0 type veth peer name veth1
# Move one end into namespace
sudo ip link set veth1 netns container1
# Configure the host side
sudo ip addr add 10.0.0.1/24 dev veth0
sudo ip link set veth0 up
# Configure the container side
sudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec container1 ip link set veth1 up
sudo ip netns exec container1 ping 10.0.0.1Limiting Resources: Cgroups
We've solved the isolation problem. Namespaces give each container its own view of processes, network, and filesystems. But there's still a problem:
Namespaces isolate what a container can see, but a container could still use 100% of your CPU and all your memory. How do we prevent one container from starving the others?
Imagine you have three containers running. Container A is running your API. Container B is running your database. Container C is running a batch job that someone wrote poorly. It has a memory leak.
Without limits, Container C will consume all available memory. The kernel's OOM killer will start terminating processes. It might kill your database. Your entire system goes down because one container misbehaved.
Control groups (cgroups) solve this by limiting how much CPU, memory, disk I/O, and network bandwidth each process can use.
Memory Limits: Preventing Memory Exhaustion
Let's visualize what happens when a container tries to use more memory than its limit:
In the demo above, set a memory limit of 512MB. Now click "Allocate Memory" repeatedly. Watch what happens:
- Memory usage increases: 100MB β 200MB β 300MB...
- At 512MB, the limit is hit
- The kernel's OOM killer terminates the process
- The container crashes
This is by design. The container dies rather than taking down the whole system. Your other containers keep running. Your host stays stable.
Cgroups are implemented via a special filesystem at /sys/fs/cgroup. To create a memory limit, you write to files:
# Create a cgroup (just mkdir!)
mkdir /sys/fs/cgroup/memory/mycontainer
# Set limit to 512MB
echo 536870912 > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
# Add a process to this cgroup
echo $PID > /sys/fs/cgroup/memory/mycontainer/cgroup.procsNow every memory allocation from that process (and its children) is tracked. When total usage exceeds 512MB, the OOM killer strikes.
You can also read the current usage at any time:
cat /sys/fs/cgroup/memory/mycontainer/memory.usage_in_bytesThis is how docker stats works. It's just reading cgroup files.
CPU Limits: Throttling Instead of Killing
CPU limits work fundamentally differently than memory. You can't "kill" a process for using too much CPU (that would be absurd). Instead, Linux throttles it.
In the demo, set a CPU limit of 50%. Start a CPU-intensive task. Watch what happens:
The process runs for 50ms, then gets put to sleep for 50ms. This pattern repeats every 100ms. The process never crashes. It just runs at half speed.
The formula for CPU limits:
CPU usage = (quota / period) Γ 100%
Examples:
quota=50000, period=100000β 50% CPUquota=25000, period=100000β 25% CPUquota=200000, period=100000β 200% CPU (can use 2 cores fully)
To set this up:
# Allow 50% CPU (50,000 microseconds out of 100,000)
echo 50000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_period_usThe kernel's Completely Fair Scheduler (CFS) enforces this. Every period microseconds, the process gets quota microseconds of CPU time. When quota runs out, the process sleeps until the next period starts.
This is brilliant for multi-tenant systems. You can guarantee that a runaway container never steals all CPU from other containers. Each container gets its fair share.
Sharing Code: Overlay Filesystems
We've solved isolation. We've solved resource limits. Now we have a new problem:
If 10 containers all use the same Ubuntu base image, do we really need 10 copies of Ubuntu on disk? And how do we let containers write to their filesystem without modifying the original image?
Imagine you have 10 containers. All of them run Python apps. All of them start from the same Ubuntu 22.04 base image. That base image is 80MB.
Without sharing: 10 containers Γ 80MB = 800MB of wasted disk space.
With sharing: 80MB total. All 10 containers share the same base layer.
Container images are built in layers, like a stack of transparent sheets. Each layer contains files that were added or changed:
βββββββββββββββββββββββ
β Layer 4: Your code β β 5MB
βββββββββββββββββββββββ€
β Layer 3: numpy β β 50MB
βββββββββββββββββββββββ€
β Layer 2: Python β β 120MB
βββββββββββββββββββββββ€
β Layer 1: Ubuntu β β 80MB
βββββββββββββββββββββββ
When you pull an image, you only download layers you don't already have. If you already have the Ubuntu layer from another image, Docker skips it. This is how images can be gigabytes large but only download a few megabytes.
But how do you stack read-only layers and still let the container write files?
Overlay Filesystems: Stacking Read-Only Layers
The demo above shows how overlay filesystems work. We have:
- Lower layers: Read-only image layers (Ubuntu, Python, numpy, your code)
- Upper layer: A writable layer unique to this container
- Merged view: What the container actually sees
Try the following in the demo:
1. Read a file from lower layers
- Click on
app.pyin the merged view - The file exists in Layer 4 (your code layer)
- Reading from lower layers is instant. No copying happens.
2. Write a new file
- Create
output.txtin the merged view - It appears in the upper layer only
- Lower layers remain unchanged
3. Modify an existing file
- Edit
app.pyin the merged view - The original stays in Layer 4 (lower, read-only)
- A copy appears in the upper layer with your changes
- This is called copy-on-write (COW)
4. Delete a file
- Delete
app.pyin the merged view - The original still exists in Layer 4 (can't delete from read-only layer)
- A whiteout file appears in upper:
.wh.app.py - The merged view hides the file
The mount command that makes this work:
mount -t overlay overlay \
-o lowerdir=/layer4:/layer3:/layer2:/layer1,upperdir=/upper,workdir=/work \
/mergedWhen you read a file, the kernel checks layers from top to bottom:
- Check upper layer first
- If not found, check layer 4
- If not found, check layer 3
- If not found, check layer 2
- If not found, check layer 1
- If not found, return "file not found"
When you write or modify a file:
- Always goes to upper layer
- If modifying, kernel copies from lower to upper first (COW)
When you delete a file:
- If it only exists in upper: actually delete it
- If it exists in lower: create
.wh.filenamein upper to hide it
This is why containers start instantly even with gigabyte-sized images. Nothing is copied. The layers are mounted as-is. Only writes create new data in the upper layer.
Hiding the Host: pivot_root
We have namespaces for isolation. We have cgroups for resource limits. We have overlay filesystems for efficient storage. But there's still a gaping security hole:
The container can still see the host's files at /. How do we completely hide the host filesystem from the container?
Think about what the container can currently see:
/etc/passwdwith all the host's users/homewith all the host's home directories/rootwith the root user's files/sysand/procshowing host processes/bootwith the host's kernel
This is a massive security problem. Even though we have mount namespaces, the container still starts with the host's filesystem as /. We need to replace the root filesystem entirely.
pivot_root: Swapping the Root Filesystem
Click through the demo above. Watch what happens to the filesystem tree:
Before pivot_root:
/ (host root)
βββ bin/
βββ etc/
βββ home/
βββ containers/
β βββ myapp/ β container files are here
β βββ bin/
β βββ etc/
β βββ app/
After pivot_root:
/ (container root - was /containers/myapp)
βββ bin/ β container's bin
βββ etc/ β container's etc
βββ app/ β your application
βββ .old_root/
βββ bin/ β host's bin (mounted but hidden)
βββ etc/ β host's etc
βββ ...
After unmounting .old_root:
/ (container root)
βββ bin/
βββ etc/
βββ app/
The host filesystem is completely gone. The container's / is the new root. The container cannot access any host files.
Here's how it works:
# Step 1: Prepare the new root
mount --bind /containers/myapp /containers/myapp
cd /containers/myapp
# Step 2: Pivot the root
mkdir .old_root
pivot_root . .old_root
# Now:
# / is /containers/myapp
# /.old_root is the old host root
# Step 3: Clean up
umount -l /.old_root
rmdir /.old_rootThe pivot_root system call does two things atomically:
- Makes the current directory (
.) the new root filesystem - Moves the old root filesystem to
.old_root
This is fundamentally different from chroot, which just changes one process's view of /. The chroot syscall is famously easy to escape from. You can chroot to a subdirectory, then chroot("..") repeatedly to escape. With pivot_root, you're actually changing what the kernel considers to be the root mount point. There's no "parent" to escape to.
Fine-Grained Privileges: Capabilities
We have another problem. Linux has traditionally had exactly two privilege levels:
- root (UID 0): Can do absolutely anything
- non-root (UID > 0): Restricted by permissions
This is binary. You're either all-powerful or you're restricted. There's no middle ground.
Some containers need some root-like powers (like binding to port 80) without having ALL root powers (like loading kernel modules). How do we give partial privileges?
The classic example: nginx needs to bind to port 80. Only root can bind to ports below 1024. So you run nginx as root. But now nginx has all root powers. It can mount filesystems, load kernel modules, access raw disk devices, trace other processes. If nginx gets compromised, the attacker has full root access.
Linux capabilities solve this by splitting root's powers into 40+ individual permissions. You can grant just CAP_NET_BIND_SERVICE (bind to low ports) without granting CAP_SYS_ADMIN (mount filesystems) or CAP_SYS_MODULE (load kernel modules).
What Docker Keeps vs Drops
Docker containers run with a reduced capability set by default. Here's what they keep:
Safe capabilities (Docker keeps these):
CAP_CHOWN: Change file ownershipCAP_NET_BIND_SERVICE: Bind to ports < 1024CAP_SETUID/CAP_SETGID: Change user/group IDsCAP_DAC_OVERRIDE: Bypass file permission checksCAP_FOWNER: Bypass permission checks for file operationsCAP_KILL: Send signals to other processes
These are useful for normal applications and relatively low-risk.
Dangerous capabilities (Docker drops these):
CAP_SYS_ADMIN: Mount filesystems, trace any process, access raw devices. This is basically root.CAP_NET_ADMIN: Reconfigure networking. Could escape network namespace.CAP_SYS_PTRACE: Trace other processes, read their memory. Steal secrets.CAP_SYS_MODULE: Load kernel modules. Inject code into the kernel itself.CAP_SYS_RAWIO: Access I/O ports and memory directly. Bypass all isolation.CAP_SYS_BOOT: Reboot the system. Need I say more?
The --privileged flag grants ALL capabilities:
docker run --privileged ubuntuNever use this unless you absolutely must. I've seen developers add --privileged to "fix" a permission error without realizing they just turned off container security entirely. An attacker who compromises a privileged container has full root access to the host.
The Principle of Least Privilege
Start with zero capabilities and add only what you need:
# Drop all capabilities, add back only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginxFor nginx, CAP_NET_BIND_SERVICE is enough. No need for anything else. If the container gets compromised, the attacker can bind to port 80, but they can't mount filesystems, load kernel modules, or escape the container.
You can list a container's capabilities:
docker run --rm alpine sh -c 'apk add -U libcap; capsh --print'The Last Line of Defense: Seccomp
We've dropped dangerous capabilities. The container can't mount filesystems or load kernel modules. But there's still a problem:
Even with capabilities dropped, a container can still make 300+ system calls to the kernel. A vulnerability in any syscall handler could let the container escape. How do we minimize the attack surface?
Every interaction between userspace and the kernel goes through a system call. Want to open a file? open() syscall. Want to allocate memory? mmap() syscall. Want to create a network socket? socket() syscall.
Linux has over 300 system calls. Each one is a potential attack vector. If there's a bug in the kernel's ptrace implementation, an attacker could exploit it. If there's a bug in keyctl, same story.
Seccomp-BPF (Secure Computing with Berkeley Packet Filter) lets you whitelist or blacklist specific syscalls. It's the last line of defense. Even if an attacker gets root inside the container with all capabilities restored, they still can't make blocked syscalls.
What Docker Blocks
Docker's default seccomp profile blocks about 50 dangerous syscalls out of the ~300+ available:
Obviously dangerous:
reboot: Container shouldn't be able to crash your hostswapon/swapoff: Manage swap (could crash the system)mount/umount: Mount filesystems (could access host data)init_module/delete_module: Load/unload kernel modules (inject code into the kernel)
Less obvious but equally dangerous:
ptrace: Trace other processes, read their memory. Steal secrets from neighboring containers.process_vm_readv/process_vm_writev: Read/write another process's memory directlykeyctl: Manage kernel keyrings where encryption keys livebpf: Load BPF programs into the kernel (could disable seccomp itself!)perf_event_open: Access hardware performance counters (side-channel attacks)kexec_load: Load a new kernel (bypass everything)
When a container tries a blocked syscall, the kernel either:
- Returns
EPERM(permission denied) - application sees an error - Sends
SIGKILL- process dies immediately
The default behavior is EPERM, so applications can handle it gracefully.
Attacking the Kernel Through Syscalls
Here's why this matters. Say an attacker exploits a vulnerability in your containerized app and gets a shell. They're root inside the container, but:
- Namespaces limit what they see
- Cgroups limit resources they can use
- Dropped capabilities prevent mounting filesystems or loading modules
- Seccomp blocks dangerous syscalls
They can't call ptrace() to trace processes on the host. They can't call keyctl() to steal encryption keys. They can't call bpf() to load a kernel exploit. The attack surface has been reduced from 300+ syscalls to ~250 safe ones.
Customizing the Profile
# Run with Docker's default seccomp profile (recommended)
docker run myimage
# Run with a custom profile
docker run --security-opt seccomp=profile.json myimage
# Disable seccomp entirely (DANGEROUS - never do this in production)
docker run --security-opt seccomp=unconfined myimageThe default profile is excellent for 99% of applications. Only customize if you have a specific need and you understand the security implications.
Distributing Images: Registries
You've built a container image on your laptop. It's 500MB. You need to deploy it to:
- 10 production servers
- Your CI/CD pipeline
- Your teammates' development machines
- Staging environment
How do you distribute this image without copying 5GB of data every time someone needs it? And how do you handle updates when only a few megabytes changed?
Container registries store images as deduplicated layers identified by SHA256 hashes. Push once, pull anywhere. If a layer already exists (same hash), it's not re-uploaded or re-downloaded.
How Registries Work
Container registries (Docker Hub, GitHub Container Registry, AWS ECR, Google Container Registry) are smart about storage and transfer. Here's what happens when you push an image:
Step 1: Calculate SHA256 hashes
Layer 1 (Ubuntu): sha256:a3ed95ca... 80MB
Layer 2 (Python): sha256:b8f3e2d1... 120MB
Layer 3 (numpy): sha256:c9a4b5f2... 50MB
Layer 4 (app): sha256:d1e5c7a3... 5MBStep 2: Check what the registry already has Docker asks: "Do you have a3ed95ca...?" Registry responds: "Yes, skip it"
Docker asks: "Do you have b8f3e2d1...?" Registry responds: "Yes, skip it"
Docker asks: "Do you have c9a4b5f2...?" Registry responds: "No, send it"
Step 3: Upload only missing layers Only layers 3 and 4 get uploaded. 55MB instead of 255MB.
Step 4: Upload manifest
{
"schemaVersion": 2,
"config": { "digest": "sha256:e7a2f1b3..." },
"layers": [
{ "digest": "sha256:a3ed95ca..." },
{ "digest": "sha256:b8f3e2d1..." },
{ "digest": "sha256:c9a4b5f2..." },
{ "digest": "sha256:d1e5c7a3..." }
]
}The manifest is tiny (a few KB) and describes how to assemble the image from layers.
Pulling Images: Same Idea in Reverse
When you pull an image:
- Download the manifest (few KB)
- Check which layers are already cached locally
- Download only missing layers
- Assemble the image from local and downloaded layers
If you pull ten different Python apps that all use python:3.11 as their base, you only download the Python layers once. The other nine pulls skip those layers.
Content-Addressable Storage
Layers are identified by their SHA256 hash. This is content-addressable storage. Data is named by what it contains, not where it lives.
Here's why this works: if two layers have the same hash, they're guaranteed to be byte-for-byte identical. You can't have hash collisions with SHA256 (2^256 possible values). Deduplication is trivial and safe:
if registry.has_layer(sha256_hash):
skip_upload()
else:
upload_layer()This is the same concept Git uses for commits. Git identifies commits by their SHA1 hash. Same data = same hash = automatic deduplication.
Why SHA256?
SHA256 produces a 256-bit hash. That's 2^256 possible values, which is roughly 10^77. For comparison, there are only 10^50 atoms in the Earth.
The probability of two different layers having the same SHA256 hash is effectively zero. You'd need to generate trillions of layers per second for billions of years to have even a remote chance of a collision. It's not going to happen.
Try it yourself:
# Push to Docker Hub
docker push yourusername/myapp:latest
# Pull from Docker Hub
docker pull yourusername/myapp:latest
# Push to a private registry
docker tag myapp myregistry.com/myapp:v1.0
docker push myregistry.com/myapp:v1.0Putting It All Together
We've built up all the pieces. Now let's see how they work together when you run a single command:
docker run nginxClick through the demo step-by-step to watch what happens:
Step 1: Get the Image
- Check if
nginx:latestis cached locally - If not, contact registry (docker.io by default)
- Download manifest
- Download missing layers (deduplicated by SHA256)
- Total download: ~140MB (first time), ~0MB (subsequent)
Step 2: Prepare the Filesystem
# Set up overlay filesystem
mount -t overlay overlay \
-o lowerdir=/var/lib/docker/overlay2/l/ABC:/var/lib/docker/overlay2/l/DEF:...,
upperdir=/var/lib/docker/overlay2/XYZ/diff,
workdir=/var/lib/docker/overlay2/XYZ/work \
/var/lib/docker/overlay2/XYZ/merged- Stack all read-only image layers
- Add a writable upper layer for this container
- Mount as a unified view
Step 3: Create Isolation
# Create namespaces
unshare --pid --net --mount --uts --ipc --fork
# PID namespace: Container thinks it's PID 1
# Net namespace: Container gets its own network stack
# Mount namespace: Container has its own view of mounts
# UTS namespace: Container can set its own hostname
# IPC namespace: Container's shared memory is isolatedStep 4: Limit Resources
# Create cgroup
mkdir /sys/fs/cgroup/docker/abc123
# Set memory limit (if specified)
echo 512M > /sys/fs/cgroup/docker/abc123/memory.limit_in_bytes
# Set CPU limit (if specified)
echo 50000 > /sys/fs/cgroup/docker/abc123/cpu.cfs_quota_us
# Add process to cgroup
echo $$ > /sys/fs/cgroup/docker/abc123/cgroup.procsStep 5: Change Root Filesystem
# Pivot to container's filesystem
cd /var/lib/docker/overlay2/XYZ/merged
mkdir .old_root
pivot_root . .old_root
umount -l /.old_rootNow / is the container's filesystem, not the host's.
Step 6: Drop Privileges
# Drop dangerous capabilities
capsh --drop=cap_sys_admin,cap_sys_module,cap_sys_rawio,... --
# Apply seccomp profile (blocks ~50 syscalls)
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter)Step 7: Execute the Entrypoint
exec /docker-entrypoint.sh nginx -g 'daemon off;'From nginx's perspective:
- It's PID 1
- It has the whole system to itself
- It's running on a fresh Linux install
From the kernel's perspective:
- It's just another process (PID 3847)
- It's in restricted namespaces
- It's limited by cgroups
- It's missing most capabilities
- It's blocked from dangerous syscalls
Total time: ~100 milliseconds (after first pull).
Defense in Depth: Layered Security
Security isn't one feature. It's multiple layers working together:
Click through the demo. Try to "escape" the container at each layer. Watch how each security mechanism blocks a different attack vector:
Layer 1: Namespaces
- Attacker can't see host processes (
/proconly shows container processes) - Attacker can't access host network interfaces
- Attacker can't see host mounts
Layer 2: Pivot Root
- Attacker can't read
/etc/passwd(it's the container's, not the host's) - Attacker can't access
/homeor/root(doesn't exist in container)
Layer 3: Cgroups
- Attacker can't exhaust all memory (OOM killer stops them at limit)
- Attacker can't steal all CPU (throttled to their quota)
Layer 4: Dropped Capabilities
- Attacker can't call
mount()(needs CAP_SYS_ADMIN) - Attacker can't load kernel modules (needs CAP_SYS_MODULE)
- Attacker can't access raw devices (needs CAP_SYS_RAWIO)
Layer 5: Seccomp
- Even if attacker restores capabilities somehow, they can't call blocked syscalls
- Can't
ptrace()other processes - Can't
reboot()the system - Can't
kexec_load()a new kernel
To fully escape, an attacker needs to:
- Find a vulnerability in the app (get initial access)
- Escape the namespace (kernel bug)
- Bypass pivot_root (kernel bug)
- Bypass cgroups (kernel bug)
- Restore capabilities (kernel bug)
- Bypass seccomp (kernel bug)
That's six separate exploits chained together. No single layer is perfect, but together they make container escapes extremely difficult.
Summary: What We've Built
We started with a simple question: What is a container?
The answer: A container is just a Linux process with restrictions applied.
Let's recap what we've learned by building our own container:
1. Processes are the Foundation
Every container is a process. It has a PID, memory, file descriptors, and environment variables. Nothing special. It runs on your kernel, uses your CPU, uses your RAM.
2. Namespaces Provide Isolation
- PID namespace: Container thinks it's PID 1, but it's really PID 3847 on the host
- Network namespace: Container gets its own network stack, isolated from other containers
- Mount namespace: Container has its own view of filesystems
- UTS namespace: Container can set its own hostname
- IPC namespace: Container's shared memory is separate
- User namespace: Root in container maps to non-root on host
The kernel maintains multiple views of the same system. When the container asks "what's my PID?", the kernel looks it up in the container's namespace and lies to it.
3. Cgroups Limit Resources
Memory limit: (usage β€ limit) β continue | (usage > limit) β OOM kill
CPU limit: usage = (quota / period) Γ 100%
Without cgroups, one runaway container could crash your entire system. With cgroups, it dies alone.
4. Overlay Filesystems Enable Sharing
βββββββββββββββββββββββ
β Container A: upper β 5MB (unique)
βββββββββββββββββββββββ€
β Container B: upper β 5MB (unique)
βββββββββββββββββββββββ€
β Shared: app layer β 5MB (shared)
βββββββββββββββββββββββ€
β Shared: numpy β 50MB (shared)
βββββββββββββββββββββββ€
β Shared: Python β 120MB (shared)
βββββββββββββββββββββββ€
β Shared: Ubuntu β 80MB (shared)
βββββββββββββββββββββββ
10 containers sharing the same base: 80MB once, not 800MB.
5. Pivot Root Hides the Host
The container's / is not the host's /. The host filesystem is completely unmounted. The container cannot access /etc/passwd, /home, or /root from the host.
6. Capabilities Split Root Powers
Root used to be all-or-nothing. Now you can grant CAP_NET_BIND_SERVICE (bind to port 80) without granting CAP_SYS_MODULE (load kernel modules).
7. Seccomp Blocks Dangerous Syscalls
Even if an attacker gets root with all capabilities, they still can't call ptrace(), reboot(), or kexec_load(). The attack surface shrinks from 300+ syscalls to ~250 safe ones.
Building Your Own Container
If you wanted to build a container from scratch, here's what you'd do:
# 1. Create namespaces
unshare --pid --net --mount --uts --ipc --user --fork
# 2. Set up overlay filesystem
mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged
# 3. Pivot to new root
cd /merged && mkdir .old_root
pivot_root . .old_root
umount -l /.old_root && rmdir /.old_root
# 4. Set resource limits
echo 512M > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/mycontainer/cgroup.procs
# 5. Drop capabilities
capsh --drop=cap_sys_admin,cap_sys_module,... --
# 6. Apply seccomp
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter)
# 7. Execute your app
exec /app/start.shOr you could just run docker run myapp, which does all of this in 100 milliseconds.
The Key Insight
Containers are not magic. They're not VMs. They're not running in a separate environment.
A container is a regular Linux process that the kernel has lied to. It's PID 3847 on the host, but it thinks it's PID 1. It's using the host's kernel, but it can only see its own filesystem. It's on the host's network interface, but it thinks it has its own network stack.
Every piece of container technology (namespaces, cgroups, overlay filesystems, capabilities, seccomp) is a Linux feature that has existed for years. Docker, Podman, and containerd are just automation tools that orchestrate these features.
Understanding what's actually happening (what docker run is really doing) makes containers less mysterious. When something breaks, you can debug it. When you need to tune performance, you know which knobs to turn. When you need to harden security, you know which layers to strengthen.
Containers aren't magic. They're just Linux.