File systems: how your computer organizes data on disk
Open your file manager and you'll see folders, files, names, sizes, dates. It feels structured, almost like the computer was designed to think in terms of files. But underneath all of that, your hard drive is just a long strip of numbered bytes, with no files, no folders, and no names.
So how does your computer turn a strip of bytes into the organized hierarchy you see on screen? That's what a file system does, and we're going to build one from scratch. We'll start with the simplest thing that could work, hit its limitations, and explore how real designs like ext4, NTFS, and APFS solve them.
Want to explore interactively? Check out the full interactive filesystem simulator to build and manage your own file system with a complete visual interface.
A strip of bytes
Start at the bottom. A hard drive (or SSD) is, from the software's perspective, a flat array of bytes (a byte is 8 binary digits, enough to store one character of text). You read or write any byte by specifying its offset: the number of bytes from the start of the disk.
Click "Write File" and watch bytes light up. That's all writing a file really is at the lowest level: putting bytes at some offset. But there's an obvious problem. If you write file A at offset 2 and file B at offset 12, you need to remember those numbers yourself. Where does file A start? How long is it? There's nothing on the disk that tells you.
We need a system for organizing this. The simplest approach: lay files side by side and keep a small table of (name, start, length) entries. Here's how it works.
Laying files side by side
We'll store files contiguously, one right after the other, with no gaps. A small table at the start of the disk records each file's name, starting offset, and length. This is called contiguous allocation.
It works well at first. Files are fast to read because all the bytes are next to each other. Watch what happens when we start deleting files.
Step through the demo. After deleting file B, a 3-byte gap opens between A and C. We have 7 bytes free total, but the largest contiguous gap is only 4 bytes. File D needs 5 contiguous bytes and can't fit anywhere, even though there's enough total space.
This is external fragmentation, and it's the fundamental problem with contiguous allocation. Every time you delete a file, you leave a hole. Over time, the disk becomes a patchwork of small gaps that can't fit new files, even when the total free space is plenty.
One option is to periodically compact the disk, sliding all files toward the beginning to eliminate gaps. But that means moving potentially gigabytes of data, which is extremely slow. This is called "defragmentation," and users used to run it regularly on hard drives to maintain performance. (Note: if you have an SSD, don't defrag it. SSDs handle fragmentation differently and defragging can actually harm them by causing unnecessary writes.)
We need a better approach for the general case.
What if files didn't have to be stored in one contiguous chunk?
Splitting things into blocks
Instead of allocating disk space byte by byte, divide the entire disk into fixed-size blocks (typically 4 KB each). When a file needs space, allocate one block at a time, and they don't need to be next to each other.
Drag the slider and watch how many blocks get allocated. A 10-byte file in a system with 4-byte blocks needs 3 blocks: two full blocks (8 bytes) and one partial block (2 bytes used, 2 bytes wasted).
That wasted space in the last block is called internal fragmentation. It's a small price to pay. With 4 KB blocks, you waste at most 4,095 bytes per file, which is negligible for most files. And you've completely eliminated the external fragmentation problem, because you never need a contiguous run of blocks. Any free block anywhere on disk can be used.
But how does the OS know which blocks are free? With thousands or millions of blocks, the OS needs a fast way to find free space.
Finding free blocks
The simplest approach is a bitmap: one bit (a single 0 or 1) per block. A 1 means the block is in use, a 0 means it's free. To allocate a block, scan the bitmap for a 0 bit. To free a block, flip its bit back to 0.
Click bits in the bitmap to toggle them and watch the corresponding disk blocks change. The stats at the bottom update in real time: total free blocks, the index of the first free block, and the longest contiguous run of free blocks (useful when the OS wants to allocate blocks near each other for performance).
How big does the bitmap get? A 1 TB disk with 4 KB blocks has about 250 million blocks (1 trillion bytes / 4,096 bytes per block). One bit per block means 250 million bits, which is about 32 MB (250 million / 8 bits per byte). That's tiny compared to the disk size, and it fits easily in memory. The OS keeps the bitmap cached in RAM and writes it back to disk when it changes.
Which free block gets allocated?
Finding a free block is one thing, but which one should the allocator choose? The simplest strategy is first-fit: scan the bitmap from the beginning and allocate the first free block you find. This is fast, but it leads to performance problems over time because blocks end up scattered randomly across the disk.
Adjust the slider to create multiple files. Notice how blocks get scattered. File 1 starts at block 0, file 2 jumps to a later region, then file 3 has to split and jump again. The real cost is this: when reading a file sequentially, the disk arm has to jump around between distant locations, causing expensive seeks. If your disk blocks are scattered all over, reading a simple 4 MB file might require dozens of separate disk seeks, each taking a millisecond. Those milliseconds add up.
Modern file systems do better with locality-based allocation: instead of scanning from the beginning every time, the allocator remembers where the last allocation happened and tries to place the next block nearby. This keeps blocks from the same file clustered together, minimizing seeks. Some allocators also try to keep files from the same directory close together, so listing a directory doesn't require random seeks across the disk.
Create the same number of files and notice the difference. Blocks now group contiguously. File 1 occupies blocks 0–3, file 2 gets blocks 5–8, and file 3 gets blocks 10–13. Reading each file sequentially is much faster because the disk doesn't need to seek far between blocks. The allocator makes this happen by tracking where recent allocations were and preferring nearby blocks.
We can find free blocks, but we still need to track which blocks belong to which file.
Tracking blocks: the file allocation table
One of the earliest solutions was the File Allocation Table, or FAT. The idea is simple: set aside a table at the beginning of the disk with one entry per block. Each entry says "the next block in this file is block X" or "this is the last block" (EOF). To read a file, you look up its first block in the directory, then follow the chain through the FAT.
Step through reading readme.txt. The directory says it starts at block 2. The FAT entry for block 2 says "next is 5." Block 5 says "next is 8." Block 8 says "next is 3." Block 3 says "EOF." So the file is stored in blocks 2, 5, 8, and 3.
FAT was invented for floppy disks in the late 1970s and became the standard file system for DOS and early Windows. It's still used on USB flash drives and SD cards because of its simplicity.
FAT has limitations. Reading the middle of a large file requires walking the entire chain from the beginning, because each entry only knows the next block, not the previous one. The table itself can become very large on modern disks. There's no built-in support for permissions, ownership, or other metadata either. We need something more flexible.
Inodes: metadata separate from data
Unix file systems took a different approach. Instead of one big table for the whole disk, each file gets its own metadata structure called an inode (index node). An inode stores everything the OS needs to know about a file: its type, size, ownership, timestamps, and a list of pointers to the data blocks. For multi-user systems, inodes also store permissions that control who can read, write, or execute the file.
Click on each field to see what it does. The block pointers tell the OS exactly which disk blocks hold the file's data. Unlike FAT's linked list, they give you direct access: to read the third block of a file, you look at block pointer 2 without walking a chain.
One field worth pausing on: permissions. On a shared computer (like a server with multiple users, or a laptop where you want to protect your private files), you need a way to prevent one user's program from reading or modifying another user's files. The kernel enforces permissions before any program can access a file. Unix file systems track who can do what with three groups of three bits: the file's owner, the owner's group, and everyone else. Each group gets three flags: read (r), write (w), and execute (x). These are written as a compact octal number, where each digit encodes one group.
Here's how the math works. Each flag has a value: read = 4, write = 2, execute = 1. You add them up for each group:
7(4+2+1) = read + write + execute5(4+0+1) = read + (no write) + execute4= read only (no write, no execute)6(4+2) = read + write (no execute)
So 755 means: owner gets 7 (rwx), group gets 5 (r-x), everyone else gets 5 (r-x). When you run ls -l and see -rwxr-xr-x, that's exactly 755 spelled out in letters. Similarly, 644 means owner gets 6 (rw-), group gets 4 (r--), everyone else gets 4 (r--): the owner can read and write, but everyone else can only read.
The link count field tracks how many directory entries point to this inode. When you delete a file, the OS decrements this count rather than immediately freeing the file. The file's data is only freed when the link count reaches zero. This enables hard links (which we'll see next) and prevents accidental data loss if a file is still open when you delete it.
Allocating inodes: finding free space for metadata
Just as the OS needs to track free data blocks, it needs to track free inodes. An inode bitmap works exactly like the block bitmap: one bit per inode, 1 means allocated, 0 means free. When you create a new file, the OS scans the inode bitmap to find a free inode slot, marks it as allocated, and writes the new inode's metadata there.
Inodes are a fixed resource set when the file system is created. If you have 100,000 inodes total, you can create at most 100,000 files, regardless of how much disk space remains. When you delete a file and its link count reaches zero, the OS marks that inode as free in the inode bitmap, freeing it for reuse.
This separation of inode and block space means a file system can run out of inodes while still having plenty of free blocks. If you create a million tiny files, you might exhaust all 1 million inodes on a system, and then you can't create new files even though you have 500 GB of free disk space left. The superblock stores both "total inodes" and "free inodes" to track both resources separately.
Click bits in the inode bitmap to toggle them. Watch how inodes are allocated and freed. Notice the statistic at the bottom: even with plenty of free disk blocks, if you run out of inodes, you can't create new files.
Permission enforcement: the kernel's gatekeepers
The permission bits stored in the inode are just numbers on disk. The actual enforcement happens in the kernel. When a process tries to open() a file, the kernel:
- Resolves the path to an inode
- Reads the inode's permission bits, owner, and group
- Checks if the current process's user ID matches the owner (you get owner permissions)
- Or if the process's group ID matches the file's group (you get group permissions)
- Or if neither (you get everyone else permissions)
- Checks if those permissions allow the requested operation (read, write, execute)
- Allows or denies the open() call
The kernel checks permissions on every open() call, not just once per file. If permissions change while a file is open, the next call to open() sees the new permissions. These checks happen in the kernel, protected from user interference, so a process can't trick the OS into granting permissions it shouldn't have.
Try changing the process's UID or GID and requesting different operations. Watch how the kernel steps through the permission check algorithm: owner first, then group, then other. The allowed operations depend entirely on which category the process falls into.
But a typical inode has room for only 12 direct block pointers. With 4 KB blocks, that's only 48 KB of data. Most files are bigger than that. How do we handle large files without making the inode itself huge?
Indirect pointers: scaling to large files
After the 12 direct pointers, the inode has one single indirect pointer. This points to a block that's entirely full of more block pointers, not data. With 4 KB blocks and 4-byte pointers, that's 1,024 additional pointers, covering up to 4 MB.
Still not enough? The next slot is a double indirect pointer, which points to a block of pointers, each of which points to another block of pointers, each of which finally points to data. That's 1,024 x 1,024 = over a million blocks, or about 4 GB. A triple indirect pointer extends this to terabytes.
Toggle between the three levels and watch how the tree of pointers fans out. Small files use only direct pointers, so there's no indirection overhead. Large files pay the cost of pointer traversal, but it's logarithmic: even a 4 GB file only needs three levels of indirection.
The superblock: file system metadata
Before we move on to directories, there's a question we've glossed over: where does the file system store information about itself? How many blocks does the disk have? What's the block size? How many inodes are there?
This metadata lives in the superblock, a special block (usually block 0 or block 1) that describes the file system as a whole. Here are the fields:
| Field | Value | Purpose |
|---|---|---|
| magic number | 0xEF53 | Identifies this as an ext4 file system. The OS checks this before mounting. |
| block size | 4096 bytes | How many bytes each block holds. Set when the file system is created. Cannot be changed later. |
| total blocks | 2,621,440 | Total number of blocks on this partition. Block size times total blocks gives the partition size (10 GB here). |
| free blocks | 1,843,200 | How many blocks are currently unused. The OS updates this on every allocation and deallocation. |
| total inodes | 655,360 | Maximum number of files this file system can hold. Set at creation time. Running out means no new files even if disk space remains. |
| free inodes | 612,430 | How many inodes are still available for new files. |
| mount count | 42 | How many times this file system has been mounted. After a threshold, fsck runs automatically. |
| state | clean | Whether the file system was properly unmounted. If "dirty," fsck or journal replay is needed. |
The superblock is the first thing the OS reads when it mounts a file system. Most file systems keep backup copies of the superblock scattered across the disk in case the primary copy is corrupted.
The "state" field tracks mount consistency. When the OS mounts a file system, it sets the state to "dirty." When it properly unmounts, it sets it back to "clean." If the OS finds a dirty state on mount, it knows the previous session didn't shut down cleanly.
Mounting: making a file system visible
When the OS boots, partitions are just raw disk space. Mounting loads a file system from disk and makes it available through the directory tree. Here's what happens:
- The OS reads the superblock from the partition to verify it's a valid file system
- It loads the metadata into memory: the inode bitmap, block bitmap, and inode table
- It reads the root inode (usually inode 2) for that file system
- It attaches this file system to the directory tree at a mount point (a directory like
/homeor/mnt/usb) - Now all files in that file system are accessible through paths starting with the mount point
So when you access /home/user/file.txt, the OS might be reading from two different physical disks: if /home is a separate partition, the OS follows the mount point and accesses file.txt through that partition's inode table.
Mounting doesn't copy anything from disk to RAM. The OS keeps frequently-used metadata in memory (the inode table, bitmap) but reads file contents on-demand through the buffer cache. When you unmount a file system, the OS flushes all dirty blocks to disk, updates the superblock's state to "clean," and releases the in-memory metadata.
With blocks, inodes, a bitmap, a superblock, and mounting in place, we need a way to name files.
Directories: names for inodes
A directory is just a file whose contents happen to be a list of (name, inode number) pairs. That's it. The root directory is typically inode 2. When you access /home/user/notes.txt, the OS resolves it step by step:
- Read inode 2 (the root directory), which is a directory, so read its data blocks.
- Search the entries for "home," which maps to inode 10.
- Read inode 10, also a directory, and read its data blocks.
- Search for "user," which maps to inode 15.
- Read inode 15 and search for "notes.txt," which maps to inode 42.
- Read inode 42, a regular file, and read its data blocks to get the file's content.
This process is called path resolution and happens every time you open a file.
Click folders to expand them and see how each directory entry maps a name to an inode number. The tree structure you see in your file manager is built entirely from these name-to-inode mappings.
Directories are just files with a special format, stored and managed exactly like regular files: with inodes, data blocks, and permissions. The file system doesn't need special machinery for the directory hierarchy. The tree structure comes from the data.
Watching path resolution happen
Watch how the OS resolves a path step by step. Each hop reads an inode, discovers it's a directory, scans its entries for the next name in the path, and follows the inode number to the next level.
Step through and notice that every / in the path means another inode read and another directory scan. Deeply nested paths are slightly slower to access because each level requires another lookup. The OS maintains a directory entry cache (dcache) in memory to avoid repeated disk reads for the same paths.
This separation of names from inodes opens a possibility: what if two names pointed to the same inode?
Multiple names, one file
Since directory entries are just (name, inode) pairs, nothing prevents two entries from pointing to the same inode number. When they do, you get a hard link: two names for the same file with the same data blocks and metadata. The inode keeps a link count that tracks how many directory entries reference it.
When you create a hard link, both names are equally valid references to the same inode. There's no "original" or "copy." Deleting one name decrements the link count but doesn't delete the data. The file only disappears when the link count reaches zero and no program has the file open.
This has practical implications: if a program has a file open and you delete it from the directory, the file's data survives on disk until the program closes the file. On Windows, you'll get an error when you try to delete because the OS prevents deletion of open files. On Unix, the deletion succeeds immediately from the user's perspective (the file disappears from the directory), but the OS keeps the blocks allocated internally until the last file descriptor is closed. You can see this if you delete a large file and then check disk usage with df: the space won't be freed until the last program that had it open closes it.
There's also a symbolic link (symlink), which works differently. A symlink is its own file (with its own inode) whose content is a path string. When you access a symlink, the OS reads the path stored inside it and resolves that path to find the real file. This means symlinks can break: if the target is deleted, the symlink points to nothing.
Step through the demo. Watch the link count change as hard links are created and deleted. When a symlink's target disappears, the symlink becomes dangling, pointing to a path that no longer exists.
Why hard links can't cross file systems
Hard links are limited to files on the same file system. Each file system has its own inode table, starting from inode 1. If you have /home on disk 1 and /tmp on disk 2, inode 42 on disk 1 is completely different from inode 42 on disk 2. A hard link just stores the number "42", so when the OS tries to open it, it won't know which file system to look in. The inode number is only meaningful within its own partition.
You can't create a hard link to a file on a different file system because the inode number is only valid within that partition's inode table.
Symlinks work differently. Instead of storing an inode number, they store a path string like /home/user/file.txt. When you follow a symlink, the OS calls path resolution on that string, which can cross mount points and file systems. That's why symlinks can point to files on other file systems, but also why they can break (if the file at that path gets deleted, the symlink becomes dangling).
What happens when you open a file
When a process opens a file, it interacts with the kernel, which manages hardware and enforces rules about who can access what. When a process calls open(), the kernel sets up data structures to track the file.
Each process has a file descriptor table, a small array of numbered slots that point to open files. Three slots are reserved and automatically available to every program:
- 0 (stdin): standard input. Where the program reads input, usually from your keyboard.
- 1 (stdout): standard output. Where the program prints normal output, usually to your terminal.
- 2 (stderr): standard error. Where the program prints error messages, also usually to your terminal.
When you run a command like python script.py < input.txt > output.txt, the shell redirects stdin to input.txt and stdout to output.txt before the program starts. The program never needs to know. It just reads from fd 0 and writes to fd 1.
When you open() a file in your program, the kernel resolves the path to an inode, creates an entry in the system-wide open file table (which tracks the current read/write position and access mode), and returns the next available file descriptor number to your program.
Step through and watch fd 3 get allocated. Opening the same file twice gives you two independent file descriptors with independent offsets: reading from fd 3 doesn't move fd 4's position. This is why a program can have multiple "cursors" into the same file.
When you close() a file descriptor, the kernel frees that slot and decrements the reference count in the open file table. When the last reference is closed, the open file entry is removed. The inode itself stays on disk.
The buffer cache: why reads are fast
Disk access is slow: a hard drive seek takes milliseconds, and even an SSD read takes microseconds. When you cat a file you just edited, it loads from memory instead of disk. That's because the OS keeps a buffer cache (also called the page cache) in RAM: recently read or written disk blocks are kept in memory so future accesses don't need to hit the disk at all.
Click disk blocks to "read" them. The first access is a miss (the block must be loaded from disk), but subsequent accesses to the same block hit the cache. When the cache is full and a new block is needed, the OS evicts a block following the LRU (Least Recently Used) policy: it removes the block that hasn't been used for the longest time. The intuition is simple: if a block hasn't been accessed lately, you're probably not going to need it again soon, so it's safe to discard. Most programs access data in patterns, so once you stop reading a file, you're unlikely to go back to its old blocks. Watch the hit rate climb as you re-access blocks.
The buffer cache is why free on Linux shows most of your RAM as "used" even when you're not running much. The OS fills idle RAM with cached disk blocks because unused RAM is wasted RAM. If a program needs memory, the OS simply evicts cached blocks; no data is lost because the disk still has the original copy.
Writes go through the cache too. When a program writes to a file, the OS updates the cached copy in RAM and marks that block as dirty, meaning it differs from what's on disk. The OS periodically flushes dirty blocks to disk in the background. This write-back strategy is fast because the program doesn't wait for the slow disk, but it means that at any given moment, some recent writes exist only in RAM.
That's why losing power without a proper shutdown can cause data loss: dirty blocks in the cache haven't been written to disk yet.
Click "Write Block" to add blocks to the cache and mark them dirty (shown in red). They stay in RAM until you click "Flush Cache," which writes them to disk and marks them clean. Notice the warning: if power fails with dirty blocks in the cache, those changes are lost forever because they haven't reached permanent storage.
Surviving crashes
Consider what happens if power fails in the middle of writing a file. The OS might have:
- Allocated new blocks but not yet updated the inode to point to them (leaked blocks)
- Updated the inode but not yet written the data (corrupted file)
- Partially updated a directory entry (broken directory)
Any of these leaves the file system in an inconsistent state, where the on-disk structures contradict each other. For example, imagine copying a 5 GB file when power dies:
- The OS allocates blocks 1000-2000 for the file
- It writes data to blocks 1000-1500
- Power fails here
- The inode now points to blocks 1000-2000, but only 1000-1500 have actual data
- Blocks 1501-2000 contain garbage or old data
When you restart, the OS sees allocated blocks that don't fully contain the promised data. This is an inconsistent state where the file system's internal structures contradict each other.
The fsck approach: scan and repair after the fact
Early file systems like ext2 handled crashes with fsck (file system check). After an unclean shutdown, the OS runs fsck to:
- Scans the entire inode table looking for inconsistencies
- Checks that block pointers point to valid blocks
- Verifies that the free space bitmap matches reality
- Repairs inconsistencies by undoing partial operations or reclaiming leaked blocks
This is slow. A 1 TB disk with 4 KB blocks means 250 million blocks. Scanning all of them takes hours. During that time, the system is down and the user is staring at a frozen boot screen. For production servers and laptops, this is unacceptable.
The journaling approach: write-ahead logging
Modern file systems use journaling instead. The idea is borrowed from database systems: write a description of the intended changes to a dedicated area called the journal before making any actual changes to the disk. Then apply the changes and clear the journal.
Step through the write operation and click "Crash here?" at each step to see what happens on recovery. At every point, recovery is straightforward. Before commit, the journal has uncommitted data that you can safely discard. After commit but before the disk update, the journal is committed, so recovery replays it to apply the changes. After the disk update, the journal can be safely cleared.
How recovery works with journaling
The journal stores which blocks were changed and what the new data should be. On restart, recovery is fast:
- If the journal is empty: the previous session shut down cleanly, so just mount the file system.
- If the journal has uncommitted entries: these didn't finish, so discard them. The disk wasn't touched.
- If the journal has committed entries but the disk updates aren't visible: recovery replays them by writing the new data from the journal to the correct disk blocks.
- If the journal shows an entry has already been replayed: skip it. Replaying the same write twice produces the same result.
Recovery with journaling is fundamentally different from fsck. Fsck scans the entire disk looking for inconsistencies, which with a 1 TB disk means checking billions of blocks. Recovery with journaling only scans the journal, a small fixed-size area (maybe 10 MB). A 1 TB file system that takes hours with fsck recovers in seconds with journaling.
ext3, ext4, NTFS, and HFS+ all use journaling (with variations in what exactly gets journaled and how).
But journaling has a cost: every write happens twice (once to the journal, once to the actual location). Some file systems take a different approach entirely.
Copy-on-write: never modify in place
Instead of writing a journal and then writing the real data, what if you never modified existing blocks at all? That's the idea behind copy-on-write (COW), used by ZFS, Btrfs, and APFS.
When you update a file, the file system writes the new data to fresh, unused blocks, creates a new version of the inode pointing to the new blocks, and then changes a single root pointer in a way that takes effect all at once. The old blocks are never touched. This atomic operation means the change happens as a single indivisible moment: either the pointer has been updated and the new data is live, or it hasn't. There's no in-between state where the file is half-updated.
Step through and watch the update happen. The original data is never touched. The switch happens all at once. Either the pointer has been updated and the new data is live, or the old data is still live. You can't have a half-updated state, so you don't need a journal.
There's an additional benefit: if you don't free the old blocks, you get snapshots for free. The old inode still points to the old data, and the new inode points to the new data. Both versions coexist on disk without duplicating blocks that didn't change. That's how ZFS and APFS implement instant, space-efficient snapshots.
The trade-off is that COW can cause fragmentation over time (data gets scattered as new versions are written to different locations) and requires a more complex allocator. For many workloads, the benefits outweigh the costs: atomic updates, free snapshots, and no journal overhead.
Extents: tracking blocks efficiently
We've solved crash safety two ways: journaling and copy-on-write. But both approaches have an efficiency problem with how inodes store block pointers. Each inode keeps a separate pointer for every single block. A 1 GB file with 4 KB blocks needs about 262,000 block pointers. Even with indirect blocks, that's a lot of metadata to store and keep in memory.
Most files are stored more or less contiguously on disk, especially when the disk isn't fragmented. If a file occupies blocks 1000 through 1999, why store 1,000 individual pointers? You could just record "start: 1000, length: 1000." That's a 1000x reduction in metadata.
An extent is a (start block, length) pair describing a contiguous run of blocks. Modern file systems like ext4 and XFS use extent-based allocation instead of individual block pointers. This saves memory, reduces disk I/O, and speeds up file lookups.
Toggle between block pointers and extents. With 8 contiguous blocks, block pointers need 8 entries. An extent needs just one. For large, contiguous files, the metadata savings are significant.
Real files might not be perfectly contiguous and may need multiple extents. Even a badly fragmented file rarely has more than a few dozen extents, far fewer than individual block pointers. ext4 stores up to 4 extents directly in the inode and uses an extent tree for files that need more. This makes the common case very fast while still handling fragmented files.
Putting it all together
We've built bitmaps, inodes, a superblock, directories, and a journal. Now see how these structures fit together on a single disk.
Click each region to see what it contains and where it lives. The superblock sits at block 0 as the first thing the OS reads when mounting. The bitmaps come next because the OS needs them for every allocation. The inode table follows, then the journal, and finally the data blocks that take up most of the space. This is the layout of ext4kb, the interactive file system below, but real ext4 uses the same idea at larger scale, often replicating the superblock and bitmaps across multiple block groups for redundancy.
What a single command actually touches
With the layout in mind, trace what happens when you create a single empty file. touch /hello.txt needs to coordinate across almost every region of the disk.
Step through and count the regions that light up. Creating an empty file touches the journal, the inode bitmap, the inode table, the root directory's data block, and the superblock. That's 5 regions and at least 9 disk writes. This is the simple case—writing content to the file would also involve the block bitmap and data blocks.
Any one of those 9 writes could be interrupted by a power failure. The journal wraps them into a single atomic transaction: either all take effect, or none do.
Click "Create File" or "Delete File" to watch the superblock update in real time. Free blocks and free inodes decrease, the state becomes "dirty," and the mount count increases.
The state field tracks mount consistency. When the OS mounts a file system, it sets the state to "dirty." File operations happen while the state is dirty. When you unmount properly, the OS flushes all pending writes and sets the state to "clean." If the OS finds a dirty state on the next mount, it knows the previous session didn't shut down cleanly.
The superblock is the metadata about metadata: it tracks the global health and configuration of the file system.
Summary
We started with a flat strip of bytes and built up incrementally:
- Raw bytes to contiguous files: gave us files, but fragmentation made it impractical.
- Contiguous to blocks: eliminated external fragmentation at the cost of small internal waste.
- A free space bitmap gave the OS a way to quickly find available blocks.
- FAT tracked block chains, letting files scatter across disk.
- Inodes gave each file its own metadata with direct block pointers, and indirect pointers extended this to huge files.
- The superblock stored file system-wide metadata.
- Directories mapped names to inode numbers, and path resolution walked the tree.
- Hard links and symlinks gave multiple names to one file.
- File descriptors connected running processes to open files.
- The buffer cache kept hot blocks in RAM to avoid slow disk reads.
- Journaling ensured crash consistency through write-ahead logging.
- Copy-on-write offered an alternative: never modify in place, get free snapshots.
- Extents replaced per-block pointers with contiguous-range tracking.
Real file systems like ext4, NTFS, APFS, and ZFS combine these ideas and more, but the core concepts are exactly what we built here. The next time you run ls or open a file, the OS coordinates inodes, directory lookups, bitmap checks, and cache hits to make it happen in microseconds.