B-Trees: How Databases Find Your Data

You run SELECT * FROM users WHERE id = 42. The table has 10 million rows.

Without an index, the database reads every row, checking if id = 42. That's 10 million comparisons. With a B-Tree index, it finds the row in about 4 comparisons.

An index is a separate data structure the database maintains alongside your table. It maps keys (like id) to the location of the corresponding row on disk, so the database can jump directly to the data instead of scanning everything.

This article explains how that index works.

The simplest approach: arrays

Before trees, consider the obvious way to store data: a list.

An unsorted array lets you append new records instantly. O(1) insert. But finding a record means checking every element. O(n) search. With 10 million rows, that's 10 million comparisons.

Step through a linear scan and watch how many comparisons it takes:

A sorted array fixes the search problem. Binary search gives you O(log n) lookups: about 24 comparisons for 10 million records. But now inserts are expensive. Adding one record means shifting all elements after it. O(n) insert.

Neither option works well for databases. We need fast reads AND fast writes. Arrays force us to choose one.

What about linked lists?

Linked lists have O(1) insert (just update pointers), but O(n) search (you must traverse from the head). Same problem.

The Problem

We need a data structure where both lookups AND inserts are fast. Ideally O(log n) for both. Arrays can't do both. What can?

Binary search trees: the right idea

A binary search tree (BST) solves the array tradeoff. The rule is simple: each node holds one key. Everything in the left subtree is smaller. Everything in the right subtree is larger.

To find a value, start at the root. If your value is smaller than the node's key, go left. If larger, go right. Repeat until you find it or hit a dead end. Each step eliminates half the remaining nodes, giving O(log n) lookups. Inserts follow the same path: walk down to where the key would be, then add it there. Also O(log n).

BSTs work well in memory. But databases store data on disk, and that changes everything.

The problem with binary trees on disk

Binary search trees give us O(log n) for both operations, but when data lives on disk, there's a hidden cost: each tree level requires a separate disk read.

In memory, following a pointer to the next node takes about 100 nanoseconds. On disk, it requires a disk seek: the drive head physically moves to that location (on spinning disks) or a page is fetched from flash storage. A disk seek takes about 10 milliseconds — roughly 100,000 times slower than a memory access.

That changes the math completely. The BST lookup in the demo above visited 4 nodes in microseconds. On disk, those same 4 nodes mean 4 disk seeks: 40ms. In a binary tree with 1 million nodes, finding a value requires about 20 levels of traversal. That's 20 disk seeks. At 10ms each, that's 200ms just to find one row.

The problem isn't the algorithm. Binary search is O(log n), which is good. The problem is that each step requires fetching a new disk page.

If we're going to read a disk page anyway, we might as well put more data in it.

Fanout: more keys per node

Spinning disks and SSDs don't read individual bytes. They read in chunks called pages (typically 4KB or more). Reading 100 bytes costs almost the same as reading 4,000 bytes.

Binary trees waste this. Each node holds one key, but we pay for a full page read.

The insight behind B-Trees: instead of storing one key per node, store many. A node with keys [20, 50, 80] creates four regions — values less than 20, between 20 and 50, between 50 and 80, and greater than 80 — each with a pointer to a child node. More keys per node means fewer levels in the tree. This is called fanout. Higher fanout means fewer disk seeks.

A B-Tree with fanout 100 can index a billion records in just 5 levels. That's 5 disk seeks, not 30.

How B-Tree lookup works

A B-Tree node contains sorted keys that act as signposts. A node with N keys has N+1 child pointers — one for each region between and around the keys. For example, a node with keys [50, 100] has three children: one for values less than 50, one for values between 50 and 100, and one for values greater than 100.

To look up a value X:

Start at the root node
Compare X against the keys in the node
Follow the pointer to the appropriate child
Repeat until reaching a leaf

At each level, you're narrowing down which subtree contains your value. The keys tell you where to go next.

The search visits exactly one node per level. With 3 levels, you find any value in 3 node accesses. The keys in each node guide you directly to the right subtree.

What happens when a node overflows

B-Trees maintain a maximum number of keys per node. When you insert a new key into a full node, the node splits.

Splitting works like this:

The node has N keys (at max capacity)
Insert the new key (temporarily N+1 keys)
Split the node in half
Promote the middle key to the parent node
The parent now points to two children instead of one

If the parent also overflows, it splits too. This can cascade up to the root. When the root splits, the tree grows one level taller.

Splits are why B-Trees grow "from the bottom up." Unlike binary trees that grow downward from the root, B-Trees push keys upward when leaves overflow.

The opposite happens during deletion. B-Trees enforce a minimum occupancy — each node must be at least half-full. When a node drops below the minimum after a deletion, it first tries to borrow a key from an adjacent sibling. If the sibling is also at minimum occupancy, the two nodes merge into one and the parent loses a key. Like splits, merges can cascade upward.

Range queries and the leaf chain

Point queries (WHERE id = 42) traverse from root to leaf. But what about range queries (WHERE id BETWEEN 100 AND 500)?

In a basic B-Tree, you'd need to traverse from the root for each value in the range, or do an awkward in-order traversal that bounces between internal nodes and leaves. Neither is efficient.

The solution: link the leaf nodes together in a chain. Find the start of the range using normal tree traversal, then follow the chain of leaves until you pass the end of the range. No need to re-traverse the tree for each value. Just scan the leaves sequentially.

This linked-leaf optimization turns range scans from O(log n) per result into O(log n + k), where k is the number of results. It's so important that virtually all database B-Trees use it.

B+-Trees: values only in leaves

Most database B-Trees are actually B+-Trees. The key difference: only leaf nodes contain actual data (row pointers or row data). Internal nodes store only keys — they're pure signposts for navigation.

This has two advantages. First, without values taking up space, internal nodes can fit more keys, which means higher fanout and a shorter tree. Second, every lookup goes all the way to a leaf, making I/O predictable — you always read the same number of pages regardless of which key you're looking for.

Combined with the linked leaf chain, B+-Trees handle both point lookups and range scans efficiently. That's why PostgreSQL, MySQL InnoDB, and SQLite all use them.

In a regular B-Tree, every node stores both keys and values:

In a B+-Tree, internal nodes only store keys. Values live exclusively in the leaves:

When someone says "B-Tree index," they almost always mean B+-Tree.

The hidden cost: write amplification

So far we've focused on reads. But databases also write data, and B-Trees have a cost that becomes clear once you think about how disks work.

When you update one record, you must rewrite its entire page. Disks don't let you change individual bytes. The smallest unit you can write is a block (typically 4KB). To modify 64 bytes, you read the page, change 64 bytes in memory, then write all 4KB back. This is called write amplification: the ratio of bytes written to bytes changed.

Write amplification wears out SSDs faster (flash has limited write cycles), slows down write-heavy workloads, and complicates crash recovery.

B-Tree variants

The original B-Tree design has inspired many variations that address the write cost.

Copy-on-write B-Trees

Instead of modifying pages in place, copy-on-write (COW) B-Trees copy any page before changing it. Updates create a new version of the page hierarchy.

Readers never block because they see a consistent snapshot of the old tree. There's no risk of corruption since original pages stay unchanged until the atomic pointer swap. Crash recovery is simple: if the system crashes mid-update, the old version is still valid. And you get multi-version concurrency control (MVCC) for free — old versions remain available to concurrent readers until they're reclaimed. LMDB uses this approach. SQLite's WAL (write-ahead log) mode is conceptually similar: writes go to a separate log file, and readers continue using the original pages until the log is merged back.

Lazy B-Trees

Lazy B-Trees buffer updates in memory before writing them to disk. Instead of immediately updating pages, changes accumulate in a buffer. When the buffer fills up, it's flushed to disk.

WiredTiger (MongoDB's default storage engine) uses this approach. Each page has an update buffer that holds recent modifications. Reads merge the buffer with the base page. Writes just append to the buffer. This reduces I/O because multiple updates to the same page become a single disk write. The tradeoff: reads must check both the buffer and the base page.

LSM Trees

Log-Structured Merge Trees take buffering to the extreme. All writes go to an in-memory sorted structure called a memtable. When the memtable fills, it's flushed to disk as an immutable sorted file called an SSTable (Sorted String Table).

Multiple SSTables are periodically merged together in a process called compaction, which keeps read performance reasonable.

LSM Trees trade read performance for write performance. Writes are always fast (just append to the memtable). Reads may need to check multiple SSTables. They're used by LevelDB, RocksDB, Cassandra, and other write-heavy systems.

Why not other data structures?

Hash tables give O(1) lookups but don't support range queries. You can find id = 42 but not id > 42.

Skip lists are probabilistic data structures with similar complexity to trees, but they don't have the disk-friendly properties of B-Trees.

Tries work well for string prefixes but waste space for numeric keys.

B-Trees hit the sweet spot: logarithmic height, high fanout, efficient range scans, and good performance on spinning disks and SSDs alike.

Summary

B-Trees solve a specific problem: finding data on disk with minimal I/O. Unsorted arrays give you O(n) search. Sorted arrays give you O(n) insert. Binary trees have too many levels for disk. B-Trees pack many keys per node, reducing tree height and disk seeks. B+-Trees go further by keeping values only in leaves and linking them for range scans. Variants like COW, lazy buffering, and LSM trees address the write cost.

When you create an index, you're building a B-Tree. When you query by primary key, you're traversing one. Understanding how they work helps you design schemas and write queries that use them effectively.