Back to Blog

B-Trees: How Databases Find Your Data

You run SELECT * FROM users WHERE id = 42. The table has 10 million rows.

Without an index, the database reads every row, checking if id = 42. That's 10 million comparisons. With a B-Tree index, it finds the row in about 4 comparisons.

An index is a separate data structure the database maintains alongside your table. It maps keys (like id) to the location of the corresponding row on disk, so the database can jump directly to the data instead of scanning everything.

This article explains how that index works.

The simplest approach: arrays

Before trees, consider the obvious way to store data: a list.

An unsorted array lets you append new records instantly. O(1) insert. But finding a record means checking every element. O(n) search. With 10 million rows, that's 10 million comparisons.

Step through a linear scan and watch how many comparisons it takes:

Unsorted array Must check every element: O(n)
Find:
73
12
95
42
8
67
31
89
54
26
81
19
Comparisons:1
Algorithm:O(n)

A sorted array fixes the search problem. Binary search gives you O(log n) lookups: about 24 comparisons for 10 million records. But now inserts are expensive. Adding one record means shifting all elements after it. O(n) insert.

Sorted array Binary search: O(log n)
Find:
8
12
19
26
31
42
54
67
73
81
89
95
Comparisons:1
Algorithm:O(log n)
Found at index 5

Neither option works well for databases. We need fast reads AND fast writes. Arrays force us to choose one.

What about linked lists?

Linked lists have O(1) insert (just update pointers), but O(n) search (you must traverse from the head). Same problem.

The Problem

We need a data structure where both lookups AND inserts are fast. Ideally O(log n) for both. Arrays can't do both. What can?

Binary search trees: the right idea

A binary search tree (BST) solves the array tradeoff. The rule is simple: each node holds one key. Everything in the left subtree is smaller. Everything in the right subtree is larger.

To find a value, start at the root. If your value is smaller than the node's key, go left. If larger, go right. Repeat until you find it or hit a dead end. Each step eliminates half the remaining nodes, giving O(log n) lookups. Inserts follow the same path: walk down to where the key would be, then add it there. Also O(log n).

Binary search tree lookup At each node: smaller goes left, larger goes right
Find:Try: 42, 93, 50
Root
50
L1
25
L1
75
L2
12
L2
37
L2
62
L2
87
L3
6
L3
18
L3
31
L3
42
L3
56
L3
68
L3
81
L3
93
Comparisons:1
Tree levels:4

BSTs work well in memory. But databases store data on disk, and that changes everything.

The problem with binary trees on disk

Binary search trees give us O(log n) for both operations, but when data lives on disk, there's a hidden cost: each tree level requires a separate disk read.

In memory, following a pointer to the next node takes about 100 nanoseconds. On disk, it requires a disk seek: the drive head physically moves to that location (on spinning disks) or a page is fetched from flash storage. A disk seek takes about 10 milliseconds — roughly 100,000 times slower than a memory access.

That changes the math completely. The BST lookup in the demo above visited 4 nodes in microseconds. On disk, those same 4 nodes mean 4 disk seeks: 40ms. In a binary tree with 1 million nodes, finding a value requires about 20 levels of traversal. That's 20 disk seeks. At 10ms each, that's 200ms just to find one row.

Why binary trees fail on disk Each tree level requires a separate disk seek
Root
50
L1
25
L1
75
L2
12
L2
37
L2
62
L2
87
L3
6
L3
18
L3
31
L3
42
L3
56
L3
68
L3
81
L3
93
Disk seeks:1
At 10ms each:10ms

The problem isn't the algorithm. Binary search is O(log n), which is good. The problem is that each step requires fetching a new disk page.

If we're going to read a disk page anyway, we might as well put more data in it.

Fanout: more keys per node

Spinning disks and SSDs don't read individual bytes. They read in chunks called pages (typically 4KB or more). Reading 100 bytes costs almost the same as reading 4,000 bytes.

Binary trees waste this. Each node holds one key, but we pay for a full page read.

The insight behind B-Trees: instead of storing one key per node, store many. A node with keys [20, 50, 80] creates four regions — values less than 20, between 20 and 50, between 50 and 80, and greater than 80 — each with a pointer to a child node. More keys per node means fewer levels in the tree. This is called fanout. Higher fanout means fewer disk seeks.

Fanout: fewer disk seeks More keys per node means fewer levels to traverse
Total items:1,000
10010 million
Binary tree10 levels
10
B-Tree (fanout 50)2 levels
2
B-Tree (fanout 200)2 levels
2

A B-Tree with fanout 100 can index a billion records in just 5 levels. That's 5 disk seeks, not 30.

How B-Tree lookup works

A B-Tree node contains sorted keys that act as signposts. A node with N keys has N+1 child pointers — one for each region between and around the keys. For example, a node with keys [50, 100] has three children: one for values less than 50, one for values between 50 and 100, and one for values greater than 100.

To look up a value X:

  1. Start at the root node
  2. Compare X against the keys in the node
  3. Follow the pointer to the appropriate child
  4. Repeat until reaching a leaf

At each level, you're narrowing down which subtree contains your value. The keys tell you where to go next.

B-Tree lookup Keys in each node guide the search to the correct subtree
Search for:Try: 75, 77, 180
Root
50100
Internal
2035
Internal
6580
Internal
120150
Leaf
1015
Leaf
202530
Leaf
354045
Leaf
505560
Leaf
657075
Leaf
80859095
Leaf
100105110
Leaf
120130140
Leaf
150160170180
Nodes visited:1
Complexity:O(log n)

The search visits exactly one node per level. With 3 levels, you find any value in 3 node accesses. The keys in each node guide you directly to the right subtree.

What happens when a node overflows

B-Trees maintain a maximum number of keys per node. When you insert a new key into a full node, the node splits.

Splitting works like this:

  1. The node has N keys (at max capacity)
  2. Insert the new key (temporarily N+1 keys)
  3. Split the node in half
  4. Promote the middle key to the parent node
  5. The parent now points to two children instead of one

If the parent also overflows, it splits too. This can cascade up to the root. When the root splits, the tree grows one level taller.

B-Tree node split When a node overflows, it splits and promotes a key upward
placeholder
Leaf node
10203040
Max capacity: 4 keys per node

Splits are why B-Trees grow "from the bottom up." Unlike binary trees that grow downward from the root, B-Trees push keys upward when leaves overflow.

The opposite happens during deletion. B-Trees enforce a minimum occupancy — each node must be at least half-full. When a node drops below the minimum after a deletion, it first tries to borrow a key from an adjacent sibling. If the sibling is also at minimum occupancy, the two nodes merge into one and the parent loses a key. Like splits, merges can cascade upward.

B-Tree deletion: borrowing from a sibling When a node underflows, it borrows through the parent
Parent
30
Left child
1020
Right sibling
304050
Minimum occupancy: 2 keys per node

Range queries and the leaf chain

Point queries (WHERE id = 42) traverse from root to leaf. But what about range queries (WHERE id BETWEEN 100 AND 500)?

In a basic B-Tree, you'd need to traverse from the root for each value in the range, or do an awkward in-order traversal that bounces between internal nodes and leaves. Neither is efficient.

The solution: link the leaf nodes together in a chain. Find the start of the range using normal tree traversal, then follow the chain of leaves until you pass the end of the range. No need to re-traverse the tree for each value. Just scan the leaves sequentially.

Range scan: following the leaf chain Linked leaves enable sequential access without re-traversing the tree
Range:to
Leaf level (doubly-linked chain)
101520
253035
404550
556065
707580
859095
Found:none yet
Leaves scanned:1

This linked-leaf optimization turns range scans from O(log n) per result into O(log n + k), where k is the number of results. It's so important that virtually all database B-Trees use it.

B+-Trees: values only in leaves

Most database B-Trees are actually B+-Trees. The key difference: only leaf nodes contain actual data (row pointers or row data). Internal nodes store only keys — they're pure signposts for navigation.

This has two advantages. First, without values taking up space, internal nodes can fit more keys, which means higher fanout and a shorter tree. Second, every lookup goes all the way to a leaf, making I/O predictable — you always read the same number of pages regardless of which key you're looking for.

Combined with the linked leaf chain, B+-Trees handle both point lookups and range scans efficiently. That's why PostgreSQL, MySQL InnoDB, and SQLite all use them.

In a regular B-Tree, every node stores both keys and values:

B-Tree Every node stores keys and values
Root
50
row_50
Internal
20
row_20
30
row_30
Internal
60
row_60
80
row_80
Leaf
10
row_10
15
row_15
Leaf
25
row_25
28
row_28
Leaf
35
row_35
45
row_45
Leaf
55
row_55
58
row_58
Leaf
65
row_65
75
row_75
Leaf
85
row_85
95
row_95

In a B+-Tree, internal nodes only store keys. Values live exclusively in the leaves:

B+-Tree Internal nodes are pure signposts; values only in leaves
Root (keys only)
50
Internal (keys only)
20
30
Internal (keys only)
60
80
Leaf
10
row_10
15
row_15
20
row_20
Leaf
25
row_25
28
row_28
30
row_30
Leaf
35
row_35
45
row_45
50
row_50
Leaf
55
row_55
58
row_58
60
row_60
Leaf
65
row_65
75
row_75
80
row_80
Leaf
85
row_85
95
row_95
Leaves linked for range scans

When someone says "B-Tree index," they almost always mean B+-Tree.

The hidden cost: write amplification

So far we've focused on reads. But databases also write data, and B-Trees have a cost that becomes clear once you think about how disks work.

When you update one record, you must rewrite its entire page. Disks don't let you change individual bytes. The smallest unit you can write is a block (typically 4KB). To modify 64 bytes, you read the page, change 64 bytes in memory, then write all 4KB back. This is called write amplification: the ratio of bytes written to bytes changed.

Write amplification Updating one record requires rewriting the entire page
4KB disk page (64 records x 64 bytes)
Changed:64B
Written:0B
Amplification:1x

Write amplification wears out SSDs faster (flash has limited write cycles), slows down write-heavy workloads, and complicates crash recovery.

B-Tree variants

The original B-Tree design has inspired many variations that address the write cost.

Copy-on-write B-Trees

Instead of modifying pages in place, copy-on-write (COW) B-Trees copy any page before changing it. Updates create a new version of the page hierarchy.

Copy-on-write B-Trees Update by copying pages, never modifying in place
Original pages
Root
Parent
Leaf
New pages (copies)
Root'
Parent'
Leaf' (modified)

Readers never block because they see a consistent snapshot of the old tree. There's no risk of corruption since original pages stay unchanged until the atomic pointer swap. Crash recovery is simple: if the system crashes mid-update, the old version is still valid. And you get multi-version concurrency control (MVCC) for free — old versions remain available to concurrent readers until they're reclaimed. LMDB uses this approach. SQLite's WAL (write-ahead log) mode is conceptually similar: writes go to a separate log file, and readers continue using the original pages until the log is merged back.

Lazy B-Trees

Lazy B-Trees buffer updates in memory before writing them to disk. Instead of immediately updating pages, changes accumulate in a buffer. When the buffer fills up, it's flushed to disk.

Lazy B-Trees: buffered updates Buffer writes in memory, flush to disk in batches
Memory: update buffer0/4
Empty (writes go here first)
Disk: base page
Key: 10Val: "A"
Key: 20Val: "B"
Key: 30Val: "C"
Key: 40Val: "D"
Updates:0
Disk writes:0

WiredTiger (MongoDB's default storage engine) uses this approach. Each page has an update buffer that holds recent modifications. Reads merge the buffer with the base page. Writes just append to the buffer. This reduces I/O because multiple updates to the same page become a single disk write. The tradeoff: reads must check both the buffer and the base page.

LSM Trees

Log-Structured Merge Trees take buffering to the extreme. All writes go to an in-memory sorted structure called a memtable. When the memtable fills, it's flushed to disk as an immutable sorted file called an SSTable (Sorted String Table).

LSM Trees: append-only storage All writes go to memory first, then flush to immutable files
Memory: memtable (sorted)0/4
Empty (writes land here)
Disk: SSTables (immutable sorted files)
No SSTables yet
Writes:0
SSTables:0
Read cost:O(1)

Multiple SSTables are periodically merged together in a process called compaction, which keeps read performance reasonable.

LSM Trees trade read performance for write performance. Writes are always fast (just append to the memtable). Reads may need to check multiple SSTables. They're used by LevelDB, RocksDB, Cassandra, and other write-heavy systems.

Why not other data structures?

Hash tables give O(1) lookups but don't support range queries. You can find id = 42 but not id > 42.

Skip lists are probabilistic data structures with similar complexity to trees, but they don't have the disk-friendly properties of B-Trees.

Tries work well for string prefixes but waste space for numeric keys.

B-Trees hit the sweet spot: logarithmic height, high fanout, efficient range scans, and good performance on spinning disks and SSDs alike.

Summary

B-Trees solve a specific problem: finding data on disk with minimal I/O. Unsorted arrays give you O(n) search. Sorted arrays give you O(n) insert. Binary trees have too many levels for disk. B-Trees pack many keys per node, reducing tree height and disk seeks. B+-Trees go further by keeping values only in leaves and linking them for range scans. Variants like COW, lazy buffering, and LSM trees address the write cost.

When you create an index, you're building a B-Tree. When you query by primary key, you're traversing one. Understanding how they work helps you design schemas and write queries that use them effectively.