Log-Structured Storage: Why Your Database Never Erases Data

You run INSERT INTO events VALUES (...) a million times per second. Each insert touches disk. How does the database keep up?

Disk writes are slow. Seeking to a random location, updating a byte, then seeking somewhere else is what kills performance. So how do write-heavy databases like Cassandra, RocksDB, and LevelDB handle millions of writes per second?

They cheat. They never update anything in place. They only append.

This article builds up the intuition for log-structured storage from first principles. You'll see why "never erase, only append" leads to faster writes, and what tricks are needed to keep reads fast too.

Accountants don't use erasers

Imagine you're an accountant in the 1800s. Your ledger shows a balance of $1,000. A client deposits $200. What do you do?

You don't erase the $1,000 and write $1,200. You add a new line: "Deposit: +$200". To find the current balance, you read through all the entries and sum them up.

Why? Two reasons:

Audit trail. If you can erase, you can hide fraud. If you can only append, every change is traceable.
No seeking. You don't need to find the old entry. You just add to the end.

Databases borrowed this idea. Instead of modifying records in place, log-structured storage appends new versions. To find the current value, you read through the entries and take the most recent one.

Try updating the same key multiple times on both sides. On the mutable side, each update overwrites the previous value, gone forever. On the immutable side, you see the complete history. The immutable approach never has to "seek" to find the right location because it just appends to the end.

The mutable approach seems simpler. Why would you want multiple copies of the same data?

Three reasons make append-only storage compelling. First, sequential writes are fast: disk drives, spinning or flash, handle sequential writes 10-100x faster than random writes. Appending to a log is sequential, while updating a page in a B-Tree means seeking to a random location. Second, concurrency is easier. When files are immutable, readers don't need locks. They just read. Writers don't block readers because they're writing to a new file, not modifying the old one. Third, crash recovery is simpler. If you crash mid-write to an immutable file, you just discard the incomplete file. The old files are still intact.

The tradeoff is that reads get more expensive. You might have to check multiple files to find a value, but there are tricks to mitigate this.

The first wall: random disk writes are slow

Say every write goes directly to disk:

write("user_1", "alice")  → seek to byte 0, write
write("user_2", "bob")    → seek to byte 100, write
write("user_1", "alice2") → seek to byte 0, overwrite

Each write requires a disk seek. On spinning disks, seeks take ~10ms. On SSDs, ~0.1ms. Either way, random I/O is the bottleneck.

Writing every insert to disk immediately means slow random I/O. But buffering only in memory means a crash loses everything. We need fast sequential writes and durability.

The solution: buffer writes in memory, then flush them all at once.

Buffering writes in memory

Instead of writing to disk immediately, log-structured storage buffers writes in an in-memory sorted structure called a memtable. This is typically a red-black tree or skip list, something that keeps keys sorted as you insert.

Writes go to the memtable. Reads check the memtable first (it has the newest data). When the memtable fills up (typically 64-256MB), it gets flushed to disk as an immutable sorted file called an SSTable (Sorted String Table).

Click "Write random key" until the memtable fills up. Watch the capacity bar grow. When it's full, the flush kicks in:

The memtable entries get sorted by key
They're written sequentially to a new file on disk
The memtable clears, ready for new writes

The flush is fast because:

Data is already sorted in memory
We write sequentially to a new file (no seeking)
The old SSTable is never modified

After flushing, the SSTable is immutable. It never changes. If we need to update a key that's already in an SSTable, we just write a newer version to the memtable. The new version shadows the old one.

But what if we crash before flushing?

Memtables live in RAM. RAM is volatile. If the power goes out before flushing, you lose all buffered data. We need a way to make writes durable without giving up the speed of in-memory operations.

The solution: before updating the memtable, every write gets appended to a log file on disk called the Write-Ahead Log (WAL). The WAL is write-only and sequential. It only appends to the end of a file, which is fast.

The write path now looks like this:

Client sends PUT key=100, value="alice"
Append to WAL (disk, sequential)
Update memtable (memory)
Acknowledge to client

The write is durable as soon as step 2 completes. Step 3 is just for serving reads quickly. If we crash between steps 2 and 3, we replay the WAL on startup to rebuild the memtable.

Try this sequence:

Click "Write data" several times to add entries
Click "Simulate crash" and watch the memtable vanish (it was only in RAM)
Click "Recover from WAL" and watch the memtable rebuild from disk

The WAL survives the crash because it's on disk. The memtable is rebuilt by replaying each WAL entry. Once the memtable is flushed to an SSTable, you can truncate the WAL since it's no longer needed.

Wait, doesn't the WAL mean we're writing to disk on every insert?

Yes, but it's a sequential append. We're not seeking around the disk. And we can batch multiple writes into a single WAL entry (group commit) to amortize the I/O cost.

How do you delete from immutable storage?

If files are immutable, how do you delete something? You can't erase data from an SSTable because it's immutable. And you can't just remove the key from the memtable either. If the key exists in an older SSTable, removing it from the memtable would make the old value "resurrect" during reads.

The trick: instead of erasing, you write a special marker called a tombstone that says "this key was deleted at this timestamp." When you delete a key, you write a tombstone to the memtable. During reads, if we encounter a tombstone, we know to return "not found", even if older SSTables contain a value for that key.

Click a "Delete" button next to one of the SSTable entries. A tombstone marker appears in the memtable. Now query for that key. The search finds the tombstone and returns "DELETED" even though older data still exists in the SSTable below.

Tombstones stick around until compaction removes them along with the values they shadow. Rapidly deleting data in an LSM Tree can temporarily increase disk usage because you're adding tombstones on top of the data you're "deleting."

Now reads have a problem

Writes are fast. But if you've done many flushes, you have many SSTables. A key could be in any of them. In the worst case, you have to check the memtable plus every SSTable to find a value, or to confirm it doesn't exist.

Searching from newest to oldest

The read algorithm is straightforward:

Check the memtable (newest data)
Check SSTable-1 (next newest)
Check SSTable-2
... keep going until you find the key or run out of files

Select different keys and step through the search:

Key 250: Found in the memtable (1 step, fast!)
Key 50: Must check all three levels (slow)
Key 999: Doesn't exist anywhere, must check everything just to confirm "not found"

This is called read amplification: the ratio of data read from disk to data actually returned. With 10 SSTables, a point lookup might require 10 disk reads.

Skipping files we don't need to read

We can't avoid checking multiple files. But we can avoid reading files that definitely don't contain our key.

A Bloom filter is a probabilistic data structure that answers: "Is key X in this set?"

"Definitely not" → 100% accurate. Skip this file entirely.
"Maybe" → Might be a false positive. Must check the file.

Each SSTable includes a Bloom filter. Before reading an SSTable from disk, we check its Bloom filter in memory. If it says "definitely not," we skip the file, no disk I/O.

Try searching for different names:

"alice" → Found in SSTable-1, Bloom filter says "maybe" (correct)
"dave" → Bloom filter says "maybe" but key isn't there (false positive!)
"zoe" → All Bloom filters say "definitely not" → zero disk reads

With a 1% false positive rate, we avoid 99% of unnecessary disk reads. Bloom filters are small (~10 bits per key) and live in memory.

Files keep piling up

Left unchecked, the number of SSTables grows forever. Reads get slower (more files to check). Disk usage grows (old values and tombstones never get removed). We need a way to clean up, but cleanup means reading and rewriting data, which has its own cost.

The answer is compaction: read multiple SSTables, merge their contents (keeping only the newest version of each key), and write the result as a new SSTable. The old SSTables are then deleted.

Click "Flush new SSTable" several times to add tables to Level 0. Then click "Compact Level 0":

Tables are read from disk
Keys are merged (newest value wins)
A single clean SSTable is written to Level 1
Old tables are deleted

Compaction is like merge sort. Since all inputs are sorted by key, merging is efficient: O(n) time, O(1) extra memory.

During compaction:

Duplicate keys are deduplicated (newest wins)
Tombstones and the values they shadow are removed
The result is a single, cleaner SSTable

The tradeoff: compaction consumes I/O and CPU. It's a background process that competes with foreground reads and writes.

You can't optimize everything at once

Every design decision in storage involves tradeoffs. The RUM Conjecture says you can optimize for two of these three, but not all:

Read overhead: How many places must you check to find data?
Update (write) overhead: How much work to insert data?
Memory (space) overhead: How much extra storage for redundancy?

Drag the slider and watch the tradeoffs shift:

Less compaction → Writes are cheap, reads are expensive, space usage is high
More compaction → Reads are fast, but writes pay the compaction cost

There's no free lunch. B-Trees optimize for reads (fast lookups, but writes require in-place updates). LSM Trees optimize for writes (append-only, but reads may check multiple files).

Two ways to organize the cleanup

Not all compaction is equal. The two main strategies trade off differently:

Leveled compaction (RocksDB, LevelDB) organizes SSTables into levels. Level 0 holds recently flushed tables (may overlap). Higher levels hold larger, non-overlapping tables. Compaction picks a table from level N and merges it with overlapping tables in level N+1.

Pro: Predictable read performance, lower space amplification
Con: Higher write amplification (data gets rewritten as it moves through levels)

Size-tiered compaction (Cassandra, HBase) groups tables by size. When you have 4 tables of similar size, merge them into one larger table. Larger tables get merged less frequently.

Pro: Lower write amplification, simpler logic
Con: Higher space amplification, more files to check during reads

Leveled compaction organizes SSTables into strict levels with size ratios. Notice how all tables within a level are the same size:

Size-tiered compaction groups tables by size instead. Notice how tables grow larger at each tier:

When does this approach make sense?

LSM Trees excel when writes outnumber reads, since append-only writes are inherently fast. They're a natural fit when data arrives in bursts (logs, events, metrics) because the memtable buffer smooths out spikes. They also help with SSD longevity, since sequential writes wear flash cells more evenly than random ones.

LSM Trees struggle when reads vastly outnumber writes (B-Trees may be faster), when you need consistent latency (compaction can cause pauses), or when you're frequently updating the same keys (which causes write amplification during compaction).

In practice, many systems use both. PostgreSQL uses B-Trees for indexes but has a write-ahead log. MongoDB's WiredTiger engine lets you choose B-Tree or LSM per collection.

The full picture

Log-structured storage inverts the traditional database model. Instead of erasing, you append. New versions shadow old ones. Writes land in a memtable in memory, then flush to sorted, immutable SSTables on disk. A write-ahead log makes those in-memory writes durable before they're flushed. Deletions become tombstone markers. Bloom filters let you skip SSTables that definitely don't contain a key. And compaction merges files to remove old versions and keep reads fast.

The core insight is simple: random writes are expensive and sequential writes are cheap. If you can restructure your storage to convert random writes into sequential writes, you can handle much higher throughput. The tradeoff is that reads may need to check multiple files, but with Bloom filters, smart compaction, and caching, this cost is manageable for most write-heavy workloads.

Next time you see a database claim "high write throughput," check if it uses LSM Trees. It probably does.