Transaction processing and recovery

You click "Submit Order" on a checkout page. Behind that button, the system needs to:

Debit $50 from your account
Reduce inventory by 1
Create an order record
Notify shipping

What if the power goes out after step 1 but before step 3?

You'd lose $50 with nothing to show for it. This is the fundamental problem databases exist to solve.

Starting simple: one operation

Before we tackle the checkout scenario, consider what happens when you save a single piece of data.

When you write balance = 800 to a database, where does that value actually go?

RAM

800

Disk

1000

After a write: RAM has the new value, disk still has the old one

The write goes to RAM first because RAM is fast. Disk is slow. But here's the catch: RAM is volatile. When power goes out, everything in RAM disappears. How do we make sure our balance = 800 survives a crash?

We could immediately write to disk after every change. But disk writes are slow, 1000x slower than RAM. If we wait for disk on every operation, our database becomes unusable.

We need a better solution. We'll get there, but first let's understand the bigger picture — because this same crash problem gets much worse when multiple operations need to happen together.

Grouping operations: transactions

Back to our checkout. Those four operations aren't independent; they only make sense together. If we debit your account but fail to create the order, the system is in a broken state.

A transaction groups operations that must succeed or fail together. Either all of them happen, or none of them do.

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 50 WHERE id = 123;
  UPDATE inventory SET quantity = quantity - 1 WHERE product = 'widget';
  INSERT INTO orders (customer, product, amount) VALUES (123, 'widget', 50);
  INSERT INTO notifications (type, order_id) VALUES ('shipping', 1001);
COMMIT;

But how do we guarantee "all or nothing"? If we crash after the first UPDATE but before COMMIT, how do we undo that first change?

What exactly must a transaction guarantee?

So we need "all or nothing." But that's not the only thing that can go wrong. What if two transactions run at the same time and step on each other's data? What if the system commits a transaction but the data doesn't survive a reboot? A transaction needs four guarantees, collectively called ACID:

Atomicity: all or nothing. If any step fails, all previous steps are undone. The rest of this post is largely about how databases achieve this.

Consistency: the database moves from one valid state to another. Constraints you've defined (foreign keys, unique constraints, CHECK constraints like "balance >= 0") are enforced at commit time. If a transaction violates any constraint, it's rejected.

Isolation: concurrent transactions don't interfere with each other. Each sees a consistent view of the data. We'll explore exactly what "consistent view" means later.

Durability: once committed, the data survives crashes. Power failure, kernel panic, the data persists. We'll see how the write-ahead log makes this possible.

Step through this transfer and watch the total balance. Both accounts update together, and the total stays consistent:

Now watch the same transfer without atomicity. The system crashes between the debit and the credit:

The money vanishes. The debit happened, but the credit never did. Real databases wrap both operations in a transaction, so if anything fails, both are rolled back. Atomicity isn't automatic. The database must track every change and be prepared to undo them if something goes wrong.

Fast writes without losing data

Now we return to our earlier problem: RAM is fast but volatile, disk is slow but durable. How do we get the speed of memory writes without losing data on crash?

The naive solution, write to disk on every operation, is too slow. But keeping everything in RAM means losing data on crash.

The answer is to write to disk, but write differently.

Writing to disk without the pain

Before modifying any data in memory, the database appends a record to a write-ahead log (WAL). The WAL is an append-only file on disk.

Appending to a file is much faster than random writes. When you update data in place, the disk head has to jump to wherever that data lives on the platter. This seeking is what makes random I/O slow. Appending just writes to wherever the head already is, in sequence.

Step through the random writes and watch the disk head jump around:

Now step through sequential writes, the way a WAL works:

This is why the WAL uses sequential appends rather than updating data files directly. The write path looks like this:

✓

Client

✓

WAL

✓

Cache

✓

ACK

The rule is simple: log the change before making the change.

What `fsync` actually does

When you commit a transaction, the database forces the log to disk. But calling write() isn't enough — there's a gap between "written" and "on disk" that trips up many people.

When your application writes data, it goes to the OS page cache first — a RAM buffer managed by the operating system. The OS flushes this buffer to the physical disk on its own schedule, which could be seconds later. If power fails before that flush, your "written" data is gone.

fsync() is the system call that closes this gap. It forces the OS to flush its buffer to the physical storage device and blocks until the data is physically on disk. Step through the layers:

Only after fsync() returns does the database acknowledge the commit to the client. This is the expensive part — but it's the price of durability. Try writing a few changes, then crash the system and walk through recovery:

The WAL is the single source of truth. The page cache is just an optimization. If we lose the cache, we replay the log. If we lose the log, we lose data.

Warning

This is why databases are paranoid about WAL durability. PostgreSQL, MySQL, and most production databases fsync the WAL on every commit by default. You can disable this for speed, but you accept the risk of data loss on crash.

What about reads?

So far we've focused on writes, but most database operations are reads. Reading from disk on every query would be painfully slow.

Databases keep frequently-accessed data in memory in a page cache (or buffer pool). When you read a record, the database checks the cache first. If it's there, you get the data immediately. If not, it loads from disk, but now the page is cached for next time.

When you modify a record, the change happens in memory first. The modified page is marked "dirty" because its in-memory version differs from the on-disk version. Dirty pages are flushed to disk periodically, not on every write. Try reading pages, modifying them, and filling the cache to see eviction in action:

But the cache can only hold so many pages. When it's full and we need to load a new page, which one do we evict? And what if the page we're evicting is dirty?

This is where the WAL saves us again. Even if a dirty page gets evicted before being flushed, the WAL has the changes. On crash recovery, we replay the WAL to reconstruct any lost modifications.

What happens when two transactions run at once?

Everything so far assumes one transaction at a time. But real databases run thousands of transactions concurrently. Without coordination, concurrent transactions can corrupt data.

Here's a concrete example: two transactions both try to withdraw $60 from a $100 balance. Without any coordination, they can interleave their reads and writes so that one withdrawal silently disappears. Step through the interleaving:

This is a lost update, where one transaction's work is silently overwritten by another. The core issue is that T2 made a decision based on a stale read. By the time T2 writes, the value it read is no longer current.

Preventing collisions with locks

So how do we stop this? Databases use locks. Before reading, a transaction acquires a shared lock (multiple readers allowed). Before writing, it acquires an exclusive lock (only one writer, no readers).

With locks, T2 would have to wait for T1 to finish before reading the balance, so it would see the updated value. Step through T1 and T2 alternately. Watch how they acquire locks and what happens when they conflict:

When a transaction can't get a lock it needs, it waits. If two transactions are each waiting for a lock the other holds, that's a deadlock. The database detects this and aborts one transaction to let the other proceed.

The classic approach is two-phase locking (2PL). In the growing phase, the transaction acquires locks as needed but never releases any. In the shrinking phase (at commit or abort), all locks are released at once.

The critical rule is that once you release any lock, you can't acquire new ones. If a transaction released a lock early (before committing), another transaction could sneak in and see a partial state. By holding all locks until commit, 2PL guarantees serializability: the concurrent execution produces the same result as if the transactions had run one after another in some order.

But this level of safety comes at a cost.

Can we get away with less locking?

Full serializability is expensive. Every write blocks readers. Every read blocks writers. Throughput suffers.

Most databases offer weaker isolation levels that allow more concurrency at the cost of potential anomalies, cases where concurrent execution produces results that couldn't happen in any serial order. There are three kinds of anomalies to know:

A dirty read happens when you read data that another transaction wrote but hasn't committed yet. If that transaction rolls back, you acted on data that never existed.

A non-repeatable read means reading the same row twice in one transaction and getting different values because another transaction modified and committed it in between.

A phantom read means running the same query twice and getting different rows because another transaction inserted or deleted rows that match your query.

Each isolation level prevents a different set of these anomalies:

Level	Dirty Read	Non-Repeatable Read	Phantom
Read Uncommitted	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible
Serializable	Prevented	Prevented	Prevented

Step through this scenario at Read Committed, the default in PostgreSQL. Watch what T1 sees on its second read:

Now the same scenario at Repeatable Read. T1's second read stays the same because the database keeps a snapshot:

Warning

PostgreSQL defaults to Read Committed. MySQL InnoDB defaults to Repeatable Read. Most applications don't need Serializable, and the performance cost is significant. Choose based on your tolerance for anomalies.

What if the log gets too long?

Before we talk about the full crash recovery algorithm, we need to understand one more problem. If the database has been running for days, replaying the entire WAL on recovery would take hours.

Checkpoints solve this by flushing all dirty pages to disk and writing a checkpoint record to the WAL. During recovery, the database only needs to replay from the most recent checkpoint.

Write several operations, create a checkpoint, then write more and crash. Notice how recovery only replays operations after the checkpoint:

Without checkpoints, a database running for days would need to replay millions of operations on restart. Checkpoints trade occasional slowdown (during the checkpoint) for fast recovery.

Recovering from a real crash

We've covered how WAL protects against crashes, but the actual recovery process is more nuanced. The industry-standard approach is ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), published in 1992.

After a crash, we have a WAL full of operations. Some transactions committed, some didn't. Some changes made it to disk, some didn't. How do we figure out what to redo and what to undo? Recovery has three phases:

Phase 1: Analysis

Scan the WAL from the last checkpoint. Figure out:

Which transactions were active at crash time (need to be undone)
Which pages might be dirty (need to be redone)

Phase 2: Redo

Replay the WAL from the earliest relevant point. Reapply all changes, including those from uncommitted transactions. This restores the database to its pre-crash state.

Phase 3: Undo

Roll back all uncommitted transactions by applying their undo records in reverse order. This removes the effects of incomplete work.

Simulate a crash and walk through the three recovery phases. Watch T1's committed changes get preserved while T2's uncommitted changes get rolled back:

We redo everything first (repeat history), then undo the uncommitted parts. This is simpler than trying to selectively replay only committed transactions.

The complete write path

The complete write path:

Client sends a write request
Database acquires necessary locks
Database appends log record to WAL buffer
Database modifies page in buffer pool (marks dirty)
If commit: force WAL to disk (fsync)
Return success to client
Later: background writer flushes dirty pages

The page flushes (step 7) happen asynchronously. The data files might be out of date. That's fine, the WAL has everything needed to recover.

On crash recovery:

Database restarts
Find last checkpoint in WAL
Analysis: identify dirty pages and active transactions
Redo: replay WAL to restore buffer pool state
Undo: roll back uncommitted transactions
Open for business

This architecture (WAL + buffer pool + checkpoints + ARIES-style recovery) has been battle-tested for decades. PostgreSQL, MySQL, Oracle, SQL Server, and most production databases use variations of it.

Putting it all together

Database transactions are deceptively complex. A simple "debit this, credit that" requires atomicity (both operations commit together or neither does), durability (the WAL ensures committed data survives crashes), concurrency control (locks prevent conflicting access), isolation (each transaction sees consistent data), and a recovery algorithm (ARIES restores state after crashes).

Write-ahead logging lets us log changes before making them, taking advantage of fast sequential I/O. The page cache keeps hot data in memory while dirty pages flush in the background. Two-phase locking acquires locks while working and releases all at commit. Checkpoints bound recovery time by periodically flushing everything. And ARIES recovery redoes everything, then undoes the uncommitted parts.

Next time your transaction commits in a few milliseconds, remember what's happening underneath: a WAL was synced, locks were managed, and a recovery algorithm stood ready to restore order after a crash.