Caching: The Art of Remembering

You're building an app. Users click a button, and your code fetches data from a database. Simple enough.

But then you notice something. The same user keeps requesting the same data. Over and over. Each request takes 100 milliseconds (a tenth of a second) to fetch from the database. That doesn't sound like much, but computers can do a lot in that time.

What if you could just... remember the answer?

That's caching. It's one of the most powerful ideas in all of computing, and in this post we're going to build one from scratch, discover why it's surprisingly tricky, and understand the tradeoffs that every caching system must make.

The speed gap

Before we solve the problem, let's understand why it exists. Your computer has layers of storage, each trading off speed for size. Think of it like a kitchen: you keep the salt on the counter because you use it constantly. Spices you use less often go on a nearby shelf. Bulk ingredients live in a pantry down the hall. And if you need something truly obscure, you drive to the warehouse store across town.

Computer storage works the same way:

Storage	Speed	Size	Cost
CPU Registers	~1 ns	Bytes	$$$$$
L1 Cache	~1 ns	64 KB	$$$$
L2 Cache	~10 ns	256 KB	$$$
L3 Cache	~50 ns	8 MB	$$
RAM	~100 ns	16 GB	$
SSD	~100,000 ns	1 TB	¢
Network	~100,000,000 ns	Unlimited	¢

A nanosecond (ns) is one billionth of a second. The numbers are in nanoseconds so you can compare them directly. Look at the gap between RAM (100 ns) and network (100,000,000 ns). Network access is one million times slower than reading from RAM. That's the difference between one second and 11 days.

The registers and L1/L2/L3 caches are tiny storage areas built into or near the processor itself, like the salt right on your counter. RAM (random access memory) is your computer's main working memory, like the full kitchen. The SSD (solid-state drive) is long-term storage, like the pantry. And the network is the warehouse across town.

So we're doing the same work over and over, fetching the same data across the slowest link in the chain. How do we avoid this?

Building our first cache

The core idea is simple: keep a copy of frequently accessed data somewhere fast. Serve from the copy when possible. Only go to the slow storage when necessary.

A cache is a lookup table, you give it a name (the key, like "user:123") and it gives you back the data (the value, like the user's profile). It sits between your application and the slow data source, which we'll call the backing store (the database, an API, a disk). When you ask for data, check the cache first. If the key is there, return the data immediately, that's a cache hit. If the key isn't there, fetch from the backing store, save a copy in the cache, and return it, that's a cache miss.

Step through and watch the cache fill up:

Notice what happens on the last step. The cache is full. We only allocated 4 slots, and we've used them all. We can't add E without removing something first. This brings us to our first real problem.

When the cache is full

When the cache runs out of space and we need to add something new, what do we throw away? The data we evict might be requested again immediately, turning a fast cache hit into a slow cache miss.

We need a replacement policy, a strategy for deciding what to evict. Let's look at the three most common approaches.

LRU: evict the oldest access

The intuition behind LRU (Least Recently Used): if you haven't touched something in a while, you probably won't need it soon. Every time you access an item, it moves to the "most recently used" position. When you need to evict, you remove the item at the other end, the one nobody has touched for the longest time.

Step through and watch how accessing A moves it to the MRU position, and how inserting D causes B to get evicted:

LRU is the most popular policy because it works well for most workloads. But it has a weakness: what if a backup job scans through your entire dataset once? It would evict your entire hot working set, replacing it with data you'll never access again.

LFU: evict the least popular

LFU (Least Frequently Used) takes a different approach. Instead of tracking when something was last accessed, it tracks how often. The item with the lowest access count gets evicted.

LFU handles the backup scan problem better, a one-time access doesn't increase the frequency count enough to evict frequently-used items. But LFU has its own problem: items that were popular but aren't anymore can stick around forever. If item A was accessed 1000 times yesterday but nobody wants it today, it still has a high count and won't be evicted. This is called cache pollution.

There's also a practical downside to both LRU and LFU. LRU has to maintain a strict ordering of all items, moving an item to the front on every access. LFU needs to search for the minimum count on every eviction. Both add overhead on every cache operation. Can we do something simpler?

SIEVE: a simpler alternative

SIEVE is a newer algorithm that asks: do we really need perfect ordering? Instead of tracking exact recency or frequency, it uses a single bit per entry. When an item is accessed, its "visited" flag is set to true. When we need to evict, a pointer (called the "hand") scans through the cache. If it finds a visited entry, it clears the flag and moves on, giving that entry a second chance. The first unvisited entry it finds gets evicted.

Why does SIEVE work so well?

SIEVE captures the essence of what makes LRU effective without the overhead. When the hand encounters a visited entry, it clears the flag and moves on, giving that entry a second chance. Entries that are truly hot will keep getting visited and survive. Entries that got lucky once will be evicted on the next pass.

Comparing the three

How do these algorithms perform on the same access pattern? Edit the sequence and step through to see how each algorithm's cache state changes on every access:

How much does caching actually help?

We've picked an eviction policy. But the choice of algorithm only matters if our cache is actually being used. The real question is: how many requests are hits vs. misses?

Hit ratio: the number that matters

The hit ratio is the percentage of requests served from cache. Here's the counterintuitive part: even a "good" hit ratio like 80% means 20% of requests are slow.

Drag the slider from 95% down to 50% and watch the average latency climb:

At 95% hit ratio with 5ms (millisecond, one thousandth of a second) hits and 100ms misses, your average is only 9.75ms. Drop to 80% and it jumps to 24ms. Drop to 50% and it's 52.5ms. The median request (the one in the middle if you sorted all requests by speed) might look great since it's probably a cache hit. But the slowest requests, the ones at the 99th percentile (meaning 99% of requests are faster), are dominated by cache misses. Those slow outliers are what users actually notice.

Sizing your cache to fit

The working set is the set of keys your application actively uses during normal operation. If your cache is larger than your working set, almost everything will be a hit. If it's smaller, you'll constantly evict data you need, a phenomenon called thrashing, where the cache churns through entries without ever getting to reuse them.

Step through different cache sizes for a repeating 4-key pattern and watch the hit ratio jump the moment the cache fits the working set:

You don't need to cache all data, just the data that's actively requested. Most real workloads follow a skewed distribution where a small fraction of keys account for the majority of requests: the top 10% of products might get 90% of the views.

The freshness problem

We've made things fast. But we've created a new problem: what if the backing store has newer data? The cache doesn't know. It happily serves stale data.

Imagine caching a user's email address. They update it in the database. But your cache still has the old one. Not good.

This is the fundamental tradeoff of caching: speed vs. freshness.

TTL: time-to-live

The simplest solution is to put an expiration time on cached data. After the TTL (time-to-live) expires, the entry is considered stale and must be refetched.

Set a short TTL and start the timer. Watch the entry expire:

Short TTL means fresh data, more misses, lower hit ratio. Long TTL means high hit ratio, potentially stale data. There's no universal "right" TTL. It depends on how stale your application can tolerate the data being: stock prices need TTLs of seconds. User profiles can handle hours. Product catalogs can handle days.

The cache invalidation problem

"There are only two hard things in Computer Science: cache invalidation and naming things." : Phil Karlton

TTL is a blunt instrument. What if data changes but the TTL hasn't expired yet? You could invalidate on write (explicitly remove the key from cache when the underlying data changes), write-through (update the cache at the same time you write to the database), or use a messaging system where data changes are broadcast to all cache nodes so they can evict stale entries. Each approach trades off complexity, consistency, and performance differently.

How applications talk to caches

We've been treating the cache as a black box. But there are different patterns for how your application code actually interacts with it. The pattern you choose determines who is responsible for fetching data on a miss, and who handles writes.

Cache-aside (lazy loading)

In the cache-aside pattern, your application manages the cache directly. Step through and notice who fetches from the database: the application, not the cache:

async function getUser(id: string) {
  // 1. Check cache first
  const cached = await cache.get(`user:${id}`);
  if (cached) return cached;
 
  // 2. Cache miss - fetch from database
  const user = await db.users.findById(id);
 
  // 3. Update cache for next time
  await cache.set(`user:${id}`, user, { ttl: 3600 });
 
  return user;
}

This is the most common pattern. It's simple, the application has full control, and it works with any data store.

Read-through

In read-through caching, the cache itself handles fetching from the backing store. Your application just asks the cache. Step through and notice the difference: the cache fetches from the database, not the application:

This simplifies application code but requires a smarter cache layer.

Write-through

When you write data, write-through sends the update to the cache and the database synchronously. The application waits for both to complete before moving on. Step through and notice the full round trip:

Consistent, because the database is always up to date. But slower, because the application blocks until the database write finishes.

Write-behind

Write-behind takes the opposite approach. The cache acknowledges the write immediately and flushes to the database later, in the background. Step through and notice how much sooner the application gets its acknowledgment:

Fast, because the application doesn't wait for the database. But if the cache crashes before flushing, the data written between the last flush and the crash is lost.

Multiple caches, multiple problems

Everything so far has assumed one cache. But modern computers have multiple processors, and each processor has its own small, fast hardware cache (those L1/L2 caches from the speed table). What happens when two processors cache the same data?

Say CPU1 reads the value A=10 into its cache. CPU2 also reads A=10 into its cache. Now CPU1 writes A=20, but only to its own cache. CPU2 still has the old copy: A=10. If CPU2 reads A, should it see 10 (stale) or 20 (current)?

This is the cache coherency problem. Hardware solves it with a protocol called MESI, where each cache line is in one of four states: Modified (this cache changed it), Exclusive (only copy), Shared (multiple caches have it), or Invalid (stale, don't use). Step through a scenario where reads and writes cause these state transitions:

In distributed systems (multiple servers instead of multiple CPUs), coherency is harder. There's no shared wire for instant invalidation. Common solutions include broadcasting invalidation messages through a message queue, using short TTLs to bound staleness, and including version numbers so stale reads can be detected and rejected.

Practical gotchas

Before you go forth and cache everything, here are common pitfalls that catch people in production.

Cache stampede

When a popular cache entry expires, many requests might simultaneously try to regenerate it, all hitting the database at once. This is a cache stampede (sometimes called "thundering herd"). Step through and watch how a single TTL expiration can overwhelm the database, and how locking fixes it:

Other solutions include probabilistic early expiration (each request has a small random chance of refreshing before the TTL expires, spreading the regeneration load) and background refresh (a background job proactively refreshes popular entries before they expire, so the cache never goes cold).

Cache warming

A cold cache (empty, or after a restart) has 0% hit ratio. All requests are misses, potentially overwhelming the backing store. Cache warming pre-populates the cache with expected hot data before traffic arrives, like pre-heating an oven before you bake.

Negative caching

What if you query for a user that doesn't exist? Without negative caching, every request for that non-existent user hits the database. The fix: cache the "not found" result too, with a shorter TTL.

async function getUser(id: string) {
  const cached = await cache.get(`user:${id}`);
 
  // Check for negative cache entry
  if (cached === null) return null;      // Cached "not found"
  if (cached !== undefined) return cached; // Cached user
 
  const user = await db.users.findById(id);
 
  if (user) {
    await cache.set(`user:${id}`, user, { ttl: 3600 });
  } else {
    // Cache the "not found" with shorter TTL
    await cache.set(`user:${id}`, null, { ttl: 300 });
  }
 
  return user;
}

What we built

We started with a speed problem (networks are slow, memory is fast) and built a simple lookup cache to solve it. That immediately led to the eviction problem (the cache fills up), which we solved with replacement policies: LRU for recency, LFU for frequency, SIEVE for simplicity.

Then we discovered that caching creates a freshness problem (the cache doesn't know when data changes). TTL bounds staleness, but cache invalidation remains genuinely hard. We looked at how applications talk to caches (cache-aside, read-through, write-through, write-behind), each making different tradeoffs between simplicity, consistency, and performance.

Finally, we saw that multiple caches create coherency problems, and that production systems need to handle stampedes, cold starts, and negative lookups.

The best cache is one you don't have to think about. But when things go wrong, understanding these fundamentals helps you debug and fix the issues.