Back to Blog

Wait-Free Synchronization

When locks are too slow, you need a different approach. Learn how to build concurrent data structures that guarantee every thread makes progress.


Here's a scenario that keeps systems programmers up at night:

You have a counter. Ten threads want to increment it. Each thread adds 100 to the counter. Simple math says the final value should be 1000. But when you run the code, you get... 934? 967? 891? What's going on?

This is the shared memory problem. When multiple threads touch the same data without coordination, things go wrong in subtle, unpredictable ways.

The traditional fix is a lock: only one thread touches the data at a time. Problem solved! But locks have their own problems. Threads wait. Performance tanks. In the worst case, your application freezes completely (hello, deadlock).

This article explores a different path: wait-free synchronization. Instead of making threads wait, we use clever atomic operations that let every thread make progress simultaneously. It's trickier to get right, but when latency matters, it's the only way.

By the end, you'll understand exactly why shared memory is dangerous, how locks solve the problem (and what new problems they create), and how to build synchronization that never blocks.

What we'll cover
  • Why shared data needs synchronization
  • How mutexes, spinlocks, and read-write locks work
  • The problems with locks: contention, priority inversion, deadlocks
  • Atomic operations and memory barriers
  • Lock-free vs wait-free: throughput vs latency tradeoffs
  • The helping technique: the secret to wait-free algorithms
  • Fast-path / Slow-path pattern: practical implementation
  • Combining and flat-combining: batching operations
  • Wait-free stacks and queues from scratch
  • Memory reclamation: hazard pointers and epoch-based reclamation

The Problem with Shared State

The Problem

When multiple threads read and write the same memory location without coordination, the result is undefined. You might get the right answer. You might not. You won't know until it's too late.

Consider a simple counter. Multiple threads each want to increment it 100 times. The expected result is 1000. Let's see what actually happens:

race_condition.pypython
import threading

counter = 0

def increment():
  global counter
  for _ in range(100):
      counter += 1  # This is NOT atomic!

threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads: t.start()
for t in threads: t.join()

print(f"Expected: 1000, Got: {counter}")
# Output varies: 934, 967, 1000, 891...

The problem is that counter += 1 is not atomic. Under the hood, it's three operations: read the value, add one, write back. If two threads read the same value before either writes, one increment is lost.

Let's watch this happen step by step:

Race Condition: Step by Step
Thread 1IdleMemory0Thread 2Idle1. READ2. MODIFY3. WRITE
Try It Out

Step through the demo slowly. Watch what happens when Thread 2 reads the counter before Thread 1 has written its result. Both threads think they're incrementing from 0, so both write 1. One increment is lost forever.

Key Insight

A race condition occurs because read-modify-write is not a single operation. Between reading and writing, another thread can sneak in and change the value. The solution must make this entire sequence indivisible (atomic).

Want to see the chaos at scale? Here's what happens when 10 threads race:

Race Condition Simulation
Current0Expected1000

Run the demo a few times without locks. Notice how the final value is almost never 1000. Now enable the lock and run again. The lock serializes access, guaranteeing correctness.

But wait... the lock fixes correctness, but it introduces new problems. Let's understand what we're dealing with.


Mutual Exclusion: The Traditional Solution

Mutual exclusion ensures that only one thread can access a shared resource at a time. The code section protected by a lock is called the critical section.

mutex_example.pypython
import threading

counter = 0
lock = threading.Lock()

def increment():
  global counter
  for _ in range(100):
      with lock:  # Acquire lock
          counter += 1  # Critical section
      # Lock released automatically

# Now counter will always be 1000

There are several types of locks, each with different tradeoffs:

Mutexes

A mutex (mutual exclusion lock) blocks a thread if it can't acquire the lock. The thread is put to sleep by the operating system and woken up when the lock becomes available. This is efficient when waits are long, but the sleep/wake cycle adds latency.

Spinlocks

A spinlock keeps the CPU busy in a loop, repeatedly checking if the lock is available. No OS involvement, no sleep/wake overhead. Great for very short critical sections. Terrible if you spin for too long.

Read-Write Locks

A read-write lock allows multiple readers OR a single writer. Useful when reads vastly outnumber writes. Readers don't block each other.

Here's how mutex contention plays out in practice. Step through to watch threads queue up and wait:

Mutex Acquisition Flow
MUTEXFreeThread 1waitingThread 2waitingThread 3waiting
Try It Out

Step through the demo. Notice how Thread 2 and Thread 3 are blocked while Thread 1 holds the lock. They can't do anything useful. This is the cost of mutual exclusion: correctness at the price of parallelism.


The Problems with Locks

Locks solve the correctness problem, but they introduce new problems that can be just as bad. In latency-sensitive systems, these problems can be deal-breakers.

Problem 1: Inefficiency

Lock acquisition isn't free. There's overhead for the atomic operations, memory barriers, and potential OS calls. Under contention, this overhead explodes.

Lock Contention: The Pileup
Threads:1
LOCK QUEUEAVERAGE WAIT7 ns1 thread: 7 ns10 threads: 600 ns (86x)100 threads: 5 µs (714x)
Try It Out

Drag the slider from 1 to 100 threads. Watch the thread queue grow and the latency bar explode. At 100 threads, you're waiting 714x longer than with a single thread. That's the cost of contention.

Key Insight

Lock contention doesn't scale linearly. It grows superlinearly. Doubling the threads more than doubles the wait time. This is why locks become bottlenecks in high-throughput systems.

Problem 2: Priority Inversion

Priority inversion occurs when a high-priority thread is blocked waiting for a lock held by a low-priority thread. It gets worse: a medium-priority thread can preempt the low-priority thread, effectively inverting the priority scheme entirely.

Priority Inversion
HighP=3idleMediumP=2idleLowP=1running

Step through the demo. Notice how the High priority thread (the most important one!) ends up waiting longer than anyone because Medium preempts Low while Low holds the lock that High needs. This is priority inversion: your priorities are meaningless.

The Mars Pathfinder Bug

In 1997, NASA's Mars Pathfinder spacecraft experienced repeated system resets due to priority inversion. A high-priority bus management task was blocked by a low-priority meteorological task. A medium-priority communication task kept preempting the low-priority task, preventing it from releasing the lock. The solution was to enable priority inheritance, where a thread holding a lock temporarily inherits the priority of the highest-priority thread waiting for it.

Problem 3: Convoying

Convoying happens when a thread holding a lock gets delayed (perhaps by a page fault or context switch), causing all other threads waiting for that lock to pile up behind it. When the lock is finally released, there's a thundering herd of threads all competing for it.

The result: cascading delays and poor throughput. One slow thread affects everyone.

Problem 4: Deadlocks

A deadlock occurs when two or more threads are waiting for resources held by each other, forming a cycle where nobody can proceed. The classic example is the ABBA problem.

deadlock.pypython
lock_a = threading.Lock()
lock_b = threading.Lock()

def thread_1():
  with lock_a:          # Acquire A
      time.sleep(0.1)   # Give thread_2 time to acquire B
      with lock_b:      # Wait for B... forever
          do_work()

def thread_2():
  with lock_b:          # Acquire B
      time.sleep(0.1)   # Give thread_1 time to acquire A
      with lock_a:      # Wait for A... forever
          do_work()

# DEADLOCK: Thread 1 holds A, waits for B
#           Thread 2 holds B, waits for A
ABBA Deadlock
Lock AFreeLock BFree
Thread 1 (A → B)
Thread 2 (B → A)
Hint: T1 gets A, T2 gets B, then each tries for the other.

Try causing a deadlock in the demo. Have Thread 1 acquire A, then Thread 2 acquire B, then have each try to acquire the other lock. Neither can proceed. The application is frozen.

Solution

Wait-free synchronization avoids all of these problems by eliminating blocking entirely. No thread ever waits for another. Every operation completes in a bounded number of steps.


Atomic Operations: The Building Blocks

To build synchronization without locks, we need operations that are atomic: indivisible, uninterruptible. Modern CPUs provide special instructions that guarantee atomicity at the hardware level.

Types of Atomic Operations

Load and Store: Read or write a value atomically. On modern CPUs, aligned reads and writes of machine-word-sized values are already atomic. But that's not enough for most synchronization needs.

Fetch-and-Modify: Read a value, modify it, and write it back, all atomically. Examples: fetch-and-add, fetch-and-subtract. Useful for counters.

Compare-and-Swap (CAS): The king of atomic operations. Compare a memory location to an expected value; if they match, swap in a new value. If they don't match, fail. This operation is the foundation of almost all lock-free and wait-free algorithms.

Let's step through exactly how CAS works:

Compare-And-Swap: Step by Step
EXPECTED5MEMORY5NEW VALUE10
Try It Out

First, walk through the "Success Path" to see CAS work normally. Then switch to "Failure Path" and watch what happens when another thread interferes. The key insight: CAS detects interference and fails gracefully.

Key Insight

CAS is the secret weapon of lock-free programming. Unlike locks, it never blocks. If another thread interferes, CAS simply fails and you retry. This turns "waiting" into "trying again", a crucial difference for latency.

Want to experiment interactively? Try this playground:

Compare-And-Swap Playground
Memory5Expected5New Value10

Click "Simulate Other Thread" to see what happens when someone else modifies the value between your read and your CAS. This is exactly the scenario the step-through demo illustrated.

Building a Spinlock from Atomics

To understand how atomics work in practice, let's build a spinlock. The core idea: an atomic boolean that indicates whether the lock is held. To acquire, keep trying to swap true until you get back false (meaning it was unlocked).

spinlock.pypython
class Spinlock:
  def __init__(self):
      self.locked = AtomicBool(False)

  def acquire(self):
      # Keep spinning until we successfully set locked to True
      while self.locked.swap(True):
          pass  # Spin!

  def release(self):
      self.locked.store(False)

Step through to watch a spinlock in action:

Spinlock Implementation
locked = AtomicBool(False)
while locked.swap(True):
pass # Spin
locked.store(False)
Memory: locked
false
Try It Out

Step through the demo. Watch Thread 2 spin repeatedly while Thread 1 holds the lock. Notice the code highlighting that tracks exactly which line is executing at each step.

Warning

Spinlocks burn CPU cycles while waiting. Thread 2 is doing nothing useful but consuming 100% of a CPU core. Only use spinlocks for very short critical sections (less than ~100 instructions).


Memory Barriers: Ordering Chaos

The Problem

Modern CPUs reorder instructions for performance. Your code says "write A, then write B", but the CPU might execute "write B, then write A". In single-threaded code, this is invisible. In multi-threaded code, it causes chaos.

Consider two threads. Thread 1 sets a value and then sets a flag indicating the value is ready. Thread 2 checks the flag and then reads the value. Without memory barriers, Thread 2 might see the flag as true but read an old value!

reordering.pypython
# Thread 1 (Writer)
value = 42
done = True    # CPU might reorder this before value = 42!

# Thread 2 (Reader)
if done:       # Might see done = True
  use(value)  # But value might still be 0!

The problem: the CPU sees no dependency between value and done, so it might reorder them. From Thread 2's perspective, done can become true before value is written.

Types of Memory Barriers

Memory barriers (or fences) prevent reordering across the barrier:

  • Acquire barrier: Operations after the barrier cannot be reordered before it
  • Release barrier: Operations before the barrier cannot be reordered after it
  • Full barrier (SeqCst): No reordering in either direction

The pattern: pair a release barrier on the write with an acquire barrier on the read. This ensures that all writes before the release are visible after the acquire.

Memory Reordering
Thread 1 (Writer)
value = 1
done = true
Thread 2 (Reader)
if done:
read value
Try It Out

Run the demo several times without barriers enabled. Watch the violation counter. You'll occasionally see cases where Thread 2 reads done = true but value = 0. Then enable barriers and run again. Violations disappear.

Key Insight

Memory barriers enforce ordering. A release barrier says "all my previous writes must be visible before this one." An acquire barrier says "I must see all writes that happened before the release." Together, they synchronize memory across threads.

Memory ordering in Python

Python's Global Interpreter Lock (GIL) provides implicit memory ordering for most Python operations. However, when using ctypes, multiprocessing with shared memory, or native extensions, you need to think about memory ordering. Languages like Rust, C++, and C expose explicit memory ordering through their atomic types.


Wait-Free Synchronization

Now we arrive at the main event. Wait-free synchronization is a paradigm where every thread completes its operation in a bounded number of steps, regardless of what other threads are doing.

Progress Conditions: Throughput vs. Latency

When designing synchronization algorithms, we must choose what we optimize for: system-wide throughput or individual thread latency.

Lock-free algorithms optimize for throughput. They guarantee that at least one thread in the system makes progress in finite time. But here's the catch: if 100 threads all try to update a variable with CAS loops, one might get lucky and succeed immediately. The other 99 might spin and retry indefinitely, stuck in a live-lock where they fight over the same memory location. From the system's perspective, progress is being made. From an individual thread's perspective, you're experiencing terrible tail latency.

Wait-free algorithms optimize for latency. They guarantee that every operation completes in a bounded number of steps, regardless of how many other threads are competing. No thread waits for another. No thread spins indefinitely. This is critical for real-time systems where consistent, predictable latency matters more than raw throughput.

Here's the formal distinction:

  • Blocking: A thread might wait forever (e.g., waiting for a mutex)
  • Lock-free: At least one thread makes progress in finite time. Individual threads might starve (system throughput guaranteed).
  • Wait-free: Every thread makes progress in bounded time. No starvation possible (individual latency guaranteed).
Progress Conditions
No guarantee of progress
Thread 1blockedThread 2blockedThread 3running
Try It Out

Click through Blocking → Lock-Free → Wait-Free. In blocking mode, two threads are stuck at 0%. In lock-free, threads might be "retrying" indefinitely. In wait-free, every thread makes progress. That's the difference.

Key Insight

Lock-free = "someone always wins" (good for throughput). Wait-free = "everyone always makes progress" (good for latency). The difference matters when you care about the slowest operation, not just the average.

The Secret Sauce: Helping

So how do we turn lock-free into wait-free? The answer is a technique called helping. Instead of letting threads fight over a resource indefinitely, faster threads help slower threads complete their work. Once the slow thread's operation is done, the fast thread proceeds with its own.

Here's the mental model:

  • Lock-free thinking: You try to open a door. It's stuck. You wait and try again and again.
  • Wait-free thinking: You see someone struggling to open the door. You help them push. They go through. Then you go through.

The mechanism works like this:

  1. Thread A wants to perform operation Op A.
  2. Thread A checks a shared state array to see if any other thread is in progress.
  3. If Thread B is in the middle of Op B, Thread A doesn't retry competing with B. Instead, Thread A announces Op B and helps complete it.
  4. Once Op B is done, Thread A performs Op A.
  5. All faster threads in the system eventually notice and help finish Thread B's work.
  6. This guarantees that even the slowest thread eventually completes in bounded time.

The key insight: Convert contention into cooperation. Instead of threads fighting over the same memory, they cooperate to get everyone's work done. This transforms the problem from "individual thread progress" (which is hard) to "system coordination" (which is manageable).

Wait-Free: Announcement & Helping
THREADST0idleT1idleT2announcingT3idleANNOUNCEMENT ARRAY[0][1][2]PUSH(42)[3]

Step through the demo to see the helping mechanism in action. Thread 2 (the slow one) announces its operation, and Thread 1 (fast) sees it and helps complete it. Thread 2 never retries, never spins, it simply waits for someone to help, and someone always does. This is why wait-free guarantees bounded completion time.

Consensus Numbers

Not all primitives are created equal. Each synchronization primitive has a consensus number: the maximum number of processes for which it can solve the consensus problem (getting all processes to agree on a value).

Consensus Numbers
PrimitiveConsensus #
Atomic registers1
Test-and-set2
Fetch-and-add2
Queues / Stacks2
Compare-and-swap
LL/SC
Try It Out

Click on each row to see details. Notice the pattern: simple operations like read/write only work for 1 thread. CAS and LL/SC have infinite consensus number, they can coordinate any number of threads.

Key Insight

CAS has infinite consensus number. This is why CAS is the workhorse of lock-free and wait-free programming. Any data structure you can imagine can be built with CAS, for any number of threads.


The Fast-Path / Slow-Path Pattern

The practical way to implement wait-free algorithms is with a two-tier approach: a fast path for the common case and a slow path for contention.

The Fast Path: Lock-Free with Low Contention

Most of the time, threads don't compete. A thread tries a standard atomic operation (like CAS) once or a few times. If it succeeds immediately, great! You're done with minimal overhead. This is the optimistic case.

fast_path.cpppython
bool enqueue(T value) {
  // Fast path: try CAS a few times
  for (int retries = 0; retries < 3; ++retries) {
      int tail = tail_.load(std::memory_order_acquire);
      int next = (tail + 1) % capacity;
      if (next == head_.load(std::memory_order_acquire)) {
          return false;  // Queue full
      }
      if (tail_.compare_exchange_weak(tail, next,
              std::memory_order_release)) {
          buffer[tail] = value;
          return true;  // Success!
      }
  }
  // Fast path failed, move to slow path...
  return enqueue_slow(value);
}

The Slow Path: Announcement and Helping

If the fast path fails repeatedly, contention is high. The thread moves to the slow path, where it announces its operation in a shared state array. Every thread periodically checks this array. When a thread sees an announced operation, it helps complete it, even if it's not its own operation.

slow_path.cpppython
struct Operation {
  Operation() : type(NONE), value(0), done(false) {}
  enum { NONE, ENQUEUE, DEQUEUE } type;
  T value;
  std::atomic<bool> done;
};

std::atomic<Operation> announced[MAX_THREADS];

bool enqueue_slow(T value) {
  int tid = get_thread_id();
  Operation op;
  op.type = Operation::ENQUEUE;
  op.value = value;

  announced[tid].store(op);  // Announce operation

  // Help other operations and our own
  for (int i = 0; i < MAX_THREADS; ++i) {
      Operation other = announced[i].load();
      if (other.type != Operation::NONE) {
          help_operation(other, i);
      }
  }

  // Wait for our operation to complete
  while (!announced[tid].load().done) {
      // Spin or yield
  }

  announced[tid].store(Operation());  // Clear announcement
  return true;
}

This mechanism ensures fairness: even the slowest thread eventually gets helped by other threads in the system. The slowest thread completes in bounded time proportional to the number of other threads, not to the amount of contention.


Wait-Free Queues

The most common wait-free data structure is the queue. Queues are everywhere: job queues, message passing between threads, event systems. Let's build one.

The Circular Buffer

Wait-free queues are typically implemented as circular buffers: a fixed-size array with head and tail pointers. The producer writes at the tail; the consumer reads from the head. Both pointers wrap around when they reach the end of the array.

spsc_queue.pypython
class SPSCQueue:
  def __init__(self, capacity):
      self.buffer = [None] * capacity
      self.head = AtomicInt(0)  # Consumer reads here
      self.tail = AtomicInt(0)  # Producer writes here
      self.capacity = capacity

  def enqueue(self, item):
      tail = self.tail.load()
      next_tail = (tail + 1) % self.capacity
      if next_tail == self.head.load():
          return False  # Queue full
      self.buffer[tail] = item
      self.tail.store(next_tail)  # Release barrier
      return True

  def dequeue(self):
      head = self.head.load()
      if head == self.tail.load():  # Acquire barrier
          return None  # Queue empty
      item = self.buffer[head]
      self.head.store((head + 1) % self.capacity)
      return item

Single Producer Single Consumer (SPSC)

The simplest case: one thread produces, one thread consumes. No CAS needed! The producer only writes to tail; the consumer only writes to head. No contention.

Single Producer Single Consumer Queue
--------Size0/7Head: 0Tail: 0

Enqueue and dequeue items in the demo. Watch the head and tail pointers move around the circular buffer. Notice that we keep one slot empty to distinguish "full" from "empty".

Multiple Producers Multiple Consumers (MPMC)

When multiple threads produce or consume, we need CAS to coordinate. A thread reads the current index, tries to CAS it to the next value. If the CAS fails, another thread got there first, and we retry.

Multi-Producer Multi-Consumer Queue
-HT-----
CAS retries: 0
Producers
Consumers

Try having multiple producers and consumers operate on the queue. Watch the CAS retry counter. When threads compete, CAS operations fail and must be retried. This is the cost of lock-free: instead of waiting on a lock, you retry atomic operations.

Bounded vs Unbounded Queues

Our examples use bounded queues with fixed capacity. Unbounded queues can grow dynamically but are harder to make truly wait-free because memory allocation itself might block. In practice, bounded queues are preferred for latency-sensitive applications.

Wait-Free Stack with Announcement

To fully understand the announcement and helping pattern, let's look at a concrete wait-free stack implementation. The stack uses a shared head pointer and an announcement array where threads declare their operations.

waitfree_stack.cpppython
struct Node {
  T value;
  Node* next;
};

struct Operation {
  enum Type { NONE, PUSH, POP } type;
  T value;        // For PUSH
  T result;       // For POP
  std::atomic<bool> done;
};

class WaitFreeStack {
  std::atomic<Node*> head;
  std::atomic<Operation> announced[MAX_THREADS];

public:
  void push(T value) {
      int tid = get_thread_id();
      Operation op;
      op.type = Operation::PUSH;
      op.value = value;
      op.done = false;

      announced[tid].store(op);

      // Help everyone, including ourselves
      for (int i = 0; i < MAX_THREADS; ++i) {
          Operation current = announced[i].load();
          if (current.type != Operation::NONE) {
              help_operation(i, current);
          }
      }

      // Wait for our operation to complete
      while (!announced[tid].load().done) {
          // Spin or yield
      }

      announced[tid].store(Operation{NONE});
  }

  bool pop(T& result) {
      int tid = get_thread_id();
      Operation op;
      op.type = Operation::POP;
      op.done = false;

      announced[tid].store(op);

      // Help everyone
      for (int i = 0; i < MAX_THREADS; ++i) {
          Operation current = announced[i].load();
          if (current.type != Operation::NONE) {
              help_operation(i, current);
          }
      }

      // Wait for our operation to complete
      while (!announced[tid].load().done) {
          // Spin or yield
      }

      op = announced[tid].load();
      announced[tid].store(Operation{NONE});

      result = op.result;
      return op.type == Operation::POP;  // True if pop succeeded
  }

private:
  void help_operation(int tid, const Operation& op) {
      if (op.type == Operation::PUSH) {
          // Help with a push
          Node* new_node = new Node{op.value, nullptr};
          Node* old_head;
          do {
              old_head = head.load();
              new_node->next = old_head;
          } while (!head.compare_exchange_weak(old_head, new_node));
      } else if (op.type == Operation::POP) {
          // Help with a pop
          Node* old_head;
          do {
              old_head = head.load();
              if (old_head == nullptr) {
                  announced[tid].store(Operation{Operation::NONE});
                  return;  // Stack empty
              }
          } while (!head.compare_exchange_weak(old_head, old_head->next));
          announced[tid].load().result = old_head->value;
      }
      announced[tid].load().done = true;
  }
};

Notice the pattern: every operation goes through the same flow:

  1. Announce the operation in the shared array
  2. Loop through all announced operations and help complete them
  3. Wait for your own operation to be marked as done (by yourself or another thread)
  4. Clear the announcement

The magic: even if you're the slowest thread, the faster threads will see your announcement and help you finish. Your operation completes in bounded time proportional to the number of threads, not to the contention level.

Combining: Turning Contention into Cooperation

Here's a clever optimization: when multiple threads announce operations simultaneously, instead of helping them one-by-one, a single combiner thread grabs all the announced operations, applies them to the data structure in a batch, and marks them all as done.

This transforms the problem from "N threads fighting over the data structure" to "N threads cooperating to let one thread do bulk work." The combiner batches multiple enqueues and dequeues into a single traversal of the queue, dramatically reducing memory contention and cache line bouncing.

combining.cpppython
// Before: Each thread tries to modify the queue
// Result: High cache line contention, many CAS failures

// After: One combiner thread batches all operations
bool combine_and_apply() {
  std::vector<Operation*> ops;

  // Collect all announced operations
  for (int i = 0; i < MAX_THREADS; ++i) {
      if (announced[i].type != Operation::NONE) {
          ops.push_back(&announced[i]);
      }
  }

  // Apply them all in one go
  for (auto op : ops) {
      if (op->type == ENQUEUE) {
          queue.push_back(op->value);
      } else if (op->type == DEQUEUE) {
          op->result = queue.front();
          queue.pop_front();
      }
      op->done = true;
  }

  return !ops.empty();
}

The beauty of combining: it doesn't change the wait-free guarantees. Every thread still completes in bounded time. But the actual performance under high contention becomes much better because threads cooperate rather than fight.


The Hidden Challenge: Memory Reclamation

Here's the problem that keeps wait-free programmers awake at night: in a lock-free data structure like a linked list, you can't delete a node just because you think no one is using it. Another thread might be in the middle of reading that node, following pointers through it, or helping you traverse it. If you free the memory, you've just created a use-after-free bug.

The Problem

Classic mutexes never had this problem. You lock, delete the node, unlock. Done. But wait-free algorithms have no locks, so you need a way to safely reclaim memory without blocking.

There are two main approaches:

Hazard Pointers

Each thread maintains a small array of hazard pointers, pointers to nodes it is currently using. Before deleting a node, a thread checks all hazard pointers in the system. If no thread has a hazard pointer to that node, it's safe to delete.

hazard_pointers.cpppython
thread_local std::vector<Node*> hazard_ptrs(NUM_HAZARD_PTRS, nullptr);

void retire_node(Node* node) {
  // Collect all hazard pointers from all threads
  std::vector<Node*> in_use;
  for (int i = 0; i < MAX_THREADS; ++i) {
      for (auto ptr : thread_hazard_ptrs[i]) {
          if (ptr != nullptr) in_use.push_back(ptr);
      }
  }

  // Check if this node is in use
  if (std::find(in_use.begin(), in_use.end(), node) == in_use.end()) {
      delete node;  // Safe!
  } else {
      // Keep it for later
      retired_nodes.push_back(node);
  }
}

Node* read_from_list() {
  Node* ptr = head.load();
  hazard_ptrs[0] = ptr;  // Mark that we're using this node
  // ... safely read from ptr ...
  hazard_ptrs[0] = nullptr;  // Done using it
  return ptr;
}

Epoch-Based Reclamation (EBR)

Divide time into epochs. Each thread announces which epoch it's in. A node can be reclaimed once all threads have passed the epoch in which it was retired. This works because if a thread is in epoch N+1, it can't have pointers to nodes retired in epoch N.

epoch_reclamation.cpppython
thread_local Epoch current_epoch = 0;

void retire_node(Node* node) {
  Epoch retire_epoch = global_epoch;
  retired_nodes[retire_epoch].push_back(node);
}

void advance_epoch() {
  // Increment global epoch
  global_epoch++;

  // Wait until all threads have advanced past retire_epoch
  Epoch safe_epoch = retire_epoch + 2;
  while (!all_threads_passed(safe_epoch)) {
      // Spin or yield
  }

  // Now it's safe to delete nodes from retire_epoch
  for (auto node : retired_nodes[safe_epoch - 2]) {
      delete node;
  }
}
C++26: Help is Coming

C++26 is likely to include std::hazard_pointer in the standard library, making safe memory reclamation much easier. Until then, you'll need to implement hazard pointers or epoch-based reclamation yourself, or use a library like Concord.

Memory reclamation is why many wait-free data structures use fixed-size pools or avoid deletion entirely. It's also why lock-free is sometimes preferred in practice, the memory management is harder than the actual algorithm!


When to Use Wait-Free

Wait-free synchronization is powerful but complex. Use it when:

  • Strict latency requirements: No operation can afford to wait
  • High contention: Many threads accessing shared data frequently
  • Real-time systems: Bounded worst-case execution time is mandatory
  • Fault tolerance: A crashed thread shouldn't block others

Use traditional locks when:

  • Simplicity matters: Lock-based code is easier to reason about
  • Low contention: If threads rarely compete, lock overhead is minimal
  • Long critical sections: Complex operations that can't be made wait-free

Not sure which to use? Walk through this decision tree:

Which Synchronization to Use?
Do you need to share data between threads?
Try It Out

Answer the questions based on your use case. The tree will guide you to the right synchronization approach for your situation.


Summary

Whew! If you made it all the way here, thank you for sticking with it. Concurrent programming is genuinely hard, and wait-free synchronization is the deep end of the pool. Here's what we covered:

Key Insight

The Core Ideas

  1. Shared memory + multiple threads = chaos without synchronization.
  2. Locks work, but they cause waiting, deadlocks, and priority inversion.
  3. CAS (Compare-and-Swap) is the magic primitive that lets you update memory without blocking.
  4. Lock-free = "someone always makes progress" (throughput).
  5. Wait-free = "everyone always makes progress" (latency).
  6. The helping technique is what makes wait-free possible: fast threads help slow threads.

The mental model that matters: convert competition into cooperation. Instead of threads fighting over the same memory, they announce what they want to do and help each other finish. That's the essence of wait-free.

Wait-free synchronization isn't magic. It trades the simplicity of locks for the complexity of careful atomic programming. But when you need guaranteed low latency, when you can't afford to wait, it's the only game in town.


Want to go deeper? "The Art of Multiprocessor Programming" by Herlihy and Shavit is the bible. It covers everything from the theory of consensus numbers to practical implementations of wait-free data structures.

And if you found this article helpful, I'd love to hear about it. Good luck with your concurrent adventures!