Back to Blog

Flynn's taxonomy: how computers process data in parallel

Let's say you're designing a processor from scratch. You start with the simplest thing that could work: a machine that reads one instruction at a time, operates on one piece of data at a time, and moves to the next instruction. It's slow, but it works.

Now you want to make it faster. What would you change?

That's essentially the story of computer architecture over the last 60 years. In 1966, Michael Flynn realized you could classify every approach to this problem with just two questions: How many instruction streams are there? How many data streams are there? The answers form a 2×2 grid that still describes every processor today.

We're going to rediscover that grid ourselves. We'll start with the simplest processor, run into its limitations, and invent our way toward faster architectures, the same path hardware designers actually followed. By the end, we'll encounter a problem so tricky that it forced engineers to invent a fifth category that Flynn never imagined.

The building blocks

Every concept in the rest of this article builds on four ideas.

What is an instruction?

An instruction is a single command that tells the processor to do something. Here are some examples:

  • ADD: Add two numbers together
  • LOAD: Get a value from memory
  • STORE: Save a value to memory
  • COMPARE: Check if two values are equal

Every program you've ever used (a web browser, a game, a text editor) is just a long sequence of these tiny instructions. Your processor chews through them one after another, millions or billions per second.

What is data?

Data is the actual values your program works with. Things like numbers, text, or pixels of an image. When you run ADD, you're adding two pieces of data. When you LOAD, you're fetching data from memory into a register.

A register is a small storage slot built directly into the processor, like a sticky note on a chef's workstation. It holds a single value that the processor needs right now. A typical processor has only a few dozen registers, but they're the fastest storage available.

What is a clock cycle?

Processors run on a clock, a signal that ticks at a fixed rate. Each tick is one cycle, and on each cycle the processor can do a small amount of work, like fetching an instruction, performing an addition, or reading from memory.

When you hear that a processor runs at "3 GHz," that means its clock ticks 3 billion times per second. Each tick is one opportunity to do work.

What is a Clock Cycle?
Clock Signal:012345
3 + 4 = ?
Tick 0/5

Idle - waiting for next instruction

The clock is what makes processors predictable. Everything happens in lockstep with the ticking. When we say an operation takes "1 cycle" or "100 cycles," we're counting how many ticks the processor needs.

What is a stream?

A stream just means a sequence that flows through the processor over time, like a conveyor belt:

  • An instruction stream is the sequence of commands being executed (ADD, LOAD, STORE...)
  • A data stream is the sequence of values being processed (5, 12, 7, 23...)

Serial vs. parallel

When things happen serially (or sequentially), they happen one after another. You finish step 1, then start step 2, like a single checkout line at a store.

When things happen in parallel, they happen at the same time, like multiple checkout lines with multiple customers being helped simultaneously.

With these four ideas (instructions, data, clock cycles, and streams) plus the distinction between serial and parallel, you can classify every computer architecture we'll cover. The question is always: how many instruction streams and data streams can the processor handle at the same time?

Let's start with the simplest answer: one and one.


SISD: one thing at a time

We'll begin with the tightest constraint: just one instruction stream and one data stream. That's the simplest processor you could build. One instruction operates on one piece of data, then the next instruction executes, then the next.

This is Single Instruction, Single Data (SISD), and it's exactly the processor we described at the top of this article. Think of it like following a recipe. You do step 1, then step 2, then step 3. There's no skipping ahead, no doing two things at once. This is how the earliest computers worked, and it's still the fundamental model for understanding computation.

The von Neumann architecture

Nearly all SISD processors follow a design called von Neumann architecture, named after the computer pioneer John von Neumann. The idea behind it is pretty straightforward. Both the program's instructions and the data it works on are stored in the same memory. The processor doesn't "know" the difference. It just reads from whatever memory address it's told to.

The processor repeats a simple three-step cycle, over and over:

  1. Fetch: read the next instruction from memory
  2. Decode: figure out what that instruction means (is it an ADD? a LOAD?)
  3. Execute: carry out the instruction

A special register called the program counter (PC) keeps track of which memory address to fetch from next. After each instruction executes, the PC advances to the next one.

Von Neumann Architecture
CPUControl UnitPC: 0ALURegister0BUSMemory0:LOAD 8INST1:ADD 9INST2:STORE 10INST3:HALTINST4:...8:59:310:?
Cycle 1/12:FETCH

LOAD 8Load value from address 8 into register

Step through this and watch the cycle repeat. The program counter marches through memory, and the processor treats each address as an instruction to execute. This "stored-program" concept (the idea that programs and data live in the same memory) is what makes modern computing possible. You can write a program, save it, and run it, all because the computer doesn't treat instructions as anything special.

Running into the wall

Now let's see what happens when our SISD processor needs to do the same operation many times. Say we want to add two arrays of numbers, element by element.

SISD: Sequential Execution
A:3725++++B:4162====C:????
Cycle 1/8:LOAD

Load A[0] and B[0]

To add four pairs of numbers, we need eight cycles. We load, add, load, add, and so on. Every operation waits for the previous one to complete.

The nice thing about SISD is that it's simple and predictable. The same inputs always produce the same sequence of events, in the same order. Debugging is straightforward because there's no parallelism to reason about.

But look at the pattern. We're doing the exact same operation (addition) four separate times. The only thing that changes is the data. We load a pair of numbers, add them, store the result... then do it all over again with the next pair.

This feels wasteful. If every pair gets the same operation, why not do all four additions at once?

Hardware designers asked the same question. Let's relax our first constraint.


SIMD: same operation, many data at once

Now we have a new constraint: one instruction stream, but multiple data streams. Instead of applying an instruction to one piece of data at a time, we'll apply it to several pieces simultaneously. It's the same operation, just wider.

This is Single Instruction, Multiple Data (SIMD), and it was born directly from the frustration we just saw. If every element gets the same operation, there's no reason to process them one by one.

SIMD: Parallel Data Processing
A:3725++++B:4162====C:????
Cycle 1/3:VLOAD A

Load all 4 values from A into vector register

What makes this work is a wider kind of register. Remember that in SISD, a register holds a single number. SIMD processors introduce vector registers, which have multiple compartments that each hold a separate number side by side. A regular register is a box that holds one item; a vector register is that box divided into slots that each hold their own item.

With vector registers, a single VADD instruction adds four pairs of numbers in one cycle. The same work that took our SISD processor eight cycles now takes two. One cycle to load the vectors, and one to add them.

SIMD in your computer right now

You don't need special hardware to use SIMD because your CPU already has it built in. These capabilities have grown wider over the years through special sets of instructions:

  • SSE uses 128-bit vector registers. Since a standard number takes up 32 bits, that means 4 numbers processed at once.
  • AVX doubles the register width to 256 bits, so 8 numbers at once.
  • AVX-512 doubles it again to 512 bits, so 16 numbers at once.

The larger the vector register, the more data you can process per instruction. When your compiler detects a loop that applies the same operation to every element of an array, it may automatically rewrite it to use these SIMD instructions, a process called auto-vectorization. You write a normal loop, and the compiler silently converts it into vector operations behind the scenes.

Running into the next wall

So we've made our processor faster by widening it. But we introduced a new constraint in the process. Every element in the vector gets the same instruction.

If element 3 needs addition but element 7 needs multiplication, SIMD can't do both at once. You'd need to split the work into separate passes (one for the additions, another for the multiplications), losing the speed advantage.

This constraint (same operation, applied uniformly) is both SIMD's strength and its limitation. It works well when your data is uniform, but it falls apart when different elements need different treatment.

So what do we do when different pieces of data genuinely need different operations? We need to relax another constraint.


When does parallelism actually help?

SIMD just showed us that parallelism can speed things up. But does that mean we should parallelize everything?

It depends on one thing: are the operations independent?

Parallel-friendly workloads

Parallel-Friendly: Image Brightness
// Each pixel is independent!
for (let i = 0; i < pixels.length; i++) {
  pixels[i] = pixels[i] * 1.5;
}
Input:3045607590105120135×1.5Output:????????
Step 1/5

Input pixels with different brightness values

Parallelism works well when the same operation applies to many independent pieces of data. These are exactly the kind of problems SIMD was designed for. In image processing, each pixel can be brightened, blurred, or color-adjusted independently. In matrix multiplication, each output element is a dot product that doesn't depend on other output elements. Physics simulations update each particle's position independently (at each timestep). And in machine learning inference, each input in a batch can be processed independently.

Sequential dependencies

Sequential Dependency: Fibonacci
// Each value depends on previous two!
for (let i = 2; i < n; i++) {
  fib[i] = fib[i-1] + fib[i-2];
}
[0]1[1]1[2]?[3]?[4]?[5]?[6]?[7]?
Step 1/8

Start with base cases: fib[0] = 1, fib[1] = 1

But some problems resist parallelization because each step depends on the result of the previous one. Fibonacci sequences need F(n-1) and F(n-2) before you can compute F(n). Parsing requires knowing what came before: is + addition or the unary plus operator? That depends on context. In graph traversal, which node you visit next depends on which node you're currently at. And in compression, patterns you find depend on patterns you already found.

No amount of parallel hardware helps if the problem is inherently sequential. In practice, high-performance computing means restructuring algorithms to expose parallelism, and accepting the sequential bottleneck when you can't find a way to.

Let's get back to relaxing constraints.


MIMD: independent processors

With SIMD, we relaxed the "one data stream" constraint but kept the "one instruction stream" constraint. Every processing unit still had to do the same thing. Now let's relax both constraints. That means multiple instruction streams and multiple data streams. Each processor gets its own instructions and its own data.

This is Multiple Instruction, Multiple Data (MIMD), the most flexible architecture. Multiple processors execute their own instruction streams on their own data, completely independently. Unlike SIMD, where all units move in lockstep, MIMD processors can each run a completely different program.

MIMD: Independent Processors
[0]: 3 + 4 =?[1]: 7 + 1 =?[2]: 2 + 6 =?[3]: 5 + 2 =?P0Element 01/3P1Element 11/4P2Element 21/3P3Element 31/5
Cycle 1/5: Each processor independently computes one element. Notice they finish at different times.

Step through this and notice something different from the SIMD demo. Processors finish at different times. P0 completes in 3 cycles while P3 needs 5. In SIMD, all lanes move together because they're locked to the same instruction. In MIMD, each processor runs at its own pace, doing its own thing.

MIMD is everywhere

If you're using a modern computer, you're using MIMD right now. Your laptop likely has 4, 8, or more independent cores. One core might be running your web browser while another handles music playback. Beyond a single machine, MIMD also describes computer clusters (groups of machines working together on the same problem) and distributed systems (servers across data centers coordinating over a network).

The cost of independence

Here's the catch: all that independence creates a coordination problem. When multiple processors work on shared data, you need mechanisms to prevent them from stepping on each other. If two processors try to update the same value at the same time, the result is unpredictable.

There are a few common ways to handle this. Locks let only one processor access a piece of data at a time. Atomic operations complete fully without interruption, so no other processor can see a half-finished state. And message passing avoids shared memory entirely, so processors just send data to each other explicitly.

This coordination adds complexity that SISD and SIMD don't have, but it's what lets each processor do something completely different.


MISD: the odd one out

For completeness, there's one more combination in our 2×2 grid. Multiple Instruction, Single Data, or MISD. That means multiple processors execute different instructions on the same data.

This is rare because it's hard to find practical use cases. Why would you want multiple processors all working on the exact same piece of data?

There is one good reason though, and that's fault tolerance. You run the same computation on multiple processors using different implementations. If they produce different results, something went wrong. The Space Shuttle's flight computer used exactly this approach. Four computers ran identical calculations, and the system compared results to detect hardware failures. When lives depend on correctness, redundancy beats efficiency.

We won't dwell on MISD since it's uncommon in everyday computing. But there's something we've been quietly ignoring.


Why raw parallelism isn't everything

So far, we've been comparing architectures by counting how many operations they can do at once. SIMD processes 4 elements simultaneously. MIMD uses 4 independent cores. More parallelism equals more speed, right?

Not so fast. There's a bottleneck we've been sweeping under the rug: all those execution units need to be fed data, and getting it there is slow.

You can have thousands of parallel execution units, but if they're all waiting for data to arrive from memory, they sit idle. To understand why this happens, we need to look at where data actually lives and how fast it can travel.

The memory hierarchy

Memory Hierarchy
Registers1 cycle | ~100 bytesCache (L1/L2/L3)3-40 cycles | KB to MBRAM (Main Memory)100-200 cycles | GBDisk / SSDmillions of cycles | TBSlowerLarger

Processors are much faster than memory. A modern CPU can perform billions of operations per second, but fetching a piece of data from main memory (RAM) takes 100–200 clock cycles. Remember the register, the one we compared to a chef's sticky note? Imagine that chef can prepare a meal in seconds, but every time they need an ingredient from the storeroom, they have to wait minutes. That's the processor-memory speed gap.

To cope, computers use a hierarchy of progressively larger but slower storage. The closer storage is to the processor, the faster but smaller it is:

  • Registers (1 cycle): the tiny storage slots inside the processor we saw earlier. A few dozen of them. The sticky notes on the chef's workstation.
  • L1 cache (3–4 cycles): a small pool of fast memory (32–64 KB) that automatically holds recently used data. The chef's counter, where ingredients used a moment ago are still within arm's reach.
  • L2 cache (10–20 cycles): larger (256 KB–1 MB) but slower. A nearby shelf.
  • L3 cache (40–75 cycles): shared across all cores (8–32 MB). A shared pantry in the kitchen.
  • Main memory / RAM (100–200 cycles): gigabytes of storage. Where your program and its data live when running. The storeroom down the hall.
  • Disk / SSD (millions of cycles): terabytes. So slow we measure access time in milliseconds, not cycles. A warehouse across town.

When the processor needs data, it checks the closest storage first: L1, then L2, then L3, then RAM. Each time the data isn't found (a cache miss), the processor has to look further away, wasting more cycles waiting. This is why programs that access data sequentially (one element after the next in memory) run much faster than programs that jump around randomly. Sequential access keeps data in cache, while random access causes frequent misses.

What this means for parallel architectures

Memory matters for our story about parallelism because even if you have 8 cores running in parallel (MIMD) or 8 values being processed at once (SIMD), they all share the same path to main memory. This shared path has a limited capacity called memory bandwidth, which is how much data can flow from memory to the processor per second.

If your parallel workload needs more data than the memory path can deliver, the extra cores or vector lanes simply wait. It's like having 8 checkout lanes but only one door into the store. Adding more lanes doesn't help if customers are stuck waiting to get in.

This is why real-world speedups from parallelism are often less than the theoretical maximum. The bottleneck is how fast you can feed data to the processors, not how many operations they can do.

Seeing it all together

Performance Comparison
SISD16 cyclesSIMD4 cyclesMIMD4 cyclesSIMD speedup: 4.0× | MIMD speedup: 4.0×

Adjust the data size and watch the race. SIMD processes 4 elements per instruction, MIMD uses 4 independent processors. Both achieve speedups over SISD, but real performance depends on more than just parallelism. Can you feed data to the processor fast enough? That's memory bandwidth. Is your data laid out sequentially or scattered randomly? That's cache behavior. How much time does MIMD spend coordinating between processors? And do threads in a GPU architecture take the same path through the code? That's divergence, which we'll see next.


Beyond the four categories

We've now explored four architectures by starting with constraints and relaxing them one at a time:

  • SISD uses one instruction and one data stream. It's the simplest starting point.
  • SIMD uses one instruction but many data streams. We relaxed the data constraint for uniform workloads.
  • MIMD uses many instructions and many data streams. We relaxed both constraints for full flexibility.
  • MISD uses many instructions on one data stream. It's the rare special case for fault tolerance.

That covers Flynn's original 2×2 grid from 1966. But in the decades since, a new kind of processor emerged that didn't fit neatly into any of these boxes: the GPU.

GPUs were designed to solve a specific problem, rendering 3D graphics in real time. To display a 1920×1080 image at 60 frames per second, you need to compute over 124 million pixels per second. Each pixel needs calculations for color, lighting, shadows, and textures. And most pixels can be computed independently, which makes this a massively parallel-friendly workload.

Now, which of our architectures should we use?

  • SISD is far too slow. One pixel at a time won't get us to 60 frames per second.
  • SIMD is close, but pixels sometimes need different calculations ("draw this pixel" vs. "skip this transparent pixel"). SIMD's "everyone does the same thing" constraint is too rigid.
  • MIMD is too expensive. Coordinating thousands of fully independent processors adds complexity and hardware cost we don't need for a mostly-uniform workload.

We need something in between: the efficiency of SIMD, where one instruction is broadcast to many units, but with some of MIMD's flexibility, where units can handle branching. The solution requires one more concept.

What is a thread?

Up to now, we've talked about processors executing a sequence of instructions. In SISD, one processor runs one sequence. In MIMD, multiple processors each run their own sequence. A thread is just the name we give to one of these independent sequences, a separate "train of thought" that a processor can follow.

What is a Thread?
Thread 1Register x:?x = 5Initialize xx = x + 1Increment xprint(x)Output: 6Thread 2Register y:?y = 10Initialize yy = y * 2Double yprint(y)Output: 20
Step 1/8

Both threads ready to start

Each thread has its own program counter (which instruction it's currently on), its own registers (scratch space for data it's working with), and its own stack (memory for tracking function calls and local variables).

In a multi-core CPU (MIMD), each core runs a different thread. And threads are also the foundation of how GPUs work.


SIMT: the GPU model

Here's the compromise GPU designers came up with. You take SIMD's approach and broadcast one instruction to many units. But instead of processing dumb data lanes that must all do the same thing, make each unit a thread with its own program counter. That way, when threads encounter a branch (like an if statement), they can at least remember where they are, even if they can't all go different directions at the same time.

NVIDIA coined the name SIMT (Single Instruction, Multiple Threads) for this hybrid approach.

SIMT: Uniform Execution
LOAD x[tid]T00T12T24T36T48T510T612T714
All threads load their data in parallel

How warps work

GPUs organize threads into groups of 32 called warps (NVIDIA's term), and all 32 threads in a warp execute the same instruction at the same time. This sounds exactly like SIMD so far. So where does the MIMD-like flexibility come in?

It shows up when those 32 threads hit a decision point, an if statement where some threads need to go one way and others need to go another. In pure SIMD, you'd have to stop, split the data into two groups, and run each group separately. SIMT handles it through a technique called masking:

  1. Threads that take the if branch execute while the others are temporarily disabled (masked off)
  2. Then threads that take the else branch execute while the first group is masked
  3. After both paths complete, all threads sync back up and continue together
SIMT: Divergent Execution
LOAD x[tid]T00T12T24T36T48T510T612T714
All threads load their data

Step through the demo and watch what happens when threads diverge. This splitting of paths is what's called divergence, and it's expensive. When threads diverge, some sit idle while waiting for others, wasting execution cycles. In the worst case, if all 32 threads take different paths, you're effectively running 32x slower than if they all took the same path.

Why this trade-off is worth it

Graphics workloads are almost perfectly parallel. Most pixels can be computed independently with the same code. But graphics programs (called shaders) sometimes need conditionals:

if this pixel is transparent:
  skip it
otherwise:
  compute its color based on lighting

Pure SIMD would force you to split this into separate passes. SIMT handles it with masking. You still run a single pass, but divergent threads wait their turn. For workloads that are mostly uniform with occasional divergence, this turns out to be a practical sweet spot between SIMD's efficiency and MIMD's flexibility.


Real hardware: mixing and matching

We've been treating these architectures as distinct categories, but real processors aren't so clean. In practice, modern chips actually combine multiple approaches.

Modern CPU: MIMD + SIMD

Modern CPU: MIMD + SIMD
Core 0ALUAVXL1/L2 CacheCore 1ALUAVXL1/L2 CacheCore 2ALUAVXL1/L2 CacheCore 3ALUAVXL1/L2 Cache
Step 1/6

Idle state: All cores ready

Your laptop's CPU is MIMD at the core level (multiple independent cores, each doing its own thing) with SIMD units inside each core that process multiple data elements per instruction. So when you hear "8-core CPU with AVX-512," that means 8 independent processors, each capable of processing 16 numbers simultaneously. That's two levels of parallelism stacked together.

GPU: SIMT architecture

GPU: SIMT Architecture
SM 0Warp 0Warp 1SM 1Warp 0Warp 1SM 2Warp 0Warp 1SM 3Warp 0Warp 1
Step 1/5

Idle: Streaming Multiprocessors ready

A GPU contains many Streaming Multiprocessors (SMs), each of which is a mini-processor that manages hundreds of threads organized into warps. A modern GPU might have 50–100+ SMs, so it can run tens of thousands of threads simultaneously. All threads in a warp execute the same instruction at the same time, with masking to handle divergence.

Vector processor: pure SIMD

Vector Processor: Pure SIMD
VLOAD V0, A
V0:24173956V1:????????Vector ALU (8 parallel adders)V2:????????
Step 1/4

Load vector A into register V0

Classic Cray supercomputers used pure SIMD with vector registers that could hold 64 or more elements. A single vector instruction processes all elements through parallel ALUs, with one ALU per vector element all computing simultaneously. ALU stands for Arithmetic Logic Unit. It's the circuit inside a processor that performs calculations like addition and multiplication.

The lines blur

The boundaries between categories aren't always clean. A modern CPU core might use a technique called pipelining. The idea is to overlap the fetch-decode-execute stages of different instructions, so that multiple instructions are in various stages of completion at the same time, like an assembly line in a factory. A "single thread" on a GPU shares physical registers with 31 other threads in its warp. But Flynn's taxonomy gives us the vocabulary to reason about how a processor approaches parallelism, even when the reality is a blend.


Summary

We started with the simplest processor (one instruction, one data) and kept relaxing constraints to build faster architectures:

Single DataMultiple Data
Single InstructionSISD (sequential)SIMD (vector)
Multiple InstructionsMISD (rare)MIMD (multi-core)

When even these four categories weren't enough, GPUs gave us a fifth. SIMT blends SIMD's efficiency with the ability for each thread to have its own control flow, using masking.

The taxonomy is 60 years old, but it still explains a lot. It tells you why your CPU has multiple cores, because MIMD lets it run independent tasks in parallel. It explains why compilers auto-vectorize your loops, because SIMD processes multiple data elements with one instruction. It's why GPUs are fast at graphics but slow at branchy code, because SIMT wastes cycles when threads diverge. And it's why some algorithms parallelize easily while others don't. It all comes down to whether the operations are independent.

The next time you're wondering why a GPU crushes your matrix multiply but crawls on a branchy algorithm, or why your 8-core CPU doesn't give you an 8x speedup, this grid is the place to start.