CPUs are probably the second most complex device known to humanity, after the human brain. Modern processors contain billions of transistors and execute billions of instructions per second. But underneath all that complexity is a simple idea: fetch instructions from memory, decode what they mean, and execute them.
We've already seen this pattern in Babbage's Analytical Engine. Now we'll watch it play out in a digital electronic CPU using the Manchester Baby as our example, one of the first working computers, built from vacuum tubes in 1948. Modern machines are much more complicated, but they're still based on these same principles.
By the end, you'll understand:
How do we make a digital electronic machine automatically execute a sequence of instructions without manually flipping switches for each step?
The Analytical Engine solved this with mechanical gears and control cards. But a mechanical machine is inherently slow: moving levers, rotating gears, reading cards. A digital electronic machine, built from transistors and logic gates, can operate at the speed of electricity.
Instead of gears, we use flip-flops to store numbers. Instead of mechanical switches, we use AND/OR gates to route signals. And instead of a person reading a control card, a clock signal automatically steps through the execution of each instruction.
Build a machine that repeatedly fetches instructions from memory, decodes them to understand what they mean, and executes them, all coordinated by a clock that ticks at regular intervals.
Before we look at the hardware, let's understand what programs look like from the programmer's perspective. Like the Analytical Engine, the Baby stores instructions and data in the same memory space. Unlike the Analytical Engine, it doesn't have a tape reader, instead, we manually load a program into RAM before turning it on.
The Baby has a complete, minimal instruction set. Only seven instructions. That's enough to compute anything computable, but writing programs is... distinctive.
Notice something odd? There's no LOAD instruction. The Baby has LDN, "load negated", which loads a value from memory AND inverts all its bits. This was a hardware limitation of the time, but it became a famous quirk. Programs have to work around it. If you want to load 10, you store -10, then LDN loads it as +10.
The simplest program just stops:
01: HLTWhen the Baby boots, the program counter starts at 1 and the machine fetches the HLT instruction. HLT stops execution immediately. Done. (The PC starts at 1, not 0, because it gets incremented before each fetch, a quirk of the Baby's timing.)
Since instructions and data share memory, we need to be careful. NUM isn't an instruction, it's a directive to place raw data in memory when the program loads.
01: HLT
02: NUM 10
03: NUM 5
04: NUM 0When this loads into RAM, address 1 contains the HLT instruction. Addresses 2, 3, and 4 contain the numbers 10, 5, and 0. The HLT at address 1 stops execution immediately, so the CPU never tries to execute the data values as if they were instructions.
Modern "buffer overflow" attacks exploit the fact that instructions and data share memory. If you can trick a program into executing data as instructions, you might be able to run arbitrary code. This ancient quirk of von Neumann architectures is still a critical security concern in 2026.
The Baby can only subtract. Seriously. No add, no multiply. But subtraction is enough to do anything, you can implement addition by subtracting a negative number.
Here's a program to compute 10 − 3 = 7:
01: LDN 20
02: SUB 21
03: STO 22
04: LDN 22
05: HLT
20: NUM -10
21: NUM 3
22: NUM 0Let's trace through this step by step:
Without jumps, programs run linearly from the top. With jumps, we can loop, skip, and branch based on conditions.
The Baby uses indirect jumps. The operand isn't the target address, it's the address of the target address.
01: LDN 20
02: SUB 21
03: SUB 22
04: STO 20
05: JMP 23
06: HLT
20: NUM 0
21: NUM 0
22: NUM -1
23: NUM 0The JMP 23 instruction says: "Jump to the address stored at memory location 23." Address 23 contains 0, so the CPU jumps to address 1 (0 + 1, because the CPU increments after fetching). This creates an infinite loop.
SKN (skip next) tests if the accumulator is negative. If so, it skips the next instruction. Combined with JMP, this creates if-statements.
01: LDN 20
02: STO 23
03: SKN
04: JMP 21
05: LDN 22
06: SUB 23
07: STO 23
08: HLT
20: NUM -10
21: NUM 6
22: NUM 0
23: NUM 0This computes the absolute value:
Programmers write assembly (human-readable mnemonics). The CPU executes machine code (binary). An assembler translates between them.
The Baby uses 32-bit words. Each instruction is encoded as:
The Manchester Baby displayed bits in an unusual order (bit 0 on the left, not the right), so the binary looks strange compared to modern conventions. But the encoding is straightforward: 3 bits for the opcode, 13 bits for the operand.
Here's how you translate assembly to machine code:
import re
dct_inst = {
'JMP': 0, 'JRP': 1, 'LDN': 2, 'STO': 3,
'SUB': 4, 'SKN': 6, 'HLT': 7
}
def sext(num, width):
# Sign-extend to handle negative numbers
if num < 0:
return bin((1 << (width + 1)) + num)[3:]
return bin(num)[2:].zfill(width)
for line in f:
parts = line.split()
addr = int(parts[0][:-1]) # Remove colon
if parts[1] == 'NUM':
# Raw data
code = sext(int(parts[2]), 32)
else:
# Instruction
opcode = bin(dct_inst[parts[1]]).zfill(3)
operand = sext(int(parts[2] if len(parts) > 2 else 0), 13)
code = (opcode + operand).zfill(32)
print(hex(int(code, 2)))The CPU has fast memory built-in: registers. The Baby has three main registers:
This is where arithmetic happens. Load a number, subtract from it, store the result. The accumulator is both an input to operations and the output.
This tracks where we are in the program. It points to the current instruction's address. At the start of each cycle, the PC is incremented (t0). Then it's used to fetch the next instruction.
The PC increments at the start of each cycle (t0), pointing to the next instruction to fetch.
When the CPU fetches an instruction from memory, it stores a copy here. The IR is then decoded to figure out what to do.
Program starts
Every CPU, whether it was built in 1948 or 2025, follows the same three steps, repeated forever:
The control unit orchestrates this with timing signals. At tick 1 of the cycle, the program counter is connected to the RAM address lines. At tick 2, the RAM outputs the instruction, which is copied into the IR.
Program counter points to memory address 1
In register transfer language (RTL), this looks like:
t1: RAM_A ← PC // PC value sent to RAM
t2: IR ← RAM_Dout // RAM data copied to IR
The IR is split into bits. The opcode bits are decoded to figure out which instruction was fetched. The operand bits are extracted for use during execution.
A 3-bit decoder (like the one we studied in digital logic) has 8 outputs, one for each of the Baby's 7 instructions. When the opcode is 2, the decoder activates the LDN output. This signal is then used to control what happens next.
This is where things differ based on what instruction was decoded. A load instruction connects RAM to the accumulator. A store instruction connects the accumulator to RAM. A jump instruction modifies the program counter.
All of this is synchronized by the clock. The control unit sends timing signals (t0, t1, t2, ...) that activate multiplexers to make temporary connections between registers, RAM, and ALU.
The control unit is where the magic happens. It sequences all the other components at the right times, making sure reads happen before writes, loads happen before arithmetic, and so on.
The simplest design uses:
For the fetch stage, the same connections happen every cycle: PC to RAM address, then RAM output to IR. For the execute stage, different connections happen depending on the decoded instruction.
The Baby uses a 3-bit counter, giving 2³ = 8 possible values. That's enough states to fetch (2 ticks), decode (1 tick), execute (3–4 ticks), and update for the next cycle. More complex CPUs might use larger counters or skip the concept entirely in favor of pipelined architectures.
The Baby's ALU is trivial: it contains only a subtractor. Modern CPUs have adders, multipliers, dividers, bitwise operators, all running in parallel. But the principle is the same: send two operands in, compute the result, send it out.
For subtraction:
t3, SUB: RAM_A ← IR[operand] // Operand sent to RAM
t4, SUB: Acc ← Acc - RAM_Dout // Subtraction result stored
The operand tells the RAM which address to read. The RAM returns the value. The subtractor computes Acc − value. The result is written back to Acc on the next tick.
Alan Turing tested the Baby when it was first built. He wrote a program to divide 36 by 5. It's a beautiful mess, using only subtraction, negation, and jumps to implement a classic algorithm.
00: NUM 19 -- Jump target
01: LDN 31 -- Load dividend (negated)
02: STO 31 -- Store as positive
03: LDN 31 -- Load again (negated back to original)
04: SUB 30 -- Subtract divisor (shifted)
05: SKN -- If negative, we overshot
06: JMP 0 -- Otherwise, jump to line 20
07: LDN 31 -- Restore
08: STO 31 -- The dividend
09: LDN 28 -- Load quotient (negated)
10: SUB 28 -- Double it
11: STO 28 -- Store
12: LDN 31 -- Load dividend
13: SUB 31 -- Double it
14: STO 31 -- Store
15: LDN 28 -- Load quotient
16: STO 28 -- Restore shifted quotient
17: SKN -- Check if done
18: JMP 26 -- If not, loop back
19: HLT -- Stop; result in line 28
...
28: NUM 0 -- Quotient (appears here, shifted)
29: NUM 536870912 -- 2^31
30: NUM 20 -- Divisor * 2^n
31: NUM 36 -- DividendThis 32-line program implements a shift-and-subtract division algorithm using only the Baby's primitive operations. The result (7 from 36 ÷ 5) gets stored in address 28.
Notice how the quirks force clever programming: the double-negation trick (lines 1–3) to convert -36 to +36, the bit-shifting by repeated subtraction (lines 10, 13), and the use of memory cells as a jump target table (lines 0, 26, 19). Modern languages abstract all this away, but Turing had to think in terms of the machine itself.
The Manchester Baby is a starting point. Once you understand how a minimal CPU works, the next question is: how do you make it faster, more powerful, and easier to program? Modern CPUs answer this by adding layers of sophistication on top of the basic architecture.
The Baby has only one user-accessible register: the accumulator. Every operation must move data in and out of RAM. Modern CPUs have dozens of registers. This lets programmers keep data inside the CPU instead of making expensive trips to RAM.
But more registers mean more complexity. Instructions now need to specify which register to use. A LOAD instruction in the Baby takes one operand (the address). With multiple registers, a LOAD might look like LOAD R1 $50A3, specifying both the register and address.
CPU architects face a fundamental question: should we make the CPU hardware simple and the instruction set small, or vice versa?
Emphasis on rich instruction sets with many variations
// CISC: Single instruction does multiple things
ADDM $A4B5 $50A3 $463F
// Adds values from addresses, stores result
// (This one instruction = RISC program)This debate raged for decades. CISC (x86 processors) tried to make programming easier by providing instructions that do complex work. RISC (MIPS, ARM) tried to make hardware simpler by keeping instructions basic. Modern CPUs blur the line, they use RISC-like simple instruction sets internally, but add complex instructions on top for compatibility.
In the Baby, a jump instruction goes to a hard-coded address. But what if you want to jump to a subroutine (a reusable piece of code) and then return to where you came from? You need to save the return address somewhere.
Early CPUs had a single return address register, this allowed one level of subroutine calling, but not recursion. Modern CPUs have a stack: a region of memory where return addresses are pushed when calling a subroutine and popped when returning. This enables nested calls and recursion.
The stack is managed by a special stack pointer register that points to the top of the stack. Call and return instructions automatically push/pop addresses. Some architectures even protect the stack, preventing user code from accidentally corrupting it.
The Baby can only do integer arithmetic. Real-world problems, scientific computing, graphics, signal processing, need decimal numbers.
Floating-point numbers (sign, exponent, mantissa) require complex hardware to add, multiply, and divide correctly. Early CPUs didn't have this; you had to buy a separate optional chip (the Intel 8087 co-processor). Modern CPUs integrate the floating-point unit (FPU) directly, with specialized registers and instructions.
Once you have a working CPU, the next challenge is speed. The Baby runs at 1 MHz. Modern CPUs run at 3+ GHz. But simply making the clock faster isn't enough anymore, physical limits prevent indefinite speedup. Instead, modern CPUs use instruction-level parallelism: doing multiple instructions' work simultaneously.
Imagine a factory assembly line. Instead of building one car completely before starting the next, workers process multiple cars simultaneously. Each worker specializes in one task, painting, assembly, testing, and does it on whichever car is currently at their station.
CPUs use the same idea. The fetch-decode-execute cycle can be split into stages. While one instruction is executing, the next can be decoded, and the next can be fetched. With a 4-stage pipeline, you can have 4 instructions in flight simultaneously.
Watch how instructions flow through the pipeline. Each stage is independent, allowing multiple instructions to be processed simultaneously.
In cycle 4, all stages are working simultaneously: Fetch on Instr 4, Decode on Instr 3, Execute on Instr 2, Write on Instr 1.
Pipelining roughly quadruples throughput (instructions per second) without making the clock faster. Modern CPUs have pipelines with 10+ stages, and can retire (complete) multiple instructions per cycle.
Pipelining works great for streaming computations with no branches. But real programs branch constantly. When a branch is reached, you don't know which way it will go until you execute it, but by then, you've already fetched and partially decoded the wrong instructions.
Branching hazards: You guess which way a branch will go, prefetch those instructions, then have to throw them away if you guessed wrong.
Data hazards: Instruction 1 writes to a register, and instruction 2 reads from it. If instruction 2 starts before instruction 1 finishes writing, it gets the old value, not the new one.
Structural hazards: Two instructions want to use the same hardware (like the ALU) at the same time.
The simplest approach: assume all branches are NOT taken (fall through). This is right for if-statements 50% of the time. A better strategy: most branches in real code come from loops, and loops iterate multiple times, so most branches ARE taken. If you always guess "taken," you're right 50-90% of the time.
The best approach: keep statistics on each branch. If a particular branch usually jumps, predict it will jump. Modern CPUs use machine learning classifiers built into hardware to predict branch outcomes with 90%+ accuracy.
Assume branch is NOT taken (always fall through)
Sometimes instruction 1 takes a long time (like division), but instruction 3 is independent of it. Why wait for instruction 1 to finish before starting instruction 3?
Out-of-order execution lets the CPU reorder instructions as long as dependencies are respected. If instructions don't depend on each other, they can run in any order. A dependency graph tracks which instructions must finish before others can start. The CPU finds a schedule that keeps all its hardware as busy as possible.
This is complex, the CPU must track instruction dependencies, manage multiple in-flight instructions, and handle exceptions correctly. But it can double or triple performance for real programs.
Even with pipelining and out-of-order execution, some hardware sits idle. The fetch unit is busy during fetch but idle during execute. Why not have it fetch instructions for a second program?
Hyperthreading runs two independent instruction streams on the same physical CPU. Each has its own program counter and registers. When one stalls waiting for memory, the other can execute. From the OS perspective, it looks like two CPUs. It's not, they share memory, caches, and ALU, but it's clever use of otherwise wasted resources.
The Manchester Baby was revolutionary, but it's wildly inefficient compared to modern CPUs. It runs at 1 MHz (1 million ticks per second). A modern CPU runs at 3+ GHz (3 billion ticks per second). It has 32 addresses of RAM. A modern computer has gigabytes. It can only subtract. Modern CPUs have specialized hardware for arithmetic, cryptography, and graphics.
But the core idea hasn't changed:
Modern CPUs optimize aggressively. They predict which branches you'll take and prefetch instructions. They run multiple instructions per cycle (superscalar execution). They cache frequently-used data. They speculate aggressively and roll back when wrong. But underneath is still the same loop.
The Baby used vacuum tubes. Each tube was a switch that could be on or off, controlled by electrical signals. Transistors replaced tubes (smaller, more reliable, used less power). Modern CPUs use transistors so small that quantum effects start to matter, and future computers might exploit quantum behavior directly.
But whether your computer uses tubes, transistors, or qubits, the fundamental idea remains: a clock signal, a memory, registers, and a control unit orchestrating computation.
You've learned how a CPU is more than magic. It's a collection of simple pieces, registers made of flip-flops, multiplexers that route signals, decoders that turn bit patterns into control signals, and a clock that synchronizes everything.
You can trace through a program and predict exactly what the machine will do:
The Baby is simple enough to understand completely. Modern CPUs are billions of times more complex, but they're built from the same basic components. Understanding the simple case gives you the foundation to reason about the complex ones.
Want to build one? The Manchester Baby (and extensions like the Ferranti Mark 1) can be simulated in LogiSim. Start with registers, an ALU, and a simple control unit. Build up until you can run real programs. It's easier than you'd think, and deeply satisfying.