Digital CPU Design: From Gates to Programs

CPUs are probably the second most complex device known to humanity, after the human brain. Modern processors contain billions of transistors and execute billions of instructions per second. But underneath all that complexity is a simple idea: fetch instructions from memory, decode what they mean, and execute them.

We've already seen this pattern in Babbage's Analytical Engine. Now we'll watch it play out in a digital electronic CPU using the Manchester Baby as our example, one of the first working computers, built from vacuum tubes in 1948. Modern machines are much more complicated, but they're still based on these same principles.

What we're exploring

By the end, you'll understand:

  • The seven instructions of the Baby and what each one does
  • How machine code encodes instructions into binary
  • The fetch-decode-execute cycle that powers all CPUs
  • CPU registers: accumulators, program counters, instruction registers
  • How control units coordinate all the pieces with timing signals
  • Why the Baby's quirks (negation, only subtraction) shaped its programming
  • Real programs: loops, arithmetic, conditionals on the Baby

The Problem: Making a Machine Compute

Problem

How do we make a digital electronic machine automatically execute a sequence of instructions without manually flipping switches for each step?

The Analytical Engine solved this with mechanical gears and control cards. But a mechanical machine is inherently slow: moving levers, rotating gears, reading cards. A digital electronic machine, built from transistors and logic gates, can operate at the speed of electricity.

Instead of gears, we use flip-flops to store numbers. Instead of mechanical switches, we use AND/OR gates to route signals. And instead of a person reading a control card, a clock signal automatically steps through the execution of each instruction.

Solution

Build a machine that repeatedly fetches instructions from memory, decodes them to understand what they mean, and executes them, all coordinated by a clock that ticks at regular intervals.


The Manchester Baby's Programmer Interface

Before we look at the hardware, let's understand what programs look like from the programmer's perspective. Like the Analytical Engine, the Baby stores instructions and data in the same memory space. Unlike the Analytical Engine, it doesn't have a tape reader, instead, we manually load a program into RAM before turning it on.

The Seven Instructions

The Baby has a complete, minimal instruction set. Only seven instructions. That's enough to compute anything computable, but writing programs is... distinctive.

Manchester Baby Instruction Set
Instructions (click to select)
Opcode
7
Description
Halt the Baby and light the stop lamp
01: HLT

Notice something odd? There's no LOAD instruction. The Baby has LDN, "load negated", which loads a value from memory AND inverts all its bits. This was a hardware limitation of the time, but it became a famous quirk. Programs have to work around it. If you want to load 10, you store -10, then LDN loads it as +10.

A Simple Program: Hello Halt

The simplest program just stops:

01: HLT

When the Baby boots, the program counter starts at 1 and the machine fetches the HLT instruction. HLT stops execution immediately. Done. (The PC starts at 1, not 0, because it gets incremented before each fetch, a quirk of the Baby's timing.)

Constants: Data and Instructions Together

Since instructions and data share memory, we need to be careful. NUM isn't an instruction, it's a directive to place raw data in memory when the program loads.

01: HLT
02: NUM 10
03: NUM 5
04: NUM 0

When this loads into RAM, address 1 contains the HLT instruction. Addresses 2, 3, and 4 contain the numbers 10, 5, and 0. The HLT at address 1 stops execution immediately, so the CPU never tries to execute the data values as if they were instructions.

Why this matters for security

Modern "buffer overflow" attacks exploit the fact that instructions and data share memory. If you can trick a program into executing data as instructions, you might be able to run arbitrary code. This ancient quirk of von Neumann architectures is still a critical security concern in 2026.


Arithmetic: The Baby Way

The Baby can only subtract. Seriously. No add, no multiply. But subtraction is enough to do anything, you can implement addition by subtracting a negative number.

Here's a program to compute 10 − 3 = 7:

01: LDN 20
02: SUB 21
03: STO 22
04: LDN 22
05: HLT
20: NUM -10
21: NUM 3
22: NUM 0

Let's trace through this step by step:

  1. Line 20 places -10 in memory at address 20
  2. Line 1 (LDN 20): Load from address 20, negating it. -10 becomes +10 in the accumulator.
  3. Line 2 (SUB 21): Subtract address 21 (value 3) from the accumulator. 10 − 3 = 7.
  4. Line 3 (STO 22): Store the accumulator (7) into address 22.
  5. Line 4 (LDN 22): Load address 22 (value 7), negating it. Gives -7. (This is odd, but needed for later machines.)
  6. Line 5: Halt.
Arithmetic Operations
Subtraction (Baby)
10 - 3 = 7
LDN 20 → SUB 21 → STO 22
With Negation
(-10) - 3 = -13
LDN inverts loaded values

Control Flow: Jumps and Branches

Without jumps, programs run linearly from the top. With jumps, we can loop, skip, and branch based on conditions.

Direct and Indirect Jumps

The Baby uses indirect jumps. The operand isn't the target address, it's the address of the target address.

01: LDN 20
02: SUB 21
03: SUB 22
04: STO 20
05: JMP 23
06: HLT
20: NUM 0
21: NUM 0
22: NUM -1
23: NUM 0

The JMP 23 instruction says: "Jump to the address stored at memory location 23." Address 23 contains 0, so the CPU jumps to address 1 (0 + 1, because the CPU increments after fetching). This creates an infinite loop.

Conditional Branches: SKN

SKN (skip next) tests if the accumulator is negative. If so, it skips the next instruction. Combined with JMP, this creates if-statements.

01: LDN 20
02: STO 23
03: SKN
04: JMP 21
05: LDN 22
06: SUB 23
07: STO 23
08: HLT
20: NUM -10
21: NUM 6
22: NUM 0
23: NUM 0

This computes the absolute value:

  1. Load -10 into the accumulator
  2. Store it into address 23
  3. Test if it's negative (SKN). It is, so skip the next instruction.
  4. Skip JMP 21, so continue to line 5
  5. Load 0, subtract address 23 (-10), giving 10
  6. Store the result (10) into address 23
  7. Halt. Address 23 now contains 10, the absolute value.

Machine Code: From Assembly to Binary

Programmers write assembly (human-readable mnemonics). The CPU executes machine code (binary). An assembler translates between them.

The Baby uses 32-bit words. Each instruction is encoded as:

Machine Code Visualization
32-bit Manchester Baby Instruction (HLT)
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Opcode (bits 29-31)
000
= 7 (HLT)
Operand (bits 0-12)
0000000000000
= 0 (none)
Unused (bits 13-28)
0000000000000
Ignored by CPU

The Manchester Baby displayed bits in an unusual order (bit 0 on the left, not the right), so the binary looks strange compared to modern conventions. But the encoding is straightforward: 3 bits for the opcode, 13 bits for the operand.

An Assembler in Python

Here's how you translate assembly to machine code:

import re
 
dct_inst = {
    'JMP': 0, 'JRP': 1, 'LDN': 2, 'STO': 3,
    'SUB': 4, 'SKN': 6, 'HLT': 7
}
 
def sext(num, width):
    # Sign-extend to handle negative numbers
    if num < 0:
        return bin((1 << (width + 1)) + num)[3:]
    return bin(num)[2:].zfill(width)
 
for line in f:
    parts = line.split()
    addr = int(parts[0][:-1])  # Remove colon
 
    if parts[1] == 'NUM':
        # Raw data
        code = sext(int(parts[2]), 32)
    else:
        # Instruction
        opcode = bin(dct_inst[parts[1]]).zfill(3)
        operand = sext(int(parts[2] if len(parts) > 2 else 0), 13)
        code = (opcode + operand).zfill(32)
 
    print(hex(int(code, 2)))

Inside the Baby: Registers

The CPU has fast memory built-in: registers. The Baby has three main registers:

The Accumulator (Acc)

This is where arithmetic happens. Load a number, subtract from it, store the result. The accumulator is both an input to operations and the output.

The Program Counter (PC)

This tracks where we are in the program. It points to the current instruction's address. At the start of each cycle, the PC is incremented (t0). Then it's used to fetch the next instruction.

Program Counter in Action
Program Counter: 1
@1
LDN 20
instruction
@2
SUB 21
instruction
@3
STO 22
instruction
@20
-10
data
@21
3
data

The PC increments at the start of each cycle (t0), pointing to the next instruction to fetch.

The Instruction Register (IR)

When the CPU fetches an instruction from memory, it stores a copy here. The IR is then decoded to figure out what to do.

Register State Through Execution

Program starts

CPU Registers
PC: 0x01
IR: 0x00000000
Acc: 0x00000000
Memory (RAM)
RAM[20]: -10 (0xFFFFFFF6)
RAM[21]: 3 (0x00000003)
RAM[22]: 0 (0x00000000)

The Fetch-Decode-Execute Cycle

Every CPU, whether it was built in 1948 or 2025, follows the same three steps, repeated forever:

1. Fetch: Get the Instruction

The control unit orchestrates this with timing signals. At tick 1 of the cycle, the program counter is connected to the RAM address lines. At tick 2, the RAM outputs the instruction, which is copied into the IR.

Fetch Cycle

Program counter points to memory address 1

Registers
PC: 0x01
IR: ?
Acc: ?
RTL Notation
Before fetch

In register transfer language (RTL), this looks like:

t1: RAM_A ← PC    // PC value sent to RAM
t2: IR ← RAM_Dout  // RAM data copied to IR

2. Decode: Understand the Instruction

The IR is split into bits. The opcode bits are decoded to figure out which instruction was fetched. The operand bits are extracted for use during execution.

A 3-bit decoder (like the one we studied in digital logic) has 8 outputs, one for each of the Baby's 7 instructions. When the opcode is 2, the decoder activates the LDN output. This signal is then used to control what happens next.

3. Execute: Perform the Instruction

This is where things differ based on what instruction was decoded. A load instruction connects RAM to the accumulator. A store instruction connects the accumulator to RAM. A jump instruction modifies the program counter.

All of this is synchronized by the clock. The control unit sends timing signals (t0, t1, t2, ...) that activate multiplexers to make temporary connections between registers, RAM, and ALU.


The Control Unit: The Brain Coordinates the Body

The control unit is where the magic happens. It sequences all the other components at the right times, making sure reads happen before writes, loads happen before arithmetic, and so on.

The simplest design uses:

  1. A counter that counts from 0 to 7 (8 ticks per instruction cycle)
  2. A decoder that converts the counter value into timing signals
  3. Multiplexers that make temporary connections based on the timing signal and instruction type

For the fetch stage, the same connections happen every cycle: PC to RAM address, then RAM output to IR. For the execute stage, different connections happen depending on the decoded instruction.

Why 8 ticks?

The Baby uses a 3-bit counter, giving 2³ = 8 possible values. That's enough states to fetch (2 ticks), decode (1 tick), execute (3–4 ticks), and update for the next cycle. More complex CPUs might use larger counters or skip the concept entirely in favor of pipelined architectures.


The Arithmetic Logic Unit: Computation

The Baby's ALU is trivial: it contains only a subtractor. Modern CPUs have adders, multipliers, dividers, bitwise operators, all running in parallel. But the principle is the same: send two operands in, compute the result, send it out.

For subtraction:

t3, SUB: RAM_A ← IR[operand]    // Operand sent to RAM
t4, SUB: Acc ← Acc - RAM_Dout   // Subtraction result stored

The operand tells the RAM which address to read. The RAM returns the value. The subtractor computes Acc − value. The result is written back to Acc on the next tick.


A Real Program: Turing's Long Division

Alan Turing tested the Baby when it was first built. He wrote a program to divide 36 by 5. It's a beautiful mess, using only subtraction, negation, and jumps to implement a classic algorithm.

00: NUM 19   -- Jump target
01: LDN 31   -- Load dividend (negated)
02: STO 31   -- Store as positive
03: LDN 31   -- Load again (negated back to original)
04: SUB 30   -- Subtract divisor (shifted)
05: SKN      -- If negative, we overshot
06: JMP 0    -- Otherwise, jump to line 20
07: LDN 31   -- Restore
08: STO 31   -- The dividend
09: LDN 28   -- Load quotient (negated)
10: SUB 28   -- Double it
11: STO 28   -- Store
12: LDN 31   -- Load dividend
13: SUB 31   -- Double it
14: STO 31   -- Store
15: LDN 28   -- Load quotient
16: STO 28   -- Restore shifted quotient
17: SKN      -- Check if done
18: JMP 26   -- If not, loop back
19: HLT      -- Stop; result in line 28
...
28: NUM 0    -- Quotient (appears here, shifted)
29: NUM 536870912  -- 2^31
30: NUM 20   -- Divisor * 2^n
31: NUM 36   -- Dividend

This 32-line program implements a shift-and-subtract division algorithm using only the Baby's primitive operations. The result (7 from 36 ÷ 5) gets stored in address 28.

Notice how the quirks force clever programming: the double-negation trick (lines 1–3) to convert -36 to +36, the bit-shifting by repeated subtraction (lines 10, 13), and the use of memory cells as a jump target table (lines 0, 26, 19). Modern languages abstract all this away, but Turing had to think in terms of the machine itself.


Beyond the Baby: Advanced CPU Design

The Manchester Baby is a starting point. Once you understand how a minimal CPU works, the next question is: how do you make it faster, more powerful, and easier to program? Modern CPUs answer this by adding layers of sophistication on top of the basic architecture.

More Registers, More Flexibility

The Baby has only one user-accessible register: the accumulator. Every operation must move data in and out of RAM. Modern CPUs have dozens of registers. This lets programmers keep data inside the CPU instead of making expensive trips to RAM.

But more registers mean more complexity. Instructions now need to specify which register to use. A LOAD instruction in the Baby takes one operand (the address). With multiple registers, a LOAD might look like LOAD R1 $50A3, specifying both the register and address.

CISC vs RISC: A Fundamental Trade-off

CPU architects face a fundamental question: should we make the CPU hardware simple and the instruction set small, or vice versa?

CISC vs RISC Philosophy

Complex Instruction Set Computing

Emphasis on rich instruction sets with many variations

Advantages
  • Fewer, shorter programs
  • Less RAM usage
  • Easier to write code
Disadvantages
  • Complex CPU design
  • Harder to debug
  • More silicon needed
// CISC: Single instruction does multiple things
ADDM  $A4B5 $50A3 $463F
// Adds values from addresses, stores result
// (This one instruction = RISC program)

This debate raged for decades. CISC (x86 processors) tried to make programming easier by providing instructions that do complex work. RISC (MIPS, ARM) tried to make hardware simpler by keeping instructions basic. Modern CPUs blur the line, they use RISC-like simple instruction sets internally, but add complex instructions on top for compatibility.

Subroutines and the Stack

In the Baby, a jump instruction goes to a hard-coded address. But what if you want to jump to a subroutine (a reusable piece of code) and then return to where you came from? You need to save the return address somewhere.

Early CPUs had a single return address register, this allowed one level of subroutine calling, but not recursion. Modern CPUs have a stack: a region of memory where return addresses are pushed when calling a subroutine and popped when returning. This enables nested calls and recursion.

The stack is managed by a special stack pointer register that points to the top of the stack. Call and return instructions automatically push/pop addresses. Some architectures even protect the stack, preventing user code from accidentally corrupting it.

Floating-Point Units

The Baby can only do integer arithmetic. Real-world problems, scientific computing, graphics, signal processing, need decimal numbers.

Floating-point numbers (sign, exponent, mantissa) require complex hardware to add, multiply, and divide correctly. Early CPUs didn't have this; you had to buy a separate optional chip (the Intel 8087 co-processor). Modern CPUs integrate the floating-point unit (FPU) directly, with specialized registers and instructions.


Making CPUs Go Faster: Pipelining and Beyond

Once you have a working CPU, the next challenge is speed. The Baby runs at 1 MHz. Modern CPUs run at 3+ GHz. But simply making the clock faster isn't enough anymore, physical limits prevent indefinite speedup. Instead, modern CPUs use instruction-level parallelism: doing multiple instructions' work simultaneously.

Pipelining: The Assembly Line

Imagine a factory assembly line. Instead of building one car completely before starting the next, workers process multiple cars simultaneously. Each worker specializes in one task, painting, assembly, testing, and does it on whichever car is currently at their station.

CPUs use the same idea. The fetch-decode-execute cycle can be split into stages. While one instruction is executing, the next can be decoded, and the next can be fetched. With a 4-stage pipeline, you can have 4 instructions in flight simultaneously.

Instruction Pipeline

Watch how instructions flow through the pipeline. Each stage is independent, allowing multiple instructions to be processed simultaneously.

Pipeline Stages
Fetch
Decode
Execute
Write
Clock Cycles
Cycle 1
Instr 1
Cycle 2
Instr 2
Instr 1
Cycle 3
Instr 3
Instr 2
Instr 1
Cycle 4
Instr 4
Instr 3
Instr 2
Instr 1
Cycle 5
Instr 4
Instr 3
Instr 2
Cycle 6
Instr 4
Instr 3
Cycle 7
Instr 4

In cycle 4, all stages are working simultaneously: Fetch on Instr 4, Decode on Instr 3, Execute on Instr 2, Write on Instr 1.

Pipelining roughly quadruples throughput (instructions per second) without making the clock faster. Modern CPUs have pipelines with 10+ stages, and can retire (complete) multiple instructions per cycle.

Pipeline Hazards: When Things Go Wrong

Pipelining works great for streaming computations with no branches. But real programs branch constantly. When a branch is reached, you don't know which way it will go until you execute it, but by then, you've already fetched and partially decoded the wrong instructions.

Branching hazards: You guess which way a branch will go, prefetch those instructions, then have to throw them away if you guessed wrong.

Data hazards: Instruction 1 writes to a register, and instruction 2 reads from it. If instruction 2 starts before instruction 1 finishes writing, it gets the old value, not the new one.

Structural hazards: Two instructions want to use the same hardware (like the ALU) at the same time.

Branch Prediction: Guessing Right

The simplest approach: assume all branches are NOT taken (fall through). This is right for if-statements 50% of the time. A better strategy: most branches in real code come from loops, and loops iterate multiple times, so most branches ARE taken. If you always guess "taken," you're right 50-90% of the time.

The best approach: keep statistics on each branch. If a particular branch usually jumps, predict it will jump. Modern CPUs use machine learning classifiers built into hardware to predict branch outcomes with 90%+ accuracy.

Branch Prediction Strategies

Assume branch is NOT taken (always fall through)

Accuracy
50% accurate for if-statements
How it works
Guess next instruction, redo if wrong
Test case
Branch will be✗ Misprediction

Out-of-Order Execution

Sometimes instruction 1 takes a long time (like division), but instruction 3 is independent of it. Why wait for instruction 1 to finish before starting instruction 3?

Out-of-order execution lets the CPU reorder instructions as long as dependencies are respected. If instructions don't depend on each other, they can run in any order. A dependency graph tracks which instructions must finish before others can start. The CPU finds a schedule that keeps all its hardware as busy as possible.

This is complex, the CPU must track instruction dependencies, manage multiple in-flight instructions, and handle exceptions correctly. But it can double or triple performance for real programs.

Hyperthreading: Two Threads, One CPU

Even with pipelining and out-of-order execution, some hardware sits idle. The fetch unit is busy during fetch but idle during execute. Why not have it fetch instructions for a second program?

Hyperthreading runs two independent instruction streams on the same physical CPU. Each has its own program counter and registers. When one stalls waiting for memory, the other can execute. From the OS perspective, it looks like two CPUs. It's not, they share memory, caches, and ALU, but it's clever use of otherwise wasted resources.


Why This Matters: Then and Now

The Manchester Baby was revolutionary, but it's wildly inefficient compared to modern CPUs. It runs at 1 MHz (1 million ticks per second). A modern CPU runs at 3+ GHz (3 billion ticks per second). It has 32 addresses of RAM. A modern computer has gigabytes. It can only subtract. Modern CPUs have specialized hardware for arithmetic, cryptography, and graphics.

But the core idea hasn't changed:

Modern CPUs optimize aggressively. They predict which branches you'll take and prefetch instructions. They run multiple instructions per cycle (superscalar execution). They cache frequently-used data. They speculate aggressively and roll back when wrong. But underneath is still the same loop.

From vacuum tubes to transistors to quantum?

The Baby used vacuum tubes. Each tube was a switch that could be on or off, controlled by electrical signals. Transistors replaced tubes (smaller, more reliable, used less power). Modern CPUs use transistors so small that quantum effects start to matter, and future computers might exploit quantum behavior directly.

But whether your computer uses tubes, transistors, or qubits, the fundamental idea remains: a clock signal, a memory, registers, and a control unit orchestrating computation.


What You Now Understand

You've learned how a CPU is more than magic. It's a collection of simple pieces, registers made of flip-flops, multiplexers that route signals, decoders that turn bit patterns into control signals, and a clock that synchronizes everything.

You can trace through a program and predict exactly what the machine will do:

The Baby is simple enough to understand completely. Modern CPUs are billions of times more complex, but they're built from the same basic components. Understanding the simple case gives you the foundation to reason about the complex ones.

Want to build one? The Manchester Baby (and extensions like the Ferranti Mark 1) can be simulated in LogiSim. Start with registers, an ALU, and a simple control unit. Build up until you can run real programs. It's easier than you'd think, and deeply satisfying.