Build a Computer from NAND Gates
Part 3: Building an Assembler
We've built the hardware (Part 1) and learned the machine language (Part 2). Now for the final piece: a tool that translates human-readable assembly into machine code.
Writing binary by hand is error-prone and tedious. An assembler solves this by letting us write:
@100
D=A
@sum
M=D
Instead of:
0000000001100100
1110110000010000
0000000000010000
1110001100001000
Let's build one from scratch.
Assembly vs machine code
First, let's see why assembly language matters. Compare these two representations of the same program:
The assembly version tells us what we're doing: loading values, adding, storing. The binary version is just numbers. An assembler bridges this gap.
What the assembler must do
The assembler performs three main tasks:
- Parse: Break each line into components (dest, comp, jump)
- Resolve symbols: Convert labels and variables to addresses
- Generate code: Convert parsed components to binary
Let's tackle each step.
The two-pass algorithm
Here's the challenge: when we encounter @LOOP, we don't yet know what address LOOP refers to. The label might be defined later in the code.
The solution: two passes.
| Label | ROM Addr |
|---|---|
| LOOP | 4 |
| END | 16 |
Scan through the code and record where each label (LOOP) appears. Labels don't generate machine code—they're just markers for ROM addresses.
Pass 1: Collect labels
In the first pass, we scan through the code and record every label's ROM address. We don't generate any code. We just build the symbol table.
When we see (LOOP):
- Note that
LOOPmaps to the current ROM address - Don't increment the address counter (labels produce no code)
When we see @value or dest=comp:
- Increment the ROM address counter
- These instructions will occupy one word each
Pass 2: Resolve and generate
In the second pass, we generate actual machine code:
- When we see
@symbol, look it up in the symbol table - If not found, it's a new variable. Allocate RAM starting at address 16
- Convert each instruction to its 16-bit binary form
This two-pass approach handles forward references elegantly.
The parser
The parser breaks each line into structured components:
Parsing rules
A-instructions start with @:
- If followed by a number, it's a constant:
@100→ value 100 - If followed by a name, it's a symbol:
@LOOP→ look up in table
C-instructions have the form dest=comp;jump:
destis optional (defaults to null)compis requiredjumpis optional (defaults to null)
Labels are names in parentheses:
(LOOP)defines a label at the current ROM address- Labels don't generate code
Comments and whitespace are stripped:
- Everything after
//is ignored - Leading/trailing spaces are trimmed
The parser algorithm
function parseLine(line):
remove comments (everything after //)
trim whitespace
if line is empty:
return EMPTY
if line starts with '(' and ends with ')':
return LABEL with name
if line starts with '@':
value = rest of line
if value is a number:
return A-INSTRUCTION with value
else:
return A-INSTRUCTION with symbol
else:
parse as C-instruction:
- if '=' exists, split on '=' for dest
- if ';' exists, split on ';' for jump
- middle part is comp
return C-INSTRUCTION with dest, comp, jump
Code generation
Once we've parsed the instruction, we need to generate the corresponding binary:
A-instruction encoding
A-instructions are simple: the value becomes the lower 15 bits.
@value → 0vvv vvvv vvvv vvvv
The leading 0 bit distinguishes A-instructions from C-instructions.
C-instruction encoding
C-instructions are more complex:
111a cccc ccdd djjj
- 111: C-instruction prefix (3 bits)
- a: Use M instead of A? (1 bit)
- cccccc: Computation code (6 bits)
- ddd: Destination code (3 bits)
- jjj: Jump code (3 bits)
Each component maps to a lookup table:
| Comp | Binary | Dest | Binary | Jump | Binary | ||
|---|---|---|---|---|---|---|---|
| 0 | 0101010 | null | 000 | null | 000 | ||
| 1 | 0111111 | M | 001 | JGT | 001 | ||
| D | 0001100 | D | 010 | JEQ | 010 | ||
| A | 0110000 | MD | 011 | JGE | 011 | ||
| D+A | 0000010 | A | 100 | JLT | 100 | ||
| ... | ... | ... | ... | JMP | 111 |
The assembler looks up each component and concatenates the bits.
The assembler playground
Now let's put it all together. The playground below shows the complete assembler in action:
Try editing the source code:
- Change values and see the binary update
- Add labels and watch them appear in the symbol table
- Introduce errors and see the error messages
Predefined symbols
The assembler comes with built-in symbols:
| Symbol | Value | Purpose |
|---|---|---|
| R0-R15 | 0-15 | Virtual registers |
| SP | 0 | Stack pointer |
| LCL | 1 | Local segment |
| ARG | 2 | Argument segment |
| THIS | 3 | This segment |
| THAT | 4 | That segment |
| SCREEN | 16384 | Screen memory base |
| KBD | 24576 | Keyboard memory |
These predefined symbols make programs more readable and portable.
Integrated development
With the assembler complete, we can now write, assemble, and run programs in one environment:
Watch your assembly code execute step by step. The highlighted line shows which instruction is about to execute, while the registers and memory display the current state.
Error handling
A good assembler doesn't just translate. It catches mistakes:
Syntax errors:
- Invalid characters in symbols
- Missing computation in C-instruction
- Unbalanced parentheses in labels
Semantic errors:
- Duplicate label definitions
- Invalid computation mnemonics
- Invalid destination or jump codes
Value errors:
- A-instruction value > 32767
- Negative values in A-instructions
Each error should report:
- The line number
- The problematic source text
- A clear description of what's wrong
A complete example
Let's trace through assembling a multiplication program:
// Multiply R0 by R1, store in R2
@R2
M=0
(LOOP)
@R0
D=M
@END
D;JEQ
@R1
D=M
@R2
M=D+M
@R0
M=M-1
@LOOP
0;JMP
(END)
@END
0;JMP
Pass 1 builds the symbol table:
LOOP→ ROM address 2END→ ROM address 14
Pass 2 generates machine code:
| Line | Assembly | Binary | Explanation |
|---|---|---|---|
| 1 | @R2 | 0000000000000010 | A = 2 |
| 2 | M=0 | 1110101010001000 | RAM[2] = 0 |
| 3 | @R0 | 0000000000000000 | A = 0 |
| 4 | D=M | 1111110000010000 | D = RAM[0] |
| 5 | @END | 0000000000001110 | A = 14 (END) |
| 6 | D;JEQ | 1110001100000010 | if D=0 jump |
| ... | ... | ... | ... |
The result: 16 instructions that multiply any two numbers.
What we've built
Over these three parts, we've constructed:
-
Hardware (Part 1)
- Logic gates from NAND
- Arithmetic circuits (adders, ALU)
- Memory (registers, RAM)
-
Architecture (Part 2)
- Machine language encoding
- CPU fetch-decode-execute cycle
- Memory-mapped I/O
-
Tools (Part 3)
- Parser for assembly syntax
- Symbol table with two-pass resolution
- Code generator for binary output
Starting from a single NAND gate, we've built a complete computer system. The hardware executes instructions. The assembler translates human-readable code to those instructions.
Next steps
This is just the beginning. From here, you could explore:
- High-level languages: Build a compiler that translates a language like Jack to assembly
- Operating systems: Write software that manages memory and processes
- Hardware description languages: Describe circuits in HDL and simulate them
- Real hardware: Implement this architecture on an FPGA
The principles we've covered (logic gates, machine language, translation) are the foundation of all computing. Every program you write, every app you use, ultimately becomes bits flowing through circuits we now understand.
The computer is no longer a black box.
This concludes the 3-part series on building a computer from first principles. The complete code for all simulators and demos is available in this project.