Build a Computer from NAND Gates
Part 7: Compiler I - Syntax Analysis
We've designed the Jack language. Now we build a compiler that translates Jack to VM code.
Compilation happens in stages:
- Tokenization: Break source into tokens
- Parsing: Build a syntax tree
- Code Generation: Emit VM instructions
This part covers the frontend (tokenization and parsing). Part 8 covers code generation.
The tokenizer
The first step transforms a stream of characters into a stream of tokens:
Token types
| Type | Examples | Pattern |
|---|---|---|
| KEYWORD | class, if, let | Reserved words |
| SYMBOL | + - ; | Single characters |
| INT_CONST | 0, 42, 32767 | Digit sequences |
| STRING_CONST | "hello" | Text in quotes |
| IDENTIFIER | x, Main, foo | Names |
Tokenization algorithm
while not at end:
skip whitespace and comments
if current char is symbol:
emit SYMBOL token
else if current char is digit:
read all digits
emit INT_CONST token
else if current char is quote:
read until closing quote
emit STRING_CONST token
else if current char is letter or underscore:
read word (letters, digits, underscores)
if word is keyword:
emit KEYWORD token
else:
emit IDENTIFIER token
Handling comments
Jack supports two comment styles:
// Single-line comment (skip to end of line)
/* Multi-line comment
spans multiple lines */
The tokenizer strips comments before parsing sees them.
String constants
Strings are enclosed in double quotes. They cannot:
- Contain newlines
- Contain double quotes
- Span multiple lines
The tokenizer stores the content without the quotes.
The parser
The parser takes tokens and builds a syntax tree that represents the program's structure:
parseClass():
consume('class')
className = consume(IDENTIFIER)
consume('{')
while peek() in ['static', 'field']:
parseClassVarDec()
while peek() in ['constructor', 'function', 'method']:
parseSubroutineDec()
consume('}')Recursive descent
We use recursive descent parsing, where each grammar rule becomes a function:
| Grammar Rule | Parser Function |
|---|---|
| class | parseClass() |
| subroutineDec | parseSubroutineDec() |
| statements | parseStatements() |
| expression | parseExpression() |
| term | parseTerm() |
Functions call each other following the grammar's structure.
The parsing loop
Each parse function follows this pattern:
- Check what token we're looking at (peek)
- Consume expected tokens (advance and validate)
- Recurse into sub-rules as needed
- Build and return an AST node
Example for parseLetStatement:
parseLetStatement():
consume(KEYWORD, 'let')
varName = consume(IDENTIFIER)
if peek() == '[':
consume('[')
indexExpr = parseExpression()
consume(']')
consume('=')
value = parseExpression()
consume(';')
return LetStatement(varName, indexExpr, value)
Lookahead
Sometimes we need to look ahead to decide which rule to apply:
parseTerm():
if peek() is IDENTIFIER:
name = consume()
if peek() == '[':
// Array access: arr[i]
return parseArrayAccess(name)
else if peek() == '(' or peek() == '.':
// Subroutine call: func() or obj.method()
return parseSubroutineCall(name)
else:
// Simple variable
return VarTerm(name)
Our grammar needs only one-token lookahead (LL(1)).
Abstract syntax tree
What's in the AST
The AST captures the program's semantic structure:
- ClassNode: class name, variable declarations, subroutines
- SubroutineDecNode: kind, return type, parameters, body
- StatementNode: let, if, while, do, return with their components
- ExpressionNode: term and operations
- TermNode: constants, variables, calls, expressions
Each node type has fields for its components.
Parse tree vs AST
expression
├── term
│ └── varName: "x"
├── op: "+"
└── term
└── intConst: 5BinaryOp
├── op: "+"
├── left: VarRef("x")
└── right: IntConst(5)A parse tree (concrete syntax tree) includes everything from the grammar, even punctuation and syntactic details.
An AST (abstract syntax tree) keeps only semantically important information:
- No semicolons or braces (structure is implicit)
- Operators are nodes, not intermediate rules
- Simpler to traverse and generate code from
Symbol table
During parsing, we build a symbol table that tracks variable information:
| Name | Type | Kind | Index | Segment |
|---|---|---|---|---|
| count | int | static | 0 | static 0 |
| size | int | static | 1 | static 1 |
| x | int | field | 0 | this 0 |
| y | int | field | 1 | this 1 |
| data | Array | field | 2 | this 2 |
Two scopes
Class scope (persists for the class):
- static variables → static segment
- field variables → this segment
Subroutine scope (reset for each function):
- argument variables → argument segment
- local variables → local segment
Symbol table operations
class SymbolTable:
define(name, type, kind) // Add a variable
varCount(kind) // Count variables of kind
kindOf(name) // Get variable's kind
typeOf(name) // Get variable's type
indexOf(name) // Get variable's index
When we see let x = 5:
- Look up
xin symbol table - Get its kind (e.g., "var") and index (e.g., 0)
- Generate:
pop local 0
Scope lookup
Variables are looked up in order:
- Subroutine scope first
- Class scope if not found
- Error if still not found
This allows local variables to shadow class variables.
Error handling
- Line number and column
- What was expected
- What was found instead
Good error messages
An error message should include:
- Line number (and column if possible)
- What was expected (based on grammar)
- What was found (actual token)
Line 5: Expected ')', got ';'
Line 8: Undefined variable 'count'
Line 12: Type mismatch: expected int, got String
Error categories
Lexical errors (tokenizer):
- Unterminated string
- Invalid character
- Integer too large
Syntax errors (parser):
- Unexpected token
- Missing closing bracket
- Invalid statement
Semantic errors (later):
- Undefined variable
- Type mismatch
- Wrong number of arguments
Error recovery
Simple compilers stop at the first error. Better compilers:
- Report the error
- Skip to a synchronization point (like
;or}) - Continue parsing to find more errors
Our compiler reports errors but may not recover gracefully.
Complete frontend pipeline
let x = 5 + 3;
Pipeline summary
Source Code
↓ Tokenizer
Token Stream
↓ Parser
Abstract Syntax Tree
↓ (Symbol Table built during parsing)
AST + Symbol Table
The frontend transforms text into a structured representation that the code generator can process.
Implementation details
Token structure
interface Token {
type: 'KEYWORD' | 'SYMBOL' | 'INT_CONST' | 'STRING_CONST' | 'IDENTIFIER';
value: string;
line: number;
column: number;
}AST node example
interface LetStatementNode {
type: 'letStatement';
varName: string;
indexExpr?: ExpressionNode; // For array access
value: ExpressionNode;
line?: number;
}Parser state
The parser maintains:
- tokens: Array of tokens from tokenizer
- pos: Current position in token array
- errors: Accumulated error messages
Testing the frontend
Unit tests
Test tokenizer with edge cases:
"" → empty string
"a" → single identifier
"123" → integer constant
"if else" → two keywords
"x+y*z" → no spaces
"/* */ a" → comment before token
Test parser with valid programs:
class Empty { }
class OneField { field int x; }
function void f() { return; }
Test with invalid programs:
class { } → missing class name
let x → missing = and expression
if (x { } → missing )
Integration tests
Parse complete programs and verify AST structure:
- Correct node types
- Correct nesting
- All identifiers captured
- All expressions preserved
What's next
We have an AST representing the program's structure. In Part 8, we'll traverse this tree and generate VM code:
- Expressions → push/pop and arithmetic operations
- Statements → control flow with labels and gotos
- Subroutines → function declarations and calls
- Classes → proper handling of this and static
The frontend gave us structure. The backend gives us execution.
Next: Part 8 implements the code generator, translating AST to VM code.