Build a Computer from NAND Gates

Part 7: Compiler I - Syntax Analysis

We've designed the Jack language. Now we build a compiler that translates Jack to VM code.

Compilation happens in stages:

Tokenization: Break source into tokens
Parsing: Build a syntax tree
Code Generation: Emit VM instructions

This part covers the frontend (tokenization and parsing). Part 8 covers code generation.

The tokenizer

The first step transforms a stream of characters into a stream of tokens:

Tokenization Step by Step

SOURCE

let x = 5;

Position 0: Start identifier

TOKENS (0)

No tokens yet...

Token types

Type	Examples	Pattern
KEYWORD	class, if, let	Reserved words
SYMBOL	+ - ;	Single characters
INT_CONST	0, 42, 32767	Digit sequences
STRING_CONST	"hello"	Text in quotes
IDENTIFIER	x, Main, foo	Names

Tokenization algorithm

while not at end:
    skip whitespace and comments

    if current char is symbol:
        emit SYMBOL token

    else if current char is digit:
        read all digits
        emit INT_CONST token

    else if current char is quote:
        read until closing quote
        emit STRING_CONST token

    else if current char is letter or underscore:
        read word (letters, digits, underscores)
        if word is keyword:
            emit KEYWORD token
        else:
            emit IDENTIFIER token

Handling comments

Jack supports two comment styles:

// Single-line comment (skip to end of line)

/* Multi-line comment
   spans multiple lines */

The tokenizer strips comments before parsing sees them.

String constants

Strings are enclosed in double quotes. They cannot:

Contain newlines
Contain double quotes
Span multiple lines

The tokenizer stores the content without the quotes.

The parser

The parser takes tokens and builds a syntax tree that represents the program's structure:

Recursive Descent Parser

PSEUDOCODE

parseClass():
  consume('class')
  className = consume(IDENTIFIER)
  consume('{')
  while peek() in ['static', 'field']:
    parseClassVarDec()
  while peek() in ['constructor', 'function', 'method']:
    parseSubroutineDec()
  consume('}')

CALLS

Each grammar rule becomes a function. Functions call each other recursively, mirroring the grammar structure.

Recursive descent

We use recursive descent parsing, where each grammar rule becomes a function:

Grammar Rule	Parser Function
class	parseClass()
subroutineDec	parseSubroutineDec()
statements	parseStatements()
expression	parseExpression()
term	parseTerm()

Functions call each other following the grammar's structure.

The parsing loop

Each parse function follows this pattern:

Check what token we're looking at (peek)
Consume expected tokens (advance and validate)
Recurse into sub-rules as needed
Build and return an AST node

Example for parseLetStatement:

parseLetStatement():
    consume(KEYWORD, 'let')
    varName = consume(IDENTIFIER)

    if peek() == '[':
        consume('[')
        indexExpr = parseExpression()
        consume(']')

    consume('=')
    value = parseExpression()
    consume(';')

    return LetStatement(varName, indexExpr, value)

Lookahead

Sometimes we need to look ahead to decide which rule to apply:

parseTerm():
    if peek() is IDENTIFIER:
        name = consume()
        if peek() == '[':
            // Array access: arr[i]
            return parseArrayAccess(name)
        else if peek() == '(' or peek() == '.':
            // Subroutine call: func() or obj.method()
            return parseSubroutineCall(name)
        else:
            // Simple variable
            return VarTerm(name)

Our grammar needs only one-token lookahead (LL(1)).

Abstract syntax tree

Abstract Syntax Tree

SOURCE CODE

AST (Abstract Syntax Tree)

class

name: "Main"

subroutineDecs:

subroutineDec

kind: "function"

returnType: "void"

name: "main"

body:

subroutineBody

varDecs:

varDec

varType: "int"

names:

"x"

statements:

letStatement

varName: "x"

value:

expression

term:

intConst

value: 5

returnStatement

What's in the AST

The AST captures the program's semantic structure:

ClassNode: class name, variable declarations, subroutines
SubroutineDecNode: kind, return type, parameters, body
StatementNode: let, if, while, do, return with their components
ExpressionNode: term and operations
TermNode: constants, variables, calls, expressions

Each node type has fields for its components.

Parse tree vs AST

Parse Tree vs Abstract Syntax Tree

PARSE TREE (Concrete)

expression
├── term
│   └── varName: "x"
├── op: "+"
└── term
    └── intConst: 5

Includes all grammar structure

AST (Abstract)

BinaryOp
├── op: "+"
├── left: VarRef("x")
└── right: IntConst(5)

Only essential structure

The parse tree mirrors the grammar exactly. The AST simplifies it, keeping only what's needed for code generation. Our compiler produces an AST.

A parse tree (concrete syntax tree) includes everything from the grammar, even punctuation and syntactic details.

An AST (abstract syntax tree) keeps only semantically important information:

No semicolons or braces (structure is implicit)
Operators are nodes, not intermediate rules
Simpler to traverse and generate code from

Symbol table

During parsing, we build a symbol table that tracks variable information:

Symbol Table

Name	Type	Kind	Index	Segment
count	int	static	0	static 0
size	int	static	1	static 1
x	int	field	0	this 0
y	int	field	1	this 1
data	Array	field	2	this 2

Kind → Segment mapping: static → static, field → this, argument → argument, var → local

Two scopes

Class scope (persists for the class):

static variables → static segment
field variables → this segment

Subroutine scope (reset for each function):

argument variables → argument segment
local variables → local segment

Symbol table operations

class SymbolTable:
    define(name, type, kind)  // Add a variable
    varCount(kind)            // Count variables of kind
    kindOf(name)              // Get variable's kind
    typeOf(name)              // Get variable's type
    indexOf(name)             // Get variable's index

When we see let x = 5:

Look up x in symbol table
Get its kind (e.g., "var") and index (e.g., 0)
Generate: pop local 0

Scope lookup

Variables are looked up in order:

Subroutine scope first
Class scope if not found
Error if still not found

This allows local variables to shadow class variables.

Error handling

Parser Error Handling

SOURCE WITH ERRORS

ERRORS

4 error(s)

Line 3: Unexpected token in term: ;

Line 4: Expected ;, got let

Line 4: Expected =, got 10

Line 6: Expected ), got {

Error messages should include:

Line number and column
What was expected
What was found instead

Good error messages

An error message should include:

Line number (and column if possible)
What was expected (based on grammar)
What was found (actual token)

Line 5: Expected ')', got ';'
Line 8: Undefined variable 'count'
Line 12: Type mismatch: expected int, got String

Error categories

Lexical errors (tokenizer):

Unterminated string
Invalid character
Integer too large

Syntax errors (parser):

Unexpected token
Missing closing bracket
Invalid statement

Semantic errors (later):

Undefined variable
Type mismatch
Wrong number of arguments

Error recovery

Simple compilers stop at the first error. Better compilers:

Report the error
Skip to a synchronization point (like ; or })
Continue parsing to find more errors

Our compiler reports errors but may not recover gracefully.

Complete frontend pipeline

Frontend Pipeline

Source

→

Tokens

→

Parse Tree

→

AST

let x = 5 + 3;

Raw Jack source code

Pipeline summary

Source Code
    ↓ Tokenizer
Token Stream
    ↓ Parser
Abstract Syntax Tree
    ↓ (Symbol Table built during parsing)
AST + Symbol Table

The frontend transforms text into a structured representation that the code generator can process.

Implementation details

Token structure

interface Token {
    type: 'KEYWORD' | 'SYMBOL' | 'INT_CONST' | 'STRING_CONST' | 'IDENTIFIER';
    value: string;
    line: number;
    column: number;
}

AST node example

interface LetStatementNode {
    type: 'letStatement';
    varName: string;
    indexExpr?: ExpressionNode;  // For array access
    value: ExpressionNode;
    line?: number;
}

Parser state

The parser maintains:

tokens: Array of tokens from tokenizer
pos: Current position in token array
errors: Accumulated error messages

Testing the frontend

Unit tests

Test tokenizer with edge cases:

""           → empty string
"a"          → single identifier
"123"        → integer constant
"if else"    → two keywords
"x+y*z"      → no spaces
"/* */ a"    → comment before token

Test parser with valid programs:

class Empty { }
class OneField { field int x; }
function void f() { return; }

Test with invalid programs:

class { }      → missing class name
let x          → missing = and expression
if (x { }      → missing )

Integration tests

Parse complete programs and verify AST structure:

Correct node types
Correct nesting
All identifiers captured
All expressions preserved

What's next

We have an AST representing the program's structure. In Part 8, we'll traverse this tree and generate VM code:

Expressions → push/pop and arithmetic operations
Statements → control flow with labels and gotos
Subroutines → function declarations and calls
Classes → proper handling of this and static

The frontend gave us structure. The backend gives us execution.

Next: Part 8 implements the code generator, translating AST to VM code.