Back to Blog

Build a Computer from NAND Gates

Part 7: Compiler I - Syntax Analysis

We've designed the Jack language. Now we build a compiler that translates Jack to VM code.

Compilation happens in stages:

  1. Tokenization: Break source into tokens
  2. Parsing: Build a syntax tree
  3. Code Generation: Emit VM instructions

This part covers the frontend (tokenization and parsing). Part 8 covers code generation.

The tokenizer

The first step transforms a stream of characters into a stream of tokens:

Tokenization Step by Step
SOURCE
let x = 5;
Position 0: Start identifier
TOKENS (0)
No tokens yet...

Token types

TypeExamplesPattern
KEYWORDclass, if, letReserved words
SYMBOL + - ;Single characters
INT_CONST0, 42, 32767Digit sequences
STRING_CONST"hello"Text in quotes
IDENTIFIERx, Main, fooNames

Tokenization algorithm

while not at end:
    skip whitespace and comments

    if current char is symbol:
        emit SYMBOL token

    else if current char is digit:
        read all digits
        emit INT_CONST token

    else if current char is quote:
        read until closing quote
        emit STRING_CONST token

    else if current char is letter or underscore:
        read word (letters, digits, underscores)
        if word is keyword:
            emit KEYWORD token
        else:
            emit IDENTIFIER token

Handling comments

Jack supports two comment styles:

// Single-line comment (skip to end of line)

/* Multi-line comment
   spans multiple lines */

The tokenizer strips comments before parsing sees them.

String constants

Strings are enclosed in double quotes. They cannot:

  • Contain newlines
  • Contain double quotes
  • Span multiple lines

The tokenizer stores the content without the quotes.

The parser

The parser takes tokens and builds a syntax tree that represents the program's structure:

Recursive Descent Parser
PSEUDOCODE
parseClass():
  consume('class')
  className = consume(IDENTIFIER)
  consume('{')
  while peek() in ['static', 'field']:
    parseClassVarDec()
  while peek() in ['constructor', 'function', 'method']:
    parseSubroutineDec()
  consume('}')
CALLS
Each grammar rule becomes a function. Functions call each other recursively, mirroring the grammar structure.

Recursive descent

We use recursive descent parsing, where each grammar rule becomes a function:

Grammar RuleParser Function
classparseClass()
subroutineDecparseSubroutineDec()
statementsparseStatements()
expressionparseExpression()
termparseTerm()

Functions call each other following the grammar's structure.

The parsing loop

Each parse function follows this pattern:

  1. Check what token we're looking at (peek)
  2. Consume expected tokens (advance and validate)
  3. Recurse into sub-rules as needed
  4. Build and return an AST node

Example for parseLetStatement:

parseLetStatement():
    consume(KEYWORD, 'let')
    varName = consume(IDENTIFIER)

    if peek() == '[':
        consume('[')
        indexExpr = parseExpression()
        consume(']')

    consume('=')
    value = parseExpression()
    consume(';')

    return LetStatement(varName, indexExpr, value)

Lookahead

Sometimes we need to look ahead to decide which rule to apply:

parseTerm():
    if peek() is IDENTIFIER:
        name = consume()
        if peek() == '[':
            // Array access: arr[i]
            return parseArrayAccess(name)
        else if peek() == '(' or peek() == '.':
            // Subroutine call: func() or obj.method()
            return parseSubroutineCall(name)
        else:
            // Simple variable
            return VarTerm(name)

Our grammar needs only one-token lookahead (LL(1)).

Abstract syntax tree

Abstract Syntax Tree
SOURCE CODE
AST (Abstract Syntax Tree)
class
name: "Main"
subroutineDecs:
subroutineDec
kind: "function"
returnType: "void"
name: "main"
body:
subroutineBody
varDecs:
varDec
varType: "int"
names:
"x"
statements:
letStatement
varName: "x"
value:
expression
term:
intConst
value: 5
returnStatement

What's in the AST

The AST captures the program's semantic structure:

  • ClassNode: class name, variable declarations, subroutines
  • SubroutineDecNode: kind, return type, parameters, body
  • StatementNode: let, if, while, do, return with their components
  • ExpressionNode: term and operations
  • TermNode: constants, variables, calls, expressions

Each node type has fields for its components.

Parse tree vs AST

Parse Tree vs Abstract Syntax Tree
PARSE TREE (Concrete)
expression
├── term
│   └── varName: "x"
├── op: "+"
└── term
    └── intConst: 5
Includes all grammar structure
AST (Abstract)
BinaryOp
├── op: "+"
├── left: VarRef("x")
└── right: IntConst(5)
Only essential structure
The parse tree mirrors the grammar exactly. The AST simplifies it, keeping only what's needed for code generation. Our compiler produces an AST.

A parse tree (concrete syntax tree) includes everything from the grammar, even punctuation and syntactic details.

An AST (abstract syntax tree) keeps only semantically important information:

  • No semicolons or braces (structure is implicit)
  • Operators are nodes, not intermediate rules
  • Simpler to traverse and generate code from

Symbol table

During parsing, we build a symbol table that tracks variable information:

Symbol Table
NameTypeKindIndexSegment
countintstatic0static 0
sizeintstatic1static 1
xintfield0this 0
yintfield1this 1
dataArrayfield2this 2
Kind → Segment mapping: static → static, field → this, argument → argument, var → local

Two scopes

Class scope (persists for the class):

  • static variables → static segment
  • field variables → this segment

Subroutine scope (reset for each function):

  • argument variables → argument segment
  • local variables → local segment

Symbol table operations

class SymbolTable:
    define(name, type, kind)  // Add a variable
    varCount(kind)            // Count variables of kind
    kindOf(name)              // Get variable's kind
    typeOf(name)              // Get variable's type
    indexOf(name)             // Get variable's index

When we see let x = 5:

  1. Look up x in symbol table
  2. Get its kind (e.g., "var") and index (e.g., 0)
  3. Generate: pop local 0

Scope lookup

Variables are looked up in order:

  1. Subroutine scope first
  2. Class scope if not found
  3. Error if still not found

This allows local variables to shadow class variables.

Error handling

Parser Error Handling
SOURCE WITH ERRORS
ERRORS
4 error(s)
Line 3: Unexpected token in term: ;
Line 4: Expected ;, got let
Line 4: Expected =, got 10
Line 6: Expected ), got {
Error messages should include:
  • Line number and column
  • What was expected
  • What was found instead

Good error messages

An error message should include:

  • Line number (and column if possible)
  • What was expected (based on grammar)
  • What was found (actual token)
Line 5: Expected ')', got ';'
Line 8: Undefined variable 'count'
Line 12: Type mismatch: expected int, got String

Error categories

Lexical errors (tokenizer):

  • Unterminated string
  • Invalid character
  • Integer too large

Syntax errors (parser):

  • Unexpected token
  • Missing closing bracket
  • Invalid statement

Semantic errors (later):

  • Undefined variable
  • Type mismatch
  • Wrong number of arguments

Error recovery

Simple compilers stop at the first error. Better compilers:

  1. Report the error
  2. Skip to a synchronization point (like ; or })
  3. Continue parsing to find more errors

Our compiler reports errors but may not recover gracefully.

Complete frontend pipeline

Frontend Pipeline
Source
Tokens
Parse Tree
AST
let x = 5 + 3;
Raw Jack source code

Pipeline summary

Source Code
    ↓ Tokenizer
Token Stream
    ↓ Parser
Abstract Syntax Tree
    ↓ (Symbol Table built during parsing)
AST + Symbol Table

The frontend transforms text into a structured representation that the code generator can process.

Implementation details

Token structure

interface Token {
    type: 'KEYWORD' | 'SYMBOL' | 'INT_CONST' | 'STRING_CONST' | 'IDENTIFIER';
    value: string;
    line: number;
    column: number;
}

AST node example

interface LetStatementNode {
    type: 'letStatement';
    varName: string;
    indexExpr?: ExpressionNode;  // For array access
    value: ExpressionNode;
    line?: number;
}

Parser state

The parser maintains:

  • tokens: Array of tokens from tokenizer
  • pos: Current position in token array
  • errors: Accumulated error messages

Testing the frontend

Unit tests

Test tokenizer with edge cases:

""           → empty string
"a"          → single identifier
"123"        → integer constant
"if else"    → two keywords
"x+y*z"      → no spaces
"/* */ a"    → comment before token

Test parser with valid programs:

class Empty { }
class OneField { field int x; }
function void f() { return; }

Test with invalid programs:

class { }      → missing class name
let x          → missing = and expression
if (x { }      → missing )

Integration tests

Parse complete programs and verify AST structure:

  • Correct node types
  • Correct nesting
  • All identifiers captured
  • All expressions preserved

What's next

We have an AST representing the program's structure. In Part 8, we'll traverse this tree and generate VM code:

  • Expressions → push/pop and arithmetic operations
  • Statements → control flow with labels and gotos
  • Subroutines → function declarations and calls
  • Classes → proper handling of this and static

The frontend gave us structure. The backend gives us execution.


Next: Part 8 implements the code generator, translating AST to VM code.