Sunday, February 4, 2018

[Compiler] Inside My Implementation of the Jack Compiler

A compiler would take a program written in a high level language and translate the instruction to the instruction for a Virtual Machine(VM).

#Modules
A compiler has the following modules
0. Tokenizer
1. Parser
2. Symbol table generator
3. VM code generator

# Implementation Detail
0. Tokenizer
A language defines its basic building blocks (tokens). These basic building blocks contains comment structure, keywords, symbols, identifier logic,  literals(integer, string etc). A parser would take a program and generate stream of tokens as defined by a given language.

The tokenizer gives the following apis
i. hasMoreTokens()
ii. advance()
iii. getCurrentToken()

Algorithm:
i. Load given program in to memory
ii. Remove all the comments
iii. Add a space where 2 tokens are joined (eg. convert 5; to 5 <Space> ;)
iv. Look for a character
             if (character != ")
                curIndex = i
                j = index of next space
                subString = extract(i, j)
                using regular expression, check if the subString is a proper token, if so store in a list raise                     an exception otherwise
            if (character == ")
                look for next " and extract the substring and store in the token list


1. Parser
Uses a tokenizer and checks if the token sequence is in harmony with the grammar of the language.

Implementation:
Lets consider rule for class
class   >>>
    'class' identifer '{'
              classVarDec*
              subroutineDec*
     '}'

A set of function related to each grammar construct and bunch of helper function helped me writing the parser module.
Example:
For the rule above, I wrote the following method in the parser

void CompileClass(){
  keyword(CLASS);
  identifier();
  symbol('{');
  classVarDec_zero_more();
  subroutineDec_zero_more();
  symbol('}');  
}
Here compileClass is the method related to the grammar rule and classVarDec_zero_more() is a helper function.
the method symbol(char) was implemented as follows

void symbol(s)
  curTok = tokenizer.getCurrentToken();
  if(curTok.type != SYMBOL)
    raise exception
   
  if(curTok.val != s)
    raise exception
 
  tokenizer.advance()
end

2. Symbol Table
A symbol table is used to translate a variable declared in a high level language to a (segment, offset) pair of a virtual machine. Symbol table is divided in two spaces, one for variables declared in class level another one for subroutine level

API
i. define(varName, kind(static, field etc), type(String, CustomClass etc))
  called when a variable declaratrion is encountered
ii. startMethodLevelSymbolTable()
  called when a method declaration is encountered
iii. getInfo(varName)
  used to resolve a variable name encountered

An instance of symbol table needs to be added on the parser. As the parser process any variable declaration, an entry needs to be created in the symbol table.

3. VM Code generator
A VM Code generator needs to be augmented within the parser. As the parser parses through a given source code, a VM Code Generator would gather context and when appropriate will generate proper VM Code. This module keeps an eye on the following cases
i. Function declartion handling
A VM only cares about subroutines. A compiler needs to translate and write all the subroutines in a output file. To resolve name collision, when a subroutine subRot is encountered in class MyClass, the VM code generator would give the subroutine the name "MyClass.subRot". Here is how it deals with different subroutine types.

#Static
Observes how many local variables declared in the static method (lets call it n) and then generate the following code
function  MyClass.subRot n

#Construtor
Observes how many local variables are declared(localCount) from parser and how many fields the MyClass has(fieldCount) from symbol table. Then generates the following code-

function MyClass.new localCount
  call Memory.Alloc(fieldCount)
  pop pointer 0 //pointer 0 is used for current object also known as this
 
#Method
A method implicitly expects the object to act on as its first argument. It observes how many local variables are declared(localCount) from parser then generates the following code-

function MyClass.subRot localCount+1
  push arg 0
  pop pointer 0 //sets this pointer
 
The above code essentially sets this pointer to the object on which the method is called.

# Variable declaration
Create an entry in appropraite space of the symbol table with proper context.

# Expressions
An expression is evaluated and is value is pushed to stack.
## for an int value, generate -> push value.
## True -> generate -> push -1 //notice -1 means 11111... in memory
## False, null -> push 0
## "ab" generate the following
  call Memory.alloc 3 //3 characters in the string
  call String.appendChar 97 //ascii value of 'a'
  call String.appendChar 98
## variable, a variable is resolved using the symbol table. For example, lets say a variable var was declared in a method and it was the 2nd variable. So this variable exists in local segemnt's 2nd index. it will be resolved as following
  push local 2
A variable declared in a class, or declared as staic or declared as an argument would resolved with a proper segment and a proper offset.
 
## binary operator, an expression with the form (exp1 op exp1) is translated in the following way
  evaluate exp1 //exp1's value would be on top of the stack
  evaluate exp2
  op //apply the operator on the top two values of the stack
 
## unary operator, op exp
  evalute exp
  op
 
## for a method call myMethod(1, 2) of class MyClass, the following code will be generated
  push pointer 0 //push this object's reference to stack
  push constant 1
  push constant 2
  call MyClass.myMethod 3 //3 is the number of arguments the caller has passed to the function
 
## To translate a call of myObj.myMethod(1,2) using the symbol table, Code generator module finds its type lets say MyClass. Then it generates the following-
  push myObj //pseudo code, myObj needs to be resolved using symbol table as described earlier.
  push 1
  push 2
  call MyClass.myMethod 3
 
## To translate a static method MyClass.myStatic(1,2) the Code generator does not have to push any objects reference as the first argument. Other than that, it produces VM code as following the logic earlier.

## Access array index, myObj[exp] is translated as follows
  evaluate exp
  push segment offset //myObj variable's segment offset
  add
  pop pointer 1 //set that pointer (pointer 1), with the address of myObj[exp]
  push that 0 // push the value of myObj[exp] to stack

# Statements
## return : Every subroutine is expected to produce a result and push it to the stack.
return; is translated as
  push constant 0
  return
 
return exp; is translated as -
  evaluate exp //the value of exp will be on top of the stack
  return
 
## method call. A method call statement, do myMethodCall(1,2); would be translated as follows
  generate code for myMethodCall(1,2) discussed in the Expression section
  pop temp 0 // discard the value generated by the function as it will not be used
 
## assignment
  ### let var = exp; using symbol table first finds segment, offset of the var. Then the following code gets generated
    evaluate exp
    pop segment offset
   
  ### let var[exp1] = exp2
    evaluate exp2
    evaluate exp1
    push segment offset
    add
    pop pointer 1
    pop that 0
   
## If else
  if(cond) { stmts1 } else { stmts2 } is translated as follows
 
  if-goto IF-TRUE
  goto IF-FALSE
  label IF-TRUE
 
  generate code for stmts1
 
  goto IF-END
  label IF-FALSE
 
  generate code for stmts2
 
  label IF-END

## while
  while(cond) { stmts } is translated as follows
 
  label WHILE-BEGIN
  evaluate cond
  not
  if-goto WHILE-END
 
  generate code for stmts
 
  goto WHILE-BEGIN
  label END

Conclusion:
To build a compiler from the scratch, start with the tokenizer module. Build parser module with the the help of tokenizer. Write symbol table module and integrate it to the Parser. Write VM Code generator module, then augment it to the Parser module; also add proper context capturing code in the Parser module.

Reference:
0. My compiler implementation
1. Elements of computing: Building a modern computer from the first principle
2. A modern compiler generation in Java
3. Stanford Compiler course by Dr. Alex Aiken
4. GATE lecture on Compilers by RavindraBabu

No comments:

Post a Comment