A compiler would take a program written in a high level language and translate the instruction to the instruction for a Virtual Machine(VM).
#Modules
A compiler has the following modules
0. Tokenizer
1. Parser
2. Symbol table generator
3. VM code generator
# Implementation Detail
0. Tokenizer
A language defines its basic building blocks (tokens). These basic building blocks contains comment structure, keywords, symbols, identifier logic, literals(integer, string etc). A parser would take a program and generate stream of tokens as defined by a given language.
The tokenizer gives the following apis
i. hasMoreTokens()
ii. advance()
iii. getCurrentToken()
Algorithm:
i. Load given program in to memory
ii. Remove all the comments
iii. Add a space where 2 tokens are joined (eg. convert 5; to 5 <Space> ;)
iv. Look for a character
if (character != ")
curIndex = i
j = index of next space
subString = extract(i, j)
using regular expression, check if the subString is a proper token, if so store in a list raise an exception otherwise
if (character == ")
look for next " and extract the substring and store in the token list
1. Parser
Uses a tokenizer and checks if the token sequence is in harmony with the grammar of the language.
Implementation:
Lets consider rule for class
class >>>
'class' identifer '{'
classVarDec*
subroutineDec*
'}'
A set of function related to each grammar construct and bunch of helper function helped me writing the parser module.
Example:
For the rule above, I wrote the following method in the parser
void CompileClass(){
keyword(CLASS);
identifier();
symbol('{');
classVarDec_zero_more();
subroutineDec_zero_more();
symbol('}');
}
Here compileClass is the method related to the grammar rule and classVarDec_zero_more() is a helper function.
the method symbol(char) was implemented as follows
void symbol(s)
curTok = tokenizer.getCurrentToken();
if(curTok.type != SYMBOL)
raise exception
if(curTok.val != s)
raise exception
tokenizer.advance()
end
2. Symbol Table
A symbol table is used to translate a variable declared in a high level language to a (segment, offset) pair of a virtual machine. Symbol table is divided in two spaces, one for variables declared in class level another one for subroutine level
API
i. define(varName, kind(static, field etc), type(String, CustomClass etc))
called when a variable declaratrion is encountered
ii. startMethodLevelSymbolTable()
called when a method declaration is encountered
iii. getInfo(varName)
used to resolve a variable name encountered
An instance of symbol table needs to be added on the parser. As the parser process any variable declaration, an entry needs to be created in the symbol table.
3. VM Code generator
A VM Code generator needs to be augmented within the parser. As the parser parses through a given source code, a VM Code Generator would gather context and when appropriate will generate proper VM Code. This module keeps an eye on the following cases
i. Function declartion handling
A VM only cares about subroutines. A compiler needs to translate and write all the subroutines in a output file. To resolve name collision, when a subroutine subRot is encountered in class MyClass, the VM code generator would give the subroutine the name "MyClass.subRot". Here is how it deals with different subroutine types.
#Static
Observes how many local variables declared in the static method (lets call it n) and then generate the following code
function MyClass.subRot n
#Construtor
Observes how many local variables are declared(localCount) from parser and how many fields the MyClass has(fieldCount) from symbol table. Then generates the following code-
function MyClass.new localCount
call Memory.Alloc(fieldCount)
pop pointer 0 //pointer 0 is used for current object also known as this
#Method
A method implicitly expects the object to act on as its first argument. It observes how many local variables are declared(localCount) from parser then generates the following code-
function MyClass.subRot localCount+1
push arg 0
pop pointer 0 //sets this pointer
The above code essentially sets this pointer to the object on which the method is called.
# Variable declaration
Create an entry in appropraite space of the symbol table with proper context.
# Expressions
An expression is evaluated and is value is pushed to stack.
## for an int value, generate -> push value.
## True -> generate -> push -1 //notice -1 means 11111... in memory
## False, null -> push 0
## "ab" generate the following
call Memory.alloc 3 //3 characters in the string
call String.appendChar 97 //ascii value of 'a'
call String.appendChar 98
## variable, a variable is resolved using the symbol table. For example, lets say a variable var was declared in a method and it was the 2nd variable. So this variable exists in local segemnt's 2nd index. it will be resolved as following
push local 2
A variable declared in a class, or declared as staic or declared as an argument would resolved with a proper segment and a proper offset.
## binary operator, an expression with the form (exp1 op exp1) is translated in the following way
evaluate exp1 //exp1's value would be on top of the stack
evaluate exp2
op //apply the operator on the top two values of the stack
## unary operator, op exp
evalute exp
op
## for a method call myMethod(1, 2) of class MyClass, the following code will be generated
push pointer 0 //push this object's reference to stack
push constant 1
push constant 2
call MyClass.myMethod 3 //3 is the number of arguments the caller has passed to the function
## To translate a call of myObj.myMethod(1,2) using the symbol table, Code generator module finds its type lets say MyClass. Then it generates the following-
push myObj //pseudo code, myObj needs to be resolved using symbol table as described earlier.
push 1
push 2
call MyClass.myMethod 3
## To translate a static method MyClass.myStatic(1,2) the Code generator does not have to push any objects reference as the first argument. Other than that, it produces VM code as following the logic earlier.
## Access array index, myObj[exp] is translated as follows
evaluate exp
push segment offset //myObj variable's segment offset
add
pop pointer 1 //set that pointer (pointer 1), with the address of myObj[exp]
push that 0 // push the value of myObj[exp] to stack
# Statements
## return : Every subroutine is expected to produce a result and push it to the stack.
return; is translated as
push constant 0
return
return exp; is translated as -
evaluate exp //the value of exp will be on top of the stack
return
## method call. A method call statement, do myMethodCall(1,2); would be translated as follows
generate code for myMethodCall(1,2) discussed in the Expression section
pop temp 0 // discard the value generated by the function as it will not be used
## assignment
### let var = exp; using symbol table first finds segment, offset of the var. Then the following code gets generated
evaluate exp
pop segment offset
### let var[exp1] = exp2
evaluate exp2
evaluate exp1
push segment offset
add
pop pointer 1
pop that 0
## If else
if(cond) { stmts1 } else { stmts2 } is translated as follows
if-goto IF-TRUE
goto IF-FALSE
label IF-TRUE
generate code for stmts1
goto IF-END
label IF-FALSE
generate code for stmts2
label IF-END
## while
while(cond) { stmts } is translated as follows
label WHILE-BEGIN
evaluate cond
not
if-goto WHILE-END
generate code for stmts
goto WHILE-BEGIN
label END
Conclusion:
To build a compiler from the scratch, start with the tokenizer module. Build parser module with the the help of tokenizer. Write symbol table module and integrate it to the Parser. Write VM Code generator module, then augment it to the Parser module; also add proper context capturing code in the Parser module.
Reference:
0. My compiler implementation
1. Elements of computing: Building a modern computer from the first principle
2. A modern compiler generation in Java
3. Stanford Compiler course by Dr. Alex Aiken
4. GATE lecture on Compilers by RavindraBabu
#Modules
A compiler has the following modules
0. Tokenizer
1. Parser
2. Symbol table generator
3. VM code generator
# Implementation Detail
0. Tokenizer
A language defines its basic building blocks (tokens). These basic building blocks contains comment structure, keywords, symbols, identifier logic, literals(integer, string etc). A parser would take a program and generate stream of tokens as defined by a given language.
The tokenizer gives the following apis
i. hasMoreTokens()
ii. advance()
iii. getCurrentToken()
Algorithm:
i. Load given program in to memory
ii. Remove all the comments
iii. Add a space where 2 tokens are joined (eg. convert 5; to 5 <Space> ;)
iv. Look for a character
if (character != ")
curIndex = i
j = index of next space
subString = extract(i, j)
using regular expression, check if the subString is a proper token, if so store in a list raise an exception otherwise
if (character == ")
look for next " and extract the substring and store in the token list
1. Parser
Uses a tokenizer and checks if the token sequence is in harmony with the grammar of the language.
Implementation:
Lets consider rule for class
class >>>
'class' identifer '{'
classVarDec*
subroutineDec*
'}'
A set of function related to each grammar construct and bunch of helper function helped me writing the parser module.
Example:
For the rule above, I wrote the following method in the parser
void CompileClass(){
keyword(CLASS);
identifier();
symbol('{');
classVarDec_zero_more();
subroutineDec_zero_more();
symbol('}');
}
Here compileClass is the method related to the grammar rule and classVarDec_zero_more() is a helper function.
the method symbol(char) was implemented as follows
void symbol(s)
curTok = tokenizer.getCurrentToken();
if(curTok.type != SYMBOL)
raise exception
if(curTok.val != s)
raise exception
tokenizer.advance()
end
2. Symbol Table
A symbol table is used to translate a variable declared in a high level language to a (segment, offset) pair of a virtual machine. Symbol table is divided in two spaces, one for variables declared in class level another one for subroutine level
API
i. define(varName, kind(static, field etc), type(String, CustomClass etc))
called when a variable declaratrion is encountered
ii. startMethodLevelSymbolTable()
called when a method declaration is encountered
iii. getInfo(varName)
used to resolve a variable name encountered
An instance of symbol table needs to be added on the parser. As the parser process any variable declaration, an entry needs to be created in the symbol table.
3. VM Code generator
A VM Code generator needs to be augmented within the parser. As the parser parses through a given source code, a VM Code Generator would gather context and when appropriate will generate proper VM Code. This module keeps an eye on the following cases
i. Function declartion handling
A VM only cares about subroutines. A compiler needs to translate and write all the subroutines in a output file. To resolve name collision, when a subroutine subRot is encountered in class MyClass, the VM code generator would give the subroutine the name "MyClass.subRot". Here is how it deals with different subroutine types.
#Static
Observes how many local variables declared in the static method (lets call it n) and then generate the following code
function MyClass.subRot n
#Construtor
Observes how many local variables are declared(localCount) from parser and how many fields the MyClass has(fieldCount) from symbol table. Then generates the following code-
function MyClass.new localCount
call Memory.Alloc(fieldCount)
pop pointer 0 //pointer 0 is used for current object also known as this
#Method
A method implicitly expects the object to act on as its first argument. It observes how many local variables are declared(localCount) from parser then generates the following code-
function MyClass.subRot localCount+1
push arg 0
pop pointer 0 //sets this pointer
The above code essentially sets this pointer to the object on which the method is called.
# Variable declaration
Create an entry in appropraite space of the symbol table with proper context.
# Expressions
An expression is evaluated and is value is pushed to stack.
## for an int value, generate -> push value.
## True -> generate -> push -1 //notice -1 means 11111... in memory
## False, null -> push 0
## "ab" generate the following
call Memory.alloc 3 //3 characters in the string
call String.appendChar 97 //ascii value of 'a'
call String.appendChar 98
## variable, a variable is resolved using the symbol table. For example, lets say a variable var was declared in a method and it was the 2nd variable. So this variable exists in local segemnt's 2nd index. it will be resolved as following
push local 2
A variable declared in a class, or declared as staic or declared as an argument would resolved with a proper segment and a proper offset.
## binary operator, an expression with the form (exp1 op exp1) is translated in the following way
evaluate exp1 //exp1's value would be on top of the stack
evaluate exp2
op //apply the operator on the top two values of the stack
## unary operator, op exp
evalute exp
op
## for a method call myMethod(1, 2) of class MyClass, the following code will be generated
push pointer 0 //push this object's reference to stack
push constant 1
push constant 2
call MyClass.myMethod 3 //3 is the number of arguments the caller has passed to the function
## To translate a call of myObj.myMethod(1,2) using the symbol table, Code generator module finds its type lets say MyClass. Then it generates the following-
push myObj //pseudo code, myObj needs to be resolved using symbol table as described earlier.
push 1
push 2
call MyClass.myMethod 3
## To translate a static method MyClass.myStatic(1,2) the Code generator does not have to push any objects reference as the first argument. Other than that, it produces VM code as following the logic earlier.
## Access array index, myObj[exp] is translated as follows
evaluate exp
push segment offset //myObj variable's segment offset
add
pop pointer 1 //set that pointer (pointer 1), with the address of myObj[exp]
push that 0 // push the value of myObj[exp] to stack
# Statements
## return : Every subroutine is expected to produce a result and push it to the stack.
return; is translated as
push constant 0
return
return exp; is translated as -
evaluate exp //the value of exp will be on top of the stack
return
## method call. A method call statement, do myMethodCall(1,2); would be translated as follows
generate code for myMethodCall(1,2) discussed in the Expression section
pop temp 0 // discard the value generated by the function as it will not be used
## assignment
### let var = exp; using symbol table first finds segment, offset of the var. Then the following code gets generated
evaluate exp
pop segment offset
### let var[exp1] = exp2
evaluate exp2
evaluate exp1
push segment offset
add
pop pointer 1
pop that 0
## If else
if(cond) { stmts1 } else { stmts2 } is translated as follows
if-goto IF-TRUE
goto IF-FALSE
label IF-TRUE
generate code for stmts1
goto IF-END
label IF-FALSE
generate code for stmts2
label IF-END
## while
while(cond) { stmts } is translated as follows
label WHILE-BEGIN
evaluate cond
not
if-goto WHILE-END
generate code for stmts
goto WHILE-BEGIN
label END
Conclusion:
To build a compiler from the scratch, start with the tokenizer module. Build parser module with the the help of tokenizer. Write symbol table module and integrate it to the Parser. Write VM Code generator module, then augment it to the Parser module; also add proper context capturing code in the Parser module.
Reference:
0. My compiler implementation
1. Elements of computing: Building a modern computer from the first principle
2. A modern compiler generation in Java
3. Stanford Compiler course by Dr. Alex Aiken
4. GATE lecture on Compilers by RavindraBabu
No comments:
Post a Comment