Wiki Compiler
Let's have a look at the functions that a compiler should be able to build.
- lexer(Lexical analysis)
- generate tokens from source code
- parser
- construct abstract syntax tree(AST) from tokens
- code generation
- generate low-level code, such as assembly code or machine code
Parser
BNF syntax is used to for computer to understand the expression, and is a critical concept to be followed to sequently parse the tokens to AST tree. Certainly, only one loop on the tokens is enough, which makes it very efficient.
BNF syntax for arithmetic operations grammar, including +, -, *, \, and u-(unary -) .
expression : term
| expression `+` expression
| expression `-` expression
term : factor
| term `*` term
| term `/` term
factor : NUMBER
| `(` expression `)`
| `u-`factor
| `u+`factor
BNF syntax for arithmetic operations and variable assignment.
expression : term
| expression `+` expression
| expression `-` expression
term : factor
| term `*` term
| term `/` term
| term `%` term
factor : NUMBER
| ID
| `(` expression `)`
| `u-`factor
| `u+`factor
| assignment
assignment : ID `=` expression
In addition, BNF can also be applied to define regular expressions:
expression : term
| term `|` term
term : factor
| term term
factor : atom
| atom `*`
atom : CHAR
| `(` expression `)`
CHAR : any valid character except meta characters (e.g., "*", "|", "(")
Difference between Compiler and Interpreter
an interpreter also does lexer and parser jobs as a compiler does in step 1 and 2, but instead of generating low-level code, the interpreter generates the results directly.
Bootstrap a compiler
A new programming language and a compiler written also in the new language is supposed to develop from an existing language. The progress is called bootstrapping, which can be summarized as,
C1 + L1 -> C20
C20 + L2u -> C21
C21 + L2 -> C22
C22 + L2 -> C23
C23 + L2 -> C24
L1 : an existing language
C1 : an existing compiler for language L1
C20: a compiler written in language L1 for language L2u
C21: a compiler written in language L2u for language L2
L2u: is subset of language L2
Bootstrapping stage:
- Write a bootstrap compiler
C20to understand languageL2u(a subset of languageL2) in using existed languageL1and its corresponding compilerC1. - Use the compiler
C20and languageL2uto write the compilerC21to understand languageL2. - Now
C21is a fully self-hosted compiler, as well as its descendantsC22,C23, andC24.
Where did the existing compiler C1 come from?
There is no need to use a compiler C1 + L1 if you write the bootstrap compiler C20 in machine code. This solves the chicken-and-egg problem totally for programming languages.
- Bootstrapping initial compiler
C20:- A small and simple compiler is created manually in machine code or written in
assemblylanguage. - [Option*] Translate the
assemblylanguage into machine code manually if it's not written in machine code. - The initial compiler is just capable enough to understand a subset of the target language
Cit is supposed to compile.
- A small and simple compiler is created manually in machine code or written in
- Use the initial compiler
C20to compile the compilerC21written in languageCwhile theC21is also supposed to compile languageC. - Now compiler
C21a fully self-compilation.
Strange Loops: Ken Thompson and the Self-referencing C Compiler | ScienceBlogs
Bootstrapping (compilers) - Wikipedia
Compilers: Principles, Techniques, and Tools - Wikipedia
Implementations
Interpreter:
- GitHub - rswier/c4: C in four functions
- GitHub - lotabout/write-a-C-interpreter: Write a simple interpreter of C. Inspired by c4 and largely based on it.
Self-hosted Compiler:
- GitHub - DoctorWkt/acwj: A Compiler Writing Journey
- GitHub - certik/bcompile: Bootstrapping a simple compiler from nothing
The basic knowledge of lexer and parser is critical and necessary for developing a programming language,
flex/lexyacc/parser
Compiler for a subset of C language bootstrapping from Python
Recently, I am becoming interested in building a lexer, parser and code generator to try to create a mini language and deep insight of how GCC or Clang/LLVM do their jobs.
For educational purposes, learning in practice is my favorite approach to grasp an overview.
Let's do it!
Prerequisites:
- Python for writing the bootstrap compiler
I use ply, a pure Python implementation of the lex and yacc tools to facilitate me to write the bootstrap compiler for the subset of C language.
Compiler for a subset of C language bootstrapping from C
Prerequisites:
- An existing
GCCfor writing the bootstrap compiler
Here are some popular tutorials from GitHub - DoctorWkt/acwj: A Compiler Writing Journey.
You can also refer GitHub - lotabout/write-a-C-interpreter although I prefer classifying it as interpreter not a complete compiler.
Compiler bootstrapping from assembly
Compiler bootstrapping from HEX
GitHub - certik/bcompile: Bootstrapping a simple compiler from nothing