Wiki Compiler
Overview
Let's have a look at the functions that a compiler should be able to build.
- lexer(Lexical analysis)
- generate tokens from source code
- parser
- construct abstract syntax tree(AST) from tokens
- code generation
- generate low-level code, such as assembly code or machine code
Parser
BNF
syntax is used to for computer to understand the expression, and is a critical concept to be followed to sequently parse the tokens to AST tree. Certainly, only one loop on the tokens is enough, which makes it very efficient.
BNF
syntax for arithmetic operations grammar, including +
, -
, *
, \
, and u-
(unary -
) .
expression : term
| expression `+` expression
| expression `-` expression
term : factor
| term `*` term
| term `/` term
factor : NUMBER
| `(` expression `)`
| `u-`factor
| `u+`factor
BNF
syntax for arithmetic operations and variable assignment.
expression : term
| expression `+` expression
| expression `-` expression
term : factor
| term `*` term
| term `/` term
| term `%` term
factor : NUMBER
| ID
| `(` expression `)`
| `u-`factor
| `u+`factor
| assignment
assignment : ID `=` expression
In addition, BNF
can also be applied to define regular expressions:
expression : term
| term `|` term
term : factor
| term term
factor : atom
| atom `*`
atom : CHAR
| `(` expression `)`
CHAR : any valid character except meta characters (e.g., "*", "|", "(")
Difference between Compiler and Interpreter
an interpreter also does lexer
and parser
jobs as a compiler does in step 1 and 2, but instead of generating low-level code, the interpreter generates the results directly.
Bootstrap a compiler
A new programming language and a compiler written also in the new language is supposed to develop from an existing language. The progress is called bootstrapping, which can be summarized as,
C1 + L1 -> C20
C20 + L2u -> C21
C21 + L2 -> C22
C22 + L2 -> C23
C23 + L2 -> C24
L1
: an existing language
C1
: an existing compiler for language L1
C20
: a compiler written in language L1
for language L2u
C21
: a compiler written in language L2u
for language L2
L2u
: is subset of language L2
Bootstrapping stage:
- Write a bootstrap compiler
C20
to understand languageL2u
(a subset of languageL2
) in using existed languageL1
and its corresponding compilerC1
. - Use the compiler
C20
and languageL2u
to write the compilerC21
to understand languageL2
. - Now
C21
is a fully self-hosted compiler, as well as its descendantsC22
,C23
, andC24
.
Where did the existing compiler C1
come from?
There is no need to use a compiler C1
+ L1
if you write the bootstrap compiler C20
in machine code. This solves the chicken-and-egg problem totally for programming languages.
- Bootstrapping initial compiler
C20
:- A small and simple compiler is created manually in machine code or written in
assembly
language. - [Option*] Translate the
assembly
language into machine code manually if it's not written in machine code. - The initial compiler is just capable enough to understand a subset of the target language
C
it is supposed to compile.
- A small and simple compiler is created manually in machine code or written in
- Use the initial compiler
C20
to compile the compilerC21
written in languageC
while theC21
is also supposed to compile languageC
. - Now compiler
C21
a fully self-compilation.
Strange Loops: Ken Thompson and the Self-referencing C Compiler | ScienceBlogs
Bootstrapping (compilers) - Wikipedia
Compilers: Principles, Techniques, and Tools - Wikipedia
Implementations
Interpreter:
- GitHub - rswier/c4: C in four functions
- GitHub - lotabout/write-a-C-interpreter: Write a simple interpreter of C. Inspired by c4 and largely based on it.
Self-hosted Compiler:
- GitHub - DoctorWkt/acwj: A Compiler Writing Journey
- GitHub - certik/bcompile: Bootstrapping a simple compiler from nothing
The basic knowledge of lexer
and parser
is critical and necessary for developing a programming language,
flex/lex
yacc/parser
Compiler for a subset of C language bootstrapping from Python
Recently, I am becoming interested in building a lexer
, parser
and code generator
to try to create a mini language and deep insight of how GCC
or Clang/LLVM
do their jobs.
For educational purposes, learning in practice is my favorite approach to grasp an overview.
Let's do it!
Prerequisites:
- Python for writing the bootstrap compiler
I use ply, a pure Python implementation of the lex
and yacc
tools to facilitate me to write the bootstrap compiler for the subset of C language.
Compiler for a subset of C language bootstrapping from C
Prerequisites:
- An existing
GCC
for writing the bootstrap compiler
Here are some popular tutorials from GitHub - DoctorWkt/acwj: A Compiler Writing Journey.
You can also refer GitHub - lotabout/write-a-C-interpreter although I prefer classifying it as interpreter not a complete compiler.
Compiler bootstrapping from assembly
Compiler bootstrapping from HEX
GitHub - certik/bcompile: Bootstrapping a simple compiler from nothing