Tools for Compiler Construction
We are in the process of creating and implementing
(and eventually using) a new programming language for
the implementation of logical algorithms and computer algebra.
Logic is special because it operates on tree structures that
can have many different forms. My current (2022) view is that
logic, in an extreme form, reveals a weakness that all current
programming languages have: We don't know how to handle
union and we don't know how to handle partiality.
This occurs everywhere, but logic has more of it.
All existing approaches (union, std::variant, inheritance, type casts,
std::optional, std::expected, use of nullptr, use of enumeration types,
use of special value like std::string::npos) are unsatisfactory.
As more parts of the compiler are being completed, they will gradually appear
on this page. In addition to the materials (code and documents) that
are directly related to the development of our programming language,
we will also post material that may be useful for developers
of compilers/interpeters of other languages and theoretical papers.
At this moment, we have a C++ parser and tokenizer
generator, slides about parsing, and slides about intermediate code
generation for C. Most of the slides were created when teaching
compiler construction at
Instytut Informatyki Uniwersytetu
School of Engineering and Digital Sciences at Nazarbayev University.
Most of the code on this page
was written while working at Nazarbayev University.
Presentation at C++ Now
I gave a lecture at
C++ Now about
parsing, tokenizing and Maphoon.
The slides and the code fragments that were shown in the lecture are
Parser Generation for C++ (Maphoon)
After having studied existing parser generators, we concluded
that none of them fulfilled our needs, and we decided to write our
own. We aimed to design it in such a way that it will be useful
for other compiler implementors as well.
The parser generator is called Maphoon .
It has the following features:
While designing Maphoon, we took
CUP as main example
of a nice parser generator. We wanted the same in C++ and added a few
- The constructed parser is bottom-up.
Bottom-up parsing is theoretically and practically superior to
top-down parsing. There is no need to adapt the grammar,
and attribute computation can be specified with the grammar rules.
- It is possible to use modern C++ in action code.
Concretely, it is possible to use attribute types with resource
invariants (copy constructor, copying assignment, destructors)
without effort. (Think of STL-containers, smart pointers, etc.)
If all attribute types are movable, the parser will take
advantage of this.
The parser generator and the constructed parsers use
portable C++ without need for additional libraries.
It is easy to maintain source information in symbols for the
generation of error messages.
It is possible to define operators at run time.
Our own language does not use this, but for example
Even if one does not use it, it often results in simpler
parsers that can be more easily modified.
It is possible to define meaningful error messages.
Defining error messages is based on regular expressions
that trigger expectations. The error messages
explain what was expected, and what was obtained instead.
The sources of Maphoon are here and
the manual is here.
Maphoon is published under the
3-clause BSD license.
Below are examples of use of Maphoon:
- The obligatory
An example which parses
A prolog parser.
There is no interpreter, only a parser.
In order to add operators, type op( unsigned, type, names ),
where unsigned is the priority, type
is in fx,fy,xfx,xfy,yfx,xf,yf,
and names is either a single operator, or a list of operators.
In order to quit, type halt.
A simulator for non-deterministic Turing machines.
Turing machines are among the simplest models of computation,
and are frequently used in teaching.
Our Turing Machine simulator is fast, easy to use, and
it can handle non-deterministic Turing machines.
There exist online Turing Machine simulators, but they
are too slow for my patience, and cannot handle non-determinism.
Our simulator uses a
It is a slight generalization of the model in
Sipser (Introduction to the Theory of Computation).
This is a Turing machine that recognizes words of form a^i.b^i.c^i:
Next we give a Turing machine recognizes words of form v#w,
where v is a substring of w. It does this by non-deterministically
throwing away a prefix of w, and then checking what w starts with v:
The simulator reads an input file, which must contain a description of the TM
together with a set of input words.
For each input word, the simulator either prints an accepting computation,
or the set of reached configurations.
The simulator never computes more than 10000 configurations, but this
number can be easily changed in the code.
The source code is here.
If you are not interested in the parser tools and
need only the simulator, the sources can be compiled without
Maphoon. The simulator reads the input from stdin, and
prints the output in stdout.
The Turing machine simulator was written with the help of
Dinislam Madikhanov, Kenessary Myrzakul, and Dastan Sultanov.
In addition to Maphoon, we also created a tokenizer generator.
We tried to meet the following goals:
Our experience with tokenizer
generators is that they are borderline useful. Main problem is that
the constructed tokenizers are usually too rigid.
Often there are some missing features that one wants to add but cannot,
the language may have a few non-regular symbols, and the user has
no control over the way source information is stored in tokens.
In order to allow flexibility, we don't generate a complete tokenizer,
but only a function that reads input into a buffer and reports
which regular expression (symbol) it belongs too. The user
can add other functions that read other things into the buffer
(or a function that reads a comment without storing it in the buffer.)
In order to make it easy to start using it, regular expressions
are initially not compiled. One constructs pairs of
regular expressions and symboltypes in code,
and calls readandclassify( ). This is easy, but not efficient,
because regular expressions are translated into automata every time
the code is started.
In order to obtain efficient tokenizers, it is possible to print
the automaton as C++ code (readandclassify applied on the given automata),
and include the C++ code. This can be done when the
tokenizer is finished, and the resulting tokenizer is efficient.
(The idea of directly generating code instead of tables
was taken from re2c.)
The constructed tokenizers must be flexible.
It must be easy to start using it.
It must be possible to construct efficient tokenizers.
The sources of the lexer generator are here,
and here is a manual.
Below are some slides about compiler construction that we used
when teaching compiler construction.
Slides about attribute grammars.
One usually has to redesign the grammar in order to make it suitable
for top-down parsing. One must merge common prefixes in rules,
and remove left recursion.
Slides about bottom-up parsing.
Read these, if you want to understand the theory behind Maphoon.
give an overview of type checking for C.
The slides are a bit simplified, but contain all the main
features of C, like the lval/rval distinction and the implicit
conversions from array to pointer.
Translation of C into Stack Machine
The first set of slides
introduces an untyped stack machine.
The second set of slides
adds types to the stack machine, and explains how to translate
C into the stack machine.
Danel Batyrbek, Witold Charatonik, Aleksandra Kireeva, Tatyana Korotkova,
Dina Muktubayeva, Cláudia Nalon, Olzhas Zhangeldinov.
This research is being sponsored by
Nazarbayev University through the Faculty Development Competitive
Research Grant Program (FDCRGP), through grant number 021220FD1651.