Tools for Compiler Construction

We are in the process of creating and implementing (and eventually using) a new programming language for the implementation of logical algorithms and computer algebra. Logic is special because it operates on tree structures that can have many different forms. My current (2022) view is that logic, in an extreme form, reveals a weakness that all current programming languages have: We don't know how to handle union and we don't know how to handle partiality. This occurs everywhere, but logic has more of it. All existing approaches (union, std::variant, inheritance, type casts, std::optional, std::expected, use of nullptr, use of enumeration types, use of special value like std::string::npos) are unsatisfactory. As more parts of the compiler are being completed, they will gradually appear on this page. In addition to the materials (code and documents) that are directly related to the development of our programming language, we will also post material that may be useful for developers of compilers/interpeters of other languages and theoretical papers. At this moment, we have a C++ parser and tokenizer generator, slides about parsing, and slides about intermediate code generation for C. Most of the slides were created when teaching compiler construction at Instytut Informatyki Uniwersytetu Wrocławskiego and the School of Engineering and Digital Sciences at Nazarbayev University. Most of the code on this page was written while working at Nazarbayev University.

Presentation at C++ Now

I gave a lecture at C++ Now about parsing, tokenizing and Maphoon. The slides and the code fragments that were shown in the lecture are here.

Parser Generation for C++ (Maphoon)

After having studied existing parser generators, we concluded that none of them fulfilled our needs, and we decided to write our own. We aimed to design it in such a way that it will be useful for other compiler implementors as well. The parser generator is called Maphoon . It has the following features:

Features

  1. The constructed parser is bottom-up. Bottom-up parsing is theoretically and practically superior to top-down parsing. There is no need to adapt the grammar, and attribute computation can be specified with the grammar rules.
  2. It is possible to use modern C++ in action code. Concretely, it is possible to use attribute types with resource invariants (copy constructor, copying assignment, destructors) without effort. (Think of STL-containers, smart pointers, etc.) If all attribute types are movable, the parser will take advantage of this.
  3. The parser generator and the constructed parsers use portable C++ without need for additional libraries.
  4. It is easy to maintain source information in symbols for the generation of error messages.
  5. It is possible to define operators at run time. Our own language does not use this, but for example Prolog does. Even if one does not use it, it often results in simpler parsers that can be more easily modified.
  6. It is possible to define meaningful error messages. Defining error messages is based on regular expressions that trigger expectations. The error messages explain what was expected, and what was obtained instead.
While designing Maphoon, we took CUP as main example of a nice parser generator. We wanted the same in C++ and added a few extra features.

Download

The sources of Maphoon are here and the manual is here. Maphoon is published under the 3-clause BSD license.

Below are examples of use of Maphoon:

Tokenizer Generation

In addition to Maphoon, we also created a tokenizer generator. We tried to meet the following goals:
  1. The constructed tokenizers must be flexible.
  2. It must be easy to start using it.
  3. It must be possible to construct efficient tokenizers.
Our experience with tokenizer generators is that they are borderline useful. Main problem is that the constructed tokenizers are usually too rigid. Often there are some missing features that one wants to add but cannot, the language may have a few non-regular symbols, and the user has no control over the way source information is stored in tokens. In order to allow flexibility, we don't generate a complete tokenizer, but only a function that reads input into a buffer and reports which regular expression (symbol) it belongs too. The user can add other functions that read other things into the buffer (or a function that reads a comment without storing it in the buffer.) In order to make it easy to start using it, regular expressions are initially not compiled. One constructs pairs of regular expressions and symboltypes in code, and calls readandclassify( ). This is easy, but not efficient, because regular expressions are translated into automata every time the code is started. In order to obtain efficient tokenizers, it is possible to print the automaton as C++ code (readandclassify applied on the given automata), and include the C++ code. This can be done when the tokenizer is finished, and the resulting tokenizer is efficient. (The idea of directly generating code instead of tables was taken from re2c.)

Download

The sources of the lexer generator are here, and here is a manual.

Presentations

Below are some slides about compiler construction that we used when teaching compiler construction.

Attribute Grammars

Slides about attribute grammars.

Top-Down Parsing

One usually has to redesign the grammar in order to make it suitable for top-down parsing. One must merge common prefixes in rules, and remove left recursion. Slides

Bottom-Up Parsing

Slides about bottom-up parsing. Read these, if you want to understand the theory behind Maphoon.

Semantic Analysis

These slides give an overview of type checking for C. The slides are a bit simplified, but contain all the main features of C, like the lval/rval distinction and the implicit conversions from array to pointer.

Translation of C into Stack Machine

The first set of slides introduces an untyped stack machine. The second set of slides adds types to the stack machine, and explains how to translate C into the stack machine.

Acknowledgements

Thanks to Danel Batyrbek, Witold Charatonik, Aleksandra Kireeva, Tatyana Korotkova, Dina Muktubayeva, Cláudia Nalon, Olzhas Zhangeldinov.

This research is being sponsored by Nazarbayev University through the Faculty Development Competitive Research Grant Program (FDCRGP), through grant number 021220FD1651.