Maphoon

Maphoon is a parser generator for C++. It is written in C++, and it creates parsers in C++. It consists of two parts, a tokenizer generator and a parser generator. The parser generator works like Yacc, but supports C++, and RAII. It creates LALR parsers, and allows extending the language at run time, which is needed for example by Prolog.

The tokenizer generator does not create a complete tokenizer, but something that we call a classifier. The classifier reads part of the input, and determines to which symbol class it belongs. Generating a full tokenizer would result in a tokenizer that is too rigid to be useful. The generated classifier can be extended by hand. This is useful for non-regular tokens, or for tokenizers that need to take indentation into account.

Design Goals of the Parser Generator

The parser generator has the following design goals:
  1. It uses bottom-up parsing. Bottom-up parsing is theoretically and practically superior to top-down parsing. There is usually no need to adapt the grammar, and attribute computation rules can be specified directly with the grammar rules.
  2. It is possible to use C++23 in action code. Concretely, it is possible to use attribute types with resource invariants (copy constructor, copying assignment, destructors) without effort. (Think of STL-containers, smart pointers, etc.) If all attribute types are movable, the parser will take advantage of this. It is possible to use move-only attributes, i.e. attributes that cannot be copied.
  3. The parser generator and the constructed parsers use portable C++ without need for libraries not in STL.
  4. It is easy to maintain source information in symbols for the generation of error messages.
  5. It is possible to define operators at run time. This is useful if one wants to implement a parser for Prolog. Even if one does not want to extend the language at runtime, using the mechanism often results in simpler parsers that can be more easily modified. The same mechanism can be used for defining context-sensitive key words.
  6. It is possible to create meaningful error messages. Defining error messages is based on regular expressions that trigger expectations. The error messages explain what was expected, and what was obtained instead.
While designing Maphoon, we took CUP as main example of a nice parser generator. We wanted the same in C++ and added a few extra features.

Reflection

The main disadvantage of bottom-up parsing is the fact that it needs a preprocessor (a source file generator). It is like that, using reflection, the source code transformation can be moved to compile time. I will start looking into this when I have access to a compiler that supports reflection.

Design Goals of the Tokenizer Generator

The tokenizer generator has the following design goals:
  1. The constructed tokenizer must be flexible. Our experience with other tokenizer generators is that they are borderline useful. Main problem is that the constructed tokenizers are too rigid. Often there are a few missing features that one wants to add but cannot, the language may have a few non-regular symbols, and the user has no control over the way source information is stored in tokens. In order to allow flexibility, we don't generate a complete tokenizer, but only a function that reads input into a buffer and reports the symbol type that it belongs too. The user can add other functions that read non-regular symbols into the buffer, or define a function that reads whitespace without storing it in the buffer.
  2. It must be easy to start using it. In order to obtain this, pairs of regular expressions and their classifications are initially written in code. A collection of such pairs is stored in a container called classifier. Using the classifier, a function called readandclassify( ) is called, which matches the input to the regular expressions. Function readandclassify( ) returns a pair consisting of the number of read characters together with the classification. After that, one can do a case analysis on the classification.
  3. It is possible to construct efficient tokenizers. Building a classifier in code is easy, but not efficient, because regular expressions are translated into automata every time the classifier is constructed. In order to obtain efficient tokenizers, it is possible to print the classifier as C++ code, and use the C++ code. This can be done when the classifier is finished. The resulting tokenizer is efficient. The idea of directly generating code instead of tables was taken from re2c.

Download

Both the tokenizer and the parser generator are published under the Aladdin Free Public License

The sources of the lexer generator are here, and the manual is here.

The sources of the parser generator are here, and the manual is here. In order compile Maphoon, edit the Makefile to set the directories. Set Lexing to a path containing directory lexing2023, and set Maph to a path containing directory maphoon. The paths need not be different. The lexer has to be compiled together with the project that uses it, hence it has no separate Makefile.