Maphoon
Maphoon is a parser generator for C++.
It is written in
C++, and it creates parsers in C++.
It consists of two parts, a tokenizer generator and
a parser generator.
The parser generator works like Yacc, but supports
C++, and RAII. It creates LALR parsers, and
allows extending the language at run time, which
is needed for example by Prolog.
The tokenizer generator does not create a complete tokenizer,
but something that we call a classifier.
The classifier reads part of the input, and determines
to which symbol class it belongs.
Generating a full tokenizer would result in a tokenizer that
is too rigid to be useful. The generated classifier
can be extended by hand.
This is useful for non-regular
tokens, or for tokenizers that need to take indentation into account.
Design Goals of the Parser Generator
The parser generator has the following design goals:
- It uses bottom-up parsing.
Bottom-up parsing is theoretically and practically superior to
top-down parsing. There is usually no need to adapt the grammar,
and attribute computation rules can be specified directly with the grammar
rules.
- It is possible to use C++23 in action code.
Concretely, it is possible to use attribute types with resource
invariants (copy constructor, copying assignment, destructors)
without effort. (Think of STL-containers, smart pointers, etc.)
If all attribute types are movable, the parser will take
advantage of this. It is possible to use
move-only attributes, i.e. attributes that cannot be copied.
- The parser generator and the constructed parsers use
portable C++ without need for libraries not in STL.
-
It is easy to maintain source information in symbols for the
generation of error messages.
-
It is possible to define operators at run time.
This is useful if one wants to implement a parser for
Prolog.
Even if one does not want to extend the language at runtime,
using the mechanism often results in simpler
parsers that can be more easily modified.
The same mechanism can be used for defining context-sensitive
key words.
-
It is possible to create meaningful error messages.
Defining error messages is based on regular expressions
that trigger expectations. The error messages
explain what was expected, and what was obtained instead.
While designing Maphoon, we took
CUP as main example
of a nice parser generator. We wanted the same in C++ and added a few
extra features.
Reflection
The main disadvantage of bottom-up parsing is the fact that it needs
a preprocessor (a source file generator). It is like that,
using reflection, the source code transformation can be moved to
compile time. I will start looking into this when I have
access to a compiler that supports reflection.
Design Goals of the Tokenizer Generator
The tokenizer generator has the following design goals:
-
The constructed tokenizer must be flexible.
Our experience with other tokenizer
generators is that they are borderline useful. Main problem is that
the constructed tokenizers are too rigid.
Often there are a few missing features that one wants to add but cannot,
the language may have a few non-regular symbols, and the user has
no control over the way source information is stored in tokens.
In order to allow flexibility, we don't generate a complete tokenizer,
but only a function that reads input into a buffer and reports
the symbol type that it belongs too. The user
can add other functions that read non-regular symbols into the buffer,
or
define a function that reads whitespace without storing it in the buffer.
-
It must be easy to start using it.
In order to obtain this, pairs of regular expressions and
their classifications are initially written in code.
A collection of such pairs is stored in a container
called classifier.
Using the classifier, a function
called readandclassify( ) is called, which matches
the input to the regular expressions. Function readandclassify( )
returns a pair consisting of the number of read characters
together with the classification. After that, one can do
a case analysis on the classification.
-
It is possible to construct efficient tokenizers.
Building a classifier in code is easy,
but not efficient,
because regular expressions are translated into automata every time
the classifier is constructed.
In order to obtain efficient tokenizers, it is possible to print
the classifier as C++ code, and use
the C++ code. This can be done when the
classifier is finished. The resulting tokenizer is efficient.
The idea of directly generating code instead of tables
was taken from re2c.
Download
Both the tokenizer and the parser generator
are published under the
Aladdin Free Public License
The sources of the lexer generator are
here, and the manual
is here.
The sources of the parser generator are here, and
the manual is here.
In order compile Maphoon,
edit the Makefile to set the directories.
Set Lexing to a path containing
directory lexing2023,
and set Maph to a path containing
directory maphoon.
The paths need not be different.
The lexer has to be compiled together with the project that uses
it, hence it has no separate Makefile.