I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level. Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a ‘.’ which is the wildcard pattern that matches any character.
I came back to Any in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.
A comprehensive C preprocessor grammar can be found on the following MSDN site. What is most interesting to us are the following two EBNF productions:
#define identifier[( identifieropt, … , identifieropt )] token-stringopt
- token-string :
- String of tokens
The String of tokens phrase this is Any+.
Suppose one defines two macros
#define min(a,b) ((a)<(b)?(a):(b)) #define max(a,b) ((a)<(b)?(b):(a))
Obviously the defining string of the `min` macro can be recognized using token-string but how can we prevent that token-string eats the `max` macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make Any non-greedy. The non-greediness can easily be expressed using the following requirement:
If `S | Any` is a pattern with `S!=Any`. If S can match a character, `S` will be preferred over `Any`.
In the production rule
R: ... Any* S ...
we can be sure that if S matches in R then Any won’t be used to match – although it would match if we leave it greedy. Same goes with
R: ... (Any* | S) ...
Non greediness in grammars
Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example
R: A B | C A: a Any* B: b C: c
`Any` causes a follow/first conflict between A and B. Making `Any` non-greedy alone won’t help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won’t be left because of the trailing `Any*`.
Detecting the trailing `Any` in A is easy though. We solve the follow/first conflict with a trailing `Any` by embedding `A` into `R`. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn’t destroy any information relevant for parsing. If A has been embedded `Any*` will be delimited by `B` to the right and we can safely apply R without the danger of `Any` consuming a token ‘b’.
Eventually we have to re-apply our embedding strategy: if `A` is a rule with a trailing `Any` and `A` is embedded in `B` and `B` has a trailing `Any` after this embedding then `B` will be embedded wherever possible.
A short experiment
Here is a mini-Python grammar is used to detect Python class definitions.
file_input: (NEWLINE | stmt)* ENDMARKER classdef: 'class' NAME ['(' Any+ ')'] ':' suite suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT simple_stmt: Any* NEWLINE stmt: simple_stmt | compound_stmt compound_stmt: classdef | Any* suite
Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only `file_input`, `suite` and `stmt` remained unchanged. In all other cases we have replaced terminal/non-terminal information that isn’t needed by `Any`.