Greedy grammars and Any

. in your regular expression

I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level.  Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a ‘.’ which is the wildcard pattern that matches any character.

I came back to Any in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.

A comprehensive C preprocessor grammar can be found on the following MSDN site. What is most interesting to us are the following two EBNF productions:

# define identifier[( identifieropt, , identifieropt )] token-stringopt
token-string :
String of tokens 

The String of tokens phrase this is Any+.

Bon appétit

Suppose one defines two macros

#define min(a,b) ((a)<(b)?(a):(b))

#define max(a,b) ((a)<(b)?(b):(a))

Obviously the defining string of the min macro can be recognized using token-string but how can we prevent that token-string eats the max macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make Any non-greedy. The non-greediness can easily be expressed using the following requirement:

If S | Any is a pattern with S!=Any. If S can match a character, S will be preferred over Any.

In the production rule

R: ... Any* S ...
we can be sure that if S matches in R then Any won’t be used to match – although it would match if we leave it greedy. Same goes with
R: ... (Any* | S) ...

Non greediness in grammars

Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example

R: A B | C
A: a Any*
B: b
C: c
Any causes a follow/first conflict between A and B. Making Any non-greedy alone won’t help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won’t be left because of the trailing Any*.

Detecting the trailing Any in A is easy though. We solve the follow/first conflict with a trailing Any by embedding A into R. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn’t destroy any information relevant for parsing. If A has been embedded Any* will be delimited by B to the right and we can safely apply R without the danger of Any consuming a token ‘b’.

Eventually we have to re-apply our embedding strategy: if A is a rule with a trailing Any and A is embedded in B and B has a trailing Any after this embedding  then B will be embedded wherever possible.

A short experiment

Here is a mini-Python grammar is used to detect Python class definitions.

file_input: (NEWLINE | stmt)* ENDMARKER
classdef: 'class' NAME ['(' Any+ ')'] ':' suite
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
simple_stmt: Any* NEWLINE
stmt: simple_stmt | compound_stmt
compound_stmt: classdef | Any* suite

Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only file_input, suite and stmt remained unchanged. In all other cases we have replaced terminal/non-terminal information that isn’t needed by Any.



Leave a Reply