Greedy grammars and Any

Posted in Grammars, Parsing, Python on May 20th, 2012 by kay – Be the first to comment

. in your regular expression

I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level.  Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a ‘.’ which is the wildcard pattern that matches any character.

I came back to Any in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.

A comprehensive C preprocessor grammar can be found on the following MSDN site. What is most interesting to us are the following two EBNF productions:

# define identifier[( identifieropt, , identifieropt )] token-stringopt
token-string :
String of tokens 

The String of tokens phrase this is Any+.

Bon appétit

Suppose one defines two macros

#define min(a,b) ((a)<(b)?(a):(b))

#define max(a,b) ((a)<(b)?(b):(a))

Obviously the defining string of the min macro can be recognized using token-string but how can we prevent that token-string eats the max macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make Any non-greedy. The non-greediness can easily be expressed using the following requirement:

If S | Any is a pattern with S!=Any. If S can match a character, S will be preferred over Any.

In the production rule

R: ... Any* S ...
we can be sure that if S matches in R then Any won’t be used to match – although it would match if we leave it greedy. Same goes with
R: ... (Any* | S) ...

Non greediness in grammars

Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example

R: A B | C
A: a Any*
B: b
C: c
Any causes a follow/first conflict between A and B. Making Any non-greedy alone won’t help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won’t be left because of the trailing Any*.

Detecting the trailing Any in A is easy though. We solve the follow/first conflict with a trailing Any by embedding A into R. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn’t destroy any information relevant for parsing. If A has been embedded Any* will be delimited by B to the right and we can safely apply R without the danger of Any consuming a token ‘b’.

Eventually we have to re-apply our embedding strategy: if A is a rule with a trailing Any and A is embedded in B and B has a trailing Any after this embedding  then B will be embedded wherever possible.

A short experiment

Here is a mini-Python grammar is used to detect Python class definitions.

file_input: (NEWLINE | stmt)* ENDMARKER
classdef: 'class' NAME ['(' Any+ ')'] ':' suite
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
simple_stmt: Any* NEWLINE
stmt: simple_stmt | compound_stmt
compound_stmt: classdef | Any* suite

Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only file_input, suite and stmt remained unchanged. In all other cases we have replaced terminal/non-terminal information that isn’t needed by Any.



The state of the Trail parser generator

Posted in Algorithms, Grammars, Langscape, Parsing, TBP, Trail on April 10th, 2012 by kay – Be the first to comment

White Easter

Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter…

 x + + + + + + + + +
 - - - - - - - - - x
 - - - - - - - - x +
 - - - - - - - x + +
 - - - - - - x + + +
 - - - - - x + + + +
 - - - - x + + + + +
 - - - x + + + + + +
 - - x + + + + + + +
 - x + + + + + + + + 

Flat Automatons

This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn’t that I believed that this speed advantage wouldn’t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn’t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let’s begin.

If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely flat then and contained only terminal symbols which referred to each other . An intrinsic hurdle is recursion. A recursive rule like

X: X X | a
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
X X | a
X X X X | X X a | a X X | a X | X a | a a | a

Working embeddings

The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won’t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:

X: X a
X: ɛ
which means that X can produce the empty word. Since ɛa = a the rule X accepts all finite sequences of a. In EBNF we can rewrite the X-rule as
X: [X] a
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
(X,0): (X,1) (a,2)
(X,1): (a,2)
(a,2): (FIN,-1)
Each state carries a unique index, which allows us to distinguish arbitrary many different X and a states in the automaton. If we further want to embedd X within another rule, say
Y: X b
which is defined by the table
(Y,0): (X,1)
(X,1): (b,2)
(a,2): (FIN,-1)
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
(X,0,X): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
With this notion we can embed X in Y:
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
(FIN,-1,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
The initial state (X,0,X) of X was completely removed. The automaton is still erroneous though. The final state (FIN,-1,X) of X is not a final state in Y. It even has a transition! We could try to remove it completely and instead write
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X):   (b,2,Y)
(b,2,Y): (FIN,-1,Y)
But suppose Y had the form:
Y: X X b
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state (FIN, -1, X) when embedded in Y into an ɛ-state in Y:
(FIN,-1,X) => (X, 3, TRAIL_DOT, Y)
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape (_, _, 0, _). Finally the fully and correctly embedded automaton X in Y looks like this:
(Y,0,0,Y): (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): (FIN,-1,0,Y)
The TRAIL_DOT marks the transition between X and Y in Y. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.

Rules of Embedding

At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don’t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:

R: A | B
A: a* c
B: a* d
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string “aaa…” we cannot decide which of the rules A or B we have to choose, unless we observe a ‘c’ or ‘d’ event i.e. our string becomes “aa…ac” or “aa…ad”. What we basically want is to defer the choice of a rule, making a late choice instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
R: a* c | a* d
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets { (a,1,0,A), (a,2,0,B) } which recognize the same character “a”. Those state sets will be stored during parsing. In case of a “c”-event which will be recognized by the state (c, 3, 0, A) we only have to dig into the state-set sequence and follow the states(a, 1 , 0, A) back to the first element of the sequence. Since (A, 4, TRAIL_DOT, R) is the only follow state of (c, 3, 0, A)we will actually see the sequence:
from this sequence we can easily reconstruct contexts and build the tree
[R, [A, a, a, ..., c]]
All of this is realized by late choice. Until a “c” or “d” event we move within A and B at the same time because. The embedding of A and B in R solves the FIRST/FIRST conflict above. This is the meaning.

FOLLOW/FIRST conflicts

So far the article didn’t contain anything new. I’ve written about all of this before.

The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:

R: U*
U: A | B
A: a+ (B c)*
B: b
There is no FIRST/FIRST conflict between A and B and we can’t factor out a common prefix. But now suppose we want to parse the string “abb”. Obviously A recognizes the two initial characters “ab” and then fails at the 2nd “b” because “c” was expected. Now A can recognize “a” alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After “a” has been recognized by A the rule B may take over and match “b” two times:
[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]
Trail applies a “longest match” recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after a and at that point the parser sets a so called checkpoint dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is “abcbb” then A will match “abc”, if it is “abcbcbb” then it is “abcbc” and so on.

The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.

The inevitability of left recursions

It is easy in top down parsing to recognize and remove or transform left recursive rule like this one

X: X a | ɛ
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don’t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn’t expected them. I’d like to go even one step further which is conjectural: we can’t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this clash dynamics, as I like to call it might lead to algorithmically undecidable problems. It’s nothing I’ve systematically thought about but I wouldn’t be too surprised.

For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as

X: X
X: X a
In both cases the FIRST-sets of X don’t contain any terminal symbol and they can’t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
E: E '*' E | NUMBER
contains a FIRST/FIRST conflict between E and NUMBER. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
T: u v [T] u w
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that’s entirely changing the game. The question is do we must tack backtracking solutions onto Trail which are deeper involved than checkpoints?

After 6 months of tinkering I wished the answer was no. Actually I believe that it is unavoidable but it occurs at places were I didn’t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.


Before I discuss left recursion problems in Trail in a follow up article I’ll present some results as a teaser.

Grammars for which parsing in Trail is O(n):

a) E: '(' E ')' | E '' E | NUMBER
b) E: '(' E+ ')' | E '' E | NUMBER
Other grammars in the same category are
c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
However for the following grammars the parser is in O(2^n)
f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
If we combine  d) and f) we get
h) G: G G (G | 'h') | G (G | 'h') | 'h'
In this case Trail will deny service and throw a ParserConstructionError exception. Some pretty polynomial grammars will be lost.




LL(*) faster than LL(1)?

Posted in Algorithms, Grammars, Parsing, Python, TBP on September 12th, 2011 by kay – 4 Comments

No jumps

When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.

In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P. How do we know that P is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.

What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this

R: a* b | a* c
The algorithm never checks out the first branch, runs through a sequence of  ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the rule
R: a* ( b | c )
However there was never any rule transformation to simplify the problem.

From rules to grammars

It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar

R: A | B
A: a* c
B: a* d
Again we have two branches, marked by the names of the non-terminals A and B and we want to decide late which one to choose.

First we turn the grammar into a regular expression:

R: a* c | a* d
but now we have lost context/structural information which needs to be somehow added:
R: a* c <A>| a* d <B>
The symbols <A> and <B> do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters <A> it will be finally decided that rule A was used. The same is true for <B>. Our example grammar is LL() and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL() grammar as-if it was an LL(1) grammar + a small context marker.


What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:

R: A[R] | B[R]
A: a[A]* c[A]
B: a[B]* d[A]
Now we can embed A and B into R while being able to preserve the context:
R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]>
Matching now the string aad yields the following sequence of sets of matching symbols:
{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>}
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match a.

Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.

We start the interpretation by translating the rightmost symbol

<B[R]> -> [R,[B, .]]
The  dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
[R, [B, .]]
[R, [B, ., d]]
[R, [B, ., a, d]]
[R, [B, ., a, a, d]]
[R, [B, a, a, d]]


We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’.  I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.

The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?


To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.

In order to measure performance I created the following simple script

import time
import decimal
import langscape
text = open(decimal.__file__.replace(".pyc", ".py")).read()
print "textlen", len(text)
python = langscape.load_langlet("python")
p4d = langscape.load_langlet("p4d")
def test(langlet):
    tokens = langlet.tokenize(text)
    a = time.time()
    for i in range(10):
    print "parser", langlet.config.langlet_name, (time.time() - a)/10

It imports a reasonably big Python module ( ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:

parser python 2.39329998493
parser p4d 2.25759999752
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.

I’ve also checked out the script with PyPy, both with activated and deactivated JIT.

PyPy with option –JIT off:

parser python 6.5631000042
parser p4d 5.66440000534
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:

PyPy with JIT:

parser python 0.791500020027
parser p4d 1.06089999676
Here the Python parser is about 1/3 faster than the P4D parser.

The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.

Python26 expressions

Posted in Grammars, Langscape, Python on November 26th, 2010 by kay – Be the first to comment

When you look at the following listing you might think it’s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:

raise a, b, c
import d
from e import*
import f
from .g import(a)
from b import c
from .import(e)
from f import(g)
from .a import(b, c as d,)
import e, f as g
from .a import(b, c,)

( Here is the complete listing ).

which is of course correct. It is a somewhat peculiar listing because you find expressions like

c[:,: d:,]**e


[c for d in e, lambda f = g, (a, b,) = c,: d, for e in f]

and a few others such as

b(*c, d, **e)**f

which will be rejected by the Python compiler because the *c argument precedes the name d but it is nevertheless “grammar correct” by which I mean it is consistent with Pythons context free LL(1) grammar.

The nice thing about the listing: it is automatically generated and it is complete in a certain sense.

Generative Grammars

When you look at a grammar rule like

for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]

it can be understood as an advice for producing exactly 2 expressions, namely:

'for' exprlist 'in' testlist ':' suite
'for' exprlist 'in' testlist ':' suite 'else' ':' suite

Other rules like

dictmaker: test ':' test (',' test ':' test)* [',']

has an infinite number of productions

test ':' test
test ':' test ','
test ':' test ',' test ':' test
test ':' test ',' test ':' test ','

When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite for_rule being finite and dictmaker being infinite the algorithm creates two productions for each.

After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.

Embedding of productions

Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand test in dictmaker by selecting and inserting one production we got for the rule test. Unfortunately a grammar isn’t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.

Look at the following two rules:

expr_stmt: testlist (augassign (yield_expr|testlist) |
           ('=' (yield_expr|testlist))*)
augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '&amp;=' | '|=' | '^=' |
            '&lt;&lt;=' | '&gt;&gt;=' | '**=' | '//=')

The only place where augassign occurs  in the grammar is in expr_stmt but counting the number of productions for augassign we get 12 whereas we only count 3 productions for expr_stmt and there is just a single production which contains expr_stmt. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can’t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.

Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:

file_input: (NEWLINE | stmt)* ENDMARKER

I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:

Wrap each initially determined production which is not a production of a start rule into a stmt

Take the production ‘+=’ of augassign as an example. We find that augassign exists in expr_stmt. So we take one expr_stmt and embedd augassign in the concrete form  ‘+=’.

testlist '+=' yield_expr

The subsequent embedding steps

expr_stmt   -&gt; small_stmt
small_stmt  -&gt; simple_stmt
simple_stmt -&gt; stmt

When embedding small_stmt into simple_stmt one has to add a trailing NEWLINE. So our final result is:

testlist '+=' yield_expr NEWLINE

Any rule we used during successive embedding  doesn’t have to be used again as an initial rule of another embedding because it was already built into file_input. It can be reused though when needed. I did not attempted to minimize the number of embeddings.

Substitute non-terminals

Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find import f and import d among the listed statements. I suspect one of them is a shortened form of import d.e but since the rule for building d.e allows using d only and it is shorter, it will be chosen.

Detecting Grammar flaws

Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn’t mean that Pythons grammar isn’t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:

file_input: (NEWLINE | stmt)* ENDMARKER
simpleargs: fpdef (',' fpdef)*
defaultargs: fpdef '=' test (',' fpdef '=' test)*
starargs: '*' NAME
dstarargs: '**' NAME
varargslist: ( simpleargs [',' defaultargs] [',' starargs] [','dstarargs] |
               defaultargs [',' starargs] [','dstarargs] |
               starargs [','dstarargs] |
               dstarargs) [',']

So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.


In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn’t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.

If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -> OK;  P(G2)(Expr(G1)) -> OK.

Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.

Higher approximations

Doesn’t the listing give us a 1st order approximation of the language? It’s a fun idea to think of all those listing expressions living in the “tangential space” of the language. “Higher approximation” would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set Tr(K, nfa) of traces of length K for a given nfa. Tr(K, nfa) may be empty for some K.Unfortunately we can’t infer from Tr(K) = {} that Tr(K+1) = {}. So what is a good stop criterion then?

The algorithm for creating Tr(K, nfa) is quite simple. The following functions are Langscape implementations:

def compute_tr(K, nfa):
    Computes the set Tr(K, nfa) of traces of length K for a given nfa.
    The return value may be [] if no trace of length K exists.
    _, start, trans = nfa
    return compute_subtraces(K, 0, start, [], trans)
def compute_subtraces(K, k, S, trace, trans):
    Computes complete traces of a given length.
    :param K: The prescribed length a trace shall have.
    :param k: The current length of a trace ( used by recursive calls ).
    :param trace: the current trace.
    :param trans: the {state:[follow-states]} dictionary which characterizes
                  one NFA.
    traces = []
    follow = trans[S]
    for F in follow:
        if F[0] is None:
            # termination condition fulfilled?
            if k == K:
            m = trace.count(F)
            # impossible to terminate trace under this condition
            if m == K:
                traces+=compute_subtraces(K, max(k,m+1), F, trace+[F], trans)
    return traces

Syntax algebra – first steps

Posted in Algorithms, Grammars on January 30th, 2010 by kay – Be the first to comment

Principle of relativity

I started to revisit syntactic mappings defined in EasyExtend 3 which are closely tied to the Python grammar being  in use. Those are functions like varargs2arglist, normalize, split_file_input or exprlist2testlist defined in the module. One of the major challenges of future releases of EasyExtend ( or a definite successor project – i don’t know ) is to abstract from Python as a target language. In EE3 only Python can be targeted by a langlet transformation whereas in EE 4 langlets are symmetric: each langlet can be a target langlet for any other langlet. All langlets exist on the same footing and also: each langlet can be used as a parent langlet of a newly defined langlet which means a proper inheritance relationship and maximum reuse.

The relativity among langlets calls for raising the bar of abstraction. All Python specific dependencies, and there are a lot in EE3, have to be removed from the basic CST libraries. Indeed not even nodes of special importance like expr, stmt or atom shall be allowed in places other than modules which are specific to a particular langlet. The generalization of the mentioned functions leads to a study of abstract syntactic forms and some sort of “syntax algebra”. It isn’t studied rigorously in this article but shall be at least motivated. As a prerequisite I recommend to read my article about CST interpolation which discusses concepts of major relevance.


The exprlist2testlist function turns a node of type exprlist defined as

exprlist: expr (',' expr)* [',']

into a node of type testlistdefined as

testlist: test (',' test)* [',']

This works without adding information because {test, expr} is a valid interpolation i.e. there is a nested sequence

[test, [or_test, [and_test, [not_test, [comparison , expr]]]]]
which is a representation of a valid CST. In terms of CST interpolations test(expr) yields a node of type test and induces a homomorphism exprlist->testlist. More generally an interpolation {A, B} induces an embedding B (x B)* -> {A,B} (y {A,B})* = A (y A)* if x and y are constant terminal symbols i.e. terminal symbols where the corresponding token have a uniquely determined token string.

Blocks or no blocks

Another relevant example is the normalize function. The idea behind normalize is that statements like if x: y or def foo(): pass are semantically equivalent to block statements:

if x:
def foo():

Those block statements can be used in a more general fashion because we can add other block statements in the thunked block. In Python blocks are expressed by the suite grammar rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Since {stmt, simple_stmt} is a valid interpolation, we can substitute all occurences of

suite: simple_stmt

with suites of the form


The general syntactic embedding is of the kind

B -> a b… {A,B} c d… with a, b, c, d … being constant terminal symbols.

Notice that INDENT is assumed to be a constant terminal despite the fact that the token string may vary. INDENT is treated special because in practical applications the number of used spaces for an indentation is fixed.  EasyExtend always uses 4 spaces.

Splitting nodes

The function split_file_input is the prototype of a node splitting function which can be thought in analogy to string splits. In this particular case we have a node file_input of the form

file_input: (NEWLINE | stmt)* ENDMARKER

and want to generate a sequence of nodes file_input: stmt ENDMARKER, file_input: NEWLINE ENDMARKER and file_input: ENDMARKER – one for each NEWLINE, stmt and ENDMARKER in the original file_input node. It doesn’t matter that file_input: NEWLINE ENDMARKER and file_input: ENDMARKER are likely be thrown away by an application because this can be decided by the nodes consuming function. The general split function is defined by

R: x (A|B…)* y -> [(R: x A y), (R: x B y), …, (R: x y)]

Node rewritings

The mappings considered above were all rather straightforward. Now we want to discuss a rule transformation which is less obvious, namely that of a function signature into an argument tuple of a function call. In Python 2.6 the function signature is defined as

varargslist: ((fpdef ['=' test] ',')*
('*' NAME [',' '**' NAME] | '**' NAME) |
fpdef ['=' test] (',' fpdef ['=' test])* [','])
fpdef: NAME | '(' fplist ')'
fplist: fpdef (',' fpdef)* [',']

and an argument list of a function call by

arglist: (argument ',')* (argument [',']| '*' test [',' '**' test] | '**' test)
argument: test [gen_for] | test '=' test

Can we transform each varargslist into an arglist?

Let’s start our treatment of varargslist with fpdef. If we insert the RHS of fplist in fpdef we get

fpdef: NAME | '(' fpdef (',' fpdef)* [','] ')'

We show that this rule is a special form of  the node atom and since {test, atom} is a valid interpolation it is also a test node. The atom node is defined by

atom: NAME | '(' [yield_expr|testlist_gexp] ')' |  '[' [listmaker] ']' | ...

which can be specialized to

atom: NAME | '(' testlist_gexp ')'

Next we consider the testlist_gexp definition

testlist_gexp: test ( gen_for | (',' test)* [','] )

which can be specialized to

testlist_gexp: test (',' test)* [',']

We insert testlist_gexp in atom which yields

atom: NAME | '(' test (',' test)* [','] ')'

If we reduce test to atom we get a rule

atom: NAME | '(' atom (',' atom)* [','] ')'

which is isomorphic to fpdef. So we just need to substitute all occurrences of fpdef in fpdef with atom, then replace atom with test(atom) and finally replace the whole of atom again with test(atom). This procedure substitutes fpdef with test.

When we substitute each occurrence of NAME with testin varargslist we get:

(test ['=' test] ',')* ('*' test [',' '**' test] | '**' test) |
                       test ['=' test] (',' test ['=' test])* [',']

which can be reduced to

(argument ',')* ('*' test [',' '**' test] | '**' test) |
                 argument (',' argument)* [',']

which is the same as

(argument ',')* (argument [','] | '*' test [',' '**' test] | '**' test)


Syntax algebra

We have done some informal steps into syntax algebra with some real functions defined in EE 3 as a starting point. For the first three functions we have found general syntactical transformations which might be universally applicable. The last transformation is very specific though and it might be more interesting to determine an algorithm used to find a rule transformation of a particular kind. Although the search algorithm might be NP complete I assume that the found transformation – if one exists – has linear time complexity which is what we want. Such an algorithm would be another great achievement of EasyExtend which does not cease to surprise me.

Is parsing Perl really impossible?

Posted in Grammars, Parsing on June 12th, 2009 by kay – 5 Comments

I just read this article about the apparent inability to parse Perl 5.

There is an underlying assumption that a parser has to resolve all ambiguities and derive a single parse tree from an expression ( giving a unique interpretation ). But instead of demanding the value of an expression in order to parse it the parser might derive multiple parse trees and insert them as child nodes into a new parse tree representing a conditional expression. So the parser can just inflate the parse tree structure and do its job.

Whether or not the conditional expression can be computed at runtime doesn’t have to bother the compiler developer. It is just another function call and no more dubious then an arbitrary function call in a conditional expression

    printf("foo() is true\n");
    printf("foo() is false\n");

So it makes sense to fix requirements to a parser before giving proofs of their impossibility.

CodeTemplates – a fresh look on code transformations

Posted in DSL, Grammars, TBP on June 7th, 2009 by kay – Be the first to comment

This is the first part of a two part article series about CodeTemplates. CodeTemplates are a code transformation technique that suggests an alternative to common syntactic as well as lexical macros for whole language transformations.

Syntactic macros

About three years ago I experimented with a “syntactic” macro system for Python incorporated in EasyExtend 2. In the meantime another project emerged called MetaPython striving for a similar goal under the sign of the Ouroboros.

Fundamentally a macro system defines functions called macros that act as compile time code generators. It is not surprising that code is data – how else shall a compiler work? – however few languages specify functions that treat source code as source code and operators that evaluate source code at compile time. By quasi-quotation some piece of code in a macro body is mentioned rather than evaluated. Complementary to quasi-quoting there is a splice-operator that evaluates code and returns other code preferably in AST form that becomes aligned with the AST of the quasi-quoted code.

The difference between source code and ASTs is going to vanish in so called homoiconic languages like Lisp where the source code that is written by the user is also a data-structure of the language. So a macro is essentially a function that transforms those data-structures meant to be code. It doesn’t mean though that homoiconicity is prerequisite for syntactic macros. The cited MetaPython but also MetaLua are examples for macro systems in non-homoiconic languages. A nice treatment of the foundations of macro systems in non-homoiconic languages can also be found in the documentation of the Converge language.

Syntactic macros in EE 2

EasyExtend 2.0 defined a macro system in an own langlet, simply called the macro langlet. The role of macros in EasyExtend is somewhat peculiar though because EE is a system used for whole language transformations with language syntax specified in grammars. EE binds node handlers to parse tree node types and those node handlers emit parse tree nodes of the target language. They were just meant to facilitate code transformations within the framework.

Unfortunately it turned out that they didn’t. A bug in a macro caused hard to detect language transformation errors. Error recovery is a major concern if not the single major concern with language transformers and the complicated quasi-quoting / splicing semantics of macros made this even harder. More often than I ever wanted I debugged through framework when attempting to detect a bug in the macro. Furthermore the macro langlet operated within the system and was transformed as well. So two languages were transformed at the same time whereas one of them was additionally engaged in transforming the other. EasyExtend 2 made this possible without any conflicts implementing a joint transformation machinery and separated ranges of node types node handlers could bind to. The macro transformation was implemented as a set of node handlers in the LangletTransformer of the macro langlet and it was an ugly beast of code. I intended to rewrite this for EE 3 but it at some point of time I decided to terminate this intellectual exercise and dropped the macro facility completely.

Why not just textual macros?

A textual substitution macro occasionally also called “lexical” macro is a simple way to do code transformations. The most important programming language ever, C, implements them by means of the C preprocessor. Lexical macros are also known as poor-mans macros and using them in EasyExtend was like leaving civilization because city-life is complicated and corrupt and one better strives for a subcomplex life as a farmer or hunter-gatherer. Furthermore EE is in deep love with syntax so a radical conceptualization of the macro idea is required. Fortunately the intellectual effort for this is low and can be summarized in one sentence

Enable arbitrary string transformations guarded by syntax

This is universally applicable for all context free languages and doesn’t require any additional macro syntax.

If one attempts to transform one string S1 into another one S2 all that is needed are the following three operators

  • insert
  • delete
  • subst

The edit distance between S1 and S2 is the minimal number of operations of this kind used to transform S1 into S2. It can be computed by the popular algorithm of Levenshtein.

EasyExtend 4 will use a slightly different set of operators

  • replicate
  • delete
  • subst

It is almost equally expressive except for the case of an empty string S1 that requires an initial insert and there is nothing yet to replicate, delete or substitute. But once you have a single character in a string, one can replicate this character and apply substitutions which yields the same as applying multiple insertions.

CodeTemplate objects in EE 4

EasyExtend 4 defines a module called that defines 4 types of objects called CodeTemplate, CodeMarker, CodeSelection and SelectionTree.

A CodeTemplate is an object that wraps some piece of source code in some language. The source code is parsed on construction of the CodeTemplate.

A CodeMarker marks sections in the source by means of search pattern. One can bind arbitrary many CodeMarker objects to a CodeTemplate. It is possible to also bind CodeMarkers to other CodeMarkers building a CodeMarker tree. When a CodeMarker B is bound to another CodeMarker A the search pattern of B is only applied to the code marked by A.

A CodeTemplate implements a matchmethod. If it is applied the pattern specified by all bound CodeMarker objects are matched against the code of the CodeTemplate. The result is a SelectionTree object. The tree holds a list of CodeSelection objects as well as a list of subtrees.

All transformations of the source code wrapped by the CodeTemplate are applied through CodeSelection objects. It is the CodeSelection object that substitutes, replicates or deletes code that is bound to. Note that internally all those operations still happen on a parse tree. Invalid operations will immediately be rejected and an exceptions is raised. It is not a complete syntax check on the whole code that is applied on each operation and it wouldn’t make sense to perform one either because some piece of code can morphed into code of another language and it will be a hybrid piece of code except at transformations start and the endpoint.

The SelectionTree is clonable and can be reused for other substitutions.


In the next article I’ll show CodeTemplates in action.

VHDL grammars and the parser sandwich

Posted in Grammars on May 20th, 2009 by kay – 4 Comments

Eli Bendersky has mentioned some problems of parsing VHDL. I was quite pleased though finding the Postlexer idea being discussed in the following paper which provides a more thorough treatment of VHDL parsing issues.

They call it “Auxiliary Terminal Generator” which is a more precise notation of what I call a “Postlexer”. In that case a Postlexer resolves ambiguities caused by context sensitivity. The solution is to introduce auxiliary terminal symbols in between two context free passes – the one for lexical analysis and parsing.

What I distaste about their approach is that they coupled the grammar with the Postlexer using parsing actions. Apparently generations of computing scientists enjoyed squeezing as much information as possible into their grammars instead of acknowledging the obvious: a separate parsing phase is often required for non-trivial grammars and since the problems are specific one can leave it entirely to a powerful general purpose language and the programmer to solve them. As a consequence the parser isn’t portable across languages but at least the grammars are.

Generally speaking one has to consider three functions now instead of two:

The Parser Sandwich

Lexer: Parsetable String -&gt; Tokenstream
Postlexer: Tokenstream -&gt; Tokenstream
Parser: Parsetable Tokenstream -&gt; Tree

Lexer and Parser shall be universally applicable for any language whereas the Postlexer is always specific and can be omitted if the whole language can be described in a context free way.

Trace Based Parsing (Part V) – From Grammars to NFAs

Posted in EasyExtend, Grammars, TBP on May 7th, 2009 by kay – Be the first to comment

In this article I will demonstrate the translation process by which an EBNF grammar will be translated into a Trail NFA. There are several different notations for EBNF that differ only in minor details. I’ve chosen the one of Pythons Grammar which can roughly be described using following set of EBNF rules:

file_input: ( RULE | NEWLINE )* ENDMARKER
RHS: ALT ( '|' ALT )*
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]

The grammar is self-describing. Starting with the non-terminal file_input you can parse the grammar using its own production rules. The multiplicities the EBNF grammar uses are:

  • [X] – zero or one time X
  • X* – zero or more times X
  • X+ – one or more times X

EasyExtend defines a langlet called grammar_langlet used to parse grammars. With the grammar_langlet EBNF grammars become just ordinary languages supported by EE. The parse trees created from grammars will be transformed into NFAs.

From EBNF parse trees to Trail NFAs

The grammar_langlet defines transformations on parse trees. More concretely it wraps pieces of the grammar into Rule objects. Rule objects are basically list wrappers with an additional copy method which is not of interest here.

class Rule(object):
    def __init__(self, lst):
        self.lst = lst
    def copy(self):

From Rule we derive several subclasses which can be reflected by issubclass:

class ConstRule(Rule): ...
class SequenceRule(Rule): ...
class AltRule(Rule): ...
class EmptyRule(Rule): ...

What you don’t find are rules like MaybeRule or ZeroOrMoreRule because we eliminate multiplicities entirely. This will discussed later.

Names and strings are wrapped wrapped into ConstRule objects as follows:

ConstRule([('file_name', 0)])
ConstRule([('"("', 1)])
ConstRule([('RULE', 2)])

In this scheme we already recover the indices that directly translates to states of Trail NFAs. A special case of a ConstRule is the rule that wraps None:

ConstRule([('None', '-')])

The rules of type SequenceRule and AltRule serve as containers for other rules.

Elimination of multiplicities

We apply a trick to get rid of rules with multiplicities i.e. rules defined by X*, X+ and [X]. All of them can be reduced to special combinations of our basic rule set. Before we show the details we need to introduce the EmptyRule type. Apart from the unique exit state (None, ‘-‘) there might be several other empty states (None, nullidx). Just like the constant rules those empty states get enumerated but with an own index and just like any but the exit state they might occur on the left-hand side of a rule. The idea of using those empty states might be demonstrated using a simple example. Consider the following rule:

R: A [B]

We can manually translate it into the NFA

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, '-')]
(B, 2): [(None, '-')]

However the translation becomes much simpler when we reify empty productions and insert them first:

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, 1)]
(None, 1): [(None, '-')]
(B, 2): [(None, '-')]

What we have effectively done here is to interpret [B] as B | None. In a second step we eliminate the indexed empty states again which will lead to NFAs of the first form. Removing A* and A+ is similar but a little more tricky. We write

A*  → None | A | A A
A+  → A | A A

This looks wrong but only when we think about those transitions in terms of an an indexing scheme that assigns different numerals to each symbol that occurs in the rule. But now all A on the right hand side shall actually be indistinguishable and lead to identical states. With this modification of the translation semantics in mind we translate the grammar rule R: A* into

(R, 0): [(A, 1), (None, 1)]
(A, 1): [(A, 1), (None, '-')]
(None, 1): [(None, '-')]

Eliminating the indexed empty state leads to this NFA

(R, 0): [(A, 1), (None, '-')]
(A, 1): [(A, 1), (None, '-')]

which is just perfect.

The reduction of rules with multiplicities to sequences and alternatives can be applied to Rule objects as well:

[A] →  AltRule( EmptyRule([(None, nullidx)]), A)
A*  →  AltRule( EmptyRule([(None, nullidx)]), A, SequenceRule([A, A]))
A+  →  AltRule( A, SequenceRule([A, A]))

Since we have found a simple a way to express grammars by rule trees containing only the mentioned rule types we can go on and flatten those trees into a tabular form.

Computing NFAs from rule trees

First we define an auxiliary function called end(). For a rule object R this function computes a set of constant or empty rules that terminate R. This means end(R) contains the last symbols of R wrapped into rule objects.

def end(R):
    if not R.lst:
        return set()
    if isinstance(R, (ConstRule, EmptyRule)):
        return set([R])
    elif isinstance(R, AltRule):
        return reduce(lambda x, y: x.union(y), [end(U) for U in R.lst])
    else:  # SequenceRule
        S = R.lst[-1]
        return end(S)

With this function we can implement the main algorithm.

Suppose you have a SequenceRule(S1, …, Sk). The end points of each S[i] are computed by application of end(S[i]) and shall be connected with the start points of S[i+1]. Those starting points will build the follow sets of S[i]’s end points. If S[i+1] is a constant or empty rule we are done. If S[i+1] is an AltRule we compute the connection of S[i] with each entry in S[i+1]. If S[i+1] is a SequenceRule we recursively call our connection building algorithm with S[i] being prepended to the sub-rules of S[i+1]. In finitely many steps we always find the complete set of constant/empty rules S[i] can be connected with. Here is the Python implementation:

def build_nfa(rule, start = None, nfa = None):
    if not start:
        nfa = {}
        # rule.lst[0] == (rule name, 0)
        # rule.lst[1] == SequenceRule(...)
        start = set([ConstRule([rule.lst[0]])])
        return build_nfa(rule.lst[1], start, nfa)
    if isinstance(rule, SequenceRule):
        for i, R in enumerate(rule.lst):
            build_nfa(R, start, nfa)
            start = end(R)
    elif isinstance(rule, AltRule):
        for R in rule.lst:
            build_nfa(R, start, nfa)
    else: # ConstRule or EmptyRule
        r = rule.lst[0]
        for s in start:
            state = s.lst[0]
            follow = nfa.get(state, set())
            nfa[state] = follow
    return nfa

The last step is to remove indexed empty states (None, idx).

def nfa_reduction(nfa):
    removable = []
    for (K, idx) in nfa:
        if K is None and idx!="-":
            F = nfa[(K, idx)]
            for follow in nfa.values():
                if (K, idx) in follow:
                    follow.remove((K, idx))
            removable.append((K, idx))
    for R in removable:
        del nfa[R]
    return nfa

This yields an NFA. In order to make it suitable for Trail one additionally has to map each state (A, idx) to (A, idx, R) with the rule id R.

What’s next?

So far all our discussions were placed in the realm of nice context free languages and their aspects. However the real world is hardly that nice and real language syntax can only be approximated by context free formalisms. Context sensitive languages are pervasive and they range from Python over Javascript to C, C++, Ruby and Perl. Instead of augmenting grammars by parser actions EasyExtend factors context sensitive actions out and let the programmer hacking in Python.

Trace Based Parsing ( Part IV ) – Recursive NFAs

Posted in EasyExtend, Grammars, TBP on May 6th, 2009 by kay – Be the first to comment

Not long ago I inserted the following piece of code in the EasyExtend/trail/ module which defines NFA tracers and cursors and deals with the reconstruction of parse trees from state set sequences.

class NFAInterpreter(object):
    def __init__(self, nfa):
        for state in nfa:
            if state[1] == 0:
                self.start = state
            raise ValueError("Incorrect NFA - start state not found")
        self.nfa = ["", "", self.start, nfa]
    def run(self, tokstream):
        cursor = NFACursor(self.nfa, {})
        selection = cursor.move(self.start[0])
        for t in tokstream:
        return cursor.derive_tree([])

This little interpreter is basically for experimentation with advancements of Trail. The idea is to initialize a a single NFAInterpreter object with the transition table of an NFA. This table can be handcrafted and represents a concept. The run method keeps a token stream and uses it to iterate through the NFA. Finally a parse tree will be derived from a state sequence. This derivation process is defined in the method derive_treeof the class NFAStateSetSequence of the same module and it has to be adapted. I’ll show you an interesting example.

Recursive rules

Recursively defined rules are most likely a cause for Trail to turn into backtracking mode. I’ve had some ideas in the past on how to deal with recursion in Trail based on Trail NFAs. Here are some of my findings.

For the following discussion I’ll consider a single grammar rule which might be among the most simple albeit interesting rules that cannot be expanded:

R: a b [R] a c
The inner R causes a First/First conflict because a b … and a c are both possible continuations at the location of [R]. This conflict can’t be resolved by expansion because it would just be reproduced:
R: a b [a b [R] a c] a c
We do now translate R into an NFA.

{(R, 0, R): [(a, 1, R)],
 (a, 1, R): [(b, 2, R)],
 (b, 2, R): [(R, 3, R), (a, 4, R)],
 (R, 3, R): [(a, 4, R)],
 (a, 4, R): [(c, 5, R)],
 (c, 5, R): [(None, '-', R)]

What I intend to introduce now is some sort of tail recursion elimination ( despite Guidos verdict about it ). This is done by the introduction of two new control characters for NFA states. Remember that a state (A, ‘.’, 23, B) contains the control character ‘.’ which is created when an NFA A is embedded in another NFA B. Transient states like this containing control characters don’t produce own nid selections but get nevertheless recorded and stored in state set sequences and are important in the derivation process of parse trees. The idea is to replace the inner state (R, 3, R) by a pair of transient states (R, ‘(‘, 0, R) and (R, ‘)’, 0, R).

The transient state (R, ‘(‘, 0, R) acts like a pointer to the beginning of the rule whereas the state (R, ‘)’, 0, R) is close to the end only followed by the exit state (None, ‘-‘, R) and the states following the formerly embedded (R, 3, R). Here is what we get:

{(R, 0, R): [(R, '(', 0, R)],
(R, '(', 0, R): [(a, 1, R)],
(a, 1, R): [(b, 2, R)],
(b, 2, R): [(R, '(', 0, R), (a, 4, R)],
(a, 4, R): [(c, 5, R)],
(c, 5, R): [(R, ')', 0, R)],
(R, ')', 0, R): [(None, '-', R), (a, 4, R)]

A correct parse created with this rule implies an exact pairing of opening and closing parentheses. If we have passed through (R, ‘(‘, 0, R) n-times we also need to pass n-times though (R, ‘)’, 0, R). But the NFA is constructed such that it can be left any time we have passed through (R, ‘)’, 0, R) because the exit state is its follow state. NFAs cannot count and the tracer can’t be used to rule out all incorrect parse trees.

The way parse trees created by those kind of rules are checked is directly at their (re-)construction from state sequences. This leads us into the next step after having introduced a new NFA scheme and new transient NFA states.

Customization of derive_tree method

The derive_tree() method of the NFAStateSetSequence is the location where parse trees are created from state lists. Here is also the place to define the behavior associated with transient states. We give a listing of the relevant section that deals with ‘(‘ and ‘)’ control characters in transient states.

def derive_tree(self, sub_trees):
    states = self.unwind() # creates state list from state set sequence
    root = states[0][0][0]
    tree = []
    rule = [root]
    cycle = []
    std_err_msg = "Failed to derive tree."
                  "Cannot build node [%s, ...]"%root
    for state, tok in states[1:]:
        nid  = state[0]
        link = state[-1]
        IDX  = state[1]
        if IDX == SKIP:
        # begin of '(' and ')' control character treatment
        # and their transient states
        elif IDX == '(':
        elif IDX == ')':
            if cycle:
                rec = cycle.pop()
                if (rec[0], rec[2], rec[3]) != (state[0], state[2], state[3]):
                    raise ValueError(std_err_msg)
                raise ValueError(std_err_msg)
            for i in xrange(len(tree)-1, -1, -1):
                t_i = tree[i]
                if type(t_i) == tuple:
                    if t_i[1] == '(':
                        tree, T = tree[:i], tree[i+1:]
                        if tree:
                            T.insert(0, link)
                            tree = T
                raise ValueError(std_err_msg)
        elif nid is None:
            if cycle:  # there must not be cycles left
                raise ValueError(std_err_msg)
            if type(tree[0]) == int:
                return tree
                tree.insert(0, link)
                return tree
        # end of '(' , ')'

Here you can download the complete code within its context. At the top the nfatracing module CONTROL characters are defined. The list can be extended by your own characters.

Testing our NFA

Let’s write some tests for our NFA interpreter:

R = 'R'
a, b, c = 'a', 'b', 'c'
nfa = {(R, 0, R): [(R, '(', 0, R)],
       (R, '(', 0, R): [(a, 1, R)],
       (a, 1, R): [(b, 2, R)],
       (b, 2, R): [(R, '(', 0, R), (a, 4, R)],
       (a, 4, R): [(c, 5, R)],
       (c, 5, R): [(R, ')', 0, R)],
       (R, ')', 0, R): [(None, '-', R), (a, 4, R)]
interpreter = NFAInterpreter(nfa)
assert'abac') == ['R', 'a', 'b', 'a', 'c']
assert'ababacac') == ['R', 'a', 'b',
                                           ['R', 'a', 'b', 'a', 'c'],
                                            'a', 'c']

With those modifications being applied the parser still exists within O(n) bounds which is quite encouraging. How to go from here to a fully general treatment of recursion in NFAs? Here are some grammars which represent challenges:

Mutually recursive expansions:

A: a b [B] a c
B: a b [A] a d

Immediate left recursion:

A: A '+' A | a

Multiple occurrences of a rule in immediate succession

A: (A '+' A A) | a

In all the of mentioned cases left recursion conflicts can be removed quite trivially. In complex grammars, however, they can lurk somewhere and with NFA expansion you can’t be sure that they do not emerge at places where you didn’t expect them.

What’s next?

In the next article I’ll move a few steps back and show how to create Trail NFAs from EBNF grammars. EasyExtend 3 contained a rather complicated and obscure method due to my apparent inability to find a simple simple algorithm at the time of its implementation. Recently I found a much simpler and more reliable generation scheme that shall be demonstrated.