TBP

The state of the Trail parser generator

Posted in Algorithms, Grammars, Langscape, Parsing, TBP, Trail on April 10th, 2012 by kay – Be the first to comment

White Easter

Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter…

 x + + + + + + + + +
 - - - - - - - - - x
 - - - - - - - - x +
 - - - - - - - x + +
 - - - - - - x + + +
 - - - - - x + + + +
 - - - - x + + + + +
 - - - x + + + + + +
 - - x + + + + + + +
 - x + + + + + + + + 

Flat Automatons

This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn’t that I believed that this speed advantage wouldn’t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn’t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let’s begin.

If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely flat then and contained only terminal symbols which referred to each other . An intrinsic hurdle is recursion. A recursive rule like

X: X X | a
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
X X | a
X X X X | X X a | a X X | a X | X a | a a | a
...

Working embeddings

The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won’t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:

X: X a
X: ɛ
which means that X can produce the empty word. Since ɛa = a the rule X accepts all finite sequences of a. In EBNF we can rewrite the X-rule as
X: [X] a
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
(X,0): (X,1) (a,2)
(X,1): (a,2)
(a,2): (FIN,-1)
Each state carries a unique index, which allows us to distinguish arbitrary many different X and a states in the automaton. If we further want to embedd X within another rule, say
Y: X b
which is defined by the table
(Y,0): (X,1)
(X,1): (b,2)
(a,2): (FIN,-1)
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
(X,0,X): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
With this notion we can embed X in Y:
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
(FIN,-1,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
The initial state (X,0,X) of X was completely removed. The automaton is still erroneous though. The final state (FIN,-1,X) of X is not a final state in Y. It even has a transition! We could try to remove it completely and instead write
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X):   (b,2,Y)
(b,2,Y): (FIN,-1,Y)
But suppose Y had the form:
Y: X X b
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state (FIN, -1, X) when embedded in Y into an ɛ-state in Y:
(FIN,-1,X) => (X, 3, TRAIL_DOT, Y)
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape (_, _, 0, _). Finally the fully and correctly embedded automaton X in Y looks like this:
(Y,0,0,Y): (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): (FIN,-1,0,Y)
    
The TRAIL_DOT marks the transition between X and Y in Y. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.

Rules of Embedding

At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don’t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:

R: A | B
A: a* c
B: a* d
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string “aaa…” we cannot decide which of the rules A or B we have to choose, unless we observe a ‘c’ or ‘d’ event i.e. our string becomes “aa…ac” or “aa…ad”. What we basically want is to defer the choice of a rule, making a late choice instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
R: a* c | a* d
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets { (a,1,0,A), (a,2,0,B) } which recognize the same character “a”. Those state sets will be stored during parsing. In case of a “c”-event which will be recognized by the state (c, 3, 0, A) we only have to dig into the state-set sequence and follow the states(a, 1 , 0, A) back to the first element of the sequence. Since (A, 4, TRAIL_DOT, R) is the only follow state of (c, 3, 0, A)we will actually see the sequence:
(A,4,TRAIL_DOT,R)
 \
  '...
      \
(c,3,0,A)
(a,2,0,A)
(a,2,0,A)
...
(a,2,0,A) 
    
from this sequence we can easily reconstruct contexts and build the tree
[R, [A, a, a, ..., c]]
All of this is realized by late choice. Until a “c” or “d” event we move within A and B at the same time because. The embedding of A and B in R solves the FIRST/FIRST conflict above. This is the meaning.

FOLLOW/FIRST conflicts

So far the article didn’t contain anything new. I’ve written about all of this before.

The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:

R: U*
U: A | B
A: a+ (B c)*
B: b
There is no FIRST/FIRST conflict between A and B and we can’t factor out a common prefix. But now suppose we want to parse the string “abb”. Obviously A recognizes the two initial characters “ab” and then fails at the 2nd “b” because “c” was expected. Now A can recognize “a” alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After “a” has been recognized by A the rule B may take over and match “b” two times:
[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]
Trail applies a “longest match” recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after a and at that point the parser sets a so called checkpoint dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is “abcbb” then A will match “abc”, if it is “abcbcbb” then it is “abcbc” and so on.

The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.

The inevitability of left recursions

It is easy in top down parsing to recognize and remove or transform left recursive rule like this one

X: X a | ɛ
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don’t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn’t expected them. I’d like to go even one step further which is conjectural: we can’t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this clash dynamics, as I like to call it might lead to algorithmically undecidable problems. It’s nothing I’ve systematically thought about but I wouldn’t be too surprised.

For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as

X: X
or
X: X a
In both cases the FIRST-sets of X don’t contain any terminal symbol and they can’t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
E: E '*' E | NUMBER
contains a FIRST/FIRST conflict between E and NUMBER. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
T: u v [T] u w
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that’s entirely changing the game. The question is do we must tack backtracking solutions onto Trail which are deeper involved than checkpoints?

After 6 months of tinkering I wished the answer was no. Actually I believe that it is unavoidable but it occurs at places were I didn’t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.

Teaser

Before I discuss left recursion problems in Trail in a follow up article I’ll present some results as a teaser.

Grammars for which parsing in Trail is O(n):

a) E: '(' E ')' | E '' E | NUMBER
b) E: '(' E+ ')' | E '' E | NUMBER
Other grammars in the same category are
c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
However for the following grammars the parser is in O(2^n)
f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
If we combine  d) and f) we get
h) G: G G (G | 'h') | G (G | 'h') | 'h'
In this case Trail will deny service and throw a ParserConstructionError exception. Some pretty polynomial grammars will be lost.

 

 

 

LL(*) faster than LL(1)?

Posted in Algorithms, Grammars, Parsing, Python, TBP on September 12th, 2011 by kay – 7 Comments

No jumps

When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.

In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P. How do we know that P is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.

What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this

R: a* b | a* c
The algorithm never checks out the first branch, runs through a sequence of  ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the rule
R: a* ( b | c )
However there was never any rule transformation to simplify the problem.

From rules to grammars

It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar

R: A | B
A: a* c
B: a* d
Again we have two branches, marked by the names of the non-terminals A and B and we want to decide late which one to choose.

First we turn the grammar into a regular expression:

R: a* c | a* d
but now we have lost context/structural information which needs to be somehow added:
R: a* c <A>| a* d <B>
The symbols <A> and <B> do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters <A> it will be finally decided that rule A was used. The same is true for <B>. Our example grammar is LL() and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL() grammar as-if it was an LL(1) grammar + a small context marker.

Reconstruction

What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:

R: A[R] | B[R]
A: a[A]* c[A]
B: a[B]* d[A]
Now we can embed A and B into R while being able to preserve the context:
R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]>
Matching now the string aad yields the following sequence of sets of matching symbols:
{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>}
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match a.

Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.

We start the interpretation by translating the rightmost symbol

<B[R]> -> [R,[B, .]]
The  dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
[R, [B, .]]
[R, [B, ., d]]
[R, [B, ., a, d]]
[R, [B, ., a, a, d]]
[R, [B, a, a, d]]

Drawbacks

We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’.  I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.

The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?

Performance

To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.

In order to measure performance I created the following simple script

import time
import decimal
import langscape
 
text = open(decimal.__file__.replace(".pyc", ".py")).read()
print "textlen", len(text)
 
python = langscape.load_langlet("python")
p4d = langscape.load_langlet("p4d")
 
def test(langlet):
    tokens = langlet.tokenize(text)
    a = time.time()
    for i in range(10):
        langlet.parse(tokens)
        tokens.reset()
    print "parser", langlet.config.langlet_name, (time.time() - a)/10
 
test(python)
test(p4d)

It imports a reasonably big Python module ( decimal.py ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:

parser python 2.39329998493
parser p4d 2.25759999752
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.

I’ve also checked out the script with PyPy, both with activated and deactivated JIT.

PyPy with option –JIT off:

parser python 6.5631000042
parser p4d 5.66440000534
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:

PyPy with JIT:

parser python 0.791500020027
parser p4d 1.06089999676
Here the Python parser is about 1/3 faster than the P4D parser.

The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.

Token Tracers

Posted in EasyExtend, Parsing, TBP on June 3rd, 2010 by kay – Be the first to comment

When I started programming EasyExtend in 2006 one of the major problems was the correct grammar -> NFA translation. I used big grammars and testing for correctness required lots of source code. The first heuristics I used was ugly and complex and it took about 2 or so years to find a neat trick which finally lead to replace it completely. The basic problem of systematic phrase or expression generation for testing purpose persisted though – until last week when I implemented a TokenTracer.

Tracers

A typical production rule in the Trail parser generator is translated into a single NFA which might look as in the following example

 1005: ["funcdef: [decorators] 'def' NAME parameters ':' suite",
        (1005, 0, 1005),
        {(1, 3, 1005): [(1006, 4, 1005)],
         (11, 5, 1005): [(1043, 6, 1005)],
         ('def', 2, 1005): [(1, 3, 1005)],
         (1004, 1, 1005): [('def', 2, 1005)],
         (1005, 0, 1005): [('def', 2, 1005), (1004, 1, 1005)],
         (1006, 4, 1005): [(11, 5, 1005)],
         (1043, 6, 1005): [(None, '-', 1005)]}],

It is not created for readability but it is nevertheless easy to decode. The funcdef grammar rule is assigned a numerical value, a rule identifier – here 1005. Asscociated with the rule identifier is a 3-list consisting of

  1. The rule in plain text
  2. The start state of a finite automaton (1005, 0, 1005)
  3. A finite automaton encoded as a dictionary of transitions.

Starting with (1005, 0, 1005) one can step through the automaton. The follow states are  [(‘def’, 2, 1005), (1004, 1, 1005)]. The first one obviously represents the def keyword whereas the second is a representation of the decorators non-terminal which has the rule identifier 1004. When you select the (1004, 1, 1005) state there is a single follow state, which is again the state of the def keyword otherwise you get the follow state (1, 3, 2005) of  (‘def’, 2, 1005). The state (None, ‘-‘, 1005) doesn’t have a follow state and it is the only one.

You can now define a function that keeps track of this stepping process through a rule. This function is called a Tracer.

A Tracer acts as follows:

&gt;&gt;&gt; tracer = Tracer(rules)
&gt;&gt;&gt; tracer.select(1005)   # selects automaton 1005 and returns the rule ids of the
['def', 1004]             # possible follow states
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)
[1006]
...

It is possible that a Tracer has to keep track of multiple traces at once. For example the exprlistrule

 1069: ["exprlist: expr (',' expr)* [',']",
        (1069, 0, 1069),
        {(12, 2, 1069): [(1053, 3, 1069)],
         (12, 4, 1069): [(None, '-', 1069)],
         (1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1053, 3, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1069, 0, 1069): [(1053, 1, 1069)]}],

defines transitions of the kind

(1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)]

with two rules of rule id 12 in the follow set. When 12 is selected in the Tracer all follow sets of all rules with rule id = 12 are unified:

&gt;&gt;&gt; tracer.select(1069)
[1053]
&gt;&gt;&gt; tracer.select(1053)
[12, None]
&gt;&gt;&gt; tracer.select(12)
[1053, None]
...

TokenTracers

This kind of tracing functionality is central to EasyExtends implementation of Trace Based Parsing (TBP). For single grammar rules TBP coincides with “Thompson NFA” style parsing discussed at length by Russ Cox or more recently by Carl Friedrich Bolz who gave a Python implementation.

We want to consider now a different sort of tracer which is more complicated to create than those for single grammar rules. Those tracers have to meet the following requirement:

The list of rule id’s returned from tracer.select() shall contain only None or rule id’s of terminal symbols.

The rule id’s of terminals are exactly the  token types. The select function of a TokenTracer returns a list of token types and gets fed with a single token type. In the following example we step through the token stream of a simple function

def foo():
    print 42

Here we go

&gt;&gt;&gt; tracer = TokenTracer(rules)
&gt;&gt;&gt; tracer.select(1001)  # a single select using a top level non-terminal
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)     # foo
[7]
&gt;&gt;&gt; tracer.select(7)     # (
[1, 7, 8, 16, 36]
&gt;&gt;&gt; tracer.select(8)     # )
[11]
&gt;&gt;&gt; tracer.select(11)    # :
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(4)     # \n
[5]
&gt;&gt;&gt; tracer.select(5)     # INDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('print')
[1, 2, 3, 4, 7, 9, 13, 13, 14, 15, 25, 26, 32, 35, 'lambda', 'not']
&gt;&gt;&gt; tracer.select(2)     # 42
[4, 7, 9, 12, ..., 36, 48, '&lt;&gt;', 'and', 'if', 'in', 'is', 'is', 'not', 'or']
&gt;&gt;&gt; tracer.select(4)     # \n
[1, 2, 3, 6, 7, ... , 'try', 'while', 'with', 'yield']
&gt;&gt;&gt; tracer.select(6)     # DEDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(0)     # ENDMARKER

Application 1 – error detection

Using a TokenTracer it is dead simple to localize a syntax error which is – in the context free case – always an unexpected token. In principle Trail could delegate error recovery entirely to a TokenTracer.

Application 2 – autocorrection

A constant token is a token with a constant token string e.g. ‘;’ or ‘:’. Closely related are token like INDENT where the token string can be derived from context and a prescribed indentation. In sharp contrast are token like NAME, NUMBER and STRING where the token string is not language but user determined. In the select() sequence above we find constant token lists of length = 1 like [11] or [7]. If one of those token is omitted it can be inserted without guessing.

Application 3 – expression generation

The most intriguing aspect of TokenTracers is that each random token sequence which is constrained by a TokenTracer is syntactically correct. This can be used to create expression generators: first write a grammar G to describe the language syntax, then you derive a TokenTracer(G). Finally an expression generator ExprGen(TokenTracer(G)) is created which is used to build random token sequences being compliant with G by means of the TokenTracer. Those token-sequences can either be turned into valid parse trees and get compiled or un-tokenized into source code.

A valuation function fitness(expr) -> float on expressions motivates the use of genetic programming for breeding expressions of a certain kind. For example I’m strongly  interested in compact grammars which create big NFA expansions in Trail. It is not easy to see how those can be built by hand. Using GP one could set an arbitrary threshold like n = 1000 for the number of states in a single expanded NFA and tries to minimize the size of a grammar, where the size is measured in the number of tokens used for a grammar description in some meta-grammar ( e.g. EBNF ).

Restricted backmatching

Posted in TBP on March 13th, 2010 by kay – Be the first to comment

In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of  a regexp engine RE2, created by Russ Cox who has written a revival paper for Thompson NFAs  a while ago with a few follow-ups which build on those ideas.  Once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:

The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.

So backmatching can’t be implemented efficiently, but why? What is it that prevents an efficient implementation and can a subset be defined which exist within the constraints of O(n) lexers or parsers? In this article we will find that backmatching for convenient use cases is not a problematic feature in linear space and time regexp engines. Some backreferences are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT – but could it be in the end that only esoteric applications of backmatching are excluded from trace based parsing (TBP)?

General backmatching

When we think about backmatching in regular expressions we might define expressions like this

a)  "... (P) ... \1"

where (P) defines a simple pattern and \1 refers to the string value matched by the pattern. We assume that there is a functional relationship between (P) and \1. First it is (P) that matches then \1 will match what (P) has matched before.

Actually this perspective is a little simplistic and in the general case backmatching can be more powerful. Consider the following simple regexp:

b)  ([ab]*)b*\1

Here the match of  ([ab]*) depends on what \1 will match but it is also highly ambiguous. If we match the following string

s = “bb”

the first b can be matched with ([ab]*) and the last “b” with \1 but the whole string can also be matched with b*.

Here is another more complicated example

c)  (([ab]*)b*\2)*[ab]*\1b*

It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of ([ab]*) on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of ([ab]*) and (([ab]*)b*\2). It is all tangled.

General backmatching as in examples b) and c)  can be used to solve NP-complete problems which exploits these tangles and finds resolutions. See this article for more examples. With NP complete backtracking algorithms built into regexp engines one gets such solutions for free. This is cool and clever but also unnecessary most of time.

Functional backmatching

If we restrict backmatching to simple functional relations between (P) and \1 as in case a) we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.

In an expression

    ... (P) ... \1

the back-reference \1 can be separated from P when the following conditions are satisfied:

  1. P doesn’t contain back-references which means it is self contained.

  2. It is possible to write the expression in the way

    ... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.

The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets

2.1 LAST-SET(L)  /\ FIRST-SET(P) = {}
2.2 LAST-SET(P)  /\ FIRST-SET(R) = {}

If additionally following condition is true

2.3 FIRST-SET(P) /\ LAST-SET(P) = {}

R can be empty and an expression

    ... L(P)\1 ...

is permitted.

End Cycles

The conditions 2.1 – 2.3 are still to restrictive. For example the regexp (a)\1 violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjacent to what I call a non empty endcycle.

An endcycle of P has the following definition:

END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )

Take for example the regexpP = (a*|b|c). Here LAST-SET(P) = {a, b, c} and  FOLLOW-SET({a,b,c}) = {a} which means that a is in the endcycle of P.

With endcycles in mind we can weaken the conditions of (2) considerably:

If P has no endcycle i.e.

    END-CYCLE(P) = {}

we permit

    ... L(P)\1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}

If on the other hand

    END-CYCLE(P) != {}

we permit

    ... L(P)R ... \1 ...

if the following is valid:

    END-CYCLE(L) /\ FIRST-SET(P) = {}
    END-CYCLE(P) /\ FIRST-SET(R) = {}

Final  Remarks

No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.

About CST interpolation

Posted in Algorithms, Parsing, TBP on December 7th, 2009 by kay – 3 Comments

Eli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.

In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).

There are two major reasons I love CSTs.

  1. Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
  2. The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
  3. For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.

I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram

A grammar gives rise to a number of finite state machines – in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to

  • check correctness of CSTs under transformations
  • connect individual CST nodes through sequences of other CST nodes ( interpolation ).
  • insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)

Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike regexps and regexp engines which uniformly apply matching and searching to all strings.

Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.

Parse tree representations

Take a look at the following AST constructor kept as an example from Eli’s article

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The Expression constructor takes a node of type BinOp and produces a node of type Expression. It is used to represent that actual Python expression “xy”*3.

Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:

&gt;&gt;&gt; import parser
&gt;&gt;&gt; parser.expr('"xy"*3').tolist()
[258, [326, [303, [304, [305, [306, [307, [309, [310,
[311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]],
[16, '*'],
[315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]

The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.

The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:

eval_input  -- NT`258
  testlist  -- NT`326
    test  -- NT`303
      or_test  -- NT`304
        and_test  -- NT`305
          not_test  -- NT`306
            comparison  -- NT`307
              expr  -- NT`309
                xor_expr  -- NT`310
                  and_expr  -- NT`311
                    shift_expr  -- NT`312
                      arith_expr  -- NT`313
                        term  -- NT`314
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                STRING  -- T`3     L`1
                                  "xy"
                          STAR  -- T`16     L`1
                            *
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                NUMBER  -- T`2     L`1
                                  3
  ENDMARKER  -- T`0     L`2
    ''

It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.

Searching nodes

Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions find_first and find_all which keep a node and a node id as arguments. So when we seek for a particular node e.g. term in the syntax tree we just call find_first( node, symbol.term) where symbol.term is the node id of term encoded in symbol.py which is a standard library module. So for

nd = parser.expr(‘”xy”*3’).tolist() we can apply find_first(nd, symbol.term)which returns

term  -- NT`314
  factor  -- NT`315
    power  -- NT`316
      atom  -- NT`317
        STRING  -- T`3     L`1
          "xy"

Traces

We want to name CST constructors just of the nodes they create. So expr creates a node of type symbol.expr, STRING a node of token.STRING and so on. In order to create a correct expr we have to call lots of node constructors. In source code this would be something like

expr(xor_expr(…(term(factor(…(STRING(“xy”)…), STAR(“*”), factor(…(NUMBER(“3”)…))…))

This doesn’t look much like noise reduction, but now consider this: when expr is created by the parser the parser starts with nothing but a sequence A = (STRING(“xy”), STAR(“*”), NUMBER(“3”)). So why isn’t it possible to start with A and expr and build expr(*A) ? We want to face a slightly more general problem namely having a sequence of nodes A = (a1, a2, …, ak) which are not necessarily token and a node constructor expr. Can we build a node expr(a1, …,ak) of type symbol.expr?

What is needed to identify an admissible sequence A with this property?

First of all let’s take a look at the grammar rule description of expr

expr: xor_expr ('|' xor_expr)*

Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes xor_expr VBAR xor_expr VBAR xor_expr is a trace of expr. But also xor_expr alone is a trace. So what is needed is to wrap a given sequence A = (a1, a2, …, ak) into a trace. We might start with the most simple case of a single node A= (a) which shall be wrapped into expr. As an example we consider A = (term).

Interpolation

In order to wrap term into expr we need a sequence of intermediate nodes xor_expr, and_expr,shift_expr, arith_expr` and then build

[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]

This sequence is uniquely determined by expr and term. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.

So when there is no choice in building a wrapper of type N around M we write {N, M} and call it an interpolation between N and M. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.

We have found already a valid interpolation {expr, term}. Other valid interpolations are {factor, STRING(“xy”)} and {factor, NUMBER(“3”)}. For term this already suffices to build a trace:

{factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)}

and with this trace we get

{expr, term({factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)})}

Now we are prepared to define an algorithm:

Let N be a node and A = (a1, ..., ak) a sequence of nodes.
Consider also the set of all nodes set(M1, ..., Mn) with
{N, Mi}, i=1,...,n being a valid interpolation starting with N.
 
For each Mi, i=1,...,n we try to build a trace
TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}).
 
If we have a found a trace for M is we get the result
{N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}

Autocompletion

Sometimes our algorithm might fail to find a trace for a node N and a sequence A but the error can still be corrected in a fully determinate fashion. Take the following rule for example:

dotted_name: NAME (‘.’ NAME)*

together with a sequence A = (NAME, NAME) of two nodes. Obviously we cannot build a valid trace NAME DOT NAME from A directly but the insertion of DOT into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for DOT. It can always only be “.”. So it is possible to omit the DOT in A and still get a uniquely determined trace for dotted_name.

Applications

We’ve come to an end already. With the prerequisites given above it is perfectly possible to write

expr(STRING(“xy”), STAR(“*”), NUMBER(“2”))

and get a valid parse or even shorten it and write

expr(‘”xy”‘, “*”, 2)

which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.

fn = lambda name, val: expr_stmt(name, “=”, val)

This lambda expression yields bindings of val to name. For example

fn(“a”, expr(‘”xy”‘, “*”, 2)) is the parse tree equivalent of a = “xy”*2.

Notice that in any case the parse tree is syntactically correct by construction.

Wrapper Functions

Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of power which is defined as

power: atom trailer* [‘**’ factor]

This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:

def CST_Call(names, args = None, star_args = None, dstar_args = None):
    Names = [atom(names[0])]
    for name in names[1:]:
        Names.append(trailer('.', name))
    ARGS = list(args) if args else []
    if star_args:
        ARGS+=['*', star_args]
    if dstar_args:
        ARGS+=['**', dstar_args]
    if ARGS:
        return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')]
    else:
        return [symbol.power] + Names + [trailer('(', ')')]

CodeTemplates – a fresh look on code transformations

Posted in DSL, Grammars, TBP on June 7th, 2009 by kay – Be the first to comment

This is the first part of a two part article series about CodeTemplates. CodeTemplates are a code transformation technique that suggests an alternative to common syntactic as well as lexical macros for whole language transformations.

Syntactic macros

About three years ago I experimented with a “syntactic” macro system for Python incorporated in EasyExtend 2. In the meantime another project emerged called MetaPython striving for a similar goal under the sign of the Ouroboros.

Fundamentally a macro system defines functions called macros that act as compile time code generators. It is not surprising that code is data – how else shall a compiler work? – however few languages specify functions that treat source code as source code and operators that evaluate source code at compile time. By quasi-quotation some piece of code in a macro body is mentioned rather than evaluated. Complementary to quasi-quoting there is a splice-operator that evaluates code and returns other code preferably in AST form that becomes aligned with the AST of the quasi-quoted code.

The difference between source code and ASTs is going to vanish in so called homoiconic languages like Lisp where the source code that is written by the user is also a data-structure of the language. So a macro is essentially a function that transforms those data-structures meant to be code. It doesn’t mean though that homoiconicity is prerequisite for syntactic macros. The cited MetaPython but also MetaLua are examples for macro systems in non-homoiconic languages. A nice treatment of the foundations of macro systems in non-homoiconic languages can also be found in the documentation of the Converge language.

Syntactic macros in EE 2

EasyExtend 2.0 defined a macro system in an own langlet, simply called the macro langlet. The role of macros in EasyExtend is somewhat peculiar though because EE is a system used for whole language transformations with language syntax specified in grammars. EE binds node handlers to parse tree node types and those node handlers emit parse tree nodes of the target language. They were just meant to facilitate code transformations within the framework.

Unfortunately it turned out that they didn’t. A bug in a macro caused hard to detect language transformation errors. Error recovery is a major concern if not the single major concern with language transformers and the complicated quasi-quoting / splicing semantics of macros made this even harder. More often than I ever wanted I debugged through framework when attempting to detect a bug in the macro. Furthermore the macro langlet operated within the system and was transformed as well. So two languages were transformed at the same time whereas one of them was additionally engaged in transforming the other. EasyExtend 2 made this possible without any conflicts implementing a joint transformation machinery and separated ranges of node types node handlers could bind to. The macro transformation was implemented as a set of node handlers in the LangletTransformer of the macro langlet and it was an ugly beast of code. I intended to rewrite this for EE 3 but it at some point of time I decided to terminate this intellectual exercise and dropped the macro facility completely.

Why not just textual macros?

A textual substitution macro occasionally also called “lexical” macro is a simple way to do code transformations. The most important programming language ever, C, implements them by means of the C preprocessor. Lexical macros are also known as poor-mans macros and using them in EasyExtend was like leaving civilization because city-life is complicated and corrupt and one better strives for a subcomplex life as a farmer or hunter-gatherer. Furthermore EE is in deep love with syntax so a radical conceptualization of the macro idea is required. Fortunately the intellectual effort for this is low and can be summarized in one sentence

Enable arbitrary string transformations guarded by syntax

This is universally applicable for all context free languages and doesn’t require any additional macro syntax.

If one attempts to transform one string S1 into another one S2 all that is needed are the following three operators

  • insert
  • delete
  • subst

The edit distance between S1 and S2 is the minimal number of operations of this kind used to transform S1 into S2. It can be computed by the popular algorithm of Levenshtein.

EasyExtend 4 will use a slightly different set of operators

  • replicate
  • delete
  • subst

It is almost equally expressive except for the case of an empty string S1 that requires an initial insert and there is nothing yet to replicate, delete or substitute. But once you have a single character in a string, one can replicate this character and apply substitutions which yields the same as applying multiple insertions.

CodeTemplate objects in EE 4

EasyExtend 4 defines a module called csttemplate.py that defines 4 types of objects called CodeTemplate, CodeMarker, CodeSelection and SelectionTree.

A CodeTemplate is an object that wraps some piece of source code in some language. The source code is parsed on construction of the CodeTemplate.

A CodeMarker marks sections in the source by means of search pattern. One can bind arbitrary many CodeMarker objects to a CodeTemplate. It is possible to also bind CodeMarkers to other CodeMarkers building a CodeMarker tree. When a CodeMarker B is bound to another CodeMarker A the search pattern of B is only applied to the code marked by A.

A CodeTemplate implements a matchmethod. If it is applied the pattern specified by all bound CodeMarker objects are matched against the code of the CodeTemplate. The result is a SelectionTree object. The tree holds a list of CodeSelection objects as well as a list of subtrees.

All transformations of the source code wrapped by the CodeTemplate are applied through CodeSelection objects. It is the CodeSelection object that substitutes, replicates or deletes code that is bound to. Note that internally all those operations still happen on a parse tree. Invalid operations will immediately be rejected and an exceptions is raised. It is not a complete syntax check on the whole code that is applied on each operation and it wouldn’t make sense to perform one either because some piece of code can morphed into code of another language and it will be a hybrid piece of code except at transformations start and the endpoint.

The SelectionTree is clonable and can be reused for other substitutions.

Outlook

In the next article I’ll show CodeTemplates in action.

Trace Based Parsing (Part V) – From Grammars to NFAs

Posted in EasyExtend, Grammars, TBP on May 7th, 2009 by kay – Be the first to comment

In this article I will demonstrate the translation process by which an EBNF grammar will be translated into a Trail NFA. There are several different notations for EBNF that differ only in minor details. I’ve chosen the one of Pythons Grammar which can roughly be described using following set of EBNF rules:

file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING

The grammar is self-describing. Starting with the non-terminal file_input you can parse the grammar using its own production rules. The multiplicities the EBNF grammar uses are:

  • [X] – zero or one time X
  • X* – zero or more times X
  • X+ – one or more times X

EasyExtend defines a langlet called grammar_langlet used to parse grammars. With the grammar_langlet EBNF grammars become just ordinary languages supported by EE. The parse trees created from grammars will be transformed into NFAs.

From EBNF parse trees to Trail NFAs

The grammar_langlet defines transformations on parse trees. More concretely it wraps pieces of the grammar into Rule objects. Rule objects are basically list wrappers with an additional copy method which is not of interest here.

class Rule(object):
    def __init__(self, lst):
        self.lst = lst
 
    def copy(self):
        ...

From Rule we derive several subclasses which can be reflected by issubclass:

class ConstRule(Rule): ...
class SequenceRule(Rule): ...
class AltRule(Rule): ...
class EmptyRule(Rule): ...

What you don’t find are rules like MaybeRule or ZeroOrMoreRule because we eliminate multiplicities entirely. This will discussed later.

Names and strings are wrapped wrapped into ConstRule objects as follows:

ConstRule([('file_name', 0)])
ConstRule([('"("', 1)])
ConstRule([('RULE', 2)])
...

In this scheme we already recover the indices that directly translates to states of Trail NFAs. A special case of a ConstRule is the rule that wraps None:

ConstRule([('None', '-')])

The rules of type SequenceRule and AltRule serve as containers for other rules.

Elimination of multiplicities

We apply a trick to get rid of rules with multiplicities i.e. rules defined by X*, X+ and [X]. All of them can be reduced to special combinations of our basic rule set. Before we show the details we need to introduce the EmptyRule type. Apart from the unique exit state (None, ‘-‘) there might be several other empty states (None, nullidx). Just like the constant rules those empty states get enumerated but with an own index and just like any but the exit state they might occur on the left-hand side of a rule. The idea of using those empty states might be demonstrated using a simple example. Consider the following rule:

R: A [B]

We can manually translate it into the NFA

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, '-')]
(B, 2): [(None, '-')]

However the translation becomes much simpler when we reify empty productions and insert them first:

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, 1)]
(None, 1): [(None, '-')]
(B, 2): [(None, '-')]

What we have effectively done here is to interpret [B] as B | None. In a second step we eliminate the indexed empty states again which will lead to NFAs of the first form. Removing A* and A+ is similar but a little more tricky. We write

A*  → None | A | A A
A+  → A | A A

This looks wrong but only when we think about those transitions in terms of an an indexing scheme that assigns different numerals to each symbol that occurs in the rule. But now all A on the right hand side shall actually be indistinguishable and lead to identical states. With this modification of the translation semantics in mind we translate the grammar rule R: A* into

(R, 0): [(A, 1), (None, 1)]
(A, 1): [(A, 1), (None, '-')]
(None, 1): [(None, '-')]

Eliminating the indexed empty state leads to this NFA

(R, 0): [(A, 1), (None, '-')]
(A, 1): [(A, 1), (None, '-')]

which is just perfect.

The reduction of rules with multiplicities to sequences and alternatives can be applied to Rule objects as well:

[A] →  AltRule( EmptyRule([(None, nullidx)]), A)
A*  →  AltRule( EmptyRule([(None, nullidx)]), A, SequenceRule([A, A]))
A+  →  AltRule( A, SequenceRule([A, A]))

Since we have found a simple a way to express grammars by rule trees containing only the mentioned rule types we can go on and flatten those trees into a tabular form.

Computing NFAs from rule trees

First we define an auxiliary function called end(). For a rule object R this function computes a set of constant or empty rules that terminate R. This means end(R) contains the last symbols of R wrapped into rule objects.

def end(R):
    if not R.lst:
        return set()
    if isinstance(R, (ConstRule, EmptyRule)):
        return set([R])
    elif isinstance(R, AltRule):
        return reduce(lambda x, y: x.union(y), [end(U) for U in R.lst])
    else:  # SequenceRule
        S = R.lst[-1]
        return end(S)

With this function we can implement the main algorithm.

Suppose you have a SequenceRule(S1, …, Sk). The end points of each S[i] are computed by application of end(S[i]) and shall be connected with the start points of S[i+1]. Those starting points will build the follow sets of S[i]’s end points. If S[i+1] is a constant or empty rule we are done. If S[i+1] is an AltRule we compute the connection of S[i] with each entry in S[i+1]. If S[i+1] is a SequenceRule we recursively call our connection building algorithm with S[i] being prepended to the sub-rules of S[i+1]. In finitely many steps we always find the complete set of constant/empty rules S[i] can be connected with. Here is the Python implementation:

def build_nfa(rule, start = None, nfa = None):
    if not start:
        nfa = {}
        # rule.lst[0] == (rule name, 0)
        # rule.lst[1] == SequenceRule(...)
        start = set([ConstRule([rule.lst[0]])])
        return build_nfa(rule.lst[1], start, nfa)
    if isinstance(rule, SequenceRule):
        for i, R in enumerate(rule.lst):
            build_nfa(R, start, nfa)
            start = end(R)
    elif isinstance(rule, AltRule):
        for R in rule.lst:
            build_nfa(R, start, nfa)
    else: # ConstRule or EmptyRule
        r = rule.lst[0]
        for s in start:
            state = s.lst[0]
            follow = nfa.get(state, set())
            follow.add(r)
            nfa[state] = follow
    return nfa

The last step is to remove indexed empty states (None, idx).

def nfa_reduction(nfa):
    removable = []
    for (K, idx) in nfa:
        if K is None and idx!="-":
            F = nfa[(K, idx)]
            for follow in nfa.values():
                if (K, idx) in follow:
                    follow.remove((K, idx))
                    follow.update(F)
            removable.append((K, idx))
    for R in removable:
        del nfa[R]
    return nfa

This yields an NFA. In order to make it suitable for Trail one additionally has to map each state (A, idx) to (A, idx, R) with the rule id R.

What’s next?

So far all our discussions were placed in the realm of nice context free languages and their aspects. However the real world is hardly that nice and real language syntax can only be approximated by context free formalisms. Context sensitive languages are pervasive and they range from Python over Javascript to C, C++, Ruby and Perl. Instead of augmenting grammars by parser actions EasyExtend factors context sensitive actions out and let the programmer hacking in Python.

Trace Based Parsing ( Part IV ) – Recursive NFAs

Posted in EasyExtend, Grammars, TBP on May 6th, 2009 by kay – Be the first to comment

Not long ago I inserted the following piece of code in the EasyExtend/trail/nfatracing.py module which defines NFA tracers and cursors and deals with the reconstruction of parse trees from state set sequences.

class NFAInterpreter(object):
    def __init__(self, nfa):
        for state in nfa:
            if state[1] == 0:
                self.start = state
                break
        else:
            raise ValueError("Incorrect NFA - start state not found")
        self.nfa = ["", "", self.start, nfa]
 
    def run(self, tokstream):
        cursor = NFACursor(self.nfa, {})
        selection = cursor.move(self.start[0])
        for t in tokstream:
            cursor.set_token(t)
            cursor.move(t)
        return cursor.derive_tree([])

This little interpreter is basically for experimentation with advancements of Trail. The idea is to initialize a a single NFAInterpreter object with the transition table of an NFA. This table can be handcrafted and represents a concept. The run method keeps a token stream and uses it to iterate through the NFA. Finally a parse tree will be derived from a state sequence. This derivation process is defined in the method derive_treeof the class NFAStateSetSequence of the same module and it has to be adapted. I’ll show you an interesting example.

Recursive rules

Recursively defined rules are most likely a cause for Trail to turn into backtracking mode. I’ve had some ideas in the past on how to deal with recursion in Trail based on Trail NFAs. Here are some of my findings.

For the following discussion I’ll consider a single grammar rule which might be among the most simple albeit interesting rules that cannot be expanded:

R: a b [R] a c
The inner R causes a First/First conflict because a b … and a c are both possible continuations at the location of [R]. This conflict can’t be resolved by expansion because it would just be reproduced:
R: a b [a b [R] a c] a c
We do now translate R into an NFA.

{(R, 0, R): [(a, 1, R)],
 (a, 1, R): [(b, 2, R)],
 (b, 2, R): [(R, 3, R), (a, 4, R)],
 (R, 3, R): [(a, 4, R)],
 (a, 4, R): [(c, 5, R)],
 (c, 5, R): [(None, '-', R)]
}

What I intend to introduce now is some sort of tail recursion elimination ( despite Guidos verdict about it ). This is done by the introduction of two new control characters for NFA states. Remember that a state (A, ‘.’, 23, B) contains the control character ‘.’ which is created when an NFA A is embedded in another NFA B. Transient states like this containing control characters don’t produce own nid selections but get nevertheless recorded and stored in state set sequences and are important in the derivation process of parse trees. The idea is to replace the inner state (R, 3, R) by a pair of transient states (R, ‘(‘, 0, R) and (R, ‘)’, 0, R).

The transient state (R, ‘(‘, 0, R) acts like a pointer to the beginning of the rule whereas the state (R, ‘)’, 0, R) is close to the end only followed by the exit state (None, ‘-‘, R) and the states following the formerly embedded (R, 3, R). Here is what we get:

{(R, 0, R): [(R, '(', 0, R)],
(R, '(', 0, R): [(a, 1, R)],
(a, 1, R): [(b, 2, R)],
(b, 2, R): [(R, '(', 0, R), (a, 4, R)],
(a, 4, R): [(c, 5, R)],
(c, 5, R): [(R, ')', 0, R)],
(R, ')', 0, R): [(None, '-', R), (a, 4, R)]
}

A correct parse created with this rule implies an exact pairing of opening and closing parentheses. If we have passed through (R, ‘(‘, 0, R) n-times we also need to pass n-times though (R, ‘)’, 0, R). But the NFA is constructed such that it can be left any time we have passed through (R, ‘)’, 0, R) because the exit state is its follow state. NFAs cannot count and the tracer can’t be used to rule out all incorrect parse trees.

The way parse trees created by those kind of rules are checked is directly at their (re-)construction from state sequences. This leads us into the next step after having introduced a new NFA scheme and new transient NFA states.

Customization of derive_tree method

The derive_tree() method of the NFAStateSetSequence is the location where parse trees are created from state lists. Here is also the place to define the behavior associated with transient states. We give a listing of the relevant section that deals with ‘(‘ and ‘)’ control characters in transient states.

def derive_tree(self, sub_trees):
    states = self.unwind() # creates state list from state set sequence
    root = states[0][0][0]
    tree = []
    rule = [root]
    cycle = []
    std_err_msg = "Failed to derive tree."
                  "Cannot build node [%s, ...]"%root
    for state, tok in states[1:]:
        nid  = state[0]
        link = state[-1]
        IDX  = state[1]
 
        if IDX == SKIP:
            ...
        # begin of '(' and ')' control character treatment
        # and their transient states
        elif IDX == '(':
            tree.append(state)
            cycle.append(state)
        elif IDX == ')':
            if cycle:
                rec = cycle.pop()
                if (rec[0], rec[2], rec[3]) != (state[0], state[2], state[3]):
                    raise ValueError(std_err_msg)
            else:
                raise ValueError(std_err_msg)
            for i in xrange(len(tree)-1, -1, -1):
                t_i = tree[i]
                if type(t_i) == tuple:
                    if t_i[1] == '(':
                        tree, T = tree[:i], tree[i+1:]
                        if tree:
                            T.insert(0, link)
                            tree.append(T)
                        else:
                            tree = T
                        break
            else:
                raise ValueError(std_err_msg)
        elif nid is None:
            if cycle:  # there must not be cycles left
                raise ValueError(std_err_msg)
            if type(tree[0]) == int:
                return tree
            else:
                tree.insert(0, link)
                return tree
 
        # end of '(' , ')'
        else:
            ...

Here you can download the complete code within its context. At the top the nfatracing module CONTROL characters are defined. The list can be extended by your own characters.

Testing our NFA

Let’s write some tests for our NFA interpreter:

R = 'R'
a, b, c = 'a', 'b', 'c'
nfa = {(R, 0, R): [(R, '(', 0, R)],
       (R, '(', 0, R): [(a, 1, R)],
       (a, 1, R): [(b, 2, R)],
       (b, 2, R): [(R, '(', 0, R), (a, 4, R)],
       (a, 4, R): [(c, 5, R)],
       (c, 5, R): [(R, ')', 0, R)],
       (R, ')', 0, R): [(None, '-', R), (a, 4, R)]
    }
interpreter = NFAInterpreter(nfa)
assert interpreter.run('abac') == ['R', 'a', 'b', 'a', 'c']
assert interpreter.run('ababacac') == ['R', 'a', 'b',
                                           ['R', 'a', 'b', 'a', 'c'],
                                            'a', 'c']

With those modifications being applied the parser still exists within O(n) bounds which is quite encouraging. How to go from here to a fully general treatment of recursion in NFAs? Here are some grammars which represent challenges:

Mutually recursive expansions:

A: a b [B] a c
B: a b [A] a d

Immediate left recursion:

A: A '+' A | a

Multiple occurrences of a rule in immediate succession

A: (A '+' A A) | a

In all the of mentioned cases left recursion conflicts can be removed quite trivially. In complex grammars, however, they can lurk somewhere and with NFA expansion you can’t be sure that they do not emerge at places where you didn’t expect them.

What’s next?

In the next article I’ll move a few steps back and show how to create Trail NFAs from EBNF grammars. EasyExtend 3 contained a rather complicated and obscure method due to my apparent inability to find a simple simple algorithm at the time of its implementation. Recently I found a much simpler and more reliable generation scheme that shall be demonstrated.

Trace Based Parsing ( Part III ) – NFALexer

Posted in EasyExtend, Grammars, TBP on May 4th, 2009 by kay – Be the first to comment

The NFALexer gives me an uneasy feeling. It is the most questionable of Trails components – one that cannot decide whether it wants to be a lexer that produces a token stream or a parser that produces parse trees. Right now it creates both and degrades parse trees to token streams.

Then there is also another component that actually does not belong to the Trail core anymore which is called the Postlexer. The Postlexer originates from the observation that Pythons tokenize.py module implements a two pass lexical analysis. In the first pass the source code is chunked into pieces using regular expressions whereas in the second pass whitespace is extracted and NEWLINE, INDENT and DEDENT token are produced. In a certain sense the split into NFALexer and Postlexer can be understood as a generalization and clarification of that module. But why creating a trace based lexer at all? Why not just staying faithful with regular expressions?

Issues with popular regexp implementations

When you take a look at tokenize.py you will observe code like this

...
Triple = group("[uU]?[rR]?'''", '[uU]?[rR]?"""')
String = group(r"[uU]?[rR]?'[^\n'\\]*(?:\\.[^\n'\\]*)*'",
               r'[uU]?[rR]?"[^\n"\\]*(?:\\.[^\n"\\]*)*"')
 
Operator = group(r"\*\*=?", r"&gt;&gt;=?", r"&lt;&lt;=?", r"&lt;&gt;", r"!=",
                 r"//=?",
                 r"[+\-*/%&amp;|^=&lt;&gt;]=?",
                 r"~")
Bracket = '[][(){}]'
Special = group(r'\r?\n', r'[:;.,`@]')
...

The group function will turn a sequence of regular expressions s1, s2, … into a new regexp (s1 | s2 | …). This group building process will yield a big regular expression of this kind containing about 60-70 components.

What’s wrong with that kind of regular expressions for lexical analysis?

Python uses longest match or greedy tokenization. It matches as much characters as possible. So do regular expressions.

Do they?

The basic problem with Pythons regular expression implementation is that s1|s2|… could match something entirely different than s2|s1|….

For example a simple Float regexp can be defined as

Float = group(r’\d+\.\d*’, r’\.\d+’)

whereas an Int is just

Int = r'[1-9]\d*’

Finally we have a Dot token

Dot = r”\.”

When we group those regular expressions as group(Int, Dot, Float) a float never gets matched as a Float: tokenizing 7.5 yields the token stream Int, Dot, Int not Float. Same with group(Int, Float, Dot). But group(Float, Int, Dot) works as expected.

This behavior is called ordered choice and it is more of a misfeature than a feature. Ordered choice is o.k. for small regular expressions ( or PEGs ) where the programmer has complete oversight but it is not appropriate for 60 or more token and it is a design failure for systems like EasyExtend.

Regular expressions and trace based parsing

I briefly noted in part I of this article series that the original work on trace based parsing goes back to the late 60s and work by Ken Thompson on regular expression engines. So there is no intrinsic flaw to regular expressions but only some of their implementations. The implicit parallelism of TBP will just scan source code using Int and Float at the same time and the matched results will always correspond no matter how we group our matching rules.

Expansion in traced based lexers

The Int, Float example highlights a First/First conflict among different rules. Trail uses expansion to eliminate them. Actually expansion works as a conflict resolution strategy for all lexers I’ve examined. For that reason NFALexer lacks backtracking as a fall back.

Why the current NFALexer implementation is flawed

The NFALexer is slow. For each character the NFALexer produces a token and often the token are just thrown away again. Strings and comments might have a rich internal token structure but in the NFAParser we just want to use a single STRING token and comments are at best preserved to enable pretty printing.

The NFAParser for Python demands a single NUMBER token for all kinds of numbers. The token grammar defines about 12 different rules for numbers. That’s perfectly o.k. as a structuring principle we get far more structure than needed. It would be good if the NFA production for the lexer could adapt itself to the demand of the parser.

The weirdest aspect is the flattening of parse trees to token. This is caused by the fact that we want want to pass Python parse trees to the compiler and the compiler can’t deal with terminal symbols that are in fact none but parse trees.

Conclusion: rewrite the NFALexer s.t. it becomes a proper lexer i.e. a function that keeps a string and returns a token stream without producing any complex intermediate data structures . A trace based lexer should fly!

In the rest of the article we discuss some design aspects of the NFALexerl.

Parser Token and Lexer Symbols

In the previous article about NFA expansion we have already touched the concept of a node id ( nid ). For each rule with rule name R there is a mapping R → nid(R) where nid(R) is a number used to identify R. The mapping expresses a unique relationship and it is not only unique for a particular EasyExtend langlet but it is intended to be unique across all of them. This means that each langlet has an own domain or range of node ids:

   L = LANGLET_OFFSET
   nid_range = range(L, L+1024)
The LANGLET_OFFSET is a number L = offset_cnt*1024, offset_cnt = 0, 1, 2, …. The offset_cnt will be read/stored in the file EasyExtend\fs and it gets incremented for each new langlet that is created.

A range of size 1024 might appear small but big programming languages have about 80 terminal symbols and 200-300 non-terminals. Even COBOL has not more than 230 rules. COBOL has a few hundred keywords but this won’t matter, since keywords are all mapped to a single token. Maybe Perl is bigger but I guess the problems with Perl are those of context sensitivity and everything that can be encoded about Perl in a context free way can also be expressed in EasyExtend.

The node id range is again splitted into two ranges: one for terminals ( token ) and another one for non-terminals ( symbols ):

    L = LANGLET_OFFSET
    token_range  = range(L, 256 + L)
    symbol_range = range(256 + L, 1024 + L)
The used names “symbol” and “token” goes back to Pythons standard library modules token.py and symbol.py.

From lex_symbol to parse_token

Both NFALexer and NFAParser are parsers and use the same LANGLET_OFFSET and the same nid ranges. Since the terminal and non-terminal ranges coincide and we want to use the symbols of the NFALexer as the token of the NFAParser we have to apply a shift on the NFALexer symbols to become NFAParser token:

token-nid NFAParser = symbol-nid NFALexer – 256

Comparing lexdef/lex_symbol.py with parsedef/parse_token.py for arbitrary langlets lets you check this relation:

[TABLE=5]

Lexer Terminals

When symbols of the lexer are token of the parser how can we characterize the token of the lexer? The answer is simple: by individual characters. One can just define:

DIGIT: '0123456789'
CHAR: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
...

The disadvantage of this explicit notation is that it blows up the lexer NFAs and adds a new state for each character. Moreover we cannot write down the set of all characters and define a symbol that behaves like a regular expression dot that matches anything. So EasyExtend lets one define character sets separately.

Character sets

Character sets include sets for alphanumeric characters, whitespace, hexdigits and octdigits but also the set ANY used to represent any character. Unlike individual characters character sets have a numerical token-id. Character sets are defined in the module EasyExtend/lexertoken.py.

Here is an example:

BaseLexerToken.tokenid.A_CHAR  = 11
BaseLexerToken.charset.A_CHAR  = set(string.ascii_letters+"_")

Each lexdef/lex_token.py contains a LexerToken object. LexerToken is a copy of BaseLexerToken prototype. During the copying process the token-ids defined at BaseLexerToken.tokenid.xxx are shifted according to the langlet offset. LexerToken objects will be directly accessed and extracted by the NFALexer.

Below we give an example of a LexerToken extension. The langlet we use here is Teuton and the charactersets we define are German umlauts.

# -*- coding: iso-8859-1 -*-
# You might create here a new LexerToken object and set or overwrite
# properties of the BaseLexerToken object defined in lexertoken.py
 
LANGLET_OFFSET = 38912
from EasyExtend.lexertoken import NewLexerToken
LexerToken = NewLexerToken(LANGLET_OFFSET)
LexerToken.tokenid.A_UMLAUT = LANGLET_OFFSET+20
LexerToken.charset.A_UMLAUT = set('\xe4\xf6\xfc\xc4\xd6\xdc\xdf\x84\x8e\x94\x99\x81\x9a\xe1')
LexerToken.tokenid.A_ue     = LANGLET_OFFSET+21
LexerToken.charset.A_ue     = set(['\x81\xfc'])

Beyond character sets

Character sets are proper set objects. However it is possible to use other set-like objects as well. An interesting yet unused set-like type is the universal set described in this article.

Postlexer token

There is another important class of token we have not touched yet. This class of token is already present in Python and some of the token make Pythons fame. Most prominently INDENTand DEDENT. But there are also NEWLINE, ENDMARKER, NL and ERRORTOKEN. What they have in common is that there is no pattern describing them. When the source code is chunked into pieces using regular expressions there is no assignment of one of those chunks to the mentioned token. The chunks are produced in a secondary post processing step.

Same for EasyExtend. There are grammar rules for INDENT, DEDENT, NEWLINE etc. but on the right-hand-side we find terminals like T_INDENT, T_DEDENT and T_NEWLINE that are neither characters nor charsets. They are terminal symbols only in a formal sense used to pacify the parser generator. In EasyExtend postlexer token are produced and inserted into the token stream generated by the NFALexer. Their production is located in the Postlexer which is a function Postlexer: TokenStream → TokenStream.

Apart from the Postlexer there is another component called the Interceptor that is used to manipulate state sets on the fly during NFA parsing. Together the Postlexer and the Interceptor enable context sensitive parsing. It is interesting to note here that they both act on the level of lexical analysis and token production and there is no interference with the NFAParser. From a design point of view this is very pleasant because parsing is usually harder and more error prone than lexing and if context sensitivity can be captured during lexical or post-lexical analysis pressure is taken away from the parser.

Special token

In a final section we want to discuss three special token called ANY, STOP and INTRON.

INTRON

INTRON – at other occasions people also call it JUNK. I used a slightly less pejorative name. Although INTRONs do not encode any token used by the parser it can happen that the Postlexer still extracts information from them. For example Pythons Postlexer can derive INDENT, DEDENTand NEWLINE token from an INTRON. Other languages like C have no use for them and just throw them away. An INTRON can be annotated to a token as an attribute for the purpose of exact source code reconstruction from parse trees.

ANY

ANY is the counterpart of a regexp dot. It just matches any character. ANY is implemented as a charset – the empty set. Since it is the empty set one could expect to run into First/First conflicts with every other character set and indeed it does! However the NFALexer has a hardcoded disambiguation strategy and according the this strategy ANY is always endowed with the lowest priority. So if {ANY, ‘X’, ‘;’} is in the selection and the current character is ‘c’ it is matched by ANY. But if it is ‘X’ it is matched by ‘X’ and the ANY trace will be discontinued.

STOP

Sometimes a production rule accepts a proper subset of the productions of another one. The NAME rule is very popular for example

NAME: A_CHAR+

and it covers every production that specifies a particular name as a token:

DEF: “def”

So DEF is always also a NAME and if the trace terminates with “def” and does not continue with another character we have two trace and need to make a decision. The STOP terminal serves for disambiguation purposes. We can write

DEF: “def” STOP

If this happens and NAME cannot be continued STOP is selected. STOP does not match any character but extends the trace of DEF which means the DEF trace is selected, not NAME.

What’s next?

In the next article I’ll move to the borders of Trail and discuss recursion in NFAs and possible methods to deal with them.