White Easter
Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter…
x + + + + + + + + +
- - - - - - - - - x
- - - - - - - - x +
- - - - - - - x + +
- - - - - - x + + +
- - - - - x + + + +
- - - - x + + + + +
- - - x + + + + + +
- - x + + + + + + +
- x + + + + + + + +
Flat Automatons
This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn’t that I believed that this speed advantage wouldn’t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn’t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let’s begin.
If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely flat then and contained only terminal symbols which referred to each other . An intrinsic hurdle is recursion. A recursive rule like
X: X X | a
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
X X | a
X X X X | X X a | a X X | a X | X a | a a | a
...
Working embeddings
The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won’t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:
X: X a
X: ɛ
which means that `X` can produce the empty word. Since ɛ`a = a` the rule `X` accepts all finite sequences of `a`. In EBNF we can rewrite the X-rule as
X: [X] a
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
(X,0): (X,1) (a,2)
(X,1): (a,2)
(a,2): (FIN,-1)
Each state carries a unique index, which allows us to distinguish arbitrary many different `X` and `a` states in the automaton. If we further want to embedd `X` within another rule, say
Y: X b
which is defined by the table
(Y,0): (X,1)
(X,1): (b,2)
(a,2): (FIN,-1)
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
(X,0,X): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
With this notion we can embed `X` in `Y`:
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
(FIN,-1,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
The initial state `(X,0,X)` of `X` was completely removed. The automaton is still erroneous though. The final state `(FIN,-1,X)` of `X` is not a final state in `Y`. It even has a transition! We could try to remove it completely and instead write
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
But suppose Y had the form:
Y: X X b
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state `(FIN, -1, X)` when embedded in `Y` into an ɛ-state in `Y`:
(FIN,-1,X) => (X, 3, TRAIL_DOT, Y)
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape `(_, _, 0, _)`. Finally the fully and correctly embedded automaton `X` in `Y` looks like this:
(Y,0,0,Y): (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): (FIN,-1,0,Y)
The `TRAIL_DOT` marks the transition between `X` and `Y` in `Y`. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.
Rules of Embedding
At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don’t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:
R: A | B
A: a* c
B: a* d
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string “aaa…” we cannot decide which of the rules A or B we have to choose, unless we observe a ‘c’ or ‘d’ event i.e. our string becomes “aa…ac” or “aa…ad”. What we basically want is to defer the choice of a rule, making a late choice instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
R: a* c | a* d
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets `{ (a,1,0,A), (a,2,0,B) }` which recognize the same character “a”. Those state sets will be stored during parsing. In case of a “c”-event which will be recognized by the state (c, 3, 0, A) we only have to dig into the state-set sequence and follow the states(a, 1 , 0, A) back to the first element of the sequence. Since (A, 4, TRAIL_DOT, R) is the only follow state of (c, 3, 0, A)we will actually see the sequence:
(A,4,TRAIL_DOT,R)
\
'...
\
(c,3,0,A)
(a,2,0,A)
(a,2,0,A)
...
(a,2,0,A)
from this sequence we can easily reconstruct contexts and build the tree
[R, [A, a, a, ..., c]]
All of this is realized by late choice. Until a “c” or “d” event we move within A and B at the same time because. The embedding of A and B in R solves the FIRST/FIRST conflict above. This is the meaning.
FOLLOW/FIRST conflicts
So far the article didn’t contain anything new. I’ve written about all of this before.
The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:
R: U*
U: A | B
A: a+ (B c)*
B: b
There is no FIRST/FIRST conflict between A and B and we can’t factor out a common prefix. But now suppose we want to parse the string “abb”. Obviously A recognizes the two initial characters “ab” and then fails at the 2nd “b” because “c” was expected. Now A can recognize “a” alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After “a” has been recognized by A the rule B may take over and match “b” two times:
[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]
Trail applies a “longest match” recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after `a` and at that point the parser sets a so called checkpoint dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is “abcbb” then A will match “abc”, if it is “abcbcbb” then it is “abcbc” and so on.
The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.
The inevitability of left recursions
It is easy in top down parsing to recognize and remove or transform left recursive rule like this one
X: X a | ɛ
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don’t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn’t expected them. I’d like to go even one step further which is conjectural: we can’t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this clash dynamics, as I like to call it might lead to algorithmically undecidable problems. It’s nothing I’ve systematically thought about but I wouldn’t be too surprised.
For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as
X: X
or
X: X a
In both cases the FIRST-sets of X don’t contain any terminal symbol and they can’t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
E: E '*' E | NUMBER
contains a FIRST/FIRST conflict between `E` and `NUMBER`. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
T: u v [T] u w
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that’s entirely changing the game. The question is do we must tack backtracking solutions onto Trail which are deeper involved than checkpoints?
After 6 months of tinkering I wished the answer was no. Actually I believe that it is unavoidable but it occurs at places were I didn’t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.
Teaser
Before I discuss left recursion problems in Trail in a follow up article I’ll present some results as a teaser.
Grammars for which parsing in Trail is O(n):
a) E: '(' E ')' | E '*' E | NUMBER
b) E: '(' E+ ')' | E '*' E | NUMBER
Other grammars in the same category are
c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
However for the following grammars the parser is in O(2^n)
f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
If we combine d) and f) we get
h) G: G G (G | 'h') | G (G | 'h') | 'h'
In this case Trail will deny service and throw a `ParserConstructionError` exception. Some pretty polynomial grammars will be lost.