This is the first in a series of articles about Trace Based Parsing ( TBP ). Trace based parsing is a powerful top down, table driven parsing style. TBP is the technique underlying Trail which is the parser generator of EasyExtend (EE). At its core Trail uses a variant of regular expression matching engines of the Thompson style described in an excellent article by Russ Cox. However Trail extends the scope of those engines and goes beyond regular expressions and covers context free grammars and programming languages.
Originally I didn’t intend to create a parser generator let alone a novel one that wasn’t described by the literature that was available to me. EasyExtend used a very efficient LL(1) parser generator written by Jonathan Riehl and what I needed was something more modest namely a parse tree validator for checking the correctness of parse tree transformations. Those validators are simpler to create than parsers and I’ll start my introductory notes about TBP describing them.
Parse Tree Validators
Suppose an EasyExtend user has defined a few new grammar rules in an extension language ExtLPython of Python. The ExtLPython source code is parsed correctly into a parse tree `PT(`ExtLPython`)` and now the user wants to transform `PT(`ExtLPython`)` back into a Python parse tree `PT(Python)` which can be compiled to bytecode by means of the usual compiler machinery. In essence the function `PT(`ExtLPython`)` → `PT(Python)` is a preprocessor of the users language and the preprocessors, not the bytecode compiler, are defined within EE and need to be checked.So the task we need to solve can be stated as
Given a parse tree `PT` and a grammar `G` check that `PT` is conformant with G.
An EBNF grammar of Pythons parser is already available in the Python distribution and can be found in the distributions `/Grammar` folder or here ( for Python 2.5 ).
We will start the derivation of a parse tree validator considering simple grammars. Take the following grammar `G1` for example:
Grammar G1 ---------- G1: A B | A C |
The possible parse trees of a language described by `G1` have the following shape :
`[G1, A, B]` or `[G1, A, C]`. Our next grammar is just slightly more involved
Grammar G2 ---------- G2: A R R: B B C | B B D |
The corresponding trees are:
TG1 = `[G2, A, [R, B,B, C]]`
TG2 = `[G2, A, [R, B, B, D]]`.
The function ( or object method ) I seek steps through the parse tree and determines for every node in the tree the set of all possible subsequent nodes. This function shall be called a tracer. Given the grammar `G2` and the symbol `A` the tracer shall yield the symbol `R`. If the symbol is `B` the yielded follow symbol is either `B` or those are the symbols `C` and `D` depending on the state of the tracer.
A tracer can be best implemented using a finite state automaton. For the grammar `G2` we derive following states and state transitions in the automaton:
(G2,0) : (A, 1) (A, 1) : (R, 2) (R, 2) : (B, 3) , (B, 4) (B, 3) : (B, 5) (B, 4) : (B, 6) (B, 5) : (C, 7) (B, 6) : (D, 8) (C, 7) : (None, '-') (D, 8) : (None, '-') |
In this automaton we have encoded the symbols of the grammar rules as 2-tuples `(symbol-name, index)`. The index is arbitrary but important to distinguish different states in the automaton originating from the same symbol that is used at different locations. In `G2` the symbol `B` gives rise to four different states `(B, 3)`, `(B, 4)`, `(B, 5)` and `(B, 6)`.
The finite state automaton is non-deterministic. That’s because each transition is described by the symbol name only. So if the automaton is in the state `(A, 1)` the symbol `R` is the label of the transition: `(A, 1) → R → (R, 2)`. The subsequent transition is
`(R, 2) → B → [(B, 3), (B,4)]`.
This particular transition doesn’t yield a unique state but a set of states. Hence the non-determinism. Nondeterministic finite state automatons are abbreviated by NFA.
For all states `S` in a state set we can safely assume that `S[0] = symbol-name` for a fixed symbol name. In essence our tracing function produces state sets on each transition and those sets are uniquely determined by symbol names.
State views and selection views
The parse trees `TG1` and `TG2` from the previous section correspond with following traces which describe state transition sequences:
Trace(TG1) = `(G2,0) → A → (A, 1) → R → (R, 2) → B → [(B, 3), (B, 4)] → B → [(B, 5), (B, 6)] → C → (C, 5) → (None, ‘-‘)`
Trace2(TG2) = `(G2,0) → A → (A, 1) → R → (R, 2) → B →[(B, 3), (B, 4)] → B →[(B, 5), (B, 6)] → D → (D, 6) → (None, ‘-‘)`
Our intended tracer would step through `Trace(TG1)` and `Trace(TG2)` at the same time. This makes the tracer is implicitly parallel! The actual state sets of the NFA are hidden from client objects that use the tracer. All they consume is the sequence of labels/symbols:
`G2 → A → R → B → B → ( C, D ) → None`
We assume now that the tracer is an object that stores the current state-set of the NFA as a private member and enables tracing though the NFA using a `select` method which returns a set of labels that can be used for the next `select` call ( unless the label is `None`). A sequence of `select`calls looks like this:
[A] = tracer.select(G2) [R] = tracer.select(A) [B] = tracer.select(R) [B] = tracer.select(B) [C,D] = tracer.select(B) [None] = tracer.select(C) # or [None] = tracer.select(D) |
Now we just have to bring the NFA tracers and the parse trees together. If the nested list `[G2, A, [R, B, B, C]]` represents a parse tree one has to step through the list and call the `select` method of the tracer with the current symbol found in the list. If we enter a sublist we spawn a new tracer a call the validator recursively. The parse tree is correct if each selection was successful. Otherwise an exception is raised.
The following Python function gives us an implementation:
def validate_parse_tree(tree, tracer): selection = [] for N in tree: if istree(N): validate_parse_tree(N, tracer.clone()) else: selection = tracer.select(N) if not selection: raise ValueError("parse tree validation failed at node %s"%N) if None in selection: return else: raise ValueError("parse tree validation failed at node %s"%tree) |
A conflict free traced based LL(1) parser
LL(1) parsers are a class of efficient top down parsers. Wikipedia statesAn LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking then it is called an LL(k) grammar. Of these grammars, LL(1) grammars, although fairly restrictive, are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k require considerable effort to parse.
Notice that this section is pretty confusing because the fact that only 1 token of lookahead is required for a parser doesn’t imply anything about the characteristic features of the grammar but about the implementation technique. Backtracking is mentioned as such a technique and it is a general goal of parser generator design to avoid it because backtracking has an O(2^n) asymptotic complexity.
Later in the article WP mentions three types of LL(1) conflicts:
- First/First conflict – overlapping of non-terminals
- First/Follow conflict – first and follow sets overlap
- Left recursion
The implicit parallelism of trace based parsers is able to solve many of the mentioned First/First and First/Follow conflicts. TBPs are nevertheless LL(1) which means that they don’t look deep into the token stream for decision making or run into backtracking. Right now it is unknown how powerful TBPs actually are. So if you know a computing science PhD student who looks for an interesting challenge in parsing theory you can recommend this work. If the problems discussed here are already solved you can recommend that work to me.
For the rest of the article we want to assume that LL(1) grammars are EBNF grammars free of the conflicts mentioned above. Traditionally LL(1) parsers are constrained by their ability to parse language of those grammars.
Reachables
The more common term found in the literature for reachable is first set. I prefer the term reachable in the context of EE though simply because the common distinction between first set and follow set is not so important in Trail. We watch parsers through the lens of Trail NFAs and everything that is on the RHS of a state is just a follow set. So the first set is also a follow set.
We give an explanation of the term reachable in the context of Trail NFAs.
Let’s reconsider the grammar G2 which defines two grammar rules named as G2 and R. The follow sets of the initial state of R and G2 are given by:
(G2,0) : (A, 1) (R, 2) : (B, 3) , (B, 4) |
The terminals or non-terminals according to those states are reachable i.e. `A` is reachable from `G2` and `B` is reachable from `R`. Actually we do not only look at the immediate follow sets of the initial state but also compute the transitive closure. Reachability is transitive which means: if `X` is reachable from `B` and `B` is reachable from `R` then `X` is also reachable from `R`.
Each token in a token stream that is passed to a parser is characterized by a terminal symbol. We call this terminal symbol also the token type. The central idea of an LL parsers is to derive terminals ( token types ) T as leftmost derivations of nonterminals N. Leftmost derivation is yet another characterization of reachability in the above sense. So we start with a terminal N and if T can be reached from N we can also create a parse tree containing N as a root node and T as a leaf node. We want to elaborate this now:
Suppose T0 is reachable from N and there is no other non-terminal M that is reachable from N which can also reach T0. So there is no non-terminal between N and T0. If N0 is a start symbol of our grammar and N1 is reachable from N0 and N2 from N1 etc. we get a nested chain of symbols [N0, [N1 , [… [Nk , … [N, T0]]] which is our initial parse tree.
That was simple but how to move on and derive the next token T1? Let’s examine the Trail NFA of N which has following structure:
(N, 0) : (T0, 1), ... (T0,1) : (M, 2), ... ... |
and make following case distinction:
- (T1, i) is a follow state of (T0, 1). Then our new parse tree becomes [N0, [N1 , [… [Nk , … [N, T0, T1 *]]] and we can proceed with token T2 and the follow states of (T1, i) in N. The * symbol indicates the location where the parser proceeds. These locations correspond to NFA states.
- (T1, i) is reachable from (M, 2). Then we recursively call the parser with the token iterator and M as a start symbol and trace through the NFA of M. The parser returns a new parse tree [M, …] and we embed this tree like [N0, [N1 , [… [Nk , … [N, T0, [M,…] *]]] and proceed with the follow states of (M, 2) and the current token Tk yielded by the token iterator.
- (None, ‘-‘) is in the follow states of (T0, 1). This means that that N is allowed to terminate. The parser with N as a start symbol returns the parse tree [N, T0]. So we get [N0, [N1 , [… [Nk , … [N, T0]*]] and proceed with T1 and Nk.
- If all those conditions fail raise an error message.
The following code is an implementation of the algorithm.
def parse(tokenstream, symbol): tracer = self.new_tracer(symbol) selection = tracer.select(symbol) tree = [symbol] token = tokenstream.current() while token: token_type = get_tokentype(token) for nid in selection: if nid is not None: if istoken(nid): if token_type == nid: tokenstream.shift_read_position() tree.append(token) break elif token_type in reachable(nid): sub = parse(tokenstream, nid) if sub: tree.append(sub) break else: continue else: if None in selection: return tree else: raise ParserError selection = tracer.select(nid) try: token = tokstream.next() except StopIteration: token = None return tree |
You will have noticed that the production of the NFAs from EBNF grammars was just presumed. It would take another article to describe the translation. However for simple grammars you can do such a translation easily by hand and the translation process doesn’t add much to our understanding of parse tree validation and language parsing.
What’s next?
In the next article we will touch grammars with First/First and First/Follow conflicts. They are not LL(1) grammars according to the Wikipedia definition. The way we’ve dealt with those grammars is neither increasing the number of lookaheads ( i.e. going from LL(1) to LL(k) or LL(*)), nor running into backtracking, nor modifying the grammar itself. Instead we carefully modify the NFAs derived from the grammar. “Careful” means that parse trees we create with TBP and modified NFAs correspond to those created with unmodified NFAs. That’s much like distort and restore the grammar itself. This might be another area of academic research for young computing scientists interested in parser technology. I’ve developed a particular algorithm for solving the mentioned problem and I haven’t the slightest idea if one could do so far better.Notice that I even tried to attack left recursion conflicts that tantalize top down parsers but I gave up on these approaches because I couldn’t understand the NFAs anymore, didn’t capture all use cases and debugging became such a pain. However there is still the promise of creating an O(n) top down parser that is able to parse all context free languages. And please, no tricks and no PEGs. Sweeping grammar conflicts under the rug using ordered choice is a foul and a burden to the programmer who wants to use the parser generator. Creating such a parser isn’t quite proving P!=NP but it would be still really, really cool and of much more practical relevance.