In the previous article about trace based parsing (TBP) I explained the basic concepts of TBP. I motivated TBP by parse tree validators that check the correctness of parse trees against grammars and I derived a trace based LL(1) parser. In this article I will leave familiar territory and will talk about the most important disambiguation technique in Trail which is called NFA expansion. Before I can get into this I have to advance the notation of Trail NFAs.
Trail NFAs
A Trail NFA is a dictionary of states. The keys of those dictionaries are single states and the values are lists of states. A single state is either a 3-tuple
`(node-type, index, rule-node-type)`
or a 4-tuple
`(node-type, control-symbol, index, rule-node-type)`.
( Notice that the only reason for this somewhat peculiar implementation of states is that they are easy to read and dump to text files. Furthermore field access on tuples is more efficient in Python than attribute access on objects. )
A `node-type` or `node-id` or `nid` is an integer value that represents a terminal or non-terminal in a parse tree. A parse tree is full of node ids. A node-type can also be `None` for the exit state of an NFA or it can be a string. All exist states are equal. If the nid is a string it is a language keyword. The index is just an arbitrary number that uniquely identifies a symbol in a grammar rule. If you have a grammar rule like the following
`R: (A | B ‘;’ c*)+`
you can enumerate the symbols as such `(R, 0), (A, 1), (B, 2), (‘;’, 3), (c, 4)`. The index of the rule symbol ( R in our example ) is always 0.
The `rule-node-type` is just the node id of the rule itself. The control-symbol will be discussed later. So the states are
`(R, 0, R), (A, 1, R), (B, 2, R), (‘;’, 3, R), (c, 4, R)`.
In EasyExtend the Trail NFAs of grammar rules are again organized again into dictionaries. The rule-node-type is the dictionary key and a list of entries is the value
257: ['file_input: ( NEWLINE | stmt)* ENDMARKER', '', (257, 0, 257), {(0, 3, 257): [(None, '-', 257)], (4, 1, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)], (257, 0, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)], (266, 2, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)]}], |
The meaning of those entries is
- the textual grammar rule description – `file_input: ( NEWLINE | stmt)* ENDMARKER`
- the second entry is a relict and it might go away in EE 4.0
- the start symbol of the NFA – `(257, 0, 257)`
- the NFA itself
The NFA looks awkward on the first sight but it is easy to very why it correctly expresses the grammar rule once we have translated node ids to rule names. Since the rule is a Python rule we can decode the nids using std library modules `symbol.py` for non-terminals and `token.py` for terminals.
>>> import symbol >>> map(symbol.sym_name.get, [257, 266]) ['file_input', 'stmt'] >>> import token >>> map(symbol.tok_name.get, [0, 4]) ['ENDMARKER', 'NEWLINE'] |
When we replace the states by the rule names mapped from the node ids we will get the following dict:
{ ENDMARKER: [None], NEWLINE: [stmt, NEWLINE, ENDMARKER], file_input: [stmt, NEWLINE, ENDMARKER], stmt: [stmt, NEWLINE, ENDMARKER] } |
With `file_input` as the start and None as the `exit` symbol it obviously expresses the correct transitions.
Here is a little more advanced NFA showing why using indices is quite reasonable
271: ["print_stmt: 'print' ([test (',' test)* [',']] | '>>' test [(',' test)+ [',']])", '', (271, 0, 271), {(12, 3, 271): [(303, 4, 271)], (12, 5, 271): [(None, '-', 271)], (12, 8, 271): [(303, 9, 271)], (12, 10, 271): [(None, '-', 271)], (35, 6, 271): [(303, 7, 271)], (271, 0, 271): [('print', 1, 271)], (303, 2, 271): [(12, 3, 271), (12, 5, 271), (None, '-', 271)], (303, 4, 271): [(12, 3, 271), (12, 5, 271), (None, '-', 271)], (303, 7, 271): [(None, '-', 271), (12, 8, 271)], (303, 9, 271): [(12, 8, 271), (12, 10, 271), (None, '-', 271)], ('print', 1, 271): [(None, '-', 271), (303, 2, 271), (35, 6, 271)]}], |
Expansion
Trail works just fine for single self-contained rules. Because of implicit parallelism a rule like
R: A* B | A* C
neither requires advanced lookahead schemes, special left factoring algorithms nor backtracking. This is also the reason why TBP is used for regular expression matching. This comfort gets lost when more than one grammar rule is used.
R: A* B | D* C D: A
Now we have a First/First conflict because `A!=D` but all A-reachables are also reachable from D. Both A and D are in the selection of R and the parser has to somehow decide which one to chose. The trick we apply is to embed D carefully in R or as we say: expand R by D. Careful means that we substitute D in R by the right hand side of D i.e. A but in such a way that the link to D is preserved. As we will see this is the deeper cause for the rule-type-id being the last entry of each state.
In a somewhat idealized Trail NFA style where the node ids are symbols we derive following two NFAs:
Trail NFA of R -------------------- (R, 0, R): (A, 1, R), (D, 2, R) (A, 1, R): (A, 1, R), (B, 3, R) (D, 2, R): (D, 2, R), (C, 4, R) (B, 3, R): (None, '-', R) (C, 4, R): (None, '-', R) Trail NFA of D -------------------- (D, 0, D): (A, 1, D) (A, 1, D): (None, '-', D) |
Now we substitute the R state `(D, 2, R)` by `(A, 1, D)`. However we have to embed the complete NFA not just the states following `(D, 0, D)`.
Expanded Trail NFA of R ------------------------ (R, 0, R): (A, 1, R), (A, 1, D) (A, 1, R): (A, 1, R), (B, 3, R) (A, 1, D): (A, 1, D), (None, '-', D) (B, 3, R): (None, '-', R) (C, 4, R): (None, '-', R) (None, '-', D): (C, 4, R) |
This is still not correct. The exit symbol `(None, ‘-‘, D)` of D must not occur in R because we want to continue in R after D and do not leave it. Instead we create a new type of state – a transient or glue state which has the form: `(D, ‘.’, 5, R)`
The dot ‘.’ indicates that the state is a transient state. The follow states of `(D, ‘.’, 5, R)` are precisely the follow states of `(D, 2, R)`. So the correct embedding of D in R looks like
Expanded Trail NFA of R ------------------------ (R, 0, R): (A, 1, R), (A, 1, D) (A, 1, R): (A, 1, R), (B, 3, R) (A, 1, D): (A, 1, D), (D, '.', 5, R) (B, 3, R): (None, '-', R) (C, 4, R): (None, '-', R) (D, '.', 5, R): (C, 4, R) |
Reconstruction of parse trees from traces
Transient states never cause First/First conflicts and they also do not produce new selections. They are nevertheless preserved in the state set traces and used for reconstructing parse trees from traces.
Assume we use the expanded NFA we have created above to parse the sentence `A A C`. This yields following sequence of state sets:
(R, 0, R) → R → [(A, 1, R), (A, 1, D)] → A → [(A, 1, R), (B, 3, R), (A, 1, D), ((C, 4, R) ← (D, '.', 5, R))] → A → [(A, 1, R), (B, 3, R), (A, 1, D), ((C, 4, R) ← (D, '.', 5, R))] → C → (None, '-', R)
From this stateset-selection sequence we can reconstruct the actual state sequence. We know that the last entry of a state is the node-type of a rule. Hence we also know the containing NFA of a state even though it is embedded. We now move backwards in the state set sequence from `(None, ‘-‘, R)` to `(R, 0, R)`. By changing the orientation in the proposed manner we enter the sub NFA D through the transient state (D, ‘.’, 5, R) and leave it only when there is no state of rule-node-type D left anymore in any of the state sets which is actually the final state (R, 0, R) in this case.
So we get
(None, '-', R) ← (C, 4, R) ← (D, '.', 5, R) ← (A, 1, D) ← (A, 1, D) ← (R, 0, R)
From this state sequence we reconstruct the parse tree. First we mark all states that are D states and wrap the node ids into D
(None, '-', R) ← (C, 4, R) ← (D, '.', 5, R) ← (A, 1, D) ← (A, 1, D) ← (R, 0, R)
(None, '-', R) ← (C, 4, R) ← [D, A, A] ← (R, 0, R)
Then we wrap the D-tree and the remaining C state into R:
[R, [D, A, A], C]
Voila!
Limitations of expansion
We have seen how to apply NFA expansion and reconstructed parse trees from state set sequences. This process is linear but it is hard to guess the effort after all because it depends on the number of states in the state sets.
Notice also that NFA embeddings require an additional shift of the state indices. That’s because it can happen that an NFA is embedded multiple times and the embeddings must not overlap.
This technique raises the question of its limitations quite naturally.
One obvious limitation is cyclic or infinite expansion. Suppose we expand D in R but this time D is a recursive rule and contains a state`(D, k, D)`. This is not necessarily harmful as long as `(D, k, D)` doesn’t cause a First/First conflict that leads to another expansion of D in R. Expansion cycles can be easily detected and in those cases expansion is aborted and the original NFA restored.
Another situation is more comparable to stack overflows. The expansion process continues on and on. Without running into a cycle it produces thousands of states and it is not clear when and how it terminates. Trail sets an artificial limit of 1500 states per NFA. If this number is exceeded expansion is terminated and the original NFA restored.
Another disadvantage of expansion is that error reporting becomes harder because localization of states becomes worse.
Brute force as a last resort
If expansion is interrupted and the original unexpanded NFA is restored the only chance to deal with First/First conflicts without changing the grammar manually is to use backtracking. Trail will have to check out any possible trace and select the longest one. In the past I tried to break `(D, k, D)` into two new transient states `(D, ‘(‘, k, D)` and `(D, ‘)’, k, D)`. The approach was interesting but made maintenance and debugging harder and there were still corner cases that were not properly treated.
What’s next?
In the next article I’ll leave the level of groundwork and will talk about the NFALexer. The NFALexer is actually just another context free TBP. The NFALexer replaces regular expression based lexical analysis in Trail which has some positive but also negative aspects. We have to talk about both.