Trace Based Parsing ( Part IV ) – Recursive NFAs

Not long ago I inserted the following piece of code in the `EasyExtend/trail/nfatracing.py` module which defines NFA tracers and cursors and deals with the reconstruction of parse trees from state set sequences.

class NFAInterpreter(object):
    def __init__(self, nfa):
        for state in nfa:
            if state[1] == 0:
                self.start = state
                break
        else:
            raise ValueError("Incorrect NFA - start state not found")
        self.nfa = ["", "", self.start, nfa]
 
    def run(self, tokstream):
        cursor = NFACursor(self.nfa, {})
        selection = cursor.move(self.start[0])
        for t in tokstream:
            cursor.set_token(t)
            cursor.move(t)
        return cursor.derive_tree([])

This little interpreter is basically for experimentation with advancements of Trail. The idea is to initialize a a single `NFAInterpreter` object with the transition table of an NFA. This table can be handcrafted and represents a concept. The `run` method keeps a token stream and uses it to iterate through the NFA. Finally a parse tree will be derived from a state sequence. This derivation process is defined in the method `derive_tree`of the class `NFAStateSetSequence` of the same module and it has to be adapted. I’ll show you an interesting example.

Recursive rules

Recursively defined rules are most likely a cause for Trail to turn into backtracking mode. I’ve had some ideas in the past on how to deal with recursion in Trail based on Trail NFAs. Here are some of my findings.

For the following discussion I’ll consider a single grammar rule which might be among the most simple albeit interesting rules that cannot be expanded:

R: a b [R] a c

The inner `R` causes a First/First conflict because `a b …` and `a c` are both possible continuations at the location of `[R]`. This conflict can’t be resolved by expansion because it would just be reproduced:

R: a b [a b [R] a c] a c

We do now translate R into an NFA.

{(R, 0, R): [(a, 1, R)],
 (a, 1, R): [(b, 2, R)],
 (b, 2, R): [(R, 3, R), (a, 4, R)],
 (R, 3, R): [(a, 4, R)],
 (a, 4, R): [(c, 5, R)],
 (c, 5, R): [(None, '-', R)]
}

What I intend to introduce now is some sort of tail recursion elimination ( despite Guidos verdict about it ). This is done by the introduction of two new control characters for NFA states. Remember that a state `(A, ‘.’, 23, B)` contains the control character ‘.’ which is created when an NFA `A` is embedded in another NFA `B`. Transient states like this containing control characters don’t produce own nid selections but get nevertheless recorded and stored in state set sequences and are important in the derivation process of parse trees. The idea is to replace the inner state `(R, 3, R)` by a pair of transient states `(R, ‘(‘, 0, R)` and `(R, ‘)’, 0, R)`.

The transient state `(R, ‘(‘, 0, R)` acts like a pointer to the beginning of the rule whereas the state `(R, ‘)’, 0, R)` is close to the end only followed by the exit state `(None, ‘-‘, R)` and the states following the formerly embedded `(R, 3, R)`. Here is what we get:

{(R, 0, R): [(R, '(', 0, R)],
(R, '(', 0, R): [(a, 1, R)],
(a, 1, R): [(b, 2, R)],
(b, 2, R): [(R, '(', 0, R), (a, 4, R)],
(a, 4, R): [(c, 5, R)],
(c, 5, R): [(R, ')', 0, R)],
(R, ')', 0, R): [(None, '-', R), (a, 4, R)]
}

A correct parse created with this rule implies an exact pairing of opening and closing parentheses. If we have passed through `(R, ‘(‘, 0, R)` n-times we also need to pass n-times though `(R, ‘)’, 0, R)`. But the NFA is constructed such that it can be left any time we have passed through `(R, ‘)’, 0, R)` because the exit state is its follow state. NFAs cannot count and the tracer can’t be used to rule out all incorrect parse trees.

The way parse trees created by those kind of rules are checked is directly at their (re-)construction from state sequences. This leads us into the next step after having introduced a new NFA scheme and new transient NFA states.

Customization of derive_tree method

The derive_tree() method of the `NFAStateSetSequence` is the location where parse trees are created from state lists. Here is also the place to define the behavior associated with transient states. We give a listing of the relevant section that deals with ‘(‘ and ‘)’ control characters in transient states.

def derive_tree(self, sub_trees):
    states = self.unwind() # creates state list from state set sequence
    root = states[0][0][0]
    tree = []
    rule = [root]
    cycle = []
    std_err_msg = "Failed to derive tree."
                  "Cannot build node [%s, ...]"%root
    for state, tok in states[1:]:
        nid  = state[0]
        link = state[-1]
        IDX  = state[1]
 
        if IDX == SKIP:
            ...
        # begin of '(' and ')' control character treatment
        # and their transient states
        elif IDX == '(':
            tree.append(state)
            cycle.append(state)
        elif IDX == ')':
            if cycle:
                rec = cycle.pop()
                if (rec[0], rec[2], rec[3]) != (state[0], state[2], state[3]):
                    raise ValueError(std_err_msg)
            else:
                raise ValueError(std_err_msg)
            for i in xrange(len(tree)-1, -1, -1):
                t_i = tree[i]
                if type(t_i) == tuple:
                    if t_i[1] == '(':
                        tree, T = tree[:i], tree[i+1:]
                        if tree:
                            T.insert(0, link)
                            tree.append(T)
                        else:
                            tree = T
                        break
            else:
                raise ValueError(std_err_msg)
        elif nid is None:
            if cycle:  # there must not be cycles left
                raise ValueError(std_err_msg)
            if type(tree[0]) == int:
                return tree
            else:
                tree.insert(0, link)
                return tree
 
        # end of '(' , ')'
        else:
            ...

Here you can download the complete code within its context. At the top the `nfatracing` module `CONTROL` characters are defined. The list can be extended by your own characters.

Testing our NFA

Let’s write some tests for our NFA interpreter:

R = 'R'
a, b, c = 'a', 'b', 'c'
nfa = {(R, 0, R): [(R, '(', 0, R)],
       (R, '(', 0, R): [(a, 1, R)],
       (a, 1, R): [(b, 2, R)],
       (b, 2, R): [(R, '(', 0, R), (a, 4, R)],
       (a, 4, R): [(c, 5, R)],
       (c, 5, R): [(R, ')', 0, R)],
       (R, ')', 0, R): [(None, '-', R), (a, 4, R)]
    }
interpreter = NFAInterpreter(nfa)
assert interpreter.run('abac') == ['R', 'a', 'b', 'a', 'c']
assert interpreter.run('ababacac') == ['R', 'a', 'b',
                                           ['R', 'a', 'b', 'a', 'c'],
                                            'a', 'c']

With those modifications being applied the parser still exists within O(n) bounds which is quite encouraging. How to go from here to a fully general treatment of recursion in NFAs? Here are some grammars which represent challenges:

Mutually recursive expansions:

A: a b [B] a c
B: a b [A] a d

Immediate left recursion:

A: A '+' A | a

Multiple occurrences of a rule in immediate succession

A: (A '+' A A) | a

In all the of mentioned cases left recursion conflicts can be removed quite trivially. In complex grammars, however, they can lurk somewhere and with NFA expansion you can’t be sure that they do not emerge at places where you didn’t expect them.

What’s next?

In the next article I’ll move a few steps back and show how to create Trail NFAs from EBNF grammars. EasyExtend 3 contained a rather complicated and obscure method due to my apparent inability to find a simple simple algorithm at the time of its implementation. Recently I found a much simpler and more reliable generation scheme that shall be demonstrated.

This entry was posted in EasyExtend, Grammars, TBP. Bookmark the permalink.