Trace Based Parsing (Part V) – From Grammars to NFAs

In this article I will demonstrate the translation process by which an EBNF grammar will be translated into a Trail NFA. There are several different notations for EBNF that differ only in minor details. I’ve chosen the one of Pythons Grammar which can roughly be described using following set of EBNF rules:

file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING

The grammar is self-describing. Starting with the non-terminal `file_input` you can parse the grammar using its own production rules. The multiplicities the EBNF grammar uses are:

  • [X] – zero or one time X
  • X* – zero or more times X
  • X+ – one or more times X

EasyExtend defines a langlet called `grammar_langlet` used to parse grammars. With the `grammar_langlet` EBNF grammars become just ordinary languages supported by EE. The parse trees created from grammars will be transformed into NFAs.

From EBNF parse trees to Trail NFAs

The `grammar_langlet` defines transformations on parse trees. More concretely it wraps pieces of the grammar into `Rule` objects. `Rule` objects are basically list wrappers with an additional `copy` method which is not of interest here.

class Rule(object):
    def __init__(self, lst):
        self.lst = lst
 
    def copy(self):
        ...

From ` Rule` we derive several subclasses which can be reflected by `issubclass`:

class ConstRule(Rule): ...
class SequenceRule(Rule): ...
class AltRule(Rule): ...
class EmptyRule(Rule): ...

What you don’t find are rules like `MaybeRule` or `ZeroOrMoreRule` because we eliminate multiplicities entirely. This will discussed later.

Names and strings are wrapped wrapped into `ConstRule` objects as follows:

ConstRule([('file_name', 0)])
ConstRule([('"("', 1)])
ConstRule([('RULE', 2)])
...

In this scheme we already recover the indices that directly translates to states of Trail NFAs. A special case of a `ConstRule` is the rule that wraps `None`:

ConstRule([('None', '-')])

The rules of type `SequenceRule` and `AltRule` serve as containers for other rules.

Elimination of multiplicities

We apply a trick to get rid of rules with multiplicities i.e. rules defined by X*, X+ and [X]. All of them can be reduced to special combinations of our basic rule set. Before we show the details we need to introduce the `EmptyRule` type. Apart from the unique exit state `(None, ‘-‘)` there might be several other empty states `(None, nullidx)`. Just like the constant rules those empty states get enumerated but with an own index and just like any but the exit state they might occur on the left-hand side of a rule. The idea of using those empty states might be demonstrated using a simple example. Consider the following rule:

`R: A [B]`

We can manually translate it into the NFA

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, '-')]
(B, 2): [(None, '-')]

However the translation becomes much simpler when we reify empty productions and insert them first:

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, 1)]
(None, 1): [(None, '-')]
(B, 2): [(None, '-')]

What we have effectively done here is to interpret `[B]` as `B | None`. In a second step we eliminate the indexed empty states again which will lead to NFAs of the first form. Removing `A*` and `A+` is similar but a little more tricky. We write

A*  → None | A | A A
A+  → A | A A

This looks wrong but only when we think about those transitions in terms of an an indexing scheme that assigns different numerals to each symbol that occurs in the rule. But now all `A` on the right hand side shall actually be indistinguishable and lead to identical states. With this modification of the translation semantics in mind we translate the grammar rule `R: A*` into

(R, 0): [(A, 1), (None, 1)]
(A, 1): [(A, 1), (None, '-')]
(None, 1): [(None, '-')]

Eliminating the indexed empty state leads to this NFA

(R, 0): [(A, 1), (None, '-')]
(A, 1): [(A, 1), (None, '-')]

which is just perfect.

The reduction of rules with multiplicities to sequences and alternatives can be applied to `Rule` objects as well:

[A] →  AltRule( EmptyRule([(None, nullidx)]), A)
A*  →  AltRule( EmptyRule([(None, nullidx)]), A, SequenceRule([A, A]))
A+  →  AltRule( A, SequenceRule([A, A]))

Since we have found a simple a way to express grammars by rule trees containing only the mentioned rule types we can go on and flatten those trees into a tabular form.

Computing NFAs from rule trees

First we define an auxiliary function called `end()`. For a rule object `R` this function computes a set of constant or empty rules that terminate R. This means `end(R)` contains the last symbols of `R` wrapped into rule objects.

def end(R):
    if not R.lst:
        return set()
    if isinstance(R, (ConstRule, EmptyRule)):
        return set([R])
    elif isinstance(R, AltRule):
        return reduce(lambda x, y: x.union(y), [end(U) for U in R.lst])
    else:  # SequenceRule
        S = R.lst[-1]
        return end(S)

With this function we can implement the main algorithm.

Suppose you have a `SequenceRule(S1, …, Sk)`. The end points of each `S[i]` are computed by application of `end(S[i])` and shall be connected with the start points of `S[i+1]`. Those starting points will build the follow sets of S[i]’s end points. If `S[i+1]` is a constant or empty rule we are done. If `S[i+1]` is an `AltRule` we compute the connection of `S[i]` with each entry in `S[i+1]`. If `S[i+1]` is a `SequenceRule` we recursively call our connection building algorithm with `S[i]` being prepended to the sub-rules of `S[i+1]`. In finitely many steps we always find the complete set of constant/empty rules `S[i]` can be connected with. Here is the Python implementation:

def build_nfa(rule, start = None, nfa = None):
    if not start:
        nfa = {}
        # rule.lst[0] == (rule name, 0)
        # rule.lst[1] == SequenceRule(...)
        start = set([ConstRule([rule.lst[0]])])
        return build_nfa(rule.lst[1], start, nfa)
    if isinstance(rule, SequenceRule):
        for i, R in enumerate(rule.lst):
            build_nfa(R, start, nfa)
            start = end(R)
    elif isinstance(rule, AltRule):
        for R in rule.lst:
            build_nfa(R, start, nfa)
    else: # ConstRule or EmptyRule
        r = rule.lst[0]
        for s in start:
            state = s.lst[0]
            follow = nfa.get(state, set())
            follow.add(r)
            nfa[state] = follow
    return nfa

The last step is to remove indexed empty states `(None, idx)`.

def nfa_reduction(nfa):
    removable = []
    for (K, idx) in nfa:
        if K is None and idx!="-":
            F = nfa[(K, idx)]
            for follow in nfa.values():
                if (K, idx) in follow:
                    follow.remove((K, idx))
                    follow.update(F)
            removable.append((K, idx))
    for R in removable:
        del nfa[R]
    return nfa

This yields an NFA. In order to make it suitable for Trail one additionally has to map each state `(A, idx)` to `(A, idx, R)` with the rule id `R`.

What’s next?

So far all our discussions were placed in the realm of nice context free languages and their aspects. However the real world is hardly that nice and real language syntax can only be approximated by context free formalisms. Context sensitive languages are pervasive and they range from Python over Javascript to C, C++, Ruby and Perl. Instead of augmenting grammars by parser actions EasyExtend factors context sensitive actions out and let the programmer hacking in Python.

This entry was posted in EasyExtend, Grammars, TBP. Bookmark the permalink.