Pattern matching with TupleTrees

As it seems advanced dispatch schemes are discussed right now under the pattern matching label. In this article I want to discuss another solution using a data-structure called the `TupleTree` ( also known as prefix tree or trie ). The tuple tree solution comes closer to PEAK rules than to Marius Eriksens pattern matching engine. I find it more elegant than the PEAK rules solution and there is less boilerplate. I can’t say much about the generality of PEAK rules though and they might cover a lot more.

A `TupleTree` is an efficient way to store tuples by factoring out common prefixes. Suppose you have a set of tuples:

`{(a, b, c), (a, b, d), (a, c, d), (b, d, d)}` then you can store the same information using a tree structure

`{(a, (b, ((c,) (d,))), (c, d)), (b, d, d)}`

Searching in the tuple tree is of complexity `O(log(n))` and can degenerate to `O(n)` if there is isn’t much to factorize.

This isn’t too interesting in the discussion of pattern matching schemes if we wouldn’t introduce two different kinds of wildcards or symbolic pattern called `ANY` and `ALL`.

ANY – a pattern that matches any symbol but with the lowest priority: if there is a tuple `(ANY, Y, …)` in the tuple tree then `(X, Y, …)` is matched by `(ANY, Y, …)` iff there isn’t a more specific matching tuple `(X, Y, … )` in the tree.

ALL – a pattern that matches any symbol but with the same priority as a more specific symbol. If there is a tuple `(ALL, Y, …)` in the tuple tree then `(X, Y, …)` is matched by `(ALL, Y, …)` and by a more specific tuple `(X, Y, … )` if present. This means `ALL` creates an ambiguity.

We can consider `ALL` as a variable and we eliminate the ambiguity using value substitution. Let’s say we have a set of tuples `{(ANY, Y), (X, Y), (ALL, Z)}` then elimination of `ALL` leads to `{(ANY, Y), (ANY, Z), (X,Y), (X, Z)}` and the tuple tree `{(ANY, (Y,), (Z,)), (X, (Y,), (Z,))}`.

TupleTree implementation

First we define the mentioned pattern `ANY` and `ALL`

class Pattern:
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        return "<P:%s>"%self.name
 
ANY = Pattern("ANY")
ALL = Pattern("ALL")

Now we create the `TupleTree` object. The `TupleTree` implements two methods `insert` and `find`. The `insert` method takes a tuple and a key as parameters. It inserts the tuple and stores a `key` at the location of the tuple. The `find` method takes a tuple and returns a key if it was inserted at the location of the tuple.

class TupleTree(object):
    def __init__(self):
        self.branches = {}
        self.key = None
        self.all = None
 
    def insert(self, args, value):
        if len(args) == 0:
            self.key = value
            return
        first = args[0]
        if first == ALL:
            for node in self.branches.values():
                node.insert(args[1:], value)
            self.all = (args[1:], value)
        elif first in self.branches:
            node = self.branches[first]
            node.insert(args[1:], value)
            if self.all:
                node.insert(*self.all)
        else:
            tree  = TupleTree()
            self.branches[first] = tree
            tree.insert(args[1:], value)
            if self.all:
                node.insert(*self.all)
 
    def find(self, args):
        first = args[0]
        if first in self.branches:
            node = self.branches[first]
        elif ANY in self.branches:
            node = self.branches[ANY]
        if len(args) == 1:
            return node.key
        else:
            return node.find(args[1:])

The Dispatcher

It is easy to define a dispatcher that matches argument tuples against a tuple in a `TupleTree`. Handler functions which are decorated by a `Dispatcher` object are stored as tuple tree keys. Those handler functions are retrieved from the `TupleTree` when the `apply` method is called with concrete arguments.

class Dispatcher(object):
    def __init__(self):
        self.ttree = TupleTree()
 
    def __call__(self, *args):
        def handler(f):
            self.ttree.insert(args, f)
            return f
        return handler
 
    def apply(self, *args):
        handler = self.ttree.find(args)
        if not handler:
            raise ValueError("Failed to find handler that matches arguments")
        else:
            return handler(*args)

Example

As an example we create a new `Dispatcher` object and decorate handler functions using it.

alt = Dispatcher()
 
@alt("/", ANY)
def not_a_resource(path, method):
    print "not a resource"
 
@alt(ANY, "GET")
def retrieve_resource(path, method):
    print "retrieve resource"
 
@alt(ANY, "POST")
def update_resource(path, method):
    print "update resource", path
 
@alt(ALL, "PUT")
def create_new_resource(path, method):
    print "create new resource", path
 
@alt(ANY, ANY)
def invalid_request(path, method):
    print "invalid request", path

Notice that the `create_new_resource` handler is called when the HTTP command is `PUT` is passed even when the path is the root path “/”. This is caused by the `ALL` pattern in the first argument. For all other commands a “not a resource” message is printed.

>>> alt.apply("/home", "PUT")
create new resource /home
>>> alt.apply("/", "PUT")
create new resource /
>>> alt.apply("/", "GET")
not a resource
>>> alt.apply("/home", "GET")
retrieve resource /home
>>> alt.apply("/home", "PAUSE")
invalid request PAUSE
Posted in Python | Comments Off on Pattern matching with TupleTrees

Trace Based Parsing (Part V) – From Grammars to NFAs

In this article I will demonstrate the translation process by which an EBNF grammar will be translated into a Trail NFA. There are several different notations for EBNF that differ only in minor details. I’ve chosen the one of Pythons Grammar which can roughly be described using following set of EBNF rules:

file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING

The grammar is self-describing. Starting with the non-terminal `file_input` you can parse the grammar using its own production rules. The multiplicities the EBNF grammar uses are:

  • [X] – zero or one time X
  • X* – zero or more times X
  • X+ – one or more times X

EasyExtend defines a langlet called `grammar_langlet` used to parse grammars. With the `grammar_langlet` EBNF grammars become just ordinary languages supported by EE. The parse trees created from grammars will be transformed into NFAs.

From EBNF parse trees to Trail NFAs

The `grammar_langlet` defines transformations on parse trees. More concretely it wraps pieces of the grammar into `Rule` objects. `Rule` objects are basically list wrappers with an additional `copy` method which is not of interest here.

class Rule(object):
    def __init__(self, lst):
        self.lst = lst
 
    def copy(self):
        ...

From ` Rule` we derive several subclasses which can be reflected by `issubclass`:

class ConstRule(Rule): ...
class SequenceRule(Rule): ...
class AltRule(Rule): ...
class EmptyRule(Rule): ...

What you don’t find are rules like `MaybeRule` or `ZeroOrMoreRule` because we eliminate multiplicities entirely. This will discussed later.

Names and strings are wrapped wrapped into `ConstRule` objects as follows:

ConstRule([('file_name', 0)])
ConstRule([('"("', 1)])
ConstRule([('RULE', 2)])
...

In this scheme we already recover the indices that directly translates to states of Trail NFAs. A special case of a `ConstRule` is the rule that wraps `None`:

ConstRule([('None', '-')])

The rules of type `SequenceRule` and `AltRule` serve as containers for other rules.

Elimination of multiplicities

We apply a trick to get rid of rules with multiplicities i.e. rules defined by X*, X+ and [X]. All of them can be reduced to special combinations of our basic rule set. Before we show the details we need to introduce the `EmptyRule` type. Apart from the unique exit state `(None, ‘-‘)` there might be several other empty states `(None, nullidx)`. Just like the constant rules those empty states get enumerated but with an own index and just like any but the exit state they might occur on the left-hand side of a rule. The idea of using those empty states might be demonstrated using a simple example. Consider the following rule:

`R: A [B]`

We can manually translate it into the NFA

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, '-')]
(B, 2): [(None, '-')]

However the translation becomes much simpler when we reify empty productions and insert them first:

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, 1)]
(None, 1): [(None, '-')]
(B, 2): [(None, '-')]

What we have effectively done here is to interpret `[B]` as `B | None`. In a second step we eliminate the indexed empty states again which will lead to NFAs of the first form. Removing `A*` and `A+` is similar but a little more tricky. We write

A*  → None | A | A A
A+  → A | A A

This looks wrong but only when we think about those transitions in terms of an an indexing scheme that assigns different numerals to each symbol that occurs in the rule. But now all `A` on the right hand side shall actually be indistinguishable and lead to identical states. With this modification of the translation semantics in mind we translate the grammar rule `R: A*` into

(R, 0): [(A, 1), (None, 1)]
(A, 1): [(A, 1), (None, '-')]
(None, 1): [(None, '-')]

Eliminating the indexed empty state leads to this NFA

(R, 0): [(A, 1), (None, '-')]
(A, 1): [(A, 1), (None, '-')]

which is just perfect.

The reduction of rules with multiplicities to sequences and alternatives can be applied to `Rule` objects as well:

[A] →  AltRule( EmptyRule([(None, nullidx)]), A)
A*  →  AltRule( EmptyRule([(None, nullidx)]), A, SequenceRule([A, A]))
A+  →  AltRule( A, SequenceRule([A, A]))

Since we have found a simple a way to express grammars by rule trees containing only the mentioned rule types we can go on and flatten those trees into a tabular form.

Computing NFAs from rule trees

First we define an auxiliary function called `end()`. For a rule object `R` this function computes a set of constant or empty rules that terminate R. This means `end(R)` contains the last symbols of `R` wrapped into rule objects.

def end(R):
    if not R.lst:
        return set()
    if isinstance(R, (ConstRule, EmptyRule)):
        return set([R])
    elif isinstance(R, AltRule):
        return reduce(lambda x, y: x.union(y), [end(U) for U in R.lst])
    else:  # SequenceRule
        S = R.lst[-1]
        return end(S)

With this function we can implement the main algorithm.

Suppose you have a `SequenceRule(S1, …, Sk)`. The end points of each `S[i]` are computed by application of `end(S[i])` and shall be connected with the start points of `S[i+1]`. Those starting points will build the follow sets of S[i]’s end points. If `S[i+1]` is a constant or empty rule we are done. If `S[i+1]` is an `AltRule` we compute the connection of `S[i]` with each entry in `S[i+1]`. If `S[i+1]` is a `SequenceRule` we recursively call our connection building algorithm with `S[i]` being prepended to the sub-rules of `S[i+1]`. In finitely many steps we always find the complete set of constant/empty rules `S[i]` can be connected with. Here is the Python implementation:

def build_nfa(rule, start = None, nfa = None):
    if not start:
        nfa = {}
        # rule.lst[0] == (rule name, 0)
        # rule.lst[1] == SequenceRule(...)
        start = set([ConstRule([rule.lst[0]])])
        return build_nfa(rule.lst[1], start, nfa)
    if isinstance(rule, SequenceRule):
        for i, R in enumerate(rule.lst):
            build_nfa(R, start, nfa)
            start = end(R)
    elif isinstance(rule, AltRule):
        for R in rule.lst:
            build_nfa(R, start, nfa)
    else: # ConstRule or EmptyRule
        r = rule.lst[0]
        for s in start:
            state = s.lst[0]
            follow = nfa.get(state, set())
            follow.add(r)
            nfa[state] = follow
    return nfa

The last step is to remove indexed empty states `(None, idx)`.

def nfa_reduction(nfa):
    removable = []
    for (K, idx) in nfa:
        if K is None and idx!="-":
            F = nfa[(K, idx)]
            for follow in nfa.values():
                if (K, idx) in follow:
                    follow.remove((K, idx))
                    follow.update(F)
            removable.append((K, idx))
    for R in removable:
        del nfa[R]
    return nfa

This yields an NFA. In order to make it suitable for Trail one additionally has to map each state `(A, idx)` to `(A, idx, R)` with the rule id `R`.

What’s next?

So far all our discussions were placed in the realm of nice context free languages and their aspects. However the real world is hardly that nice and real language syntax can only be approximated by context free formalisms. Context sensitive languages are pervasive and they range from Python over Javascript to C, C++, Ruby and Perl. Instead of augmenting grammars by parser actions EasyExtend factors context sensitive actions out and let the programmer hacking in Python.

Posted in EasyExtend, Grammars, TBP | Comments Off on Trace Based Parsing (Part V) – From Grammars to NFAs

Trace Based Parsing ( Part IV ) – Recursive NFAs

Not long ago I inserted the following piece of code in the `EasyExtend/trail/nfatracing.py` module which defines NFA tracers and cursors and deals with the reconstruction of parse trees from state set sequences.

class NFAInterpreter(object):
    def __init__(self, nfa):
        for state in nfa:
            if state[1] == 0:
                self.start = state
                break
        else:
            raise ValueError("Incorrect NFA - start state not found")
        self.nfa = ["", "", self.start, nfa]
 
    def run(self, tokstream):
        cursor = NFACursor(self.nfa, {})
        selection = cursor.move(self.start[0])
        for t in tokstream:
            cursor.set_token(t)
            cursor.move(t)
        return cursor.derive_tree([])

This little interpreter is basically for experimentation with advancements of Trail. The idea is to initialize a a single `NFAInterpreter` object with the transition table of an NFA. This table can be handcrafted and represents a concept. The `run` method keeps a token stream and uses it to iterate through the NFA. Finally a parse tree will be derived from a state sequence. This derivation process is defined in the method `derive_tree`of the class `NFAStateSetSequence` of the same module and it has to be adapted. I’ll show you an interesting example.

Recursive rules

Recursively defined rules are most likely a cause for Trail to turn into backtracking mode. I’ve had some ideas in the past on how to deal with recursion in Trail based on Trail NFAs. Here are some of my findings.

For the following discussion I’ll consider a single grammar rule which might be among the most simple albeit interesting rules that cannot be expanded:

R: a b [R] a c

The inner `R` causes a First/First conflict because `a b …` and `a c` are both possible continuations at the location of `[R]`. This conflict can’t be resolved by expansion because it would just be reproduced:

R: a b [a b [R] a c] a c

We do now translate R into an NFA.

{(R, 0, R): [(a, 1, R)],
 (a, 1, R): [(b, 2, R)],
 (b, 2, R): [(R, 3, R), (a, 4, R)],
 (R, 3, R): [(a, 4, R)],
 (a, 4, R): [(c, 5, R)],
 (c, 5, R): [(None, '-', R)]
}

What I intend to introduce now is some sort of tail recursion elimination ( despite Guidos verdict about it ). This is done by the introduction of two new control characters for NFA states. Remember that a state `(A, ‘.’, 23, B)` contains the control character ‘.’ which is created when an NFA `A` is embedded in another NFA `B`. Transient states like this containing control characters don’t produce own nid selections but get nevertheless recorded and stored in state set sequences and are important in the derivation process of parse trees. The idea is to replace the inner state `(R, 3, R)` by a pair of transient states `(R, ‘(‘, 0, R)` and `(R, ‘)’, 0, R)`.

The transient state `(R, ‘(‘, 0, R)` acts like a pointer to the beginning of the rule whereas the state `(R, ‘)’, 0, R)` is close to the end only followed by the exit state `(None, ‘-‘, R)` and the states following the formerly embedded `(R, 3, R)`. Here is what we get:

{(R, 0, R): [(R, '(', 0, R)],
(R, '(', 0, R): [(a, 1, R)],
(a, 1, R): [(b, 2, R)],
(b, 2, R): [(R, '(', 0, R), (a, 4, R)],
(a, 4, R): [(c, 5, R)],
(c, 5, R): [(R, ')', 0, R)],
(R, ')', 0, R): [(None, '-', R), (a, 4, R)]
}

A correct parse created with this rule implies an exact pairing of opening and closing parentheses. If we have passed through `(R, ‘(‘, 0, R)` n-times we also need to pass n-times though `(R, ‘)’, 0, R)`. But the NFA is constructed such that it can be left any time we have passed through `(R, ‘)’, 0, R)` because the exit state is its follow state. NFAs cannot count and the tracer can’t be used to rule out all incorrect parse trees.

The way parse trees created by those kind of rules are checked is directly at their (re-)construction from state sequences. This leads us into the next step after having introduced a new NFA scheme and new transient NFA states.

Customization of derive_tree method

The derive_tree() method of the `NFAStateSetSequence` is the location where parse trees are created from state lists. Here is also the place to define the behavior associated with transient states. We give a listing of the relevant section that deals with ‘(‘ and ‘)’ control characters in transient states.

def derive_tree(self, sub_trees):
    states = self.unwind() # creates state list from state set sequence
    root = states[0][0][0]
    tree = []
    rule = [root]
    cycle = []
    std_err_msg = "Failed to derive tree."
                  "Cannot build node [%s, ...]"%root
    for state, tok in states[1:]:
        nid  = state[0]
        link = state[-1]
        IDX  = state[1]
 
        if IDX == SKIP:
            ...
        # begin of '(' and ')' control character treatment
        # and their transient states
        elif IDX == '(':
            tree.append(state)
            cycle.append(state)
        elif IDX == ')':
            if cycle:
                rec = cycle.pop()
                if (rec[0], rec[2], rec[3]) != (state[0], state[2], state[3]):
                    raise ValueError(std_err_msg)
            else:
                raise ValueError(std_err_msg)
            for i in xrange(len(tree)-1, -1, -1):
                t_i = tree[i]
                if type(t_i) == tuple:
                    if t_i[1] == '(':
                        tree, T = tree[:i], tree[i+1:]
                        if tree:
                            T.insert(0, link)
                            tree.append(T)
                        else:
                            tree = T
                        break
            else:
                raise ValueError(std_err_msg)
        elif nid is None:
            if cycle:  # there must not be cycles left
                raise ValueError(std_err_msg)
            if type(tree[0]) == int:
                return tree
            else:
                tree.insert(0, link)
                return tree
 
        # end of '(' , ')'
        else:
            ...

Here you can download the complete code within its context. At the top the `nfatracing` module `CONTROL` characters are defined. The list can be extended by your own characters.

Testing our NFA

Let’s write some tests for our NFA interpreter:

R = 'R'
a, b, c = 'a', 'b', 'c'
nfa = {(R, 0, R): [(R, '(', 0, R)],
       (R, '(', 0, R): [(a, 1, R)],
       (a, 1, R): [(b, 2, R)],
       (b, 2, R): [(R, '(', 0, R), (a, 4, R)],
       (a, 4, R): [(c, 5, R)],
       (c, 5, R): [(R, ')', 0, R)],
       (R, ')', 0, R): [(None, '-', R), (a, 4, R)]
    }
interpreter = NFAInterpreter(nfa)
assert interpreter.run('abac') == ['R', 'a', 'b', 'a', 'c']
assert interpreter.run('ababacac') == ['R', 'a', 'b',
                                           ['R', 'a', 'b', 'a', 'c'],
                                            'a', 'c']

With those modifications being applied the parser still exists within O(n) bounds which is quite encouraging. How to go from here to a fully general treatment of recursion in NFAs? Here are some grammars which represent challenges:

Mutually recursive expansions:

A: a b [B] a c
B: a b [A] a d

Immediate left recursion:

A: A '+' A | a

Multiple occurrences of a rule in immediate succession

A: (A '+' A A) | a

In all the of mentioned cases left recursion conflicts can be removed quite trivially. In complex grammars, however, they can lurk somewhere and with NFA expansion you can’t be sure that they do not emerge at places where you didn’t expect them.

What’s next?

In the next article I’ll move a few steps back and show how to create Trail NFAs from EBNF grammars. EasyExtend 3 contained a rather complicated and obscure method due to my apparent inability to find a simple simple algorithm at the time of its implementation. Recently I found a much simpler and more reliable generation scheme that shall be demonstrated.

Posted in EasyExtend, Grammars, TBP | Comments Off on Trace Based Parsing ( Part IV ) – Recursive NFAs

Trace Based Parsing ( Part III ) – NFALexer

The NFALexer gives me an uneasy feeling. It is the most questionable of Trails components – one that cannot decide whether it wants to be a lexer that produces a token stream or a parser that produces parse trees. Right now it creates both and degrades parse trees to token streams.

Then there is also another component that actually does not belong to the Trail core anymore which is called the Postlexer. The Postlexer originates from the observation that Pythons `tokenize.py` module implements a two pass lexical analysis. In the first pass the source code is chunked into pieces using regular expressions whereas in the second pass whitespace is extracted and `NEWLINE`, `INDENT` and `DEDENT` token are produced. In a certain sense the split into NFALexer and Postlexer can be understood as a generalization and clarification of that module. But why creating a trace based lexer at all? Why not just staying faithful with regular expressions?

Issues with popular regexp implementations

When you take a look at `tokenize.py` you will observe code like this

...
Triple = group("[uU]?[rR]?'''", '[uU]?[rR]?"""')
String = group(r"[uU]?[rR]?'[^\n'\\]*(?:\\.[^\n'\\]*)*'",
               r'[uU]?[rR]?"[^\n"\\]*(?:\\.[^\n"\\]*)*"')
 
Operator = group(r"\*\*=?", r">>=?", r"<<=?", r"<>", r"!=",
                 r"//=?",
                 r"[+\-*/%&|^=<>]=?",
                 r"~")
Bracket = '[][(){}]'
Special = group(r'\r?\n', r'[:;.,`@]')
...

The group function will turn a sequence of regular expressions `s1`, `s2`, … into a new regexp `(s1 | s2 | …)`. This group building process will yield a big regular expression of this kind containing about 60-70 components.

What’s wrong with that kind of regular expressions for lexical analysis?

Python uses longest match or greedy tokenization. It matches as much characters as possible. So do regular expressions.

Do they?

The basic problem with Pythons regular expression implementation is that `s1|s2|…` could match something entirely different than `s2|s1|…`.

For example a simple `Float` regexp can be defined as

`Float = group(r’\d+\.\d*’, r’\.\d+’)`

whereas an `Int` is just

`Int = r'[1-9]\d*’`

Finally we have a `Dot` token

`Dot = r”\.”`

When we group those regular expressions as `group(Int, Dot, Float)` a float never gets matched as a `Float`: tokenizing `7.5` yields the token stream `Int, Dot, Int` not `Float`. Same with `group(Int, Float, Dot)`. But `group(Float, Int, Dot)` works as expected.

This behavior is called ordered choice and it is more of a misfeature than a feature. Ordered choice is o.k. for small regular expressions ( or PEGs ) where the programmer has complete oversight but it is not appropriate for 60 or more token and it is a design failure for systems like EasyExtend.

Regular expressions and trace based parsing

I briefly noted in part I of this article series that the original work on trace based parsing goes back to the late 60s and work by Ken Thompson on regular expression engines. So there is no intrinsic flaw to regular expressions but only some of their implementations. The implicit parallelism of TBP will just scan source code using `Int` and `Float` at the same time and the matched results will always correspond no matter how we group our matching rules.

Expansion in traced based lexers

The `Int`, `Float` example highlights a First/First conflict among different rules. Trail uses expansion to eliminate them. Actually expansion works as a conflict resolution strategy for all lexers I’ve examined. For that reason NFALexer lacks backtracking as a fall back.

Why the current NFALexer implementation is flawed

The NFALexer is slow. For each character the NFALexer produces a token and often the token are just thrown away again. Strings and comments might have a rich internal token structure but in the NFAParser we just want to use a single STRING token and comments are at best preserved to enable pretty printing.

The NFAParser for Python demands a single `NUMBER` token for all kinds of numbers. The token grammar defines about 12 different rules for numbers. That’s perfectly o.k. as a structuring principle we get far more structure than needed. It would be good if the NFA production for the lexer could adapt itself to the demand of the parser.

The weirdest aspect is the flattening of parse trees to token. This is caused by the fact that we want want to pass Python parse trees to the compiler and the compiler can’t deal with terminal symbols that are in fact none but parse trees.

Conclusion: rewrite the NFALexer s.t. it becomes a proper lexer i.e. a function that keeps a string and returns a token stream without producing any complex intermediate data structures . A trace based lexer should fly!

In the rest of the article we discuss some design aspects of the NFALexerl.

Parser Token and Lexer Symbols

In the previous article about NFA expansion we have already touched the concept of a node id ( nid ). For each rule with rule name R there is a mapping R → nid(R) where nid(R) is a number used to identify R. The mapping expresses a unique relationship and it is not only unique for a particular EasyExtend langlet but it is intended to be unique across all of them. This means that each langlet has an own domain or range of node ids:

   L = LANGLET_OFFSET
   nid_range = range(L, L+1024)

The `LANGLET_OFFSET` is a number `L = offset_cnt*1024, offset_cnt = 0, 1, 2, … `. The `offset_cnt` will be read/stored in the file `EasyExtend\fs` and it gets incremented for each new langlet that is created.

A range of size 1024 might appear small but big programming languages have about 80 terminal symbols and 200-300 non-terminals. Even COBOL has not more than 230 rules. COBOL has a few hundred keywords but this won’t matter, since keywords are all mapped to a single token. Maybe Perl is bigger but I guess the problems with Perl are those of context sensitivity and everything that can be encoded about Perl in a context free way can also be expressed in EasyExtend.

The node id range is again splitted into two ranges: one for terminals ( token ) and another one for non-terminals ( symbols ):

    L = LANGLET_OFFSET
    token_range  = range(L, 256 + L)
    symbol_range = range(256 + L, 1024 + L)

The used names “symbol” and “token” goes back to Pythons standard library modules `token.py` and `symbol.py`.

From lex_symbol to parse_token

Both NFALexer and NFAParser are parsers and use the same `LANGLET_OFFSET` and the same nid ranges. Since the terminal and non-terminal ranges coincide and we want to use the symbols of the NFALexer as the token of the NFAParser we have to apply a shift on the NFALexer symbols to become NFAParser token:

token-nid NFAParser = symbol-nid NFALexer – 256

Comparing `lexdef/lex_symbol.py` with `parsedef/parse_token.py` for arbitrary langlets lets you check this relation:

[TABLE=5]

Lexer Terminals

When symbols of the lexer are token of the parser how can we characterize the token of the lexer? The answer is simple: by individual characters. One can just define:

DIGIT: '0123456789'
CHAR: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
...

The disadvantage of this explicit notation is that it blows up the lexer NFAs and adds a new state for each character. Moreover we cannot write down the set of all characters and define a symbol that behaves like a regular expression dot that matches anything. So EasyExtend lets one define character sets separately.

Character sets

Character sets include sets for alphanumeric characters, whitespace, hexdigits and octdigits but also the set ANY used to represent any character. Unlike individual characters character sets have a numerical token-id. Character sets are defined in the module `EasyExtend/lexertoken.py`.

Here is an example:

BaseLexerToken.tokenid.A_CHAR  = 11
BaseLexerToken.charset.A_CHAR  = set(string.ascii_letters+"_")

Each `lexdef/lex_token.py` contains a `LexerToken` object. `LexerToken` is a copy of `BaseLexerToken` prototype. During the copying process the token-ids defined at `BaseLexerToken.tokenid.xxx` are shifted according to the langlet offset. `LexerToken` objects will be directly accessed and extracted by the NFALexer.

Below we give an example of a `LexerToken` extension. The langlet we use here is Teuton and the charactersets we define are German umlauts.

# -*- coding: iso-8859-1 -*-
# You might create here a new LexerToken object and set or overwrite
# properties of the BaseLexerToken object defined in lexertoken.py
 
LANGLET_OFFSET = 38912
from EasyExtend.lexertoken import NewLexerToken
LexerToken = NewLexerToken(LANGLET_OFFSET)
LexerToken.tokenid.A_UMLAUT = LANGLET_OFFSET+20
LexerToken.charset.A_UMLAUT = set('\xe4\xf6\xfc\xc4\xd6\xdc\xdf\x84\x8e\x94\x99\x81\x9a\xe1')
LexerToken.tokenid.A_ue     = LANGLET_OFFSET+21
LexerToken.charset.A_ue     = set(['\x81\xfc'])

Beyond character sets

Character sets are proper set objects. However it is possible to use other set-like objects as well. An interesting yet unused set-like type is the universal set described in this article.

Postlexer token

There is another important class of token we have not touched yet. This class of token is already present in Python and some of the token make Pythons fame. Most prominently `INDENT`and `DEDENT`. But there are also `NEWLINE`, `ENDMARKER`, `NL` and `ERRORTOKEN`. What they have in common is that there is no pattern describing them. When the source code is chunked into pieces using regular expressions there is no assignment of one of those chunks to the mentioned token. The chunks are produced in a secondary post processing step.

Same for EasyExtend. There are grammar rules for `INDENT`, `DEDENT`, `NEWLINE` etc. but on the right-hand-side we find terminals like `T_INDENT`, `T_DEDENT` and `T_NEWLINE` that are neither characters nor charsets. They are terminal symbols only in a formal sense used to pacify the parser generator. In EasyExtend postlexer token are produced and inserted into the token stream generated by the NFALexer. Their production is located in the Postlexer which is a function `Postlexer: TokenStream → TokenStream`.

Apart from the Postlexer there is another component called the Interceptor that is used to manipulate state sets on the fly during NFA parsing. Together the Postlexer and the Interceptor enable context sensitive parsing. It is interesting to note here that they both act on the level of lexical analysis and token production and there is no interference with the NFAParser. From a design point of view this is very pleasant because parsing is usually harder and more error prone than lexing and if context sensitivity can be captured during lexical or post-lexical analysis pressure is taken away from the parser.

Special token

In a final section we want to discuss three special token called `ANY`, `STOP` and `INTRON`.

INTRON

`INTRON` – at other occasions people also call it `JUNK`. I used a slightly less pejorative name. Although `INTRONs` do not encode any token used by the parser it can happen that the Postlexer still extracts information from them. For example Pythons Postlexer can derive `INDENT`, `DEDENT`and `NEWLINE` token from an `INTRON`. Other languages like C have no use for them and just throw them away. An `INTRON` can be annotated to a token as an attribute for the purpose of exact source code reconstruction from parse trees.

ANY

`ANY` is the counterpart of a regexp dot. It just matches any character. `ANY` is implemented as a charset – the empty set. Since it is the empty set one could expect to run into First/First conflicts with every other character set and indeed it does! However the `NFALexer` has a hardcoded disambiguation strategy and according the this strategy `ANY` is always endowed with the lowest priority. So if {ANY, ‘X’, ‘;’} is in the selection and the current character is ‘c’ it is matched by ANY. But if it is ‘X’ it is matched by ‘X’ and the ANY trace will be discontinued.

STOP

Sometimes a production rule accepts a proper subset of the productions of another one. The `NAME` rule is very popular for example

`NAME: A_CHAR+`

and it covers every production that specifies a particular name as a token:

`DEF: “def” `

So `DEF` is always also a `NAME` and if the trace terminates with “def” and does not continue with another character we have two trace and need to make a decision. The `STOP` terminal serves for disambiguation purposes. We can write

`DEF: “def” STOP`

If this happens and `NAME` cannot be continued `STOP` is selected. `STOP` does not match any character but extends the trace of `DEF` which means the `DEF` trace is selected, not `NAME`.

What’s next?

In the next article I’ll move to the borders of Trail and discuss recursion in NFAs and possible methods to deal with them.

Posted in EasyExtend, Grammars, TBP | Comments Off on Trace Based Parsing ( Part III ) – NFALexer

Trace Based Parsing ( Part II ) – NFA Expansion

In the previous article about trace based parsing (TBP) I explained the basic concepts of TBP. I motivated TBP by parse tree validators that check the correctness of parse trees against grammars and I derived a trace based LL(1) parser. In this article I will leave familiar territory and will talk about the most important disambiguation technique in Trail which is called NFA expansion. Before I can get into this I have to advance the notation of Trail NFAs.

Trail NFAs

A Trail NFA is a dictionary of states. The keys of those dictionaries are single states and the values are lists of states. A single state is either a 3-tuple

`(node-type, index, rule-node-type)`

or a 4-tuple

`(node-type, control-symbol, index, rule-node-type)`.

( Notice that the only reason for this somewhat peculiar implementation of states is that they are easy to read and dump to text files. Furthermore field access on tuples is more efficient in Python than attribute access on objects. )

A `node-type` or `node-id` or `nid` is an integer value that represents a terminal or non-terminal in a parse tree. A parse tree is full of node ids. A node-type can also be `None` for the exit state of an NFA or it can be a string. All exist states are equal. If the nid is a string it is a language keyword. The index is just an arbitrary number that uniquely identifies a symbol in a grammar rule. If you have a grammar rule like the following

`R: (A | B ‘;’ c*)+`

you can enumerate the symbols as such `(R, 0), (A, 1), (B, 2), (‘;’, 3), (c, 4)`. The index of the rule symbol ( R in our example ) is always 0.

The `rule-node-type` is just the node id of the rule itself. The control-symbol will be discussed later. So the states are

`(R, 0, R), (A, 1, R), (B, 2, R), (‘;’, 3, R), (c, 4, R)`.

In EasyExtend the Trail NFAs of grammar rules are again organized again into dictionaries. The rule-node-type is the dictionary key and a list of entries is the value

 257: ['file_input: ( NEWLINE | stmt)* ENDMARKER',
       '',
       (257, 0, 257),
       {(0, 3, 257): [(None, '-', 257)],
        (4, 1, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)],
        (257, 0, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)],
        (266, 2, 257): [(266, 2, 257), (4, 1, 257), (0, 3, 257)]}],

The meaning of those entries is

  • the textual grammar rule description – `file_input: ( NEWLINE | stmt)* ENDMARKER`
  • the second entry is a relict and it might go away in EE 4.0
  • the start symbol of the NFA – `(257, 0, 257)`
  • the NFA itself

The NFA looks awkward on the first sight but it is easy to very why it correctly expresses the grammar rule once we have translated node ids to rule names. Since the rule is a Python rule we can decode the nids using std library modules `symbol.py` for non-terminals and `token.py` for terminals.

>>> import symbol
>>> map(symbol.sym_name.get, [257, 266])
['file_input', 'stmt']
>>> import token
>>> map(symbol.tok_name.get, [0, 4])
['ENDMARKER', 'NEWLINE']

When we replace the states by the rule names mapped from the node ids we will get the following dict:

{ ENDMARKER: [None],
  NEWLINE: [stmt, NEWLINE, ENDMARKER],
  file_input: [stmt, NEWLINE, ENDMARKER],
  stmt: [stmt, NEWLINE, ENDMARKER]
}

With `file_input` as the start and None as the `exit` symbol it obviously expresses the correct transitions.

Here is a little more advanced NFA showing why using indices is quite reasonable

 271: ["print_stmt: 'print' ([test (',' test)* [',']] | '>>' test [(',' test)+ [',']])",
       '',
       (271, 0, 271),
       {(12, 3, 271): [(303, 4, 271)],
        (12, 5, 271): [(None, '-', 271)],
        (12, 8, 271): [(303, 9, 271)],
        (12, 10, 271): [(None, '-', 271)],
        (35, 6, 271): [(303, 7, 271)],
        (271, 0, 271): [('print', 1, 271)],
        (303, 2, 271): [(12, 3, 271), (12, 5, 271), (None, '-', 271)],
        (303, 4, 271): [(12, 3, 271), (12, 5, 271), (None, '-', 271)],
        (303, 7, 271): [(None, '-', 271), (12, 8, 271)],
        (303, 9, 271): [(12, 8, 271), (12, 10, 271), (None, '-', 271)],
        ('print', 1, 271): [(None, '-', 271), (303, 2, 271), (35, 6, 271)]}],

Expansion

Trail works just fine for single self-contained rules. Because of implicit parallelism a rule like

R: A* B | A* C

neither requires advanced lookahead schemes, special left factoring algorithms nor backtracking. This is also the reason why TBP is used for regular expression matching. This comfort gets lost when more than one grammar rule is used.

R: A* B | D* C
D: A

Now we have a First/First conflict because `A!=D` but all A-reachables are also reachable from D. Both A and D are in the selection of R and the parser has to somehow decide which one to chose. The trick we apply is to embed D carefully in R or as we say: expand R by D. Careful means that we substitute D in R by the right hand side of D i.e. A but in such a way that the link to D is preserved. As we will see this is the deeper cause for the rule-type-id being the last entry of each state.

In a somewhat idealized Trail NFA style where the node ids are symbols we derive following two NFAs:

Trail NFA of R
--------------------
(R, 0, R): (A, 1, R), (D, 2, R)
(A, 1, R): (A, 1, R), (B, 3, R)
(D, 2, R): (D, 2, R), (C, 4, R)
(B, 3, R): (None, '-', R)
(C, 4, R): (None, '-', R)
 
Trail NFA of D
--------------------
(D, 0, D): (A, 1, D)
(A, 1, D): (None, '-', D)

Now we substitute the R state `(D, 2, R)` by `(A, 1, D)`. However we have to embed the complete NFA not just the states following `(D, 0, D)`.

Expanded Trail NFA of R
------------------------
(R, 0, R): (A, 1, R), (A, 1, D)
(A, 1, R): (A, 1, R), (B, 3, R)
(A, 1, D): (A, 1, D), (None, '-', D)
(B, 3, R): (None, '-', R)
(C, 4, R): (None, '-', R)
(None, '-', D): (C, 4, R)

This is still not correct. The exit symbol `(None, ‘-‘, D)` of D must not occur in R because we want to continue in R after D and do not leave it. Instead we create a new type of state – a transient or glue state which has the form: `(D, ‘.’, 5, R)`

The dot ‘.’ indicates that the state is a transient state. The follow states of `(D, ‘.’, 5, R)` are precisely the follow states of `(D, 2, R)`. So the correct embedding of D in R looks like

Expanded Trail NFA of R
------------------------
(R, 0, R): (A, 1, R), (A, 1, D)
(A, 1, R): (A, 1, R), (B, 3, R)
(A, 1, D): (A, 1, D), (D, '.', 5, R)
(B, 3, R): (None, '-', R)
(C, 4, R): (None, '-', R)
(D, '.', 5, R): (C, 4, R)

Reconstruction of parse trees from traces

Transient states never cause First/First conflicts and they also do not produce new selections. They are nevertheless preserved in the state set traces and used for reconstructing parse trees from traces.

Assume we use the expanded NFA we have created above to parse the sentence `A A C`. This yields following sequence of state sets:

(R, 0, R) → R → [(A, 1, R), (A, 1, D)]
          → A → [(A, 1, R), (B, 3, R), (A, 1, D), ((C, 4, R) ← (D, '.', 5, R))]  
          → A → [(A, 1, R), (B, 3, R), (A, 1, D), ((C, 4, R) ← (D, '.', 5, R))]
          → C → (None, '-', R)

From this stateset-selection sequence we can reconstruct the actual state sequence. We know that the last entry of a state is the node-type of a rule. Hence we also know the containing NFA of a state even though it is embedded. We now move backwards in the state set sequence from `(None, ‘-‘, R)` to `(R, 0, R)`. By changing the orientation in the proposed manner we enter the sub NFA D through the transient state (D, ‘.’, 5, R) and leave it only when there is no state of rule-node-type D left anymore in any of the state sets which is actually the final state (R, 0, R) in this case.

So we get

(None, '-', R)(C, 4, R) ← (D, '.', 5, R)(A, 1, D) ← (A, 1, D) ← (R, 0, R)

From this state sequence we reconstruct the parse tree. First we mark all states that are D states and wrap the node ids into D

(None, '-', R)(C, 4, R) ← (D, '.', 5, R)(A, 1, D) ← (A, 1, D) ← (R, 0, R)
(None, '-', R)(C, 4, R) ← [D, A, A] ← (R, 0, R)

Then we wrap the D-tree and the remaining C state into R:

[R, [D, A, A], C]

Voila!

Limitations of expansion

We have seen how to apply NFA expansion and reconstructed parse trees from state set sequences. This process is linear but it is hard to guess the effort after all because it depends on the number of states in the state sets.

Notice also that NFA embeddings require an additional shift of the state indices. That’s because it can happen that an NFA is embedded multiple times and the embeddings must not overlap.

This technique raises the question of its limitations quite naturally.

One obvious limitation is cyclic or infinite expansion. Suppose we expand D in R but this time D is a recursive rule and contains a state`(D, k, D)`. This is not necessarily harmful as long as `(D, k, D)` doesn’t cause a First/First conflict that leads to another expansion of D in R. Expansion cycles can be easily detected and in those cases expansion is aborted and the original NFA restored.

Another situation is more comparable to stack overflows. The expansion process continues on and on. Without running into a cycle it produces thousands of states and it is not clear when and how it terminates. Trail sets an artificial limit of 1500 states per NFA. If this number is exceeded expansion is terminated and the original NFA restored.

Another disadvantage of expansion is that error reporting becomes harder because localization of states becomes worse.

Brute force as a last resort

If expansion is interrupted and the original unexpanded NFA is restored the only chance to deal with First/First conflicts without changing the grammar manually is to use backtracking. Trail will have to check out any possible trace and select the longest one. In the past I tried to break `(D, k, D)` into two new transient states `(D, ‘(‘, k, D)` and `(D, ‘)’, k, D)`. The approach was interesting but made maintenance and debugging harder and there were still corner cases that were not properly treated.

What’s next?

In the next article I’ll leave the level of groundwork and will talk about the NFALexer. The NFALexer is actually just another context free TBP. The NFALexer replaces regular expression based lexical analysis in Trail which has some positive but also negative aspects. We have to talk about both.

Posted in EasyExtend, Grammars | Comments Off on Trace Based Parsing ( Part II ) – NFA Expansion

Trace Based Parsing (Part I) – Introduction

This is the first in a series of articles about Trace Based Parsing ( TBP ). Trace based parsing is a powerful top down, table driven parsing style. TBP is the technique underlying Trail which is the parser generator of EasyExtend (EE). At its core Trail uses a variant of regular expression matching engines of the Thompson style described in an excellent article by Russ Cox. However Trail extends the scope of those engines and goes beyond regular expressions and covers context free grammars and programming languages.

Originally I didn’t intend to create a parser generator let alone a novel one that wasn’t described by the literature that was available to me. EasyExtend used a very efficient LL(1) parser generator written by Jonathan Riehl and what I needed was something more modest namely a parse tree validator for checking the correctness of parse tree transformations. Those validators are simpler to create than parsers and I’ll start my introductory notes about TBP describing them.

Parse Tree Validators

Suppose an EasyExtend user has defined a few new grammar rules in an extension language ExtLPython of Python. The ExtLPython source code is parsed correctly into a parse tree `PT(`ExtLPython`)` and now the user wants to transform `PT(`ExtLPython`)` back into a Python parse tree `PT(Python)` which can be compiled to bytecode by means of the usual compiler machinery. In essence the function `PT(`ExtLPython`)` → `PT(Python)` is a preprocessor of the users language and the preprocessors, not the bytecode compiler, are defined within EE and need to be checked.

So the task we need to solve can be stated as

Given a parse tree `PT` and a grammar `G` check that `PT` is conformant with G.

An EBNF grammar of Pythons parser is already available in the Python distribution and can be found in the distributions `/Grammar` folder or here ( for Python 2.5 ).

We will start the derivation of a parse tree validator considering simple grammars. Take the following grammar `G1` for example:

Grammar G1
----------
G1: A B | A C

The possible parse trees of a language described by `G1` have the following shape :
`[G1, A, B]` or `[G1, A, C]`. Our next grammar is just slightly more involved

Grammar G2
----------
G2: A R
R: B B C | B B D

The corresponding trees are:

TG1 = `[G2, A, [R, B,B, C]]`
TG2 = `[G2, A, [R, B, B, D]]`.

The function ( or object method ) I seek steps through the parse tree and determines for every node in the tree the set of all possible subsequent nodes. This function shall be called a tracer. Given the grammar `G2` and the symbol `A` the tracer shall yield the symbol `R`. If the symbol is `B` the yielded follow symbol is either `B` or those are the symbols `C` and `D` depending on the state of the tracer.

A tracer can be best implemented using a finite state automaton. For the grammar `G2` we derive following states and state transitions in the automaton:

(G2,0) : (A, 1)
(A, 1) : (R, 2)
(R, 2) : (B, 3) , (B, 4)
(B, 3) : (B, 5)
(B, 4) : (B, 6)
(B, 5) : (C, 7)
(B, 6) : (D, 8)
(C, 7) : (None, '-')
(D, 8) : (None, '-')

In this automaton we have encoded the symbols of the grammar rules as 2-tuples `(symbol-name, index)`. The index is arbitrary but important to distinguish different states in the automaton originating from the same symbol that is used at different locations. In `G2` the symbol `B` gives rise to four different states `(B, 3)`, `(B, 4)`, `(B, 5)` and `(B, 6)`.

The finite state automaton is non-deterministic. That’s because each transition is described by the symbol name only. So if the automaton is in the state `(A, 1)` the symbol `R` is the label of the transition: `(A, 1) → R → (R, 2)`. The subsequent transition is
`(R, 2) → B → [(B, 3), (B,4)]`.
This particular transition doesn’t yield a unique state but a set of states. Hence the non-determinism. Nondeterministic finite state automatons are abbreviated by NFA.

For all states `S` in a state set we can safely assume that `S[0] = symbol-name` for a fixed symbol name. In essence our tracing function produces state sets on each transition and those sets are uniquely determined by symbol names.

State views and selection views

The parse trees `TG1` and `TG2` from the previous section correspond with following traces which describe state transition sequences:

Trace(TG1) = `(G2,0) → A → (A, 1) → R → (R, 2) → B → [(B, 3), (B, 4)] → B → [(B, 5), (B, 6)] → C → (C, 5) → (None, ‘-‘)`

Trace2(TG2) = `(G2,0) → A → (A, 1) → R → (R, 2) → B →[(B, 3), (B, 4)] → B →[(B, 5), (B, 6)] → D → (D, 6) → (None, ‘-‘)`

Our intended tracer would step through `Trace(TG1)` and `Trace(TG2)` at the same time. This makes the tracer is implicitly parallel! The actual state sets of the NFA are hidden from client objects that use the tracer. All they consume is the sequence of labels/symbols:

`G2 → A → R → B → B → ( C, D ) → None`

We assume now that the tracer is an object that stores the current state-set of the NFA as a private member and enables tracing though the NFA using a `select` method which returns a set of labels that can be used for the next `select` call ( unless the label is `None`). A sequence of `select`calls looks like this:

[A] = tracer.select(G2)
[R] = tracer.select(A)
[B] = tracer.select(R)
[B] = tracer.select(B)
[C,D] = tracer.select(B)
[None] = tracer.select(C)     # or [None] = tracer.select(D)

Now we just have to bring the NFA tracers and the parse trees together. If the nested list `[G2, A, [R, B, B, C]]` represents a parse tree one has to step through the list and call the `select` method of the tracer with the current symbol found in the list. If we enter a sublist we spawn a new tracer a call the validator recursively. The parse tree is correct if each selection was successful. Otherwise an exception is raised.

The following Python function gives us an implementation:

def validate_parse_tree(tree, tracer):
    selection = []
    for N in tree:
        if istree(N):
            validate_parse_tree(N, tracer.clone())
        else:
            selection = tracer.select(N)
            if not selection:
                raise ValueError("parse tree validation failed at node %s"%N)
    if None in selection:
       return
    else:
       raise ValueError("parse tree validation failed at node %s"%tree)

A conflict free traced based LL(1) parser

LL(1) parsers are a class of efficient top down parsers. Wikipedia states

An LL parser is called an LL(k) parser if it uses k tokens of lookahead when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking then it is called an LL(k) grammar. Of these grammars, LL(1) grammars, although fairly restrictive, are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k require considerable effort to parse.

Notice that this section is pretty confusing because the fact that only 1 token of lookahead is required for a parser doesn’t imply anything about the characteristic features of the grammar but about the implementation technique. Backtracking is mentioned as such a technique and it is a general goal of parser generator design to avoid it because backtracking has an O(2^n) asymptotic complexity.

Later in the article WP mentions three types of LL(1) conflicts:

  • First/First conflict – overlapping of non-terminals
  • First/Follow conflict – first and follow sets overlap
  • Left recursion

The implicit parallelism of trace based parsers is able to solve many of the mentioned First/First and First/Follow conflicts. TBPs are nevertheless LL(1) which means that they don’t look deep into the token stream for decision making or run into backtracking. Right now it is unknown how powerful TBPs actually are. So if you know a computing science PhD student who looks for an interesting challenge in parsing theory you can recommend this work. If the problems discussed here are already solved you can recommend that work to me.

For the rest of the article we want to assume that LL(1) grammars are EBNF grammars free of the conflicts mentioned above. Traditionally LL(1) parsers are constrained by their ability to parse language of those grammars.

Reachables

The more common term found in the literature for reachable is first set. I prefer the term reachable in the context of EE though simply because the common distinction between first set and follow set is not so important in Trail. We watch parsers through the lens of Trail NFAs and everything that is on the RHS of a state is just a follow set. So the first set is also a follow set.

We give an explanation of the term reachable in the context of Trail NFAs.

Let’s reconsider the grammar G2 which defines two grammar rules named as G2 and R. The follow sets of the initial state of R and G2 are given by:

(G2,0) : (A, 1)
(R, 2) : (B, 3) , (B, 4)

The terminals or non-terminals according to those states are reachable i.e. `A` is reachable from `G2` and `B` is reachable from `R`. Actually we do not only look at the immediate follow sets of the initial state but also compute the transitive closure. Reachability is transitive which means: if `X` is reachable from `B` and `B` is reachable from `R` then `X` is also reachable from `R`.

Each token in a token stream that is passed to a parser is characterized by a terminal symbol. We call this terminal symbol also the token type. The central idea of an LL parsers is to derive terminals ( token types ) T as leftmost derivations of nonterminals N. Leftmost derivation is yet another characterization of reachability in the above sense. So we start with a terminal N and if T can be reached from N we can also create a parse tree containing N as a root node and T as a leaf node. We want to elaborate this now:

Suppose T0 is reachable from N and there is no other non-terminal M that is reachable from N which can also reach T0. So there is no non-terminal between N and T0. If N0 is a start symbol of our grammar and N1 is reachable from N0 and N2 from N1 etc. we get a nested chain of symbols [N0, [N1 , [… [Nk , … [N, T0]]] which is our initial parse tree.

That was simple but how to move on and derive the next token T1? Let’s examine the Trail NFA of N which has following structure:

(N, 0) : (T0, 1), ...
(T0,1) : (M, 2), ...
...

and make following case distinction:

  • (T1, i) is a follow state of (T0, 1). Then our new parse tree becomes [N0, [N1 , [… [Nk , … [N, T0, T1 *]]] and we can proceed with token T2 and the follow states of (T1, i) in N. The * symbol indicates the location where the parser proceeds. These locations correspond to NFA states.
  • (T1, i) is reachable from (M, 2). Then we recursively call the parser with the token iterator and M as a start symbol and trace through the NFA of M. The parser returns a new parse tree [M, …] and we embed this tree like [N0, [N1 , [… [Nk , … [N, T0, [M,…] *]]] and proceed with the follow states of (M, 2) and the current token Tk yielded by the token iterator.
  • (None, ‘-‘) is in the follow states of (T0, 1). This means that that N is allowed to terminate. The parser with N as a start symbol returns the parse tree [N, T0]. So we get [N0, [N1 , [… [Nk , [N, T0]*]] and proceed with T1 and Nk.
  • If all those conditions fail raise an error message.

The following code is an implementation of the algorithm.

def parse(tokenstream, symbol):
    tracer    = self.new_tracer(symbol)
    selection = tracer.select(symbol)
    tree      = [symbol]
    token     = tokenstream.current()
    while token:
        token_type = get_tokentype(token)
        for nid in selection:
            if nid is not None:
                if istoken(nid):
                    if token_type == nid:
                        tokenstream.shift_read_position()
                        tree.append(token)
                        break
                elif token_type in reachable(nid):
                    sub = parse(tokenstream, nid)
                    if sub:
                        tree.append(sub)
                        break
                    else:
                        continue
        else:
            if None in selection:
                return tree
            else:
                raise ParserError
        selection = tracer.select(nid)
        try:
            token = tokstream.next()
        except StopIteration:
            token = None
    return tree

You will have noticed that the production of the NFAs from EBNF grammars was just presumed. It would take another article to describe the translation. However for simple grammars you can do such a translation easily by hand and the translation process doesn’t add much to our understanding of parse tree validation and language parsing.

What’s next?

In the next article we will touch grammars with First/First and First/Follow conflicts. They are not LL(1) grammars according to the Wikipedia definition. The way we’ve dealt with those grammars is neither increasing the number of lookaheads ( i.e. going from LL(1) to LL(k) or LL(*)), nor running into backtracking, nor modifying the grammar itself. Instead we carefully modify the NFAs derived from the grammar. “Careful” means that parse trees we create with TBP and modified NFAs correspond to those created with unmodified NFAs. That’s much like distort and restore the grammar itself. This might be another area of academic research for young computing scientists interested in parser technology. I’ve developed a particular algorithm for solving the mentioned problem and I haven’t the slightest idea if one could do so far better.

Notice that I even tried to attack left recursion conflicts that tantalize top down parsers but I gave up on these approaches because I couldn’t understand the NFAs anymore, didn’t capture all use cases and debugging became such a pain. However there is still the promise of creating an O(n) top down parser that is able to parse all context free languages. And please, no tricks and no PEGs. Sweeping grammar conflicts under the rug using ordered choice is a foul and a burden to the programmer who wants to use the parser generator. Creating such a parser isn’t quite proving P!=NP but it would be still really, really cool and of much more practical relevance.

Posted in EasyExtend, Grammars | Comments Off on Trace Based Parsing (Part I) – Introduction

Python vs TTCN-3

Some time ago computing scientists Bernard Stepien and Liam Peyton from the University of Ottawa compared Python with TTCN-3. TTCN-3 means Testing and Test Control Notation and it is domain specific language specifically designed for writing tests in the domain of telecommunication. In almost any category poor Python just loses in comparison.

A few things of the presentation are just odd. At several places it contains Python code that is not even consistent. Examples: on slide 42 an “else if” construct is used instead of the correct “elif”. On slide 13 fields are not assigned to self which leads to code that fails to run. But leaving this carelessnes aside there is something important that estranges an experienced Python programmer. Take slide 12 for example. Here the authors state:

TTCN-3 templates to Python

  • TTCN-3 templates could be mapped to Python object instances.
  • However, there are serious limitations using the above technique with Python.
  • Python objects are practically typeless.

Then the presentation goes over 18 slides just to explain how bad plain Python classes are for serving the same purposes as TTCN-3 templates. Although this is correct why should anyone care? If you want to experience an exhilarating effect you have to turn grapes into wine and do not expect the same fun from grape juice. Then you can compare cheap wine with premium one. That’s also why people are used to compare Django to Ruby On Rails and not Django to Ruby or Rails to Python. That’s why libraries and frameworks exist in the first place.

So what about modeling a `Template` class that has same functionality as a TTCN-3 template? Here is a first attempt:

class Template(object):
    def __init__(self):
        self._frozen = False
        self._fields = []        
 
    def __setattr__(self, name, value):
        if name in ('_fields', '_frozen'):
            object.__setattr__(self, name, value)
            return
        if self._frozen:
            raise RuntimeError("Cannot set attribute value on a frozen template")
        if name not in self._fields:
            self._fields.append(name)
        object.__setattr__(self, name, value)
 
    def __eq__(self, other):
        if len(self._fields)!=len(other._fields):
            return False
        else:
            for f_self, f_other in zip(self._fields, other._fields):
                val_self  = getattr(self, f_self)
                val_other = getattr(self, f_other)
                if val_self != val_other:
                    return False
            return True
 
    def __ne__(self, other):
        return not self.__eq__(other)
 
    def freeze(self):
        self._frozen = True
 
    def clone(self):
        T = Template()
        T._fields = self._fields[:]
        for field in T._fields:
            setattr(T, field, getattr(self, field))
        return T

The `Template` class manages an internal `_fields` attribute that contains the attribute names created by `__setattr__` dynamically. This is done to preserve the sequential order of the fields on template comparison. In the current implementation there aren’t any fields marked as optional and it will take additional effort to introduce them and modify `__eq__`.

It is now easy to demonstrate the behavior of wildcards. First we have to introduce another class for this purpose:

class Wildcard(object):
    def __eq__(self, other):
        return True
 
    def __ne__(self, other):
        return False

For any object `X` which is compared to `wildcard` it is always `X == wildcard == True`. So following template instances are equal:

templ_1 = Template()
templ_1.field_1 = wildcard
templ_1.field_2 = "abc"
 
templ_2 = Template()
templ_2.field_1 = "xyz"
templ_2.field_2 = wildcard
 
assert templ_1 == templ_2

A `Pattern` class can be created in the same way as the Wildcard class:

class Pattern(object):
    def __init__(self, regexp):
        self.pattern = re.compile(regexp)
 
    def __eq__(self, other):
        try:
            return bool(self.pattern.match(other))
        except TypeError:
            return True
 
    def __ne__(self, other):
        return not self.__eq__(other)

An `’==’` comparison of a `Pattern` instance with a string will match the string against the pattern and yield `True` if the string could be matched and `False` otherwise.

So it is perfectly possible to reproduce most aspects of TTCN-3 templates by just a few lines of Python code. On the downside of `TTCN-3` it isn’t possible to do things the other way round. No matter how hard you work in TTCN-3 it is impossible to create new TTCN-3 types showing interesting behavior. Pythons expressiveness is miles ahead.

Posted in DSL, Python, Testing | 4 Comments

Tail recursion decorator revisited

A while ago some of us tried to find the best solution for a tail recursion decorator. The story began when Crutcher Dunnavant came up with a surprising decorator that was able to eliminate tail recursion. His solution used stack inspections and could flatten the call stack s.t. the maximum height of the stack became 2. I gave an alternative, faster implementation that omitted stack inspections. This solution was further improved by Michele Simionato and George Sakkis. The story already ends here because the performance penalty of those decorators was still quite considerable. On the samples we used the undecorated recursive function was more than twice as fast as the decorated one. So the decorator added more than 100% overhead.

Today I tried something new. I reimplemented Georges decorator in Cython which is considerably faster than Python and then I compared the performance of undecorated code, code that is decorated with the pure Python decorator and the Cython version of it.

Here you can download the relevant source code for the decorators.

The following implementation shows the Cython decorator defined in `tail_recursive.pyx`

cdef class tail_recursive:
    cdef int firstcall
    cdef int CONTINUE
    cdef object argskwd
    cdef object func
 
    def __init__(self, func):
        self.func = func
        self.firstcall = True
        self.CONTINUE = id(object())
 
    def __call__(self, *args, **kwd):
        if self.firstcall:
            self.firstcall = False
            try:
                while True:
                    result = self.func(*args, **kwd)
                    if result == self.CONTINUE: # update arguments
                        args, kwd = self.argskwd
                    else: # last call
                        return result
            finally:
                self.firstcall = True
        else: # return the arguments of the tail call
            self.argskwd = args, kwd
            return self.CONTINUE

Here are some performance tests.

As a sample function I used a tail recursive factorial

def factorial(n, acc=1):
    "calculate a factorial"
    return (acc if n == 0 else factorial(n-1, n*acc))

The timining function is defined by

import time
def mtime(foo):
    a = time.time()
    for j in range(10):
        for i in range(900):
            foo(i)
    return time.time()-a

The results I got were:

8.484 -- undecorated
9.405 -- with Cython decorator   + 10%
17.93 -- with Python decorator   + 111%

Next I checked out a pair of mutual recursive functions. Notice that in those pairs only one function may be decorated by `tail_recursive`.

def even(n):
    if n == 0:
        return True
    else:
        return odd(n-1)
 
def odd(n):
    if n == 0:
        return False
    else:
        return even(n-1)

Here are the results:

2.969 -- undecorated
3.312 -- with Cython decorator   + 11%
7.437 -- with Python decorator   + 150%

These are about the same proportions as for `factorial` example.

My conclusion is that one can expect about 10% performance penalty for Cythons `tail_recursive` decorator. This is a quite good result and I don’t shy away from recommending it.

Posted in Python | 6 Comments

Universal Set in Python

Building sets from set() and ANY

When defining a regular expression there is a well known duality between a set of characters specified by enclosing characters or character ranges within square brackets `[…]` and defining the pattern of the complement set `[^…]` by preceding the caret symbol ^ .

  • [0-9] — matches all digits
  • [^0-9] — matches all non-digits

The nature of the complement set mentioned here is less precisely defined and is implementation dependent in practice. If a regexp matcher knows about `ASCII` only then `[^0-9]` is the set of all `ASCII` characters which are not digits.

Suppose now we want to build a character matching set from sets in the sense of a `set` data-type as inhabited in Python as a builtin. This is easy because we start from the empty `set()` and add just more elements. Each finite set can be constructed in this way and we can establish a set ~ pattern correspondence:

`set(range(9)) ~ [0-9]`

But how do we find a corresponding set for `[^0-9]`?

This cannot be finitely constructed from the bottom up using `set()` as a start value but it has to be finitely de-constructed from the set of all elements. This looks like we are running into unlimited comprehension and the Russel paradox but actually the set theory we are using is quite weak and element membership can be decided just by doing hashtable lookups.

Instead of constructing a set from `set()` we want to (de-)construct it from `ANY`. The name `ANY` is not chosen by chance because we want to express the correspondence with `.` pattern:

`ANY ~ .`

( Actually `ANY` is a little stronger because `.` doesn’t match newline characters unless explicitly specified by the `DOTALL` flag in the `re.compile` function. )

Using `ANY` we can easily state the above correspondence:

`ANY – set(range(9)) ~ [^0-9]`

Implementation of ANY

We define `ANY` as an instance of `univset`. The `univset` has a set-like interface and implements set operations consistently on sets of the type `ANY-A`, `ANY-B`,… where `A`, `B`… are finite sets constructed from `set()`. For example

`ANY-A ∩ ANY-B` = `ANY – ( A ∪ B )`
`ANY-A ∪ B` = `ANY – ( A – B )`

class univset(object):
    def __init__(self):
        self._diff = set()
 
    def __sub__(self, other):
        S = univset()
        if type(other) == set:
            S._diff = self._diff | other
            return S
        else:
            S._diff = self._diff | other._diff
            return S
 
    def __rsub__(self, other):
        return other & self._diff
 
    def __contains__(self, obj):
        return not obj in self._diff
 
    def __and__(self, other):
        return other - self._diff
 
    def __rand__(self, other):
        return other - self._diff
 
    def __repr__(self):
        if self._diff == set():
            return "ANY"
        else:
            return "ANY - %s"%self._diff
 
    def __or__(self, other):
        S = univset()
        S._diff = self._diff - other
        return S
 
    def __xor__(self, other):
        return (self - other) | (other - self)
 
    def add(self, elem):
        if elem in self._diff:
            self._diff.remove(elem)
 
    def update(self, elem):
        self._diff = self._diff - other
 
    def __ror__(self, other):
        return self.__or__(other)
 
    def union(self, other):
        return self.__or__(other)
 
    def difference(self, other):
        return self.__sub__(other)
 
    def intersection(self, other):
        return self.__and__(other)
 
    def symmetric_difference(self, other):
        return self.__xor__(other)
 
    def issubset(self, other):
        if type(other) == set:
            return False
        if issubset(other._diff, self._diff):
            return True
        return False
 
    def issuperset(self, other):
        if self._diff & other:
            return False
        return True
 
    def __lt__(self, other):
        return self.issubset(other)
 
    def __eq__(self, other):
        if type(other) == set:
            return False
        try:
            return self._diff == other._diff
        except AttributeError:
            return False
 
    def __ne__(self, other):
        return not self.__eq__(other)
 
    def __le__(self, other):
        return self.__lt__(other) or self.__eq__(other)
 
    def __gt__(self, other):
        return self.issuperset(other)
 
    def __gt__(self, other):
        return self.issuperset(other) or self == other

Examples

ANY = univset()
NON_DIGITS = ANY - set(range(9))
assert 8 not in NON_DIGITS
assert 'a' in NON_DIGITS
assert 0 in NON_DIGITS | set([0])
Posted in Mathematics, Python | Tagged | 5 Comments

The statement that never was

The Python release archive is a nice repository for archaeological findings. My biggest surprise was the existence of an access statement and `access` keyword in the pre-history of Python 1.5.0. The syntax was already specified in release 1.0 with an accompanying `accessobject.c` and it began to fade in Python 1.4.

The access statement

That is how the access statement was defined in 1.0

access_stmt: 'access' ('*' | NAME (',' NAME)*) ':' accesstype  (',' accesstype)*
accesstype: NAME+
# accesstype should be ('public' | 'protected' | 'private') ['read'] ['write']
# but can't be because that would create undesirable reserved words!

The `access_stmt` was a small statement i.e. one that doesn’t contain a block and could be used like `print`, `del` or `exec`.

So the proper use of the access statement would be like

class Point:
    access x, y, z: private write, public read

or more compact:

class Point:
    access *: private write, public read

The word `public` wasn’t a language keyword which made it possible to use it in this way:

class AccessModifier:
    access public: public
    access private: public
    access protected: public

Note that the access statement was purely declarative even more so than Java’s where you can write

public int x = 10;

and define `x` while specifying its type and access modifier. It was also more elaborated in terms of intended protections by means of differentiating between `read` and `write`.

How it never took off and gone away

In the `HISTORY` file of Python 1.0 we find the notice

There’s a new reserved word: “access”. The syntax and semantics are
still subject of of research and debate (as well as undocumented), but
the parser knows about the keyword so you must not use it as a
variable, function, or attribute name.

In Python 1.4 the access statement was already uncommented from the `Grammar`.

The story ends with the following remark in the `HISTORY` of Python 1.5:

‘access’ is no longer a reserved word, and all code related to its
implementation is gone (or at least #ifdef’ed out). This should make
Python a little speedier too!

Posted in General, Python | Tagged | 1 Comment