Algorithms

Token Tracers

Posted in EasyExtend, Parsing, TBP, Trail on June 3rd, 2010 by kay – Be the first to comment

When I started programming EasyExtend in 2006 one of the major problems was the correct grammar -> NFA translation. I used big grammars and testing for correctness required lots of source code. The first heuristics I used was ugly and complex and it took about 2 or so years to find a neat trick which finally lead to replace it completely. The basic problem of systematic phrase or expression generation for testing purpose persisted though - until last week when I implemented a TokenTracer.

Tracers

A typical production rule in the Trail parser generator is translated into a single NFA which might look as in the following example

 1005: ["funcdef: [decorators] 'def' NAME parameters ':' suite",
        (1005, 0, 1005),
        {(1, 3, 1005): [(1006, 4, 1005)],
         (11, 5, 1005): [(1043, 6, 1005)],
         ('def', 2, 1005): [(1, 3, 1005)],
         (1004, 1, 1005): [('def', 2, 1005)],
         (1005, 0, 1005): [('def', 2, 1005), (1004, 1, 1005)],
         (1006, 4, 1005): [(11, 5, 1005)],
         (1043, 6, 1005): [(None, '-', 1005)]}],

It is not created for readability but it is nevertheless easy to decode. The funcdef grammar rule is assigned a numerical value, a rule identifier - here 1005. Asscociated with the rule identifier is a 3-list consisting of

  1. The rule in plain text
  2. The start state of a finite automaton (1005, 0, 1005)
  3. A finite automaton encoded as a dictionary of transitions.

Starting with (1005, 0, 1005) one can step through the automaton. The follow states are  [('def', 2, 1005), (1004, 1, 1005)]. The first one obviously represents the def keyword whereas the second is a representation of the decorators non-terminal which has the rule identifier 1004. When you select the (1004, 1, 1005) state there is a single follow state, which is again the state of the def keyword otherwise you get the follow state (1, 3, 2005) of  (’def’, 2, 1005). The state (None, ‘-’, 1005) doesn’t have a follow state and it is the only one.

You can now define a function that keeps track of this stepping process through a rule. This function is called a Tracer.

A Tracer acts as follows:

>>> tracer = Tracer(rules)
>>> tracer.select(1005)   # selects automaton 1005 and returns the rule ids of the
['def', 1004]             # possible follow states
>>> tracer.select('def')
[1]
>>> tracer.select(1)
[1006]
...

It is possible that a Tracer has to keep track of multiple traces at once. For example the exprlistrule

 1069: ["exprlist: expr (',' expr)* [',']",
        (1069, 0, 1069),
        {(12, 2, 1069): [(1053, 3, 1069)],
         (12, 4, 1069): [(None, '-', 1069)],
         (1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1053, 3, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1069, 0, 1069): [(1053, 1, 1069)]}],

defines transitions of the kind

(1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)]

with two rules of rule id 12 in the follow set. When 12 is selected in the Tracer all follow sets of all rules with rule id = 12 are unified:

>>> tracer.select(1069)
[1053]
>>> tracer.select(1053)
[12, None]
>>> tracer.select(12)
[1053, None]
...

TokenTracers

This kind of tracing functionality is central to EasyExtends implementation of Trace Based Parsing (TBP). For single grammar rules TBP coincides with “Thompson NFA” style parsing discussed at length by Russ Cox or more recently by Carl Friedrich Bolz who gave a Python implementation.

We want to consider now a different sort of tracer which is more complicated to create than those for single grammar rules. Those tracers have to meet the following requirement:

The list of rule id’s returned from tracer.select() shall contain only None or rule id’s of terminal symbols.

The rule id’s of terminals are exactly the  token types. The select function of a TokenTracer returns a list of token types and gets fed with a single token type. In the following example we step through the token stream of a simple function

def foo():
    print 42

Here we go

>>> tracer = TokenTracer(rules)
>>> tracer.select(1001)  # a single select using a top level non-terminal
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select('def')
[1]
>>> tracer.select(1)     # foo
[7]
>>> tracer.select(7)     # (
[1, 7, 8, 16, 36]
>>> tracer.select(8)     # )
[11]
>>> tracer.select(11)    # :
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select(4)     # \n
[5]
>>> tracer.select(5)     # INDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select('print')
[1, 2, 3, 4, 7, 9, 13, 13, 14, 15, 25, 26, 32, 35, 'lambda', 'not']
>>> tracer.select(2)     # 42
[4, 7, 9, 12, ..., 36, 48, '<>', 'and', 'if', 'in', 'is', 'is', 'not', 'or']
>>> tracer.select(4)     # \n
[1, 2, 3, 6, 7, ... , 'try', 'while', 'with', 'yield']
>>> tracer.select(6)     # DEDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select(0)     # ENDMARKER

Application 1 - error detection

Using a TokenTracer it is dead simple to localize a syntax error which is - in the context free case - always an unexpected token. In principle Trail could delegate error recovery entirely to a TokenTracer.

Application 2 - autocorrection

A constant token is a token with a constant token string e.g. ‘;’ or ‘:’. Closely related are token like INDENT where the token string can be derived from context and a prescribed indentation. In sharp contrast are token like NAME, NUMBER and STRING where the token string is not language but user determined. In the select() sequence above we find constant token lists of length = 1 like [11] or [7]. If one of those token is omitted it can be inserted without guessing.

Application 3 - expression generation

The most intriguing aspect of TokenTracers is that each random token sequence which is constrained by a TokenTracer is syntactically correct. This can be used to create expression generators: first write a grammar G to describe the language syntax, then you derive a TokenTracer(G). Finally an expression generator ExprGen(TokenTracer(G)) is created which is used to build random token sequences being compliant with G by means of the TokenTracer. Those token-sequences can either be turned into valid parse trees and get compiled or un-tokenized into source code.

A valuation function fitness(expr) -> float on expressions motivates the use of genetic programming for breeding expressions of a certain kind. For example I’m strongly  interested in compact grammars which create big NFA expansions in Trail. It is not easy to see how those can be built by hand. Using GP one could set an arbitrary threshold like n = 1000 for the number of states in a single expanded NFA and tries to minimize the size of a grammar, where the size is measured in the number of tokens used for a grammar description in some meta-grammar ( e.g. EBNF ).

Inheritance and the C preprocessor

Posted in Algorithms, C on March 24th, 2010 by kay – 3 Comments

Defining n-ary trees using the C preprocessor

In this article I introduce a compile time C technique used to define inheritance. Instead of giving a lengthy motivation I’ll jump directly to the algorithm and discuss it later. I hope lovers of C and its preprocessor find it useful. #defines first!

#define TOP 0
#define SET_CHILD(n,parent) ( parent==TOP ? n: \
                            ( parent<(1<<4) ? (n<<4) + parent : \
                            ( parent<(1<<8) ? (n<<8) + parent : (n<<12)+parent)))
 
#define IS_SUBNODE(child, parent) ((child & parent) == parent)
 
#define SELECT(X, a, best) ( a > best && IS_SUBNODE(X, a)? a : best)
 
#define SELECT_FROM_5(X, a, b, c, d, e) SELECT(X, a, \
                                        SELECT(X, b, \
                                        SELECT(X, c, \
                                        SELECT(X, d, \
                                        SELECT(X, e, 0)))))
 
#define SELECT_FROM_4(X, a, b, c, d) SELECT_FROM_5(X, a, b, c, d, 0)
#define SELECT_FROM_3(X, a, b, c)    SELECT_FROM_5(X, a, b, c, 0, 0)
#define SELECT_FROM_2(X, a, b)       SELECT_FROM_5(X, a, b, 0, 0, 0)

The SET_CHILD macro is used to define up to 15 child nodes of a given root for a n-ary tree of depth 5 with a single root node, named TOP. This is encoded within a single number of type word which is adequate for most embedded compilers. For 32 or 64 bit processors one can either support more child nodes or a deeper tree.

SET_CHILD is assigning a name to n-th child of a given parent. One starts with TOP as the parent of all nodes and recurses down:

#define A SET_CHILD(1, TOP)
#define B SET_CHILD(2, TOP)
...
#define A1 SET_CHILD(1, A)
#define A2 SET_CHILD(2, A)
...
#define B1 SET_CHILD(1, B)
#define B2 SET_CHILD(2, B)
...
#define A11 SET_CHILD(1, A1)
#define A12 SET_CHILD(2, A1)
...
#define A21 SET_CHILD(1, A2)
#define A22 SET_CHILD(2, A2)
...

By construction no more than 15 child nodes for a given parent are permitted. If more are used, macros like IS_CHILD will fail to work correctly.

Once a tree is created with the appropriate nodes, one can use IS_CHILD to check for child/parent relationships. The tree is constructed s.t. IS_CHILD(A, B) returns 1 iff A is a direct child of B or a grandchild of B etc. otherwise 0. So IS_CHILD(A22, A) evaluates to 1 just like IS_CHILD(A22, A2) or IS_CHILD(A22, TOP) but IS_CHILD(A22, A1) is 0.

The C preprocessor doesn’t support overloading and the flavors I checked didn’t support varargs wich wouldn’t probably be much helpful in this case either. So I defined a group of 5 SELECT_FROM_xx macros being distinguished only be the number of arguments. The number 5 isn’t magic and one can extend the range of SELECT_FROM_xx macros by need.

How is SELECT_FROM_xx used? The first argument X is an arbitary node of the tree. If one of the susequent nodes a, b, … c is identical with X, X will be the value of SELECT_FROM_xx(X, a, b, …, c). Otherwise the most-direct-parent of X among the nodes a, …c will be returned. If none of them is a parent of X the return value is TOP.

Example:

If we set

#define X A22

then we get

SELECT_FROM_2(X, A, B)        // = A
SELECT_FROM_3(X, A, B, A1)    // = A
SELECT_FROM_3(X, A, B, A2)    // = A2
SELECT_FROM_3(X, A2, B, A)    // = A2
SELECT_FROM_3(X, A2, A, A22)  // = A2
SELECT_FROM_2(X, A1, B)       // = TOP

Inheritance

With the definitions above we can influence conditional compilation:

#if SELECT_FROM_3(X,A2,A,B) == A2
        const int a = 0;
#else if SELECT_FROM_3(X,A2,A,B) == A
        const int a = 1;
#else if SELECT_FROM_3(X,A2,A,B) == B
        const int a = 2;
#else
        const int a = -1;
#endif

The virtue of the construction lies in its robustness. Suppose X is A22 then the first branch is selected but this remains true also if we build a “subclass” A22k , k = 1, …, 9, A, …, F of A22 and assign e.g.

#define X A225

So if we use conditional compilation for a given System S and create a subsystem T of S e.g. a new version of S, we have to adapt our C code only in places where T differs explicitely from S. This robustness is also the major virtue of using inheritance / polymorphism in OOP. It has led to disrespect of using case-statements in OOP since those do not exploit polymorphism and cause in turn less robust code. We see that case- or if-else statements can be confined with the very same idea and robustness even on the level of the C preprocessor. The additional charme of using the C preprocessor is that child/parent relationships are computed at compile time and do not cause any runtime performance penalty.

Restricted backmatching

Posted in TBP on March 13th, 2010 by kay – Be the first to comment

In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of  a regexp engine RE2, created by Russ Cox, the guy who has written a revival paper for Thompson NFAs  a while ago with a few follow-ups which build on those ideas.  Now once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:

The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.

When one takes a closer look at backmatching one starts to wonder what the restrictions to backmatching in trace based parsing ( TBP ) approaches to pattern matching really are?  One can ask the question also with a more positive intent: what cases of backmatching are compliant with TBP? We’ll find there are quite a lot. Some are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT - but could it be in the end that only esoteric applications of backmatching are excluded from TBP? The reader might be the final judge.

General backmatching

When we think about backmatching in regular expressions we might have expressions like this

 "... (P) ... \1"

in mind where (P) defines a simple pattern and \1 refers to the value of the matched pattern. So there is a functional relationship between (P) and \1.

Actually this perspective is a little simplistic in the general case. Consider the following simple regexp:

([ab]*)b*\1

Here the match of  ([ab]*) depends on what \1 will match but it is also highly ambiguous. If we match the following string

s = “bb”

the first b can be matched by ([ab]*) and the last “b” by \1 but the whole string can also be matched by b*.

Here is another more complicated example

(([ab]*)b*\2)*[ab]*\1b*

It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of ([ab]*) on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of ([ab]*) and (([ab]*)b*\2). It’s all tangled.

General backmatching like the one above can be used to solve NP-complete problems which exploits these tanglements and find resolutions. See this article for more examples. With exponential time backtracking algorithms built into regexp engines one gets solutions for free or by magic. In some sense this is cool but if you’d write a requirements document for a regexp engine would you demand that it shall solve 3-SAT problems? If you say “yes”, show me the real world application which uses this power.

Functional backmatching

If we restrict backmatching to simple functional relations between (P) and \1 we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.

In an expression

    ... (P) ... \1

the back-reference \1 can be separated from P when following holds:

  1. P doesn’t contain back-references which means it is self contained.

  2. It is possible to write the expression in the way

    ... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.

The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets

2.1 LAST-SET(L)  /\ FIRST-SET(P) = {}
2.2 LAST-SET(P)  /\ FIRST-SET(R) = {}

If additionally following condition is true

2.3 FIRST-SET(P) /\ LAST-SET(P) = {}

R might be empty and an expression

    ... L(P)\1 ...

is permitted.

End Cycles

The current conditions are still to restrictive. For example regexp (a)\1 violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjecent to what I call a non empty endcycle.

An endcycle of P has the following definition:

END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )

Take for example the regexpP = (a*|b|c). Here LAST-SET(P) = {a, b, c} and  FOLLOW-SET({a,b,c}) = {a} which means that a is in the endcycle of P.

With endcycles in mind we can weaken the conditions of (2) considerably:

If P has no endcycle i.e.

    END-CYCLE(P) = {}

we permit

    ... L(P)\1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}

If on the other hand

    END-CYCLE(P) != {}

we permit

    ... L(P)R ... \1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}
    END-CYCLE(P) /\ FIRST-SET(R) = {}

Final  Remarks

No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.

Syntax algebra - first steps

Posted in Algorithms, Grammars on January 30th, 2010 by kay – Be the first to comment

Principle of relativity

I started to revisit syntactic mappings defined in EasyExtend 3 which are closely tied to the Python grammar being  in use. Those are functions like varargs2arglist, normalize, split_file_input or exprlist2testlist defined in the csttools.py module. One of the major challenges of future releases of EasyExtend ( or a definite successor project - i don’t know ) is to abstract from Python as a target language. In EE3 only Python can be targeted by a langlet transformation whereas in EE 4 langlets are symmetric: each langlet can be a target langlet for any other langlet. All langlets exist on the same footing and also: each langlet can be used as a parent langlet of a newly defined langlet which means a proper inheritance relationship and maximum reuse.

The relativity among langlets calls for raising the bar of abstraction. All Python specific dependencies, and there are a lot in EE3, have to be removed from the basic CST libraries. Indeed not even nodes of special importance like expr, stmt or atom shall be allowed in places other than modules which are specific to a particular langlet. The generalization of the mentioned functions leads to a study of abstract syntactic forms and some sort of “syntax algebra”. It isn’t studied rigorously in this article but shall be at least motivated. As a prerequisite I recommend to read my article about CST interpolation which discusses concepts of major relevance.

Embeddings

The exprlist2testlist function turns a node of type exprlist defined as

exprlist: expr (',' expr)* [',']

into a node of type testlistdefined as

testlist: test (',' test)* [',']

This works without adding information because {test, expr} is a valid interpolation i.e. there is a nested sequence

[test, [or_test, [and_test, [not_test, [comparison , expr]]]]]
which is a representation of a valid CST. In terms of CST interpolations test(expr) yields a node of type test and induces a homomorphism exprlist->testlist. More generally an interpolation {A, B} induces an embedding B (x B)* -> {A,B} (y {A,B})* = A (y A)* if x and y are constant terminal symbols i.e. terminal symbols where the corresponding token have a uniquely determined token string.

Blocks or no blocks

Another relevant example is the normalize function. The idea behind normalize is that statements like if x: y or def foo(): pass are semantically equivalent to block statements:

if x:
    y
def foo():
    pass

Those block statements can be used in a more general fashion because we can add other block statements in the thunked block. In Python blocks are expressed by the suite grammar rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Since {stmt, simple_stmt} is a valid interpolation, we can substitute all occurences of

suite: simple_stmt

with suites of the form

suite: NEWLINE INDENT stmt+ DEDENT

The general syntactic embedding is of the kind

B -> a b… {A,B} c d… with a, b, c, d … being constant terminal symbols.

Notice that INDENT is assumed to be a constant terminal despite the fact that the token string may vary. INDENT is treated special because in practical applications the number of used spaces for an indentation is fixed.  EasyExtend always uses 4 spaces.

Splitting nodes

The function split_file_input is the prototype of a node splitting function which can be thought in analogy to string splits. In this particular case we have a node file_input of the form

file_input: (NEWLINE | stmt)* ENDMARKER

and want to generate a sequence of nodes file_input: stmt ENDMARKER, file_input: NEWLINE ENDMARKER and file_input: ENDMARKER - one for each NEWLINE, stmt and ENDMARKER in the original file_input node. It doesn’t matter that file_input: NEWLINE ENDMARKER and file_input: ENDMARKER are likely be thrown away by an application because this can be decided by the nodes consuming function. The general split function is defined by

R: x (A|B…)* y -> [(R: x A y), (R: x B y), ..., (R: x y)]

Node rewritings

The mappings considered above were all rather straightforward. Now we want to discuss a rule transformation which is less obvious, namely that of a function signature into an argument tuple of a function call. In Python 2.6 the function signature is defined as

varargslist: ((fpdef ['=' test] ',')*
('*' NAME [',' '**' NAME] | '**' NAME) |
fpdef ['=' test] (',' fpdef ['=' test])* [','])
fpdef: NAME | '(' fplist ')'
fplist: fpdef (',' fpdef)* [',']

and an argument list of a function call by

arglist: (argument ',')* (argument [',']| '*' test [',' '**' test] | '**' test)
argument: test [gen_for] | test '=' test

Can we transform each varargslist into an arglist?

Let’s start our treatment of varargslist with fpdef. If we insert the RHS of fplist in fpdef we get

fpdef: NAME | '(' fpdef (',' fpdef)* [','] ')'

We show that this rule is a special form of  the node atom and since {test, atom} is a valid interpolation it is also a test node. The atom node is defined by

atom: NAME | '(' [yield_expr|testlist_gexp] ')' |  '[' [listmaker] ']' | ...

which can be specialized to

atom: NAME | '(' testlist_gexp ')'

Next we consider the testlist_gexp definition

testlist_gexp: test ( gen_for | (',' test)* [','] )

which can be specialized to

testlist_gexp: test (',' test)* [',']

We insert testlist_gexp in atom which yields

atom: NAME | '(' test (',' test)* [','] ')'

If we reduce test to atom we get a rule

atom: NAME | '(' atom (',' atom)* [','] ')'

which is isomorphic to fpdef. So we just need to substitute all occurrences of fpdef in fpdef with atom, then replace atom with test(atom) and finally replace the whole of atom again with test(atom). This procedure substitutes fpdef with test.

When we substitute each occurrence of NAME with testin varargslist we get:

(test ['=' test] ',')* ('*' test [',' '**' test] | '**' test) |
                       test ['=' test] (',' test ['=' test])* [',']

which can be reduced to

(argument ',')* ('*' test [',' '**' test] | '**' test) |
                 argument (',' argument)* [',']

which is the same as

(argument ',')* (argument [','] | '*' test [',' '**' test] | '**' test)

Voilà!

Syntax algebra

We have done some informal steps into syntax algebra with some real functions defined in EE 3 as a starting point. For the first three functions we have found general syntactical transformations which might be universally applicable. The last transformation is very specific though and it might be more interesting to determine an algorithm used to find a rule transformation of a particular kind. Although the search algorithm might be NP complete I assume that the found transformation - if one exists - has linear time complexity which is what we want. Such an algorithm would be another great achievement of EasyExtend which does not cease to surprise me.

About CST interpolation

Posted in Algorithms, Parsing, TBP, Trail on December 7th, 2009 by kay – 3 Comments

Eli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.

In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).

There are two major reasons I love CSTs.

  1. Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
  2. The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
  3. For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.

I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram

A grammar gives rise to a number of finite state machines - in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to

  • check correctness of CSTs under transformations
  • connect individual CST nodes through sequences of other CST nodes ( interpolation ).
  • insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)

Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike regexps and regexp engines which uniformly apply matching and searching to all strings.

Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.

Parse tree representations

Take a look at the following AST constructor kept as an example from Eli’s article

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The Expression constructor takes a node of type BinOp and produces a node of type Expression. It is used to represent that actual Python expression “xy”*3.

Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:

>>> import parser
>>> parser.expr('"xy"*3').tolist()
[258, [326, [303, [304, [305, [306, [307, [309, [310,
[311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]],
[16, '*'],
[315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]

The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.

The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:

eval_input  -- NT`258
  testlist  -- NT`326
    test  -- NT`303
      or_test  -- NT`304
        and_test  -- NT`305
          not_test  -- NT`306
            comparison  -- NT`307
              expr  -- NT`309
                xor_expr  -- NT`310
                  and_expr  -- NT`311
                    shift_expr  -- NT`312
                      arith_expr  -- NT`313
                        term  -- NT`314
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                STRING  -- T`3     L`1
                                  "xy"
                          STAR  -- T`16     L`1
                            *
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                NUMBER  -- T`2     L`1
                                  3
  ENDMARKER  -- T`0     L`2
    ''

It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.

Searching nodes

Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions find_first and find_all which keep a node and a node id as arguments. So when we seek for a particular node e.g. term in the syntax tree we just call find_first( node, symbol.term) where symbol.term is the node id of term encoded in symbol.py which is a standard library module. So for

nd = parser.expr(’”xy”*3′).tolist() we can apply find_first(nd, symbol.term)which returns

term  -- NT`314
  factor  -- NT`315
    power  -- NT`316
      atom  -- NT`317
        STRING  -- T`3     L`1
          "xy"

Traces

We want to name CST constructors just of the nodes they create. So expr creates a node of type symbol.expr, STRING a node of token.STRING and so on. In order to create a correct expr we have to call lots of node constructors. In source code this would be something like

expr(xor_expr(…(term(factor(…(STRING(”xy”)…), STAR(”*”), factor(…(NUMBER(”3″)…))…))

This doesn’t look much like noise reduction, but now consider this: when expr is created by the parser the parser starts with nothing but a sequence A = (STRING(”xy”), STAR(”*”), NUMBER(”3″)). So why isn’t it possible to start with A and expr and build expr(*A) ? We want to face a slightly more general problem namely having a sequence of nodes A = (a1, a2, …, ak) which are not necessarily token and a node constructor expr. Can we build a node expr(a1, …,ak) of type symbol.expr?

What is needed to identify an admissible sequence A with this property?

First of all let’s take a look at the grammar rule description of expr

expr: xor_expr ('|' xor_expr)*

Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes xor_expr VBAR xor_expr VBAR xor_expr is a trace of expr. But also xor_expr alone is a trace. So what is needed is to wrap a given sequence A = (a1, a2, …, ak) into a trace. We might start with the most simple case of a single node A= (a) which shall be wrapped into expr. As an example we consider A = (term).

Interpolation

In order to wrap term into expr we need a sequence of intermediate nodes xor_expr, and_expr,shift_expr, arith_expr` and then build

[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]

This sequence is uniquely determined by expr and term. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.

So when there is no choice in building a wrapper of type N around M we write {N, M} and call it an interpolation between N and M. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.

We have found already a valid interpolation {expr, term}. Other valid interpolations are {factor, STRING(”xy”)} and {factor, NUMBER(”3″)}. For term this already suffices to build a trace:

{factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)}

and with this trace we get

{expr, term({factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)})}

Now we are prepared to define an algorithm:

Let N be a node and A = (a1, ..., ak) a sequence of nodes.
Consider also the set of all nodes set(M1, ..., Mn) with
{N, Mi}, i=1,...,n being a valid interpolation starting with N.
 
For each Mi, i=1,...,n we try to build a trace
TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}).
 
If we have a found a trace for M is we get the result
{N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}

Autocompletion

Sometimes our algorithm might fail to find a trace for a node N and a sequence A but the error can still be corrected in a fully determinate fashion. Take the following rule for example:

dotted_name: NAME (’.’ NAME)*

together with a sequence A = (NAME, NAME) of two nodes. Obviously we cannot build a valid trace NAME DOT NAME from A directly but the insertion of DOT into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for DOT. It can always only be “.”. So it is possible to omit the DOT in A and still get a uniquely determined trace for dotted_name.

Applications

We’ve come to an end already. With the prerequisites given above it is perfectly possible to write

expr(STRING(”xy”), STAR(”*”), NUMBER(”2″))

and get a valid parse or even shorten it and write

expr(’”xy”‘, “*”, 2)

which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.

fn = lambda name, val: expr_stmt(name, “=”, val)

This lambda expression yields bindings of val to name. For example

fn(”a”, expr(’”xy”‘, “*”, 2)) is the parse tree equivalent of a = “xy”*2.

Notice that in any case the parse tree is syntactically correct by construction.

Wrapper Functions

Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of power which is defined as

power: atom trailer* ['**' factor]

This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:

def CST_Call(names, args = None, star_args = None, dstar_args = None):
    Names = [atom(names[0])]
    for name in names[1:]:
        Names.append(trailer('.', name))
    ARGS = list(args) if args else []
    if star_args:
        ARGS+=['*', star_args]
    if dstar_args:
        ARGS+=['**', dstar_args]
    if ARGS:
        return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')]
    else:
        return [symbol.power] + Names + [trailer('(', ')')]

Choosers and ChooserMixins in C++ and Python

Posted in CPP, Chooser, Python, Testing on September 12th, 2009 by kay – 2 Comments

Chooser Objects

From time to time I’m amazed to find a simple algorithm which seemed to be a low hanging fruit which was just overlooked. In this particular case it is about generating and utilizing test data in a both simple and flexible manner. Mark Seaborn described the method in his outstanding blog article How to do model checking of Python code. He distilled what we might call the Chooser Algorithm from a scientific paper which buries the message under all sorts of methodological considerations and special case treatments and other bloat. This is sad because good algorithms are the crown jewels of programming. It also helped that he provided an implementation in Python and not in C or some sloppy computing-scientist-only pseudo code notation which changes from author to author.

We can motivate Chooser objects as follows.

Suppose you have a control flow statement defined in a function f. The path the flow control takes is determined by the value of some variable x:

def f(*args):
    ...
    x = g(*args)
    if x>0:
        ...
    else:
        ...

When we want to test the if-statement alone we can ignore the value of x computed by g. A simple method to achieve this is to introduce a for-loop in the code which iterates over a range of values which represent jumps to the individual if-statement branches:

def f(*args):
    ...
    x = g(*args)
    for x in (1,-1):
        if x>0:
            ...
        else:
            ...

However, this is a quite heavy change and we would likely not want to repeat this at another place. Instead of adding a for-loop we can introduce a non-deterministic choice over the values 1 and -1 and pull the iteration, represented by the loop, out of the function:

def test(chooser):
    def f(*args):
        ...
        x = g(*args)
        x = chooser.choose([1,-1])
        if x>0:
            ...
        else:
            ...
    f(1,2,3)  # call f with appropriate arguments

Here we inserted a call to choose which represents a set of choices. No new control flow is introduced. The function f must be called as many times as there are choices passed to choose.

The repeated call of f is managed by a new function check which is part of the Chooser Algorithm. It actually calls the test function which has a uniform interface and keeps a single chooser parameter.

class ModelCheckEscape(Exception): pass
 
def check(func):
    stack = [[]]
    while stack:
        chosen = stack.pop()
        try:
            func(Chooser(chosen, stack))
        except ModelCheckEscape:
            pass

The check function creates a Chooser object and passes it to func which is represents the system under test. The Chooser constructor takes two arguments. One is a list called chosen popped from a stack of such lists, the other one is the stack itself which might be filled with new lists.

class ModelCheckEscape(Exception): pass
 
class Chooser(object):
    def __init__(self, chosen, stack):
        self._chosen = chosen
        self._stack  = stack
        self._it     = iter(chosen)
 
    def choose(self, choices):
        try:
            choice = self._it.next()
            if choice not in choices:
                raise Exception("Program is not deterministic")
            return choice
        except StopIteration:
            self._stack+=[self._chosen + [choice] for choice in choices]
            raise ModelCheckEscape()

This is the definition of the Chooser object. It is a tiny bit of elementary but ingenuous code. In order to understand what it does consider the following test function with its three calls of choose:

def test(chooser):
    x = chooser.choose([True, False])
    if x:
        y = chooser.choose(["a", "b"])
    else:
        z = chooser.choose(["y", "z"])

On each choose call a value is returned from the _it iterator. Those values must conform to the choices passed to choose for every call of choose. Otherwise a ChooserException is raised. So we expect _it to be an iterator wrapped around lists like [True, "a"], [True, "b"], [False, "y"], [False, "z"]. Those lists are associated with the choices being made at (x, y) or (x, z).

In fact we observe some more of those lists, starting with the empty list [] and the incompletely filled lists [True] and [False]. When _it is wrapped around an incomplete list one of the choose calls will raise a StopIteration exception at _it.next(). Assume for example that _it = iter([True]) then _it is already exhausted after x and choose and will raise StopIteration at the definition of y. At this point each of the choices at y i.e. “a” and “b” will produce a new list. Those lists are [True, "a"] and [True, "b"] which are now complete. New lists are pushed on the stack as long as incomplete lists are popped from the stack incheck().

As a special case we consider a simple linear sequence of choose calls

def test(chooser):
    x = chooser.choose([True, False])
    y = chooser.choose(["a", "b"])
    z = chooser.choose(["y", "z"])

The set of complete lists according to this sequence will be the Cartesian product of the choices: {True, False} x {”a”, “b”} x {”y”, “z”}. If you just want Cartesian products there are more efficient alternatives to create them though.

These are the Chooser basics. For Python you can download the used code here.

Choosers in C++

I gave a C++ and STL based implementation of Chooser objects. The Chooser C++ API closely follows the Python example. You can download the code from the linked document.

In its most general form the choose method has following signature:

    template <typename Container>
    typename Container::value_type choose(Container& choices)

The return type is derived from the containers value_type attribute. Other than this the algorithm only relies on iterators which means that any STL container can be used. We can rewrite the simple test function above in C++:

void test(Chooser& chooser) {
    int x = chooser.choose(2);
    if x {
        string s = "ab";
        char y = chooser.choose(s);
    }
    else {
        string s = "yz";
        char z = chooser.choose(s);
    }
}

This is not all that much overhead. In case of the x definition we use an overloaded version of choose which takes a single integer parameter k. This is equivalent to a choice of values within the range {0, 1, …, k-1}. The most relevant case may be choose(2) which is the boolean choice.

The string type is an STL container type as well. More precisely it is a typedef for basic_string<char>. We can create a string object with a string literal but we cannot pass a string literal directly to choose which expects an explicit reference to a container from which the return type is derived ( char in this case ).

ChooserMixin classes

Suppose we want to introduce Chooser objects into arbitrary methods of an existing class. The Chooser Algorithm is implemented s.t. a Chooser object is explicitly passed as a parameter but this would require changes in a methods interface, something we try to avoid.

Visibility of Chooser instances in the local scope of a method can also be achieved by making them global or member variables. An inexpensive method which is safer than using globals is to use a mixin class. The mixin class defines aChooser instance and if some class wants to use it, it derives from the mixin.

class ChooserMixin {
protected:
    Chooser chooser;
public:
    void test() = 0;
 
    void check()
    {
        ...
        this->chooser = Chooser(chosen, queue);
        test();
        ...
    }
}

The test method is abstract. If f is the method we want to check, then the implementation of test would just invoke f with appropriate parameters:

void test() {
    f(arg1, arg2, ...);
}
It’s easy to change test without touching any other source code.

More advantages of ChooserMixins

When we use ChooserMixin we can define the choices C being used in chooser.choose(C) also as member variables. This makes choices configurable. A subclass of a ChooserMixin might read data from an external file or a database and populate the C container.

I wonder if it’s even possible to get rid of T x = chooser.choose(C)assignments in method source when using data binding techniques? In JavaFX we can restate the assigment in the form

var x = bind chooser.choose(C)

The bound variable x is updated whenever C is changed. Instead of creating a new instance of Chooser on each iteration, we replace the members defined in a single instance and trigger updates of C which in turn causes chooser.choose(C) to produce a new value. It remains to be examined if this idea is somehow practical.

Linear bounds for backtracking parsers with memoization

Posted in Parsing on June 12th, 2009 by kay – 2 Comments

On the comp.compilers mailing list I asked for a confirmation of the O(n) complexity claim for packrat parsers. A packrat parser is a particular top down recursive descendant parser which applies backtracking on failure of consuming a token by application of a parsing expression. Now a backtracking parser usually exhibits O(2n) complexity and the essential claim is that memoization brings it down to O(n) on the cost of expanding storage. The work of Bryan Ford about packrat parsers refers to an original article written by Birman & Ullmann that gave a proof. The cited article costs 19$ at IEEE which I felt I would spent when I failed to give a proof by myself. Fortunately the proof is elementary and not complicated. So I could save my money.

Here we go.

Each non-terminal N of a grammar G has a finite first set {T1, .., Tn} which are terminal symbols that can be recursively determined by examining the start of the rule N. So if N has the shape

N → R1 … | R2 …

then

FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R2) …

with FirstSet(Ri ) = { Ri } if Ri is a terminal symbol.

Next we consider the preimages of terminals with regard to FirstSets. We call them ancestors and for a terminal T the set of ancestors is defined by:

Ancestors(T) = {T} ∪ {N ∈ NonTerminals(G)| T ∈ FirstSet(N)}

Each terminal T has at least one ancestor ( itself ) and at most K = #NT(G)+1 where #NT(G) is the number of non-terminals of the grammar G.

Suppose now we feed a token stream T1 T2 … Tn into a top down parser. For each Ti in the stream we can apply all of the rules in Ancestors(Ti) and store the results in a set

Trees(i) = {(i, Tree(R))| R ∈ Ancestors(Ti) }

Since for all i the inequality#Ancestors(Ti) ≤ K holds the set

Trees = i∈{1, …, n}Trees(i)

has at most K*n elements. If a new tree needs to be computed the sub-trees may be looked up in Trees. If one is missing that one is computed as well etc. In any case no more than K*n computations need to be performed to determine all elements of Trees. Since K was a constant that depended only on the grammar we can conclude that the asymptotic complexity of computing Trees is O(n). Since Trees contains a solution to our problem it can be computed in O(n) as well.

Is parsing Perl really impossible?

Posted in Grammars, Parsing on June 12th, 2009 by kay – 5 Comments

I just read this article about the apparent inability to parse Perl 5.

There is an underlying assumption that a parser has to resolve all ambiguities and derive a single parse tree from an expression ( giving a unique interpretation ). But instead of demanding the value of an expression in order to parse it the parser might derive multiple parse trees and insert them as child nodes into a new parse tree representing a conditional expression. So the parser can just inflate the parse tree structure and do its job.

Whether or not the conditional expression can be computed at runtime doesn’t have to bother the compiler developer. It is just another function call and no more dubious then an arbitrary function call in a conditional expression

if(foo())
    printf("foo() is true\n");
else
    printf("foo() is false\n");

So it makes sense to fix requirements to a parser before giving proofs of their impossibility.

CodeTemplates - a fresh look on code transformations

Posted in DSL, Grammars, TBP on June 7th, 2009 by kay – Be the first to comment

This is the first part of a two part article series about CodeTemplates. CodeTemplates are a code transformation technique that suggests an alternative to common syntactic as well as lexical macros for whole language transformations.

Syntactic macros

About three years ago I experimented with a “syntactic” macro system for Python incorporated in EasyExtend 2. In the meantime another project emerged called MetaPython striving for a similar goal under the sign of the Ouroboros.

Fundamentally a macro system defines functions called macros that act as compile time code generators. It is not surprising that code is data - how else shall a compiler work? - however few languages specify functions that treat source code as source code and operators that evaluate source code at compile time. By quasi-quotation some piece of code in a macro body is mentioned rather than evaluated. Complementary to quasi-quoting there is a splice-operator that evaluates code and returns other code preferably in AST form that becomes aligned with the AST of the quasi-quoted code.

The difference between source code and ASTs is going to vanish in so called homoiconic languages like Lisp where the source code that is written by the user is also a data-structure of the language. So a macro is essentially a function that transforms those data-structures meant to be code. It doesn’t mean though that homoiconicity is prerequisite for syntactic macros. The cited MetaPython but also MetaLua are examples for macro systems in non-homoiconic languages. A nice treatment of the foundations of macro systems in non-homoiconic languages can also be found in the documentation of the Converge language.

Syntactic macros in EE 2

EasyExtend 2.0 defined a macro system in an own langlet, simply called the macro langlet. The role of macros in EasyExtend is somewhat peculiar though because EE is a system used for whole language transformations with language syntax specified in grammars. EE binds node handlers to parse tree node types and those node handlers emit parse tree nodes of the target language. They were just meant to facilitate code transformations within the framework.

Unfortunately it turned out that they didn’t. A bug in a macro caused hard to detect language transformation errors. Error recovery is a major concern if not the single major concern with language transformers and the complicated quasi-quoting / splicing semantics of macros made this even harder. More often than I ever wanted I debugged through framework when attempting to detect a bug in the macro. Furthermore the macro langlet operated within the system and was transformed as well. So two languages were transformed at the same time whereas one of them was additionally engaged in transforming the other. EasyExtend 2 made this possible without any conflicts implementing a joint transformation machinery and separated ranges of node types node handlers could bind to. The macro transformation was implemented as a set of node handlers in the LangletTransformer of the macro langlet and it was an ugly beast of code. I intended to rewrite this for EE 3 but it at some point of time I decided to terminate this intellectual exercise and dropped the macro facility completely.

Why not just textual macros?

A textual substitution macro occasionally also called “lexical” macro is a simple way to do code transformations. The most important programming language ever, C, implements them by means of the C preprocessor. Lexical macros are also known as poor-mans macros and using them in EasyExtend was like leaving civilization because city-life is complicated and corrupt and one better strives for a subcomplex life as a farmer or hunter-gatherer. Furthermore EE is in deep love with syntax so a radical conceptualization of the macro idea is required. Fortunately the intellectual effort for this is low and can be summarized in one sentence

Enable arbitrary string transformations guarded by syntax

This is universally applicable for all context free languages and doesn’t require any additional macro syntax.

If one attempts to transform one string S1 into another one S2 all that is needed are the following three operators

  • insert
  • delete
  • subst

The edit distance between S1 and S2 is the minimal number of operations of this kind used to transform S1 into S2. It can be computed by the popular algorithm of Levenshtein.

EasyExtend 4 will use a slightly different set of operators

  • replicate
  • delete
  • subst

It is almost equally expressive except for the case of an empty string S1 that requires an initial insert and there is nothing yet to replicate, delete or substitute. But once you have a single character in a string, one can replicate this character and apply substitutions which yields the same as applying multiple insertions.

CodeTemplate objects in EE 4

EasyExtend 4 defines a module called csttemplate.py that defines 4 types of objects called CodeTemplate, CodeMarker, CodeSelection and SelectionTree.

A CodeTemplate is an object that wraps some piece of source code in some language. The source code is parsed on construction of the CodeTemplate.

A CodeMarker marks sections in the source by means of search pattern. One can bind arbitrary many CodeMarker objects to a CodeTemplate. It is possible to also bind CodeMarkers to other CodeMarkers building a CodeMarker tree. When a CodeMarker B is bound to another CodeMarker A the search pattern of B is only applied to the code marked by A.

A CodeTemplate implements a matchmethod. If it is applied the pattern specified by all bound CodeMarker objects are matched against the code of the CodeTemplate. The result is a SelectionTree object. The tree holds a list of CodeSelection objects as well as a list of subtrees.

All transformations of the source code wrapped by the CodeTemplate are applied through CodeSelection objects. It is the CodeSelection object that substitutes, replicates or deletes code that is bound to. Note that internally all those operations still happen on a parse tree. Invalid operations will immediately be rejected and an exceptions is raised. It is not a complete syntax check on the whole code that is applied on each operation and it wouldn’t make sense to perform one either because some piece of code can morphed into code of another language and it will be a hybrid piece of code except at transformations start and the endpoint.

The SelectionTree is clonable and can be reused for other substitutions.

Outlook

In the next article I’ll show CodeTemplates in action.

Trace Based Parsing (Part V) - From Grammars to NFAs

Posted in EasyExtend, Grammars, TBP, Trail on May 7th, 2009 by kay – Be the first to comment

In this article I will demonstrate the translation process by which an EBNF grammar will be translated into a Trail NFA. There are several different notations for EBNF that differ only in minor details. I’ve chosen the one of Pythons Grammar which can roughly be described using following set of EBNF rules:

file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING

The grammar is self-describing. Starting with the non-terminal file_input you can parse the grammar using its own production rules. The multiplicities the EBNF grammar uses are:

  • [X] - zero or one time X
  • X* - zero or more times X
  • X+ - one or more times X

EasyExtend defines a langlet called grammar_langlet used to parse grammars. With the grammar_langlet EBNF grammars become just ordinary languages supported by EE. The parse trees created from grammars will be transformed into NFAs.

From EBNF parse trees to Trail NFAs

The grammar_langlet defines transformations on parse trees. More concretely it wraps pieces of the grammar into Rule objects. Rule objects are basically list wrappers with an additional copy method which is not of interest here.

class Rule(object):
    def __init__(self, lst):
        self.lst = lst
 
    def copy(self):
        ...

From Rule we derive several subclasses which can be reflected by issubclass:

class ConstRule(Rule): ...
class SequenceRule(Rule): ...
class AltRule(Rule): ...
class EmptyRule(Rule): ...

What you don’t find are rules like MaybeRule or ZeroOrMoreRule because we eliminate multiplicities entirely. This will discussed later.

Names and strings are wrapped wrapped into ConstRule objects as follows:

ConstRule([('file_name', 0)])
ConstRule([('"("', 1)])
ConstRule([('RULE', 2)])
...

In this scheme we already recover the indices that directly translates to states of Trail NFAs. A special case of a ConstRule is the rule that wraps None:

ConstRule([('None', '-')])

The rules of type SequenceRule and AltRule serve as containers for other rules.

Elimination of multiplicities

We apply a trick to get rid of rules with multiplicities i.e. rules defined by X*, X+ and [X]. All of them can be reduced to special combinations of our basic rule set. Before we show the details we need to introduce the EmptyRule type. Apart from the unique exit state (None, ‘-’) there might be several other empty states (None, nullidx). Just like the constant rules those empty states get enumerated but with an own index and just like any but the exit state they might occur on the left-hand side of a rule. The idea of using those empty states might be demonstrated using a simple example. Consider the following rule:

R: A [B]

We can manually translate it into the NFA

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, '-')]
(B, 2): [(None, '-')]

However the translation becomes much simpler when we reify empty productions and insert them first:

(R, 0): [(A, 1)]
(A, 1): [(B, 2), (None, 1)]
(None, 1): [(None, '-')]
(B, 2): [(None, '-')]

What we have effectively done here is to interpret [B] as B | None. In a second step we eliminate the indexed empty states again which will lead to NFAs of the first form. Removing A* and A+ is similar but a little more tricky. We write

A*  → None | A | A A
A+  → A | A A

This looks wrong but only when we think about those transitions in terms of an an indexing scheme that assigns different numerals to each symbol that occurs in the rule. But now all A on the right hand side shall actually be indistinguishable and lead to identical states. With this modification of the translation semantics in mind we translate the grammar rule R: A* into

(R, 0): [(A, 1), (None, 1)]
(A, 1): [(A, 1), (None, '-')]
(None, 1): [(None, '-')]

Eliminating the indexed empty state leads to this NFA

(R, 0): [(A, 1), (None, '-')]
(A, 1): [(A, 1), (None, '-')]

which is just perfect.

The reduction of rules with multiplicities to sequences and alternatives can be applied to Rule objects as well:

[A] →  AltRule( EmptyRule([(None, nullidx)]), A)
A*  →  AltRule( EmptyRule([(None, nullidx)]), A, SequenceRule([A, A]))
A+  →  AltRule( A, SequenceRule([A, A]))

Since we have found a simple a way to express grammars by rule trees containing only the mentioned rule types we can go on and flatten those trees into a tabular form.

Computing NFAs from rule trees

First we define an auxiliary function called end(). For a rule object R this function computes a set of constant or empty rules that terminate R. This means end(R) contains the last symbols of R wrapped into rule objects.

def end(R):
    if not R.lst:
        return set()
    if isinstance(R, (ConstRule, EmptyRule)):
        return set([R])
    elif isinstance(R, AltRule):
        return reduce(lambda x, y: x.union(y), [end(U) for U in R.lst])
    else:  # SequenceRule
        S = R.lst[-1]
        return end(S)

With this function we can implement the main algorithm.

Suppose you have a SequenceRule(S1, …, Sk). The end points of each S[i] are computed by application of end(S[i]) and shall be connected with the start points of S[i+1]. Those starting points will build the follow sets of S[i]’s end points. If S[i+1] is a constant or empty rule we are done. If S[i+1] is an AltRule we compute the connection of S[i] with each entry in S[i+1]. If S[i+1] is a SequenceRule we recursively call our connection building algorithm with S[i] being prepended to the sub-rules of S[i+1]. In finitely many steps we always find the complete set of constant/empty rules S[i] can be connected with. Here is the Python implementation:

def build_nfa(rule, start = None, nfa = None):
    if not start:
        nfa = {}
        # rule.lst[0] == (rule name, 0)
        # rule.lst[1] == SequenceRule(...)
        start = set([ConstRule([rule.lst[0]])])
        return build_nfa(rule.lst[1], start, nfa)
    if isinstance(rule, SequenceRule):
        for i, R in enumerate(rule.lst):
            build_nfa(R, start, nfa)
            start = end(R)
    elif isinstance(rule, AltRule):
        for R in rule.lst:
            build_nfa(R, start, nfa)
    else: # ConstRule or EmptyRule
        r = rule.lst[0]
        for s in start:
            state = s.lst[0]
            follow = nfa.get(state, set())
            follow.add(r)
            nfa[state] = follow
    return nfa

The last step is to remove indexed empty states (None, idx).

def nfa_reduction(nfa):
    removable = []
    for (K, idx) in nfa:
        if K is None and idx!="-":
            F = nfa[(K, idx)]
            for follow in nfa.values():
                if (K, idx) in follow:
                    follow.remove((K, idx))
                    follow.update(F)
            removable.append((K, idx))
    for R in removable:
        del nfa[R]
    return nfa

This yields an NFA. In order to make it suitable for Trail one additionally has to map each state (A, idx) to (A, idx, R) with the rule id R.

What’s next?

So far all our discussions were placed in the realm of nice context free languages and their aspects. However the real world is hardly that nice and real language syntax can only be approximated by context free formalisms. Context sensitive languages are pervasive and they range from Python over Javascript to C, C++, Ruby and Perl. Instead of augmenting grammars by parser actions EasyExtend factors context sensitive actions out and let the programmer hacking in Python.