Trails in a Langscape

Python26 expressions

Posted on 26. November 2010 by kay

When you look at the following listing you might think it’s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:

raise a, b, c
import d
from e import*
import f
from .g import(a)
from b import c
from .import(e)
from f import(g)
from .a import(b, c as d,)
import e, f as g
from .a import(b, c,)
...

( Here is the complete listing ).

which is of course correct. It is a somewhat peculiar listing because you find expressions like

c[:,: d:,]**e

[c for d in e, lambda f = g, (a, b,) = c,: d, for e in f]

and a few others such as

b(*c, d, **e)**f

which will be rejected by the Python compiler because the `*c` argument precedes the name `d` but it is nevertheless “grammar correct” by which I mean it is consistent with Pythons context free LL(1) grammar.

The nice thing about the listing: it is automatically generated and it is complete in a certain sense.

Generative Grammars

When you look at a grammar rule like

for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]

it can be understood as an advice for producing exactly 2 expressions, namely:

'for' exprlist 'in' testlist ':' suite
'for' exprlist 'in' testlist ':' suite 'else' ':' suite

Other rules like

dictmaker: test ':' test (',' test ':' test)* [',']

has an infinite number of productions

test ':' test
test ':' test ','
test ':' test ',' test ':' test
test ':' test ',' test ':' test ','
...

When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite `for_rule` being finite and `dictmaker` being infinite the algorithm creates two productions for each.

After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.

Embedding of productions

Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand `test` in `dictmaker` by selecting and inserting one production we got for the rule `test`. Unfortunately a grammar isn’t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.

Look at the following two rules:

expr_stmt: testlist (augassign (yield_expr|testlist) |
           ('=' (yield_expr|testlist))*)
augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '&amp;=' | '|=' | '^=' |
            '&lt;&lt;=' | '&gt;&gt;=' | '**=' | '//=')

The only place where `augassign` occurs in the grammar is in `expr_stmt` but counting the number of productions for `augassign` we get 12 whereas we only count 3 productions for `expr_stmt` and there is just a single production which contains `expr_stmt`. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can’t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.

Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:

file_input: (NEWLINE | stmt)* ENDMARKER

I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:

*Wrap each initially determined production which is not a production of a start rule into a stmt*

Take the production ‘+=’ of `augassign` as an example. We find that `augassign` exists in `expr_stmt`. So we take one `expr_stmt` and *embedd* `augassign` in the concrete form ‘+=’.

testlist '+=' yield_expr

The subsequent embedding steps

expr_stmt   -&gt; small_stmt
small_stmt  -&gt; simple_stmt
simple_stmt -&gt; stmt

When embedding `small_stmt` into `simple_stmt` one has to add a trailing NEWLINE. So our final result is:

testlist '+=' yield_expr NEWLINE

Any rule we used during successive embedding doesn’t have to be used again as an initial rule of another embedding because it was already built into `file_input`. It can be reused though when needed. I did not attempted to minimize the number of embeddings.

Substitute non-terminals

Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find `import f` and `import d` among the listed statements. I suspect one of them is a shortened form of `import d.e` but since the rule for building `d.e` allows using `d` only and it is shorter, it will be chosen.

Detecting Grammar flaws

Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn’t mean that Pythons grammar isn’t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:

file_input: (NEWLINE | stmt)* ENDMARKER
simpleargs: fpdef (',' fpdef)*
defaultargs: fpdef '=' test (',' fpdef '=' test)*
starargs: '*' NAME
dstarargs: '**' NAME
varargslist: ( simpleargs [',' defaultargs] [',' starargs] [','dstarargs] |
               defaultargs [',' starargs] [','dstarargs] |
               starargs [','dstarargs] |
               dstarargs) [',']

So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.

Refactorings

In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn’t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.

If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -> OK; P(G2)(Expr(G1)) -> OK.

Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.

Higher approximations

Doesn’t the listing give us a 1st order approximation of the language? It’s a fun idea to think of all those listing expressions living in the “tangential space” of the language. “Higher approximation” would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set `Tr(K, nfa)` of traces of length `K` for a given nfa. `Tr(K, nfa)` may be empty for some K.Unfortunately we can’t infer from `Tr(K) = {}` that `Tr(K+1) = {}`. So what is a good stop criterion then?

The algorithm for creating `Tr(K, nfa)` is quite simple. The following functions are Langscape implementations:

def compute_tr(K, nfa):
    '''
    Computes the set Tr(K, nfa) of traces of length K for a given nfa.
    The return value may be [] if no trace of length K exists.
    '''
    _, start, trans = nfa
    return compute_subtraces(K, 0, start, [], trans)
 
def compute_subtraces(K, k, S, trace, trans):
    '''
    Computes complete traces of a given length.
 
    :param K: The prescribed length a trace shall have.
    :param k: The current length of a trace ( used by recursive calls ).
    :param trace: the current trace.
    :param trans: the {state:[follow-states]} dictionary which characterizes
                  one NFA.
    '''
    traces = []
    follow = trans[S]
    for F in follow:
        if F[0] is None:
            # termination condition fulfilled?
            if k == K:
                traces.append(trace+[F])
        else:
            m = trace.count(F)
            # impossible to terminate trace under this condition
            if m == K:
                continue
            else:
                traces+=compute_subtraces(K, max(k,m+1), F, trace+[F], trans)
    return traces

Posted in Grammars, Langscape, Python | Comments Off

Open source saturation

Posted on 7. August 2010 by kay

Reading the following post of Jimmy Schementi who explained his exit at Microsoft with the loss of MS’s interest in IronRuby I start to wonder if this isn’t a sign of the times? Open source projects get started by a small team of employees and killed when they don’t attract a community which brings them forth what rarely ever happens because everyone in OSS is already busy and either engaged with a major project, a brand which has been established a few years ago like (C)Python, Rails, Linux or Django or doing solo acts as in my own case. Same with Google Wave which was promising but the only wave it produced was a Tsunami of initial attention in the wikiredditblogosphere. Everyone expected Google would bring it forth just like any other commodity. I guess the same would happen to their Go language which was started by a superstar team of veteran programmers and would immediately go away if Google discontinues investment.

There are very few brands which are both new and do well like Clojure and Scala which seem to follow Pythons BDFL model and they are – unsurprisingly? – programming languages. Are there other examples of OSS projects that peaked in the last 2-3 years and established a community of regular committers who are not interns of a single company or do we see an almost inevitable saturation?

Posted in General, Programming Culture | 7 Comments

Langscape

Posted on 16. July 2010 by kay

Welcome to Trails in a Langscape which is the new title of this blog. It is a minor change since URLs are not affected and the character of the blog will also remain the same. Langscape is the successor project of EasyExtend and is publically hosted at Google Code.

Since I created this WordPress blog instance I slowly worked on a new EasyExtend release. I published lots of related technical ideas but never released any code. Now the code is out. It lives in an Hg repository, it is filed under a new label and hopefully a first packaged Langscape 0.1 release will follow soon. There is no project documentation at the moment and I still think about its organization. Another open issue is packaging and distribution, but I have no idea what is up-to-date in this area, how Langscape is possibly used, if anyone will ever create langlets or just use the growing toolbox applicable to Python, including syntactically guarded search and replace.

Europython 2010

Of course the time of publication is not arbitrarily chosen. I attend to Europython 2010 next week in Birmingham and have a talk about EasyExtend/Langscape at Wednesday late in the afternoon before we leave for Conference dinner. I hope many of you go to Europython as well and I’ll find a few of my casual readers in the audience. If the talk is so good as the fun I had preparing my slides, you’ll enjoy it as well.

Posted in General | Comments Off

Token Tracers

Posted on 3. June 2010 by kay

When I started programming EasyExtend in 2006 one of the major problems was the correct grammar -> NFA translation. I used big grammars and testing for correctness required lots of source code. The first heuristics I used was ugly and complex and it took about 2 or so years to find a neat trick which finally lead to replace it completely. The basic problem of systematic phrase or expression generation for testing purpose persisted though – until last week when I implemented a TokenTracer.

Tracers

A typical production rule in the Trail parser generator is translated into a single NFA which might look as in the following example

 1005: ["funcdef: [decorators] 'def' NAME parameters ':' suite",
        (1005, 0, 1005),
        {(1, 3, 1005): [(1006, 4, 1005)],
         (11, 5, 1005): [(1043, 6, 1005)],
         ('def', 2, 1005): [(1, 3, 1005)],
         (1004, 1, 1005): [('def', 2, 1005)],
         (1005, 0, 1005): [('def', 2, 1005), (1004, 1, 1005)],
         (1006, 4, 1005): [(11, 5, 1005)],
         (1043, 6, 1005): [(None, '-', 1005)]}],

It is not created for readability but it is nevertheless easy to decode. The `funcdef` grammar rule is assigned a numerical value, a rule identifier – here `1005`. Asscociated with the rule identifier is a 3-list consisting of

The rule in plain text
The start state of a finite automaton (1005, 0, 1005)
A finite automaton encoded as a dictionary of transitions.

Starting with `(1005, 0, 1005)` one can step through the automaton. The follow states are `[(‘def’, 2, 1005), (1004, 1, 1005)]`. The first one obviously represents the `def` keyword whereas the second is a representation of the `decorators` non-terminal which has the rule identifier `1004`. When you select the `(1004, 1, 1005)` state there is a single follow state, which is again the state of the `def` keyword otherwise you get the follow state `(1, 3, 2005)` of `(‘def’, 2, 1005)`. The state `(None, ‘-‘, 1005)` doesn’t have a follow state and it is the only one.

You can now define a function that keeps track of this stepping process through a rule. This function is called a Tracer.

A Tracer acts as follows:

&gt;&gt;&gt; tracer = Tracer(rules)
&gt;&gt;&gt; tracer.select(1005)   # selects automaton 1005 and returns the rule ids of the
['def', 1004]             # possible follow states
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)
[1006]
...

It is possible that a Tracer has to keep track of multiple traces at once. For example the `exprlist`rule

 1069: ["exprlist: expr (',' expr)* [',']",
        (1069, 0, 1069),
        {(12, 2, 1069): [(1053, 3, 1069)],
         (12, 4, 1069): [(None, '-', 1069)],
         (1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1053, 3, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1069, 0, 1069): [(1053, 1, 1069)]}],

defines transitions of the kind

(1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)]

with two rules of rule id `12` in the follow set. When `12` is selected in the Tracer all follow sets of all rules with rule id = 12 are unified:

&gt;&gt;&gt; tracer.select(1069)
[1053]
&gt;&gt;&gt; tracer.select(1053)
[12, None]
&gt;&gt;&gt; tracer.select(12)
[1053, None]
...

TokenTracers

This kind of tracing functionality is central to EasyExtends implementation of Trace Based Parsing (TBP). For single grammar rules TBP coincides with “Thompson NFA” style parsing discussed at length by Russ Cox or more recently by Carl Friedrich Bolz who gave a Python implementation.

We want to consider now a different sort of tracer which is more complicated to create than those for single grammar rules. Those tracers have to meet the following requirement:

The list of rule id’s returned from tracer.select() shall contain only None or rule id’s of terminal symbols.

The rule id’s of terminals are exactly the token types. The `select` function of a TokenTracer returns a list of token types and gets fed with a single token type. In the following example we step through the token stream of a simple function

def foo():
    print 42

Here we go

&gt;&gt;&gt; tracer = TokenTracer(rules)
&gt;&gt;&gt; tracer.select(1001)  # a single select using a top level non-terminal
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)     # foo
[7]
&gt;&gt;&gt; tracer.select(7)     # (
[1, 7, 8, 16, 36]
&gt;&gt;&gt; tracer.select(8)     # )
[11]
&gt;&gt;&gt; tracer.select(11)    # :
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(4)     # \n
[5]
&gt;&gt;&gt; tracer.select(5)     # INDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('print')
[1, 2, 3, 4, 7, 9, 13, 13, 14, 15, 25, 26, 32, 35, 'lambda', 'not']
&gt;&gt;&gt; tracer.select(2)     # 42
[4, 7, 9, 12, ..., 36, 48, '&lt;&gt;', 'and', 'if', 'in', 'is', 'is', 'not', 'or']
&gt;&gt;&gt; tracer.select(4)     # \n
[1, 2, 3, 6, 7, ... , 'try', 'while', 'with', 'yield']
&gt;&gt;&gt; tracer.select(6)     # DEDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(0)     # ENDMARKER

Application 1 – error detection

Using a TokenTracer it is dead simple to localize a syntax error which is – in the context free case – always an unexpected token. In principle Trail could delegate error recovery entirely to a TokenTracer.

Application 2 – autocorrection

A constant token is a token with a constant token string e.g. ‘;’ or ‘:’. Closely related are token like INDENT where the token string can be derived from context and a prescribed indentation. In sharp contrast are token like NAME, NUMBER and STRING where the token string is not language but user determined. In the select() sequence above we find constant token lists of length = 1 like [11] or [7]. If one of those token is omitted it can be inserted without guessing.

Application 3 – expression generation

The most intriguing aspect of TokenTracers is that each random token sequence which is constrained by a TokenTracer is syntactically correct. This can be used to create expression generators: first write a grammar G to describe the language syntax, then you derive a TokenTracer(G). Finally an expression generator `ExprGen(TokenTracer(G))` is created which is used to build random token sequences being compliant with G by means of the TokenTracer. Those token-sequences can either be turned into valid parse trees and get compiled or un-tokenized into source code.

A valuation function `fitness(expr)` -> `float` on expressions motivates the use of genetic programming for breeding expressions of a certain kind. For example I’m strongly interested in compact grammars which create big NFA expansions in Trail. It is not easy to see how those can be built by hand. Using GP one could set an arbitrary threshold like n = 1000 for the number of states in a single expanded NFA and tries to minimize the size of a grammar, where the size is measured in the number of tokens used for a grammar description in some meta-grammar ( e.g. EBNF ).

Posted in EasyExtend, Parsing, TBP | Comments Off

Shaky Python future

Posted on 24. April 2010 by kay

Mark Pilgrim says:

Anyway, I’m really proud of how well DiP3 [Dive into Python 3, ks] came out. The only problem is that no one is using Python 3. I took a gamble last year that large libraries would port to Python 3 while I was writing. That didn’t happen. I think it’s pretty clear by now that that’s not going to happen anytime soon. Everyone who gambled on the glorious non-backward-compatible future got burned. Given my experience with HTML, you’d think I’d learn. Ah well.

So what are realist expectations? Python 2 as the future of a research language called Python 3?

Posted in Python | 10 Comments

Inheritance and the C preprocessor

Posted on 24. March 2010 by kay

Defining n-ary trees using the C preprocessor

In this article I introduce a compile time C technique used to define inheritance. Instead of giving a lengthy motivation I’ll jump directly to the algorithm and discuss it later. I hope lovers of C and its preprocessor find it useful. #defines first!

#define TOP 0
#define SET_CHILD(n,parent) ( parent==TOP ? n: \
                            ( parent&lt;(1&lt;&lt;4) ? (n&lt;&lt;4) + parent : \
                            ( parent&lt;(1&lt;&lt;8) ? (n&lt;&lt;8) + parent : (n&lt;&lt;12)+parent)))
 
#define IS_SUBNODE(child, parent) ((child &amp; parent) == parent)
 
#define SELECT(X, a, best) ( a &gt; best &amp;&amp; IS_SUBNODE(X, a)? a : best)
 
#define SELECT_FROM_5(X, a, b, c, d, e) SELECT(X, a, \
                                        SELECT(X, b, \
                                        SELECT(X, c, \
                                        SELECT(X, d, \
                                        SELECT(X, e, 0)))))
 
#define SELECT_FROM_4(X, a, b, c, d) SELECT_FROM_5(X, a, b, c, d, 0)
#define SELECT_FROM_3(X, a, b, c)    SELECT_FROM_5(X, a, b, c, 0, 0)
#define SELECT_FROM_2(X, a, b)       SELECT_FROM_5(X, a, b, 0, 0, 0)

The `SET_CHILD` macro is used to define up to 15 child nodes of a given root for a n-ary tree of depth 5 with a single root node, named `TOP`. This is encoded within a single number of type `word` which is adequate for most embedded compilers. For 32 or 64 bit processors one can either support more child nodes or a deeper tree.

`SET_CHILD` is assigning a name to n-th child of a given `parent`. One starts with `TOP` as the parent of all nodes and recurses down:

#define A SET_CHILD(1, TOP)
#define B SET_CHILD(2, TOP)
...
#define A1 SET_CHILD(1, A)
#define A2 SET_CHILD(2, A)
...
#define B1 SET_CHILD(1, B)
#define B2 SET_CHILD(2, B)
...
#define A11 SET_CHILD(1, A1)
#define A12 SET_CHILD(2, A1)
...
#define A21 SET_CHILD(1, A2)
#define A22 SET_CHILD(2, A2)
...

By construction no more than 15 child nodes for a given parent are permitted. If more are used, macros like `IS_CHILD` will fail to work correctly.

Once a tree is created with the appropriate nodes, one can use `IS_CHILD` to check for child/parent relationships. The tree is constructed s.t. `IS_CHILD(A, B)` returns 1 iff `A` is a direct child of `B` or a grandchild of `B` etc. otherwise 0. So `IS_CHILD(A22, A)` evaluates to 1 just like `IS_CHILD(A22, A2) ` or `IS_CHILD(A22, TOP)` but `IS_CHILD(A22, A1)` is 0.

The C preprocessor doesn’t support overloading and the flavors I checked didn’t support varargs wich wouldn’t probably be much helpful in this case either. So I defined a group of 5 `SELECT_FROM_xx` macros being distinguished only be the number of arguments. The number 5 isn’t magic and one can extend the range of `SELECT_FROM_xx` macros by need.

How is `SELECT_FROM_xx` used? The first argument `X` is an arbitary node of the tree. If one of the susequent nodes `a`, `b`, … `c` is identical with `X`, `X` will be the value of `SELECT_FROM_xx(X, a, b, …, c)`. Otherwise the most-direct-parent of `X` among the nodes `a`, …`c` will be returned. If none of them is a parent of `X` the return value is `TOP`.

Example:

If we set

#define X A22

then we get

SELECT_FROM_2(X, A, B)        // = A
SELECT_FROM_3(X, A, B, A1)    // = A
SELECT_FROM_3(X, A, B, A2)    // = A2
SELECT_FROM_3(X, A2, B, A)    // = A2
SELECT_FROM_3(X, A2, A, A22)  // = A2
SELECT_FROM_2(X, A1, B)       // = TOP

Inheritance

With the definitions above we can influence conditional compilation:

#if SELECT_FROM_3(X,A2,A,B) == A2
        const int a = 0;
#else if SELECT_FROM_3(X,A2,A,B) == A
        const int a = 1;
#else if SELECT_FROM_3(X,A2,A,B) == B
        const int a = 2;
#else
        const int a = -1;
#endif

The virtue of the construction lies in its robustness. Suppose X is `A22` then the first branch is selected but this remains true also if we build a “subclass” `A22k , k = 1, …, 9, A, …, F` of `A22` and assign e.g.

#define X A225

So if we use conditional compilation for a given System `S` and create a subsystem `T` of `S` e.g. a new version of `S`, we have to adapt our C code only in places where `T` differs explicitely from `S`. This robustness is also the major virtue of using inheritance / polymorphism in OOP. It has led to disrespect of using case-statements in OOP since those do not exploit polymorphism and cause in turn less robust code. We see that case- or if-else statements can be confined with the very same idea and robustness even on the level of the C preprocessor. The additional charme of using the C preprocessor is that child/parent relationships are computed at compile time and do not cause any runtime performance penalty.

Posted in Algorithms, C | 3 Comments

Restricted backmatching

Posted on 13. March 2010 by kay

In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of a regexp engine RE2, created by Russ Cox who has written a revival paper for Thompson NFAs a while ago with a few follow-ups which build on those ideas. Once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:

The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.

So backmatching can’t be implemented efficiently, but why? What is it that prevents an efficient implementation and can a subset be defined which exist within the constraints of O(n) lexers or parsers? In this article we will find that backmatching for convenient use cases is not a problematic feature in linear space and time regexp engines. Some backreferences are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT – but could it be in the end that only esoteric applications of backmatching are excluded from trace based parsing (TBP)?

General backmatching

When we think about backmatching in regular expressions we might define expressions like this

a)  "... (P) ... \1"

where (P) defines a simple pattern and \1 refers to the string value matched by the pattern. We assume that there is a functional relationship between (P) and \1. First it is (P) that matches then \1 will match what (P) has matched before.

Actually this perspective is a little simplistic and in the general case backmatching can be more powerful. Consider the following simple regexp:

b)  ([ab]*)b*\1

Here the match of `([ab]*)` depends on what \1 will match but it is also highly ambiguous. If we match the following string

s = “bb”

the first b can be matched with `([ab]*)` and the last “b” with `\1` but the whole string can also be matched with `b*`.

Here is another more complicated example

c)  (([ab]*)b*\2)*[ab]*\1b*

It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of `([ab]*)` on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of `([ab]*)` and `(([ab]*)b*\2)`. It is all tangled.

General backmatching as in examples b) and c) can be used to solve NP-complete problems which exploits these tangles and finds resolutions. See this article for more examples. With NP complete backtracking algorithms built into regexp engines one gets such solutions for free. This is cool and clever but also unnecessary most of time.

Functional backmatching

If we restrict backmatching to simple functional relations between (P) and \1 as in case a) we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.

In an expression

    ... (P) ... \1

the back-reference \1 can be separated from P when the following conditions are satisfied:

1. P doesn’t contain back-references which means it is self contained.

2. It is possible to write the expression in the way

    ... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.

The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets

2.1 LAST-SET(L)  /\ FIRST-SET(P) = {}
2.2 LAST-SET(P)  /\ FIRST-SET(R) = {}

If additionally following condition is true

2.3 FIRST-SET(P) /\ LAST-SET(P) = {}

R can be empty and an expression

    ... L(P)\1 ...

is permitted.

End Cycles

The conditions 2.1 – 2.3 are still to restrictive. For example the regexp `(a)\1` violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjacent to what I call a non empty endcycle.

An endcycle of P has the following definition:

END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )

Take for example the regexp`P = (a*|b|c)`. Here `LAST-SET(P) = {a, b, c}` and `FOLLOW-SET({a,b,c}) = {a}` which means that `a` is in the endcycle of `P`.

With endcycles in mind we can weaken the conditions of (2) considerably:

If P has no endcycle i.e.

    END-CYCLE(P) = {}

we permit

    ... L(P)\1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}

If on the other hand

    END-CYCLE(P) != {}

we permit

    ... L(P)R ... \1 ...

if the following is valid:

    END-CYCLE(L) /\ FIRST-SET(P) = {}
    END-CYCLE(P) /\ FIRST-SET(R) = {}

Final Remarks

No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.

Posted in TBP | Comments Off

reverb – a revival

Posted on 28. February 2010 by kay

Sometimes software is given up by people and you realize it only a few years later. Large packages or libraries will inevitably be flagged as legacy and die but tiny modules might have a chance to survive and find a maintainer. I have done the latter now for reverb.py.

Posted in General | 3 Comments

Syntax algebra – first steps

Posted on 30. January 2010 by kay

Principle of relativity

I started to revisit syntactic mappings defined in EasyExtend 3 which are closely tied to the Python grammar being in use. Those are functions like `varargs2arglist`, `normalize`, `split_file_input` or `exprlist2testlist` defined in the `csttools.py` module. One of the major challenges of future releases of EasyExtend ( or a definite successor project – i don’t know ) is to abstract from Python as a target language. In EE3 only Python can be targeted by a langlet transformation whereas in EE 4 langlets are symmetric: each langlet can be a target langlet for any other langlet. All langlets exist on the same footing and also: each langlet can be used as a parent langlet of a newly defined langlet which means a proper inheritance relationship and maximum reuse.

The relativity among langlets calls for raising the bar of abstraction. All Python specific dependencies, and there are a lot in EE3, have to be removed from the basic CST libraries. Indeed not even nodes of special importance like `expr`, `stmt` or `atom` shall be allowed in places other than modules which are specific to a particular langlet. The generalization of the mentioned functions leads to a study of abstract syntactic forms and some sort of “syntax algebra”. It isn’t studied rigorously in this article but shall be at least motivated. As a prerequisite I recommend to read my article about CST interpolation which discusses concepts of major relevance.

Embeddings

The `exprlist2testlist` function turns a node of type `exprlist` defined as

exprlist: expr (',' expr)* [',']

into a node of type `testlist`defined as

testlist: test (',' test)* [',']

This works without adding information because `{test, expr}` is a valid interpolation i.e. there is a nested sequence

[test, [or_test, [and_test, [not_test, [comparison , expr]]]]]

which is a representation of a valid CST. In terms of CST interpolations `test(expr)` yields a node of type `test` and induces a homomorphism `exprlist`->`testlist`. More generally an interpolation `{A, B}` induces an embedding `B (x B)*` -> `{A,B} (y {A,B})* = A (y A)*` if `x` and `y` are constant terminal symbols i.e. terminal symbols where the corresponding token have a uniquely determined token string.

Blocks or no blocks

Another relevant example is the `normalize` function. The idea behind `normalize` is that statements like `if x: y` or `def foo(): pass` are semantically equivalent to block statements:

if x:
    y

def foo():
    pass

Those block statements can be used in a more general fashion because we can add other block statements in the thunked block. In Python blocks are expressed by the `suite` grammar rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Since `{stmt, simple_stmt}` is a valid interpolation, we can substitute all occurences of

suite: simple_stmt

with suites of the form

suite: NEWLINE INDENT stmt+ DEDENT

The general syntactic embedding is of the kind

`B` -> `a b… {A,B} c d…` with a, b, c, d … being constant terminal symbols.

Notice that `INDENT` is assumed to be a constant terminal despite the fact that the token string may vary. `INDENT` is treated special because in practical applications the number of used spaces for an indentation is fixed. EasyExtend always uses 4 spaces.

Splitting nodes

The function `split_file_input` is the prototype of a node splitting function which can be thought in analogy to string splits. In this particular case we have a node `file_input` of the form

file_input: (NEWLINE | stmt)* ENDMARKER

and want to generate a sequence of nodes `file_input: stmt ENDMARKER`, `file_input: NEWLINE ENDMARKER` and `file_input: ENDMARKER` – one for each `NEWLINE`, `stmt` and `ENDMARKER` in the original `file_input` node. It doesn’t matter that `file_input: NEWLINE ENDMARKER` and `file_input: ENDMARKER` are likely be thrown away by an application because this can be decided by the nodes consuming function. The general split function is defined by

`R: x (A|B…)* y` -> `[(R: x A y), (R: x B y), …, (R: x y)]`

Node rewritings

The mappings considered above were all rather straightforward. Now we want to discuss a rule transformation which is less obvious, namely that of a function signature into an argument tuple of a function call. In Python 2.6 the function signature is defined as

varargslist: ((fpdef ['=' test] ',')*
('*' NAME [',' '**' NAME] | '**' NAME) |
fpdef ['=' test] (',' fpdef ['=' test])* [','])
fpdef: NAME | '(' fplist ')'
fplist: fpdef (',' fpdef)* [',']

and an argument list of a function call by

arglist: (argument ',')* (argument [',']| '*' test [',' '**' test] | '**' test)
argument: test [gen_for] | test '=' test

Can we transform each `varargslist` into an `arglist`?

Let’s start our treatment of varargslist with `fpdef`. If we insert the RHS of `fplist` in `fpdef` we get

fpdef: NAME | '(' fpdef (',' fpdef)* [','] ')'

We show that this rule is a special form of the node `atom` and since `{test, atom}` is a valid interpolation it is also a `test` node. The `atom` node is defined by

atom: NAME | '(' [yield_expr|testlist_gexp] ')' |  '[' [listmaker] ']' | ...

which can be specialized to

atom: NAME | '(' testlist_gexp ')'

Next we consider the `testlist_gexp` definition

testlist_gexp: test ( gen_for | (',' test)* [','] )

which can be specialized to

testlist_gexp: test (',' test)* [',']

We insert `testlist_gexp` in `atom` which yields

atom: NAME | '(' test (',' test)* [','] ')'

If we reduce `test` to `atom` we get a rule

atom: NAME | '(' atom (',' atom)* [','] ')'

which is isomorphic to `fpdef`. So we just need to substitute all occurrences of `fpdef` in `fpdef` with `atom`, then replace `atom` with `test(atom)` and finally replace the whole of `atom` again with `test(atom)`. This procedure substitutes `fpdef` with `test`.

When we substitute each occurrence of `NAME` with `test`in `varargslist` we get:

(test ['=' test] ',')* ('*' test [',' '**' test] | '**' test) |
                       test ['=' test] (',' test ['=' test])* [',']

which can be reduced to

(argument ',')* ('*' test [',' '**' test] | '**' test) |
                 argument (',' argument)* [',']

which is the same as

(argument ',')* (argument [','] | '*' test [',' '**' test] | '**' test)

Voilà!

Syntax algebra

We have done some informal steps into syntax algebra with some real functions defined in EE 3 as a starting point. For the first three functions we have found general syntactical transformations which might be universally applicable. The last transformation is very specific though and it might be more interesting to determine an algorithm used to find a rule transformation of a particular kind. Although the search algorithm might be NP complete I assume that the found transformation – if one exists – has linear time complexity which is what we want. Such an algorithm would be another great achievement of EasyExtend which does not cease to surprise me.

Posted in Algorithms, Grammars | Comments Off

About CST interpolation

Posted on 7. December 2009 by kay

Eli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.

In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).

There are two major reasons I love CSTs.

Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.

I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram

A grammar gives rise to a number of finite state machines – in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to

check correctness of CSTs under transformations
connect individual CST nodes through sequences of other CST nodes ( interpolation ).
insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)

Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike `regexps` and `regexp` engines which uniformly apply matching and searching to all strings.

Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.

Parse tree representations

Take a look at the following AST constructor kept as an example from Eli’s article

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The `Expression` constructor takes a node of type `BinOp` and produces a node of type `Expression`. It is used to represent that actual Python expression `”xy”*3`.

Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:

&gt;&gt;&gt; import parser
&gt;&gt;&gt; parser.expr('"xy"*3').tolist()
[258, [326, [303, [304, [305, [306, [307, [309, [310,
[311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]],
[16, '*'],
[315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]

The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.

The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:

eval_input  -- NT`258
  testlist  -- NT`326
    test  -- NT`303
      or_test  -- NT`304
        and_test  -- NT`305
          not_test  -- NT`306
            comparison  -- NT`307
              expr  -- NT`309
                xor_expr  -- NT`310
                  and_expr  -- NT`311
                    shift_expr  -- NT`312
                      arith_expr  -- NT`313
                        term  -- NT`314
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                STRING  -- T`3     L`1
                                  "xy"
                          STAR  -- T`16     L`1
                            *
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                NUMBER  -- T`2     L`1
                                  3
  ENDMARKER  -- T`0     L`2
    ''

It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.

Searching nodes

Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions `find_first` and `find_all` which keep a node and a node id as arguments. So when we seek for a particular node e.g. `term` in the syntax tree we just call `find_first( node, symbol.term)` where `symbol.term` is the node id of `term` encoded in `symbol.py` which is a standard library module. So for

`nd = parser.expr(‘”xy”*3’).tolist()` we can apply `find_first(nd, symbol.term)`which returns

term  -- NT`314
  factor  -- NT`315
    power  -- NT`316
      atom  -- NT`317
        STRING  -- T`3     L`1
          "xy"

Traces

We want to name CST constructors just of the nodes they create. So `expr` creates a node of type `symbol.expr`, `STRING` a node of `token.STRING` and so on. In order to create a correct `expr` we have to call lots of node constructors. In source code this would be something like

`expr(xor_expr(…(term(factor(…(STRING(“xy”)…), STAR(“*”), factor(…(NUMBER(“3”)…))…))`

This doesn’t look much like noise reduction, but now consider this: when `expr` is created by the parser the parser starts with nothing but a sequence A = (STRING(“xy”), STAR(“*”), NUMBER(“3”)). So why isn’t it possible to start with `A` and `expr` and build `expr(*A)` ? We want to face a slightly more general problem namely having a sequence of nodes `A = (a1, a2, …, ak)` which are not necessarily token and a node constructor `expr`. Can we build a node `expr(a1, …,ak)` of type `symbol.expr`?

What is needed to identify an admissible sequence A with this property?

First of all let’s take a look at the grammar rule description of `expr`

expr: xor_expr ('|' xor_expr)*

Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes `xor_expr VBAR xor_expr VBAR xor_expr` is a trace of `expr`. But also `xor_expr` alone is a trace. So what is needed is to wrap a given sequence `A = (a1, a2, …, ak)` into a trace. We might start with the most simple case of a single node `A= (a)` which shall be wrapped into `expr`. As an example we consider `A = (term)`.

Interpolation

In order to wrap `term` into `expr` we need a sequence of intermediate nodes `xor_expr, and_expr, `shift_expr, arith_expr` and then build

`[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]`

This sequence is uniquely determined by `expr` and `term`. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.

So when there is no choice in building a wrapper of type `N` around `M` we write `{N, M}` and call it an interpolation between `N` and `M`. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.

We have found already a valid interpolation `{expr, term}`. Other valid interpolations are `{factor, STRING(“xy”)}` and `{factor, NUMBER(“3”)}`. For `term` this already suffices to build a trace:

`{factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)}`

and with this trace we get

`{expr, term({factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)})}`

Now we are prepared to define an algorithm:

Let N be a node and A = (a1, ..., ak) a sequence of nodes.
Consider also the set of all nodes set(M1, ..., Mn) with
{N, Mi}, i=1,...,n being a valid interpolation starting with N.
 
For each Mi, i=1,...,n we try to build a trace
TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}).
 
If we have a found a trace for M is we get the result
{N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}

Autocompletion

Sometimes our algorithm might fail to find a trace for a node `N` and a sequence `A` but the error can still be corrected in a fully determinate fashion. Take the following rule for example:

`dotted_name: NAME (‘.’ NAME)*`

together with a sequence `A = (NAME, NAME)` of two nodes. Obviously we cannot build a valid trace `NAME DOT NAME` from A directly but the insertion of `DOT` into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for `DOT`. It can always only be “.”. So it is possible to omit the `DOT` in A and still get a uniquely determined trace for `dotted_name`.

Applications

We’ve come to an end already. With the prerequisites given above it is perfectly possible to write

`expr(STRING(“xy”), STAR(“*”), NUMBER(“2”))`

and get a valid parse or even shorten it and write

`expr(‘”xy”‘, “*”, 2)`

which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.

`fn = lambda name, val: expr_stmt(name, “=”, val)`

This lambda expression yields bindings of `val` to `name`. For example

`fn(“a”, expr(‘”xy”‘, “*”, 2))` is the parse tree equivalent of `a = “xy”*2`.

Notice that in any case the parse tree is syntactically correct by construction.

Wrapper Functions

Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of `power` which is defined as

`power: atom trailer* [‘**’ factor]`

This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:

def CST_Call(names, args = None, star_args = None, dstar_args = None):
    Names = [atom(names[0])]
    for name in names[1:]:
        Names.append(trailer('.', name))
    ARGS = list(args) if args else []
    if star_args:
        ARGS+=['*', star_args]
    if dstar_args:
        ARGS+=['**', dstar_args]
    if ARGS:
        return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')]
    else:
        return [symbol.power] + Names + [trailer('(', ')')]

Posted in Algorithms, Parsing, TBP | 3 Comments