Open source saturation

Posted in General, Programming Culture on August 7th, 2010 by kay – 7 Comments

Reading the following post of Jimmy Schementi who explained his exit at Microsoft with the loss of MS’s interest in IronRuby I start to wonder if this isn’t a sign of the times? Open source projects get started by a small team of employees and killed when they don’t attract a community which brings them forth what rarely ever happens because everyone in OSS is already busy and either engaged with a major project, a brand which has been established a few years ago like (C)Python, Rails, Linux or Django  or doing solo acts as in my own case. Same with Google Wave which was promising but the only wave it produced was a Tsunami of initial attention in the wikiredditblogosphere. Everyone expected Google would bring it forth just like any other commodity. I guess the same would happen to their Go language which was started by a superstar team of veteran programmers and would immediately go away if Google discontinues investment.

There are very few brands which are both new and do well like Clojure and Scala which seem to follow Pythons BDFL model and they are - unsurprisingly? - programming languages. Are there other examples of OSS projects that peaked in the last 2-3 years and established a community of regular committers who are not interns of a single company or do we see an almost inevitable saturation?

Langscape

Posted in General on July 16th, 2010 by kay – Be the first to comment

Trails in a Langscape

Welcome to Trails in a Langscape which is the new title of this blog. It is a minor change since URLs are not affected and the character of the blog will also remain the same. Langscape is the successor project of EasyExtend and is publically hosted at Google Code.

Since I created this Wordpress blog instance I slowly worked on a new EasyExtend release. I published lots of related technical ideas but never released any code. Now the code is out. It lives in an Hg repository, it is filed under a new label and hopefully a first packaged Langscape 0.1 release will follow soon. There is no project documentation at the moment and I still think about its organization. Another open issue is packaging and distribution, but I have no idea what is up-to-date in this area, how Langscape is possibly used, if anyone will ever create langlets or just use the growing toolbox applicable to Python, including syntactically guarded search and replace.

Europython 2010

Of course the time of publication is not arbitrarily chosen. I attend to Europython 2010 next week in Birmingham and have a talk about EasyExtend/Langscape at Wednesday late in the afternoon before we leave for Conference dinner. I hope many of you go to Europython as well and I’ll find a few of my casual readers in the audience. If the talk is so good as the fun I had preparing my slides, you’ll enjoy it as well.

Ready Mades - African football culture

Posted in General on June 12th, 2010 by kay – 14 Comments

It took just two football games yesterday evening to create a worldwide recognizable sound icon which does now specify “African football culture” - a loud, monotonous sounding plastic horn, called the Vuvuzela. Since many spectators, in particular in Europe, complained about the hornets nest kind of noise over the stadium, it is now a symbol of African difference and tradition even though its mass production goes back for only 4 years.

Overnight the Vuvuzela has become the ready-made of a particular “culture”. It doesn’t matter much if there has been a real tradition i.e. one of singing and drumming because this was just a localized variant of what was already common on other continents. The sound-image of the Vuvuzela makes a difference and this difference can be fed back into a powerful symbolic machinery which retroactively creates identity and difference.

I’m very curious how everyone is going to deal with it over the next couple of weeks. Nothing against football but I find this far more thrilling than the games themselves.

Token Tracers

Posted in EasyExtend, Parsing, TBP, Trail on June 3rd, 2010 by kay – Be the first to comment

When I started programming EasyExtend in 2006 one of the major problems was the correct grammar -> NFA translation. I used big grammars and testing for correctness required lots of source code. The first heuristics I used was ugly and complex and it took about 2 or so years to find a neat trick which finally lead to replace it completely. The basic problem of systematic phrase or expression generation for testing purpose persisted though - until last week when I implemented a TokenTracer.

Tracers

A typical production rule in the Trail parser generator is translated into a single NFA which might look as in the following example

 1005: ["funcdef: [decorators] 'def' NAME parameters ':' suite",
        (1005, 0, 1005),
        {(1, 3, 1005): [(1006, 4, 1005)],
         (11, 5, 1005): [(1043, 6, 1005)],
         ('def', 2, 1005): [(1, 3, 1005)],
         (1004, 1, 1005): [('def', 2, 1005)],
         (1005, 0, 1005): [('def', 2, 1005), (1004, 1, 1005)],
         (1006, 4, 1005): [(11, 5, 1005)],
         (1043, 6, 1005): [(None, '-', 1005)]}],

It is not created for readability but it is nevertheless easy to decode. The funcdef grammar rule is assigned a numerical value, a rule identifier - here 1005. Asscociated with the rule identifier is a 3-list consisting of

  1. The rule in plain text
  2. The start state of a finite automaton (1005, 0, 1005)
  3. A finite automaton encoded as a dictionary of transitions.

Starting with (1005, 0, 1005) one can step through the automaton. The follow states are  [('def', 2, 1005), (1004, 1, 1005)]. The first one obviously represents the def keyword whereas the second is a representation of the decorators non-terminal which has the rule identifier 1004. When you select the (1004, 1, 1005) state there is a single follow state, which is again the state of the def keyword otherwise you get the follow state (1, 3, 2005) of  (’def’, 2, 1005). The state (None, ‘-’, 1005) doesn’t have a follow state and it is the only one.

You can now define a function that keeps track of this stepping process through a rule. This function is called a Tracer.

A Tracer acts as follows:

>>> tracer = Tracer(rules)
>>> tracer.select(1005)   # selects automaton 1005 and returns the rule ids of the
['def', 1004]             # possible follow states
>>> tracer.select('def')
[1]
>>> tracer.select(1)
[1006]
...

It is possible that a Tracer has to keep track of multiple traces at once. For example the exprlistrule

 1069: ["exprlist: expr (',' expr)* [',']",
        (1069, 0, 1069),
        {(12, 2, 1069): [(1053, 3, 1069)],
         (12, 4, 1069): [(None, '-', 1069)],
         (1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1053, 3, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1069, 0, 1069): [(1053, 1, 1069)]}],

defines transitions of the kind

(1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)]

with two rules of rule id 12 in the follow set. When 12 is selected in the Tracer all follow sets of all rules with rule id = 12 are unified:

>>> tracer.select(1069)
[1053]
>>> tracer.select(1053)
[12, None]
>>> tracer.select(12)
[1053, None]
...

TokenTracers

This kind of tracing functionality is central to EasyExtends implementation of Trace Based Parsing (TBP). For single grammar rules TBP coincides with “Thompson NFA” style parsing discussed at length by Russ Cox or more recently by Carl Friedrich Bolz who gave a Python implementation.

We want to consider now a different sort of tracer which is more complicated to create than those for single grammar rules. Those tracers have to meet the following requirement:

The list of rule id’s returned from tracer.select() shall contain only None or rule id’s of terminal symbols.

The rule id’s of terminals are exactly the  token types. The select function of a TokenTracer returns a list of token types and gets fed with a single token type. In the following example we step through the token stream of a simple function

def foo():
    print 42

Here we go

>>> tracer = TokenTracer(rules)
>>> tracer.select(1001)  # a single select using a top level non-terminal
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select('def')
[1]
>>> tracer.select(1)     # foo
[7]
>>> tracer.select(7)     # (
[1, 7, 8, 16, 36]
>>> tracer.select(8)     # )
[11]
>>> tracer.select(11)    # :
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select(4)     # \n
[5]
>>> tracer.select(5)     # INDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select('print')
[1, 2, 3, 4, 7, 9, 13, 13, 14, 15, 25, 26, 32, 35, 'lambda', 'not']
>>> tracer.select(2)     # 42
[4, 7, 9, 12, ..., 36, 48, '<>', 'and', 'if', 'in', 'is', 'is', 'not', 'or']
>>> tracer.select(4)     # \n
[1, 2, 3, 6, 7, ... , 'try', 'while', 'with', 'yield']
>>> tracer.select(6)     # DEDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
>>> tracer.select(0)     # ENDMARKER

Application 1 - error detection

Using a TokenTracer it is dead simple to localize a syntax error which is - in the context free case - always an unexpected token. In principle Trail could delegate error recovery entirely to a TokenTracer.

Application 2 - autocorrection

A constant token is a token with a constant token string e.g. ‘;’ or ‘:’. Closely related are token like INDENT where the token string can be derived from context and a prescribed indentation. In sharp contrast are token like NAME, NUMBER and STRING where the token string is not language but user determined. In the select() sequence above we find constant token lists of length = 1 like [11] or [7]. If one of those token is omitted it can be inserted without guessing.

Application 3 - expression generation

The most intriguing aspect of TokenTracers is that each random token sequence which is constrained by a TokenTracer is syntactically correct. This can be used to create expression generators: first write a grammar G to describe the language syntax, then you derive a TokenTracer(G). Finally an expression generator ExprGen(TokenTracer(G)) is created which is used to build random token sequences being compliant with G by means of the TokenTracer. Those token-sequences can either be turned into valid parse trees and get compiled or un-tokenized into source code.

A valuation function fitness(expr) -> float on expressions motivates the use of genetic programming for breeding expressions of a certain kind. For example I’m strongly  interested in compact grammars which create big NFA expansions in Trail. It is not easy to see how those can be built by hand. Using GP one could set an arbitrary threshold like n = 1000 for the number of states in a single expanded NFA and tries to minimize the size of a grammar, where the size is measured in the number of tokens used for a grammar description in some meta-grammar ( e.g. EBNF ).

The tangled hierarchy of nonsense - computing science vision

Posted in General on April 26th, 2010 by kay – Be the first to comment
  • English - meaningless, because it doesn’t have a formal semantics
  • Haskell - somewhat meaningless but less than English
  • ML - meaningful because it has a formal semantics, which means nothing though without a human reader who thinks in meaningless terms of the English language

Shaky Python future

Posted in Python on April 24th, 2010 by kay – 10 Comments

Mark Pilgrim says:

Anyway, I’m really proud of how well DiP3 [Dive into Python 3, ks] came out. The only problem is that no one is using Python 3. I took a gamble last year that large libraries would port to Python 3 while I was writing. That didn’t happen. I think it’s pretty clear by now that that’s not going to happen anytime soon. Everyone who gambled on the glorious non-backward-compatible future got burned. Given my experience with HTML, you’d think I’d learn. Ah well.

So what are realist expectations? Python 2 as the future of a research language called Python 3?

Rebuilding a ship on the ocean

Posted in Programming Culture on April 3rd, 2010 by kay – Be the first to comment

The contemporary hero? The one who changes a running system without causing havoc.

Usually you start with the energy of a reformer but end up as a connoisseur of delicate causalities and dependencies. Jaron Lanier writes about such “lock ins” in his latest book, which I will review here later. Reinventing the wheel might be a simple strategy to detract from dependencies. They have not had time yet to spread and strangle innovation.  But if the (re-)invention becomes successful, it gets involved in the tangle, spreads its own dependencies and adds to the overall complexity. We observe an excess of relatedness with lots of unknowns.

Far from being “principled conservatives” we use to be frustrated reformers for whom the modernist call for “embracing change” becomes a moral/emotional supplement i.e. the  right attitude without much of a noticeable effect.

Inheritance and the C preprocessor

Posted in Algorithms, C on March 24th, 2010 by kay – 3 Comments

Defining n-ary trees using the C preprocessor

In this article I introduce a compile time C technique used to define inheritance. Instead of giving a lengthy motivation I’ll jump directly to the algorithm and discuss it later. I hope lovers of C and its preprocessor find it useful. #defines first!

#define TOP 0
#define SET_CHILD(n,parent) ( parent==TOP ? n: \
                            ( parent<(1<<4) ? (n<<4) + parent : \
                            ( parent<(1<<8) ? (n<<8) + parent : (n<<12)+parent)))
 
#define IS_SUBNODE(child, parent) ((child & parent) == parent)
 
#define SELECT(X, a, best) ( a > best && IS_SUBNODE(X, a)? a : best)
 
#define SELECT_FROM_5(X, a, b, c, d, e) SELECT(X, a, \
                                        SELECT(X, b, \
                                        SELECT(X, c, \
                                        SELECT(X, d, \
                                        SELECT(X, e, 0)))))
 
#define SELECT_FROM_4(X, a, b, c, d) SELECT_FROM_5(X, a, b, c, d, 0)
#define SELECT_FROM_3(X, a, b, c)    SELECT_FROM_5(X, a, b, c, 0, 0)
#define SELECT_FROM_2(X, a, b)       SELECT_FROM_5(X, a, b, 0, 0, 0)

The SET_CHILD macro is used to define up to 15 child nodes of a given root for a n-ary tree of depth 5 with a single root node, named TOP. This is encoded within a single number of type word which is adequate for most embedded compilers. For 32 or 64 bit processors one can either support more child nodes or a deeper tree.

SET_CHILD is assigning a name to n-th child of a given parent. One starts with TOP as the parent of all nodes and recurses down:

#define A SET_CHILD(1, TOP)
#define B SET_CHILD(2, TOP)
...
#define A1 SET_CHILD(1, A)
#define A2 SET_CHILD(2, A)
...
#define B1 SET_CHILD(1, B)
#define B2 SET_CHILD(2, B)
...
#define A11 SET_CHILD(1, A1)
#define A12 SET_CHILD(2, A1)
...
#define A21 SET_CHILD(1, A2)
#define A22 SET_CHILD(2, A2)
...

By construction no more than 15 child nodes for a given parent are permitted. If more are used, macros like IS_CHILD will fail to work correctly.

Once a tree is created with the appropriate nodes, one can use IS_CHILD to check for child/parent relationships. The tree is constructed s.t. IS_CHILD(A, B) returns 1 iff A is a direct child of B or a grandchild of B etc. otherwise 0. So IS_CHILD(A22, A) evaluates to 1 just like IS_CHILD(A22, A2) or IS_CHILD(A22, TOP) but IS_CHILD(A22, A1) is 0.

The C preprocessor doesn’t support overloading and the flavors I checked didn’t support varargs wich wouldn’t probably be much helpful in this case either. So I defined a group of 5 SELECT_FROM_xx macros being distinguished only be the number of arguments. The number 5 isn’t magic and one can extend the range of SELECT_FROM_xx macros by need.

How is SELECT_FROM_xx used? The first argument X is an arbitary node of the tree. If one of the susequent nodes a, b, … c is identical with X, X will be the value of SELECT_FROM_xx(X, a, b, …, c). Otherwise the most-direct-parent of X among the nodes a, …c will be returned. If none of them is a parent of X the return value is TOP.

Example:

If we set

#define X A22

then we get

SELECT_FROM_2(X, A, B)        // = A
SELECT_FROM_3(X, A, B, A1)    // = A
SELECT_FROM_3(X, A, B, A2)    // = A2
SELECT_FROM_3(X, A2, B, A)    // = A2
SELECT_FROM_3(X, A2, A, A22)  // = A2
SELECT_FROM_2(X, A1, B)       // = TOP

Inheritance

With the definitions above we can influence conditional compilation:

#if SELECT_FROM_3(X,A2,A,B) == A2
        const int a = 0;
#else if SELECT_FROM_3(X,A2,A,B) == A
        const int a = 1;
#else if SELECT_FROM_3(X,A2,A,B) == B
        const int a = 2;
#else
        const int a = -1;
#endif

The virtue of the construction lies in its robustness. Suppose X is A22 then the first branch is selected but this remains true also if we build a “subclass” A22k , k = 1, …, 9, A, …, F of A22 and assign e.g.

#define X A225

So if we use conditional compilation for a given System S and create a subsystem T of S e.g. a new version of S, we have to adapt our C code only in places where T differs explicitely from S. This robustness is also the major virtue of using inheritance / polymorphism in OOP. It has led to disrespect of using case-statements in OOP since those do not exploit polymorphism and cause in turn less robust code. We see that case- or if-else statements can be confined with the very same idea and robustness even on the level of the C preprocessor. The additional charme of using the C preprocessor is that child/parent relationships are computed at compile time and do not cause any runtime performance penalty.

Restricted backmatching

Posted in TBP on March 13th, 2010 by kay – Be the first to comment

In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of  a regexp engine RE2, created by Russ Cox, the guy who has written a revival paper for Thompson NFAs  a while ago with a few follow-ups which build on those ideas.  Now once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:

The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.

When one takes a closer look at backmatching one starts to wonder what the restrictions to backmatching in trace based parsing ( TBP ) approaches to pattern matching really are?  One can ask the question also with a more positive intent: what cases of backmatching are compliant with TBP? We’ll find there are quite a lot. Some are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT - but could it be in the end that only esoteric applications of backmatching are excluded from TBP? The reader might be the final judge.

General backmatching

When we think about backmatching in regular expressions we might have expressions like this

 "... (P) ... \1"

in mind where (P) defines a simple pattern and \1 refers to the value of the matched pattern. So there is a functional relationship between (P) and \1.

Actually this perspective is a little simplistic in the general case. Consider the following simple regexp:

([ab]*)b*\1

Here the match of  ([ab]*) depends on what \1 will match but it is also highly ambiguous. If we match the following string

s = “bb”

the first b can be matched by ([ab]*) and the last “b” by \1 but the whole string can also be matched by b*.

Here is another more complicated example

(([ab]*)b*\2)*[ab]*\1b*

It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of ([ab]*) on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of ([ab]*) and (([ab]*)b*\2). It’s all tangled.

General backmatching like the one above can be used to solve NP-complete problems which exploits these tanglements and find resolutions. See this article for more examples. With exponential time backtracking algorithms built into regexp engines one gets solutions for free or by magic. In some sense this is cool but if you’d write a requirements document for a regexp engine would you demand that it shall solve 3-SAT problems? If you say “yes”, show me the real world application which uses this power.

Functional backmatching

If we restrict backmatching to simple functional relations between (P) and \1 we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.

In an expression

    ... (P) ... \1

the back-reference \1 can be separated from P when following holds:

  1. P doesn’t contain back-references which means it is self contained.

  2. It is possible to write the expression in the way

    ... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.

The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets

2.1 LAST-SET(L)  /\ FIRST-SET(P) = {}
2.2 LAST-SET(P)  /\ FIRST-SET(R) = {}

If additionally following condition is true

2.3 FIRST-SET(P) /\ LAST-SET(P) = {}

R might be empty and an expression

    ... L(P)\1 ...

is permitted.

End Cycles

The current conditions are still to restrictive. For example regexp (a)\1 violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjecent to what I call a non empty endcycle.

An endcycle of P has the following definition:

END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )

Take for example the regexpP = (a*|b|c). Here LAST-SET(P) = {a, b, c} and  FOLLOW-SET({a,b,c}) = {a} which means that a is in the endcycle of P.

With endcycles in mind we can weaken the conditions of (2) considerably:

If P has no endcycle i.e.

    END-CYCLE(P) = {}

we permit

    ... L(P)\1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}

If on the other hand

    END-CYCLE(P) != {}

we permit

    ... L(P)R ... \1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}
    END-CYCLE(P) /\ FIRST-SET(R) = {}

Final  Remarks

No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.

reverb - a revival

Posted in General on February 28th, 2010 by kay – 3 Comments

Sometimes software is given up by people and you realize it only a few years later.  Large packages or libraries will inevitably be flagged as legacy and die but tiny modules might have a chance to survive and find a maintainer. I have done the latter now for reverb.py.