Greedy grammars and Any

Posted in Grammars, Parsing, Python on May 20th, 2012 by kay – Be the first to comment

. in your regular expression

I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level.  Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a ‘.’ which is the wildcard pattern that matches any character.

I came back to Any in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.

A comprehensive C preprocessor grammar can be found on the following MSDN site. What is most interesting to us are the following two EBNF productions:

# define identifier[( identifieropt, , identifieropt )] token-stringopt
token-string :
String of tokens 

The String of tokens phrase this is Any+.

Bon appétit

Suppose one defines two macros

#define min(a,b) ((a)<(b)?(a):(b))

#define max(a,b) ((a)<(b)?(b):(a))

Obviously the defining string of the min macro can be recognized using token-string but how can we prevent that token-string eats the max macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make Any non-greedy. The non-greediness can easily be expressed using the following requirement:

If S | Any is a pattern with S!=Any. If S can match a character, S will be preferred over Any.

In the production rule

R: ... Any* S ...
we can be sure that if S matches in R then Any won’t be used to match – although it would match if we leave it greedy. Same goes with
R: ... (Any* | S) ...

Non greediness in grammars

Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example

R: A B | C
A: a Any*
B: b
C: c
Any causes a follow/first conflict between A and B. Making Any non-greedy alone won’t help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won’t be left because of the trailing Any*.

Detecting the trailing Any in A is easy though. We solve the follow/first conflict with a trailing Any by embedding A into R. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn’t destroy any information relevant for parsing. If A has been embedded Any* will be delimited by B to the right and we can safely apply R without the danger of Any consuming a token ‘b’.

Eventually we have to re-apply our embedding strategy: if A is a rule with a trailing Any and A is embedded in B and B has a trailing Any after this embedding  then B will be embedded wherever possible.

A short experiment

Here is a mini-Python grammar is used to detect Python class definitions.

file_input: (NEWLINE | stmt)* ENDMARKER
classdef: 'class' NAME ['(' Any+ ')'] ':' suite
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
simple_stmt: Any* NEWLINE
stmt: simple_stmt | compound_stmt
compound_stmt: classdef | Any* suite

Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only file_input, suite and stmt remained unchanged. In all other cases we have replaced terminal/non-terminal information that isn’t needed by Any.



LL(*) faster than LL(1)?

Posted in Algorithms, Grammars, Parsing, Python, TBP on September 12th, 2011 by kay – 7 Comments

No jumps

When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.

In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P. How do we know that P is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.

What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this

R: a* b | a* c
The algorithm never checks out the first branch, runs through a sequence of  ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the rule
R: a* ( b | c )
However there was never any rule transformation to simplify the problem.

From rules to grammars

It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar

R: A | B
A: a* c
B: a* d
Again we have two branches, marked by the names of the non-terminals A and B and we want to decide late which one to choose.

First we turn the grammar into a regular expression:

R: a* c | a* d
but now we have lost context/structural information which needs to be somehow added:
R: a* c <A>| a* d <B>
The symbols <A> and <B> do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters <A> it will be finally decided that rule A was used. The same is true for <B>. Our example grammar is LL() and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL() grammar as-if it was an LL(1) grammar + a small context marker.


What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:

R: A[R] | B[R]
A: a[A]* c[A]
B: a[B]* d[A]
Now we can embed A and B into R while being able to preserve the context:
R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]>
Matching now the string aad yields the following sequence of sets of matching symbols:
{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>}
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match a.

Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.

We start the interpretation by translating the rightmost symbol

<B[R]> -> [R,[B, .]]
The  dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
[R, [B, .]]
[R, [B, ., d]]
[R, [B, ., a, d]]
[R, [B, ., a, a, d]]
[R, [B, a, a, d]]


We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’.  I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.

The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?


To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.

In order to measure performance I created the following simple script

import time
import decimal
import langscape
text = open(decimal.__file__.replace(".pyc", ".py")).read()
print "textlen", len(text)
python = langscape.load_langlet("python")
p4d = langscape.load_langlet("p4d")
def test(langlet):
    tokens = langlet.tokenize(text)
    a = time.time()
    for i in range(10):
    print "parser", langlet.config.langlet_name, (time.time() - a)/10

It imports a reasonably big Python module ( ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:

parser python 2.39329998493
parser p4d 2.25759999752
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.

I’ve also checked out the script with PyPy, both with activated and deactivated JIT.

PyPy with option –JIT off:

parser python 6.5631000042
parser p4d 5.66440000534
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:

PyPy with JIT:

parser python 0.791500020027
parser p4d 1.06089999676
Here the Python parser is about 1/3 faster than the P4D parser.

The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.

Patching tracebacks

Posted in Algorithms, Langscape, Python on April 11th, 2011 by kay – Be the first to comment

One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.

The situation is easily explained: you have a program P written in some language L. There is a source-to-source translation of P into another program Q of a target language, preferably Python. So you write P but the code which is executed is Q. When Q fails Python produces a traceback. A traceback is a stack of execution frames – a snapshot of the current computation – and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:

Traceback (most recent call last):
  File "tests\", line 12, in
  File "tests\", line 10, in bar
  File "tests\", line 5, in foo
NameError: global name 'b' is not defined

The problem here is that line number information in the frames is from Q whereas the file and the lines being displayed are from P – the only file there is!

Hacking line numbers into parse trees

My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate Q or rather the parse tree corresponding to Q which got updated with the line numbers I used to expect. When the growth of line numbers in Q was non-monotonic, using CPythons internal line number table, lnotab, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won’t be overwritten. There is a  hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.

From token streams to string pattern

Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of P into expressions in Q or vice versa, break  P token into token sequences in Q etc. ) we can checkout the following construction. Let be

T(V, Q) = {T in TS_Q| V = T.Value }
the set of token in the token stream TS_Qof Q with the prescribed token value V. Analog to this we can build build a set T(V, P) for P and TS_P.

The basic idea is now to construct a mapping between T(V,Q)  and  T(V,P). In order to get the value V we examine the byte code of a traceback frame up to the last executed instruction f_lasti. We assume that executing f_lasti leads to the error. Now the instruction may not be coupled to a particular name, so we examine f_lasti or the last instruction preceding f_lasti for which the instruction type is in the set

From the value related to one of those instructions, the type of the value which is one of  {NAME, STRING, NUMBER} and the execution line f_lineno we create a new token T_q = [tokentype, value, lineno]. For V we set V = value. Actually things are a little more complicated because the dedicated token T_q and the line in Q are not necessarily in a 1-1 relationship. So there might in fact be n>1 token being equal to T_q which originate from different lines in P. So let Token be the list of all token we found on line T_q.Line and k = Token.count(T_q). We add k to the data characterizing T_q.

So assume having found T_q. The map we want to build is T(T_q.Value, Q) -> T(T_q.Value, P). How can we do that?

In the first step we assign a character to each token in the  stream TS_Q and turn TS_Q into a string S_TS_Q. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is T_q -> c. For each T in T(T_q.Value, Q) we determine then a substring of S_TS_Q with c as a midpoint:

    class Pattern:
        def __init__(self, index, token, pattern):
            self.index = index
            self.token = token
            self.pattern = pattern
    S_TS_Q = u''
    m = 0x30
    k = 5
    tcmap = {}    
    # create a string from the token stream
    for T in TS_Q:
        if T.Value in tcmap:
            s = unichr(m)
            tcmap[T.Value] = s
    # create string pattern
    pattern = []
    for T in TVQ:
        n = TS_Q.index(T)
        S = S_TS_Q[max(0, n-k): n+k]
        pattern.append(Pattern(n, T, S))

The same construction is used for the creation of target patterns from T(V, P). In that case we use the tcmap dictionary built during the creation of pattern from T(V, Q): when two token in TS_Q and TS_P have the same token value, the corresponding characters shall coincide.

The token mapping matrix

In the next step we create a distance matrix between source and target string pattern.

    n = len(source_pattern)
    m = len(target_pattern)
    Rows = range(n)
    Cols = range(m)
    M = []
    for i in range(n):
    for i, SP in enumerate(source_pattern):
        for j, TP in enumerate(target_pattern):
            M[i][j] = levenshtein(SP.pattern, TP.pattern)

As a distance metrics we use the edit- or levenshtein distance.

Having that matrix we compute an index pair (I,J) with M[I][J] = min{ M[i][j] | i in Rows and j in Cols}. Our interpretation of (I,J) is that we map source_pattern[I].token onto target_pattern[J].token. Since there is an I for which source_pattern[I].token == T_q the corresponding T_p =target_pattern[J].token is exactly the token in P we searched for.

The line in the current traceback is the line T_q.Line of Q. Now we have found T_p.Line of P which is the corrected line which shall be displayed in the patched traceback. Let’s take a brief look on the index selection algorithm for which M was prepared:

while True:
    k, I = 1000, -1
    if n&gt;m and len(Cols) == 1:
        return target_pattern[Cols[0]].token[2]
        for r in Rows:
            d = min(M[r])
            if d&lt;k:
                k = d
                I = r
        J = M[I].index(k)
        for row in M:
            row[J] = 100
    SP = source_pattern[I]
    if SP.token == T_q:
        tok = target_pattern[J].token
        return tok.Line

If there is only one column left i.e. one token in T(V, P) its line will be chosen. If the J-column was selected we avoid re-selection by setting row[J] = 100 on each row. In fact it would suffice to consider only the rows left in Rows.


One example I modified over and over again for testing purposes was following one: [P]
def foo():
    a = ("c",
        (lambda x: 0+(lambda y: y+0)(2))(1),
def bar():

You can comment out b.p turn a + in the lambda expression into / provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get

import langscape; __langlet__ = langscape.load_langlet("python")
def foo():
    a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p)
def bar():

So it is this code which is executed using the command line

which runs in the Python langlet through So the execution process sees but the code which is compiled by Python will be Q.

See the mess that happens when the traceback is not patched:

Traceback (most recent call last):
  File "", line 9, in &lt;module&gt;
  File "langscape\langlets\python\tests\", line 6, in &lt;module&gt;
  File "langscape\langlets\python\tests\", line 5, in bar
  File "langscape\langlets\python\tests\", line 3, in foo
NameError: global name 'b' is not defined

There is even a strange coincidence because bar() is executed on line 5 in the transformed program and b.p is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via sys.excepthook,  the traceback patching mechanism whose major algorithm we’ve developed above we get

Traceback (most recent call last):
  File "", line 9, in &lt;module&gt;
  File "langscape\langlets\python\tests\", line 13, in &lt;module&gt;
  File "langscape\langlets\python\tests\", line 11, in bar
  File "langscape\langlets\python\tests\", line 5, in foo
NameError: global name 'b' is not defined

which is exactly right!


The algorithm described in this article is merely a heuristics and it won’t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won’t change too much.

An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don’t think it is too far away because I have some time now to work on it. It will still be under tested when it’s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.

Fuzzy string matching II – matching wordlists

Posted in Algorithms, Python on January 7th, 2011 by kay – 3 Comments

Small misspellings

An anonymous programming reddit commenter wrote about my fuzzy string matching article:

A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original “damn cool algorithm” matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.

This is a valid objection.

However, for the most common case, which is an edit distance of 1 you don’t need a Levenshtein automaton either. Here is the recipe:

Let a wordlist and an alphabet be given. An alphabet can be for example the attribute string.letters of the string module. For a string S all string variants of S with an edit distance <=1 over the alphabet can be computed as follows:

def string_variants(S, alphabet):
    variants = set()
    for i in range(len(S)+1):
        variants.add(S[:i]+S[i+1:])       # delete char at i
        for c in alphabet:
            variants.add(S[:i]+c+S[i:])   # insert char at i
            variants.add(S[:i]+c+S[i+1:]) # subst char at i
    return variants

The set of words that shall be matched is given by:

set(load(wordlist)) &amp; string_variants(S, alphabet)

The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.

When you want to build string variants of edit distance = 2, just take the result of string_variants and apply string_variants on it again.

The complexity of is


where n is the string length and k the edit distance.

Alternative Approaches

For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of string_variants to determine for a given S all strings with edit-distance <=2 over an alphabet. So the most simple approaches probably serve you well in practise!

For k>2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn’t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can’t even imagine.

The basic idea is to organize our wordlist into an n-ary tree, the so called PrefixTree, and implement an algorithm which is variant of fuzzy_compare to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.

Prefix Trees

For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree – we call it a PrefixTree. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean is_word field.

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char     = char
        self.parent   = parent
        self.children = {}
        self.is_word  = False
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
    def __iter__(self):
        return self._tolist()
    def insert(self, value):
        if value:
            c = value[0]
            tree = self.children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self.children[c] = tree
            self.is_word = True
    def __contains__(self, value):
        if value:
            c = value[0]
            if c in self.children:
                return value[1:] in self.children[c]
            return False
        return True
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Reading a wordlist into a PrefixTree can be simply done like this:

pt = PrefixTree()
for word in wordlist:

Before we criticise and modify the PrefixTree let us take a look at the matching algorithm.

Matching the PrefixTree

The algorithm is inspired by our fuzzy_compare algorithm. It uses the same recursive structure and memoization as fuzzy_compare.

def update_visited(ptree, visited):
    visited[ptree][-1] = 0
    T = ptree.parent
    while T is not None and T.char!='':
        if len(T.children) == 1:
            visited[T][-1] = 0
            T = T.parent
def is_visited(i, T, k, visited):
    d = visited.get(T, {})
    if -1 in d:
        return True
    m = d.get(i,-1)
    if k&gt;m:
        d[i] = k
        visited[T] = d
        return False
    return True
def fuzzy_match(S, ptree, k, i=0, visited = None, N = 0):
    Computes all strings T contained in ptree with a distance dist(T, S)&lt;=k.
    trees = set()
    # handles root node of a PrefixTree
    if ptree.char == '' and ptree.children:
        N = len(S)
        visited = {}
        for pt in ptree.children.values():
            trees.update(fuzzy_match(S, pt, k, i, visited, N))
        return trees
    # already tried
    if is_visited(i, ptree, k, visited):
        return []
    # can't match ...
    if k == -1 or (k == 0 and S[i] != ptree.char):
        return []
    if ptree.is_word and (N-i&lt;=k or (N-i&lt;=k+1 and ptree.char == S[i])):
        if not ptree.children:
            update_visited(ptree, visited)
            return trees
    if ptree.char!=S[i]:
        trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))
    for c in ptree.children:
        if ptree.char == S[i]:
            trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N))
            trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N))
        trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N))
    return trees

Lazy PrefixTree construction

The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.

A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL hash_map and got a worse result: 14 sec of execution time – about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn’t cease to surprise me. Of course I feel bad because I haven’t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.

A possible solution for Python ( and C++ 😉 ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.

Here is the modified code:

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char      = char
        self.parent    = parent
        self.is_word   = False
        self._children = {}
        self._words    = set()
    def _get_children(self):
        if self._words:
        return self._children
    children = property(_get_children)
    def _create_children(self):
        for tree, word in self._words:
        self._words = set()
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
    def __iter__(self):
        return self._tolist()
    def insert(self, value):
        if value:
            c = value[0]
            tree = self._children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self._children[c] = tree
            if len(value) == 1:
                tree.is_word = True
            self.is_word = True
    def __contains__(self, value):
        if value:
            if value in self._words:
                return True
            c = value[0]
            if c in self._children:
                return value[1:] in self._children[c]
            return False
        return True
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Some numbers

The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.

Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec
Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec
Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec
Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec
Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec
Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec
Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec
We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.

Finally here are the numbers using string variants:

Execution time of string_variants("haus", string.letters): 0.0 sec
Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec
Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec
The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren’t necessary and for k >=3 PrefixTree based approaches doesn’t run amok.


The code for fuzzy_compare and fuzzy_match can be downloaded here. It also contains tests, some timing measurements and a German sample wordlist.

Fuzzy string matching

Posted in Algorithms, Python on December 21st, 2010 by kay – 5 Comments

A damn hot algorithm

I found the following article written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.

I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:

k = 6= "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()

The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?

In the following article I’m constructing an approximate string matching algorithm from scratch.

Recursive rules for approximate string matching

Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.

We describe the rule which describes all possible character substitutions in “food” like this:

pattern(food, 1) = ?ood | f?od | fo?d  | foo?
Applying successive left factorings yields:
pattern(food, 1) = ?ood | f  ( ?od | o (?d  | o? ) )
This inspires a recursive notation which roughly looks like this:
pattern(food, 1) = ?ood | f pattern(ood, 1)
or more precisely:
pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)
where we have used a string variable S instead of the concrete string “food”.

When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:

pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)

Consuming or not consuming?

When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …

The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:

fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)

All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.

A first implementation

The following implementation is a good point to start with:

def fuzzy_compare(S1, S2, k, i=0, j=0):
    N1 = len(S1)
    N2 = len(S2)
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
            if S1[i] == S2[j]:
        except IndexError:
            return False
        if k == 0:
            return False
            if fuzzy_compare(S1, S2, k-1, i+1, j):
                return True
            if fuzzy_compare(S1, S2, k-1, i, j+1):
                return True
            if fuzzy_compare(S1, S2, k-1, i+1, j+1):
                return True
            return False

The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.

A second implementation using memoization

Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k&gt;m:
        visited[(i,j)] = k
        return False
    return True
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
    N1 = len(S1)
    N2 = len(S2)
    if visited is None:
        visited = {}
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
            if S1[i] == S2[j]:
                visited[(i,j)] = k
        except IndexError:
            return False
        if k == 0:
            return False
            if not is_visited(i+1, j, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
                    return True
            if not is_visited(i, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
                    return True
            if not is_visited(i+1, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
                    return True
            return False

A third implementation eliminating recursion

In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k&gt;m:
        visited[(i,j)] = k
        return False
    return True
def fuzzy_compare(S1, S2, k):
    N1 = len(S1)
    N2 = len(S2)
    i = j = 0
    visited = {}
    stack = []
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
            if S1[i] == S2[j]:
                visited[(i,j)] = k
        except IndexError:
            if stack:
                i, j, k = stack.pop()
                return False
        if k == 0:
            if stack:
                i, j, k = stack.pop()
                return False
            if not is_visited(i+1, j, k-1, visited):
                i, j, k = i+1, j, k-1
            elif not is_visited(i, j+1, k-1, visited):
                i, j, k = i, j+1, k-1
            elif not is_visited(i+1, j+1, k-1, visited):
                i, j, k = i+1, j+1, k-1
            elif stack:                               
                i, j, k = stack.pop()
                return False

This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of ifelif branches can impact performance of the algorithm. Do you see a way to improve it?

Checking the algorithm

When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.

def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [range(l1 + 1)] * (l2 + 1)
    for zz in range(l2 + 1):
      matrix[zz] = range(zz,zz + l1 + 1)
    for zz in range(0,l2):
      for sz in range(0,l1):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
                                   matrix[zz][sz] + 1)
    return matrix[l2][l1]

For exhaustive testing we define a set of strings as follows:

Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.

def string_set(S = None, k = 0, strings = None, n = 6):
    if S is None:
        strings = []
        S = ["a"]*n
    for i in range(k, n):
        S1 = S[:]
        S1[i] = "b"
        string_set(S1, i+1, strings, n)
    return strings
def string_pairs(n):
    L1 = string_set(n=n)
    pairs = []
    for i in range(len(L1)):
        for k in range(1, n+1):
            L2 = string_set(n=k)
            for j in range(len(L2)):
                pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
                pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
    return pairs

Our test function is short:

def test(n):
    for S1, S2, D in string_pairs(n):
        assert fuzzy_compare(S1, S2, D) == True, (S1, S2, D)
        assert fuzzy_compare(S1, S2, D-1) == False, (S1, S2, D-1)

Have much fun!

Python26 expressions

Posted in Grammars, Langscape, Python on November 26th, 2010 by kay – Be the first to comment

When you look at the following listing you might think it’s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:

raise a, b, c
import d
from e import*
import f
from .g import(a)
from b import c
from .import(e)
from f import(g)
from .a import(b, c as d,)
import e, f as g
from .a import(b, c,)

( Here is the complete listing ).

which is of course correct. It is a somewhat peculiar listing because you find expressions like

c[:,: d:,]**e


[c for d in e, lambda f = g, (a, b,) = c,: d, for e in f]

and a few others such as

b(*c, d, **e)**f

which will be rejected by the Python compiler because the *c argument precedes the name d but it is nevertheless “grammar correct” by which I mean it is consistent with Pythons context free LL(1) grammar.

The nice thing about the listing: it is automatically generated and it is complete in a certain sense.

Generative Grammars

When you look at a grammar rule like

for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]

it can be understood as an advice for producing exactly 2 expressions, namely:

'for' exprlist 'in' testlist ':' suite
'for' exprlist 'in' testlist ':' suite 'else' ':' suite

Other rules like

dictmaker: test ':' test (',' test ':' test)* [',']

has an infinite number of productions

test ':' test
test ':' test ','
test ':' test ',' test ':' test
test ':' test ',' test ':' test ','

When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite for_rule being finite and dictmaker being infinite the algorithm creates two productions for each.

After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.

Embedding of productions

Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand test in dictmaker by selecting and inserting one production we got for the rule test. Unfortunately a grammar isn’t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.

Look at the following two rules:

expr_stmt: testlist (augassign (yield_expr|testlist) |
           ('=' (yield_expr|testlist))*)
augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '&amp;=' | '|=' | '^=' |
            '&lt;&lt;=' | '&gt;&gt;=' | '**=' | '//=')

The only place where augassign occurs  in the grammar is in expr_stmt but counting the number of productions for augassign we get 12 whereas we only count 3 productions for expr_stmt and there is just a single production which contains expr_stmt. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can’t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.

Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:

file_input: (NEWLINE | stmt)* ENDMARKER

I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:

Wrap each initially determined production which is not a production of a start rule into a stmt

Take the production ‘+=’ of augassign as an example. We find that augassign exists in expr_stmt. So we take one expr_stmt and embedd augassign in the concrete form  ‘+=’.

testlist '+=' yield_expr

The subsequent embedding steps

expr_stmt   -&gt; small_stmt
small_stmt  -&gt; simple_stmt
simple_stmt -&gt; stmt

When embedding small_stmt into simple_stmt one has to add a trailing NEWLINE. So our final result is:

testlist '+=' yield_expr NEWLINE

Any rule we used during successive embedding  doesn’t have to be used again as an initial rule of another embedding because it was already built into file_input. It can be reused though when needed. I did not attempted to minimize the number of embeddings.

Substitute non-terminals

Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find import f and import d among the listed statements. I suspect one of them is a shortened form of import d.e but since the rule for building d.e allows using d only and it is shorter, it will be chosen.

Detecting Grammar flaws

Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn’t mean that Pythons grammar isn’t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:

file_input: (NEWLINE | stmt)* ENDMARKER
simpleargs: fpdef (',' fpdef)*
defaultargs: fpdef '=' test (',' fpdef '=' test)*
starargs: '*' NAME
dstarargs: '**' NAME
varargslist: ( simpleargs [',' defaultargs] [',' starargs] [','dstarargs] |
               defaultargs [',' starargs] [','dstarargs] |
               starargs [','dstarargs] |
               dstarargs) [',']

So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.


In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn’t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.

If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -> OK;  P(G2)(Expr(G1)) -> OK.

Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.

Higher approximations

Doesn’t the listing give us a 1st order approximation of the language? It’s a fun idea to think of all those listing expressions living in the “tangential space” of the language. “Higher approximation” would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set Tr(K, nfa) of traces of length K for a given nfa. Tr(K, nfa) may be empty for some K.Unfortunately we can’t infer from Tr(K) = {} that Tr(K+1) = {}. So what is a good stop criterion then?

The algorithm for creating Tr(K, nfa) is quite simple. The following functions are Langscape implementations:

def compute_tr(K, nfa):
    Computes the set Tr(K, nfa) of traces of length K for a given nfa.
    The return value may be [] if no trace of length K exists.
    _, start, trans = nfa
    return compute_subtraces(K, 0, start, [], trans)
def compute_subtraces(K, k, S, trace, trans):
    Computes complete traces of a given length.
    :param K: The prescribed length a trace shall have.
    :param k: The current length of a trace ( used by recursive calls ).
    :param trace: the current trace.
    :param trans: the {state:[follow-states]} dictionary which characterizes
                  one NFA.
    traces = []
    follow = trans[S]
    for F in follow:
        if F[0] is None:
            # termination condition fulfilled?
            if k == K:
            m = trace.count(F)
            # impossible to terminate trace under this condition
            if m == K:
                traces+=compute_subtraces(K, max(k,m+1), F, trace+[F], trans)
    return traces

Shaky Python future

Posted in Python on April 24th, 2010 by kay – 10 Comments

Mark Pilgrim says:

Anyway, I’m really proud of how well DiP3 [Dive into Python 3, ks] came out. The only problem is that no one is using Python 3. I took a gamble last year that large libraries would port to Python 3 while I was writing. That didn’t happen. I think it’s pretty clear by now that that’s not going to happen anytime soon. Everyone who gambled on the glorious non-backward-compatible future got burned. Given my experience with HTML, you’d think I’d learn. Ah well.

So what are realist expectations? Python 2 as the future of a research language called Python 3?

“Resolver Games” are alive!

Posted in Python on October 24th, 2009 by kay – 3 Comments

Resolver One competition

I admit I participated in the Resolver One competition in the first round in January as one of the losers. When Resolver Systems announced their challenge I got the impression they encouraged using their spreadsheet almost like a medium for expressing ideas and thinking out of the box. However, Resolver Systems is yet another company which exists solely to sell stuff and consulting, not a patron of modern art or hacking experiments. So the awarded spreadsheets are looking a bit conventional and are technically unsophisticated. Their sophistication lies in external factors like the scientific ideas which are exercised. Some make extensive use of graphic capabilities of .NET which is also outside of the Resolver One API. It’s good demo stuff nevertheless and this might be their main purpose in the end.

Resolver Games

I’m glad to see Resolver Games being online now which was my own contribution. Resolver Games is about simple learning games for word lists. The examples I used were from “Teuton” ( Teuton is a German Python dialect, inspired by an Artima posting of Andy Dent, which replaces Pythons English keywords and builtins by German translations of them – Teuton is one of my fun projects and a langlet demo for EasyExtend ), the other one is IPA – the international phonetic alphabet – which is just a great learning target.

A Resolver Game consists of a pair of Resolver One spreadsheets. One for the word-list/game data and the other one for the game board. The game data spreadsheet is conventional and stateless. The game board is designed to be generic and independent of the particular game. The game board was tricky to program because it uses the spreadsheet for event handling and user interactions. Writing an event handler is a bit like scanning a screen and notify changes by comparing the actual with the previous image point by point. Resolver One stores data in the background and those data affect re-computations. Sometimes I wanted to cause a user event doing re-computations without changing the displayed data. I used the following trick: add and remove a blank to the cell-data and swap between the two representations periodically.

"+" -&gt; "+ " -&gt; "+" -&gt; "+ " -&gt; ...

When the cell content is “+” change it to “+ ” and vice versa. This goes unnoticed because there is no visual effect associated with the blank. Once I got into whitespace oriented programming the hard problems with state changes in Resolver Games became solvable.

One could argue that Resolver One is simply not the right-tool-for-the-job and it is overstretched in this application. I don’t disagree but this line of argument always appeared philistine to me and I reserve it to those who simply don’t know it better. A more serious objection to Resolver Games might be the fun aspect. Is it really fun to play? Resolver Games are surely a bit poor in game dramaturgy and visual effects. So I’d rather say NO, but I’m not a gamer anyway.

Choosers and ChooserMixins in C++ and Python

Posted in Chooser, CPP, Python, Testing on September 12th, 2009 by kay – 2 Comments

Chooser Objects

From time to time I’m amazed to find a simple algorithm which seemed to be a low hanging fruit which was just overlooked. In this particular case it is about generating and utilizing test data in a both simple and flexible manner. Mark Seaborn described the method in his outstanding blog article How to do model checking of Python code. He distilled what we might call the Chooser Algorithm from a scientific paper which buries the message under all sorts of methodological considerations and special case treatments and other bloat. This is sad because good algorithms are the crown jewels of programming. It also helped that he provided an implementation in Python and not in C or some sloppy computing-scientist-only pseudo code notation which changes from author to author.

We can motivate Chooser objects as follows.

Suppose you have a control flow statement defined in a function f. The path the flow control takes is determined by the value of some variable x:

def f(*args):
    x = g(*args)
    if x&gt;0:

When we want to test the if-statement alone we can ignore the value of x computed by g. A simple method to achieve this is to introduce a for-loop in the code which iterates over a range of values which represent jumps to the individual if-statement branches:

def f(*args):
    x = g(*args)
    for x in (1,-1):
        if x&gt;0:

However, this is a quite heavy change and we would likely not want to repeat this at another place. Instead of adding a for-loop we can introduce a non-deterministic choice over the values 1 and -1 and pull the iteration, represented by the loop, out of the function:

def test(chooser):
    def f(*args):
        x = g(*args)
        x = chooser.choose([1,-1])
        if x&gt;0:
    f(1,2,3)  # call f with appropriate arguments

Here we inserted a call to choose which represents a set of choices. No new control flow is introduced. The function f must be called as many times as there are choices passed to choose.

The repeated call of f is managed by a new function check which is part of the Chooser Algorithm. It actually calls the test function which has a uniform interface and keeps a single chooser parameter.

class ModelCheckEscape(Exception): pass
def check(func):
    stack = [[]]
    while stack:
        chosen = stack.pop()
            func(Chooser(chosen, stack))
        except ModelCheckEscape:

The check function creates a Chooser object and passes it to func which is represents the system under test. The Chooser constructor takes two arguments. One is a list called chosen popped from a stack of such lists, the other one is the stack itself which might be filled with new lists.

class ModelCheckEscape(Exception): pass
class Chooser(object):
    def __init__(self, chosen, stack):
        self._chosen = chosen
        self._stack  = stack
        self._it     = iter(chosen)
    def choose(self, choices):
            choice =
            if choice not in choices:
                raise Exception("Program is not deterministic")
            return choice
        except StopIteration:
            self._stack+=[self._chosen + [choice] for choice in choices]
            raise ModelCheckEscape()

This is the definition of the Chooser object. It is a tiny bit of elementary but ingenuous code. In order to understand what it does consider the following test function with its three calls of choose:

def test(chooser):
    x = chooser.choose([True, False])
    if x:
        y = chooser.choose(["a", "b"])
        z = chooser.choose(["y", "z"])

On each choose call a value is returned from the _it iterator. Those values must conform to the choices passed to choose for every call of choose. Otherwise a ChooserException is raised. So we expect _it to be an iterator wrapped around lists like [True, “a”], [True, “b”], [False, “y”], [False, “z”]. Those lists are associated with the choices being made at (x, y) or (x, z).

In fact we observe some more of those lists, starting with the empty list [] and the incompletely filled lists [True] and [False]. When _it is wrapped around an incomplete list one of the choose calls will raise a StopIteration exception at Assume for example that _it = iter([True]) then _it is already exhausted after x and choose and will raise StopIteration at the definition of y. At this point each of the choices at y i.e. “a” and “b” will produce a new list. Those lists are [True, “a”] and [True, “b”] which are now complete. New lists are pushed on the stack as long as incomplete lists are popped from the stack incheck().

As a special case we consider a simple linear sequence of choose calls

def test(chooser):
    x = chooser.choose([True, False])
    y = chooser.choose(["a", "b"])
    z = chooser.choose(["y", "z"])

The set of complete lists according to this sequence will be the Cartesian product of the choices: {True, False} x {“a”, “b”} x {“y”, “z”}. If you just want Cartesian products there are more efficient alternatives to create them though.

These are the Chooser basics. For Python you can download the used code here.

Choosers in C++

I gave a C++ and STL based implementation of Chooser objects. The Chooser C++ API closely follows the Python example. You can download the code from the linked document.

In its most general form the choose method has following signature:

    template &lt;typename Container&gt;
    typename Container::value_type choose(Container&amp; choices)

The return type is derived from the containers value_type attribute. Other than this the algorithm only relies on iterators which means that any STL container can be used. We can rewrite the simple test function above in C++:

void test(Chooser&amp; chooser) {
    int x = chooser.choose(2);
    if x {
        string s = "ab";
        char y = chooser.choose(s);
    else {
        string s = "yz";
        char z = chooser.choose(s);

This is not all that much overhead. In case of the x definition we use an overloaded version of choose which takes a single integer parameter k. This is equivalent to a choice of values within the range {0, 1, …, k-1}. The most relevant case may be choose(2) which is the boolean choice.

The string type is an STL container type as well. More precisely it is a typedef for basic_string<char>. We can create a string object with a string literal but we cannot pass a string literal directly to choose which expects an explicit reference to a container from which the return type is derived ( char in this case ).

ChooserMixin classes

Suppose we want to introduce Chooser objects into arbitrary methods of an existing class. The Chooser Algorithm is implemented s.t. a Chooser object is explicitly passed as a parameter but this would require changes in a methods interface, something we try to avoid.

Visibility of Chooser instances in the local scope of a method can also be achieved by making them global or member variables. An inexpensive method which is safer than using globals is to use a mixin class. The mixin class defines aChooser instance and if some class wants to use it, it derives from the mixin.

class ChooserMixin {
    Chooser chooser;
    void test() = 0;
    void check()
        this-&gt;chooser = Chooser(chosen, queue);

The test method is abstract. If f is the method we want to check, then the implementation of test would just invoke f with appropriate parameters:

void test() {
    f(arg1, arg2, ...);
It’s easy to change test without touching any other source code.

More advantages of ChooserMixins

When we use ChooserMixin we can define the choices C being used in chooser.choose(C) also as member variables. This makes choices configurable. A subclass of a ChooserMixin might read data from an external file or a database and populate the C container.

I wonder if it’s even possible to get rid of T x = chooser.choose(C)assignments in method source when using data binding techniques? In JavaFX we can restate the assigment in the form

var x = bind chooser.choose(C)

The bound variable x is updated whenever C is changed. Instead of creating a new instance of Chooser on each iteration, we replace the members defined in a single instance and trigger updates of C which in turn causes chooser.choose(C) to produce a new value. It remains to be examined if this idea is somehow practical.

Python – Hibernate – Jynx

Posted in Hibernate, Jynx, Jython on September 4th, 2009 by kay – 2 Comments

Jynx 0.4 goes Hibernate

In Jynx 0.4 JPA/Hibernate annotations are supported. Although this is still work in progress some of the more complex nested annotations were tested as well as Hibernate extension annotations which cannot be single-name imported along with the corresponding JPA annotations without conflicts.

Jynx 0.4 provides other new features as well. One can now use @signature decorators to express Java method overloading. A simple Java parser is integrated. A Java parser was necessary to improve the Java class detection heuristics used to determine required imports when a Java proxy is created from a Jython class and compiled dynamically. Finally there is a new @bean_property decorator which creates a private attribute foo along with public getters and setters given a bean_property decorated method def foo(_):_. Full documentation of Jynx as well as its changes can be found here.

Using Hibernate from Jython

Starting and closing sessions and managing simple transactions is not difficult in Hibernate. In Jynx two context managers for with-statements are defined which hide open+close and begin+commit/rollback boilerplate from the programmer. Code for Hibernate sessions and transactions lives then in with-statement blocks.

class hn_session(object):
    Context manager which opens/closes hibernate sessions.
    def __init__(self, *classes):
        sessionFactory = buildSessionFactory(*classes)
        self.session   = sessionFactory.openSession()
    def __enter__(self):
        return self.session
    def __exit__(self, *exc_info):
class hn_transact(object):
    Context manager which begins and performs commits/rollbacks hibernate transactions.
    def __init__(self, session):
        self.tx = session.beginTransaction()
    def __enter__(self):
        return self.tx
    def __exit__(self, type, value, traceback):
        if type is None:

A simple session using a single Entity Bean may then look like:

from __future__ import with_statement
from jynx.lib.hibernate import*
class Course(Serializable):
    @signature("public int _()")
    def getCourseId(self):
        return self.courseId
    @Column(name="COURSE_NAME", nullable = False, length=50)
    @signature("public String _()")
    def getCourseName(self):
        return self.courseName
    @signature("public void _(String)")
    def setCourseName(self, value):
        self.courseName = value
    @signature("public void _(int)")
    def setCourseId(self, value):
        self.courseId = value
with hn_session(Course) as session:
    course  = Course()
    with hn_transact(session):

Boilerplate Reduction

The standard class decorator for creating a Java class from a Jython class in Jynx is @JavaClass. In Jynx 0.4 some slightly extended decorators are introduced in particular @Entity and @Embeddable. Not only do they make Jython code more concise because one doesn’t have to stack @Entity and @JavaClass but translating with @Entity turns some automatically generated Java attributes into transient ones i.e. a @Transient annotation is applied which prevents those attributes to be mapped to table columns.

The massive boilerplate needed for defining a valid Entity Bean in the preceding example can be reduced using the @bean_property decorator:

class Course(Serializable):
    def courseId(self): pass
    @Column(name="COURSE_NAME", nullable = False, length=50)
    def courseName(self): pass

Applied to def courseId(self): pass the @bean_property decorator will cause the following Java code translation

    @Id @Column(name="COURSE_ID") private int courseId;
    int getCourseId() { return courseId; }
    int setCourseId(int value) { courseId = value; }

which specifies a complete Java Bean property.


In the following example two Entities are associated using a one-to-one mapping between primary keys.

class Heart(Serializable):
    def id(self):pass
class Body(Serializable):
    def id(self):pass
    @OneToOne(cascade = CascadeType.ALL)
    def heart(self):pass

Now we can check the behavior:

# session 1
with hn_session(Heart, Body) as session:
    body = Body()
    heart = Heart()
    body.heart = heart = 1 =
    with hn_transact(session):
# session 2
with hn_session(Heart, Body) as session:
    with hn_transact(session):
        b = session.get(Body, 1)
        assert b
        assert b.heart
        assert == 1


With Hibernate support in Jython we notice another clear departure from the CPython world and its web frameworks and components. Hibernate is distinctively Java and special techniques are needed to create compile time Java properties in a dynamic language. Jython has long been a second citizen in Python land. I suspect this is going to change with support of Java frameworks which alone have as many users/downloads as Python.