Algorithms

Kites on a zebra crossing: an algorithmic challenge

Posted in Algorithms on June 9th, 2012 by kay – 3 Comments

Forget Sudoku

A couple of moons ago I stumbled across an algorithmic puzzle which I want to share. It is not too simple s.t. an efficient solution can easily be seen but also not too complicated s.t. it is beyond hope of finding a solution in polynomial time – at least not yet or not for me.

So my effort in describing this problem can also be considered as a crowd sourcing attempt in finding a solution and I call you for contribution. The problem is an idealization of a real algorithmic problem I was dealing with but I stripped off some additional complications in order to improve it as a game play.

Let’s go in medias res quickly.

The kite and zebra crossing

The basic setup is that of a directed acyclic graph which is layered or stratified as depicted in the diagram below.

The "kite on a zebra crossing" diagram

If we enumerate the stripes in ascending order,  a vertex of stripe i can be connected with one or more vertices in stripe i+1. There is a single vertex in the bottom stripe and the top stripe. Each path from the bottom vertex to the top vertex has the same length and each move upwards is as good as any other.

In order to create a challenge we complicate the situation. For a given positive number K we assign a possibly empty list of numbers with elements in {-K, -K+1, … ,K-1, K} – {0} with each vertex V in the kite. So for example we set K = 2 and make the assignments

V[0].numlist = []
V[1].numlist = [-1, -2]
V[2].numlist = [-1, -1]
V[3].numlist = [-1]
V[4].numlist = [2]
V[5].numlist = [2]
V[6].numlist = [1]
V[7].numlist = [-2]
V[8].numlist = []
Number lists can assigned to paths of the kite as well. Let Pt(i, j) an arbitrary path from V[i] to V[j]. The number list of Pt(i,j) is the concatenation of the number list of its vertices. Examples for such paths and their associated number lists are
Pt(V[0], V[2], V[4], V[7], V[8]).numlist = [-1, -1, 2, -2]
Pt(V[0], V[1], V[4], V[6], V[8]).numlist = [-1, -2, 2, 1]
On number lists we define a reduction rule which states:

Remove adjacent numbers x, y in a number list if x+y = 0

In the number list [-1, -2, 2, 1] we can first remove -2 and 2 and get [-1, 1] which can be further reduced to []. The number list [-1, -1, 2, -2] can be reduced to [-1, -1] but not any further.

We write

L1 => L2

if there is a sequence of reductions of the number list L1 yielding L2 and say that a complete path Pt(V[0], V[N]) connecting the bottom with the top of the kite is valid iff  Pt(V[0], V[N]).numlist => []. The number list of a valid path can be reduced to the empty list.

Now the challenge

Assume that K is fixed and let the number N>=2 of kite vertices be variable. So we consider the set of all kites which are constrained by N vertices and this for arbitrary numbers N>=2. Is there an algorithm which can determine a valid path ( or detect its absence ) which is polynomial in N?

 

 

Greedy grammars and Any

Posted in Grammars, Parsing, Python on May 20th, 2012 by kay – Be the first to comment

. in your regular expression

I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level.  Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a ‘.’ which is the wildcard pattern that matches any character.

I came back to Any in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.

A comprehensive C preprocessor grammar can be found on the following MSDN site. What is most interesting to us are the following two EBNF productions:

# define identifier[( identifieropt, , identifieropt )] token-stringopt
token-string :
String of tokens 

The String of tokens phrase this is Any+.

Bon appétit

Suppose one defines two macros

#define min(a,b) ((a)<(b)?(a):(b))

#define max(a,b) ((a)<(b)?(b):(a))

Obviously the defining string of the min macro can be recognized using token-string but how can we prevent that token-string eats the max macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make Any non-greedy. The non-greediness can easily be expressed using the following requirement:

If S | Any is a pattern with S!=Any. If S can match a character, S will be preferred over Any.

In the production rule

R: ... Any* S ...
we can be sure that if S matches in R then Any won’t be used to match – although it would match if we leave it greedy. Same goes with
R: ... (Any* | S) ...

Non greediness in grammars

Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example

R: A B | C
A: a Any*
B: b
C: c
Any causes a follow/first conflict between A and B. Making Any non-greedy alone won’t help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won’t be left because of the trailing Any*.

Detecting the trailing Any in A is easy though. We solve the follow/first conflict with a trailing Any by embedding A into R. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn’t destroy any information relevant for parsing. If A has been embedded Any* will be delimited by B to the right and we can safely apply R without the danger of Any consuming a token ‘b’.

Eventually we have to re-apply our embedding strategy: if A is a rule with a trailing Any and A is embedded in B and B has a trailing Any after this embedding  then B will be embedded wherever possible.

A short experiment

Here is a mini-Python grammar is used to detect Python class definitions.

file_input: (NEWLINE | stmt)* ENDMARKER
classdef: 'class' NAME ['(' Any+ ')'] ':' suite
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
simple_stmt: Any* NEWLINE
stmt: simple_stmt | compound_stmt
compound_stmt: classdef | Any* suite

Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only file_input, suite and stmt remained unchanged. In all other cases we have replaced terminal/non-terminal information that isn’t needed by Any.

 

 

The state of the Trail parser generator

Posted in Algorithms, Grammars, Langscape, Parsing, TBP, Trail on April 10th, 2012 by kay – Be the first to comment

White Easter

Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter…

 x + + + + + + + + +
 - - - - - - - - - x
 - - - - - - - - x +
 - - - - - - - x + +
 - - - - - - x + + +
 - - - - - x + + + +
 - - - - x + + + + +
 - - - x + + + + + +
 - - x + + + + + + +
 - x + + + + + + + + 

Flat Automatons

This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn’t that I believed that this speed advantage wouldn’t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn’t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let’s begin.

If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely flat then and contained only terminal symbols which referred to each other . An intrinsic hurdle is recursion. A recursive rule like

X: X X | a
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
X X | a
X X X X | X X a | a X X | a X | X a | a a | a
...

Working embeddings

The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won’t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:

X: X a
X: ɛ
which means that X can produce the empty word. Since ɛa = a the rule X accepts all finite sequences of a. In EBNF we can rewrite the X-rule as
X: [X] a
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
(X,0): (X,1) (a,2)
(X,1): (a,2)
(a,2): (FIN,-1)
Each state carries a unique index, which allows us to distinguish arbitrary many different X and a states in the automaton. If we further want to embedd X within another rule, say
Y: X b
which is defined by the table
(Y,0): (X,1)
(X,1): (b,2)
(a,2): (FIN,-1)
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
(X,0,X): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
With this notion we can embed X in Y:
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
(FIN,-1,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
The initial state (X,0,X) of X was completely removed. The automaton is still erroneous though. The final state (FIN,-1,X) of X is not a final state in Y. It even has a transition! We could try to remove it completely and instead write
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X):   (b,2,Y)
(b,2,Y): (FIN,-1,Y)
But suppose Y had the form:
Y: X X b
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state (FIN, -1, X) when embedded in Y into an ɛ-state in Y:
(FIN,-1,X) => (X, 3, TRAIL_DOT, Y)
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape (_, _, 0, _). Finally the fully and correctly embedded automaton X in Y looks like this:
(Y,0,0,Y): (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): (FIN,-1,0,Y)
    
The TRAIL_DOT marks the transition between X and Y in Y. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.

Rules of Embedding

At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don’t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:

R: A | B
A: a* c
B: a* d
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string “aaa…” we cannot decide which of the rules A or B we have to choose, unless we observe a ‘c’ or ‘d’ event i.e. our string becomes “aa…ac” or “aa…ad”. What we basically want is to defer the choice of a rule, making a late choice instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
R: a* c | a* d
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets { (a,1,0,A), (a,2,0,B) } which recognize the same character “a”. Those state sets will be stored during parsing. In case of a “c”-event which will be recognized by the state (c, 3, 0, A) we only have to dig into the state-set sequence and follow the states(a, 1 , 0, A) back to the first element of the sequence. Since (A, 4, TRAIL_DOT, R) is the only follow state of (c, 3, 0, A)we will actually see the sequence:
(A,4,TRAIL_DOT,R)
 \
  '...
      \
(c,3,0,A)
(a,2,0,A)
(a,2,0,A)
...
(a,2,0,A) 
    
from this sequence we can easily reconstruct contexts and build the tree
[R, [A, a, a, ..., c]]
All of this is realized by late choice. Until a “c” or “d” event we move within A and B at the same time because. The embedding of A and B in R solves the FIRST/FIRST conflict above. This is the meaning.

FOLLOW/FIRST conflicts

So far the article didn’t contain anything new. I’ve written about all of this before.

The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:

R: U*
U: A | B
A: a+ (B c)*
B: b
There is no FIRST/FIRST conflict between A and B and we can’t factor out a common prefix. But now suppose we want to parse the string “abb”. Obviously A recognizes the two initial characters “ab” and then fails at the 2nd “b” because “c” was expected. Now A can recognize “a” alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After “a” has been recognized by A the rule B may take over and match “b” two times:
[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]
Trail applies a “longest match” recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after a and at that point the parser sets a so called checkpoint dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is “abcbb” then A will match “abc”, if it is “abcbcbb” then it is “abcbc” and so on.

The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.

The inevitability of left recursions

It is easy in top down parsing to recognize and remove or transform left recursive rule like this one

X: X a | ɛ
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don’t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn’t expected them. I’d like to go even one step further which is conjectural: we can’t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this clash dynamics, as I like to call it might lead to algorithmically undecidable problems. It’s nothing I’ve systematically thought about but I wouldn’t be too surprised.

For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as

X: X
or
X: X a
In both cases the FIRST-sets of X don’t contain any terminal symbol and they can’t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
E: E '*' E | NUMBER
contains a FIRST/FIRST conflict between E and NUMBER. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
T: u v [T] u w
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that’s entirely changing the game. The question is do we must tack backtracking solutions onto Trail which are deeper involved than checkpoints?

After 6 months of tinkering I wished the answer was no. Actually I believe that it is unavoidable but it occurs at places were I didn’t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.

Teaser

Before I discuss left recursion problems in Trail in a follow up article I’ll present some results as a teaser.

Grammars for which parsing in Trail is O(n):

a) E: '(' E ')' | E '' E | NUMBER
b) E: '(' E+ ')' | E '' E | NUMBER
Other grammars in the same category are
c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
However for the following grammars the parser is in O(2^n)
f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
If we combine  d) and f) we get
h) G: G G (G | 'h') | G (G | 'h') | 'h'
In this case Trail will deny service and throw a ParserConstructionError exception. Some pretty polynomial grammars will be lost.

 

 

 

LL(*) faster than LL(1)?

Posted in Algorithms, Grammars, Parsing, Python, TBP on September 12th, 2011 by kay – 7 Comments

No jumps

When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.

In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P. How do we know that P is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.

What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this

R: a* b | a* c
The algorithm never checks out the first branch, runs through a sequence of  ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the rule
R: a* ( b | c )
However there was never any rule transformation to simplify the problem.

From rules to grammars

It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar

R: A | B
A: a* c
B: a* d
Again we have two branches, marked by the names of the non-terminals A and B and we want to decide late which one to choose.

First we turn the grammar into a regular expression:

R: a* c | a* d
but now we have lost context/structural information which needs to be somehow added:
R: a* c <A>| a* d <B>
The symbols <A> and <B> do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters <A> it will be finally decided that rule A was used. The same is true for <B>. Our example grammar is LL() and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL() grammar as-if it was an LL(1) grammar + a small context marker.

Reconstruction

What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:

R: A[R] | B[R]
A: a[A]* c[A]
B: a[B]* d[A]
Now we can embed A and B into R while being able to preserve the context:
R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]>
Matching now the string aad yields the following sequence of sets of matching symbols:
{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>}
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match a.

Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.

We start the interpretation by translating the rightmost symbol

<B[R]> -> [R,[B, .]]
The  dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
[R, [B, .]]
[R, [B, ., d]]
[R, [B, ., a, d]]
[R, [B, ., a, a, d]]
[R, [B, a, a, d]]

Drawbacks

We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’.  I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.

The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?

Performance

To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.

In order to measure performance I created the following simple script

import time
import decimal
import langscape
 
text = open(decimal.__file__.replace(".pyc", ".py")).read()
print "textlen", len(text)
 
python = langscape.load_langlet("python")
p4d = langscape.load_langlet("p4d")
 
def test(langlet):
    tokens = langlet.tokenize(text)
    a = time.time()
    for i in range(10):
        langlet.parse(tokens)
        tokens.reset()
    print "parser", langlet.config.langlet_name, (time.time() - a)/10
 
test(python)
test(p4d)

It imports a reasonably big Python module ( decimal.py ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:

parser python 2.39329998493
parser p4d 2.25759999752
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.

I’ve also checked out the script with PyPy, both with activated and deactivated JIT.

PyPy with option –JIT off:

parser python 6.5631000042
parser p4d 5.66440000534
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:

PyPy with JIT:

parser python 0.791500020027
parser p4d 1.06089999676
Here the Python parser is about 1/3 faster than the P4D parser.

The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.

Maptrackers

Posted in Algorithms on April 12th, 2011 by kay – 2 Comments

From graphs to maps

A Maptracker is a special backtracking algorithm used to check the equivalence of certain maps which can be represented as connected, directed graphs or finite state machines. It shall be described in this article.

The original motivation was to find an algorithm for reconstruction of grammars from finite state-machines with the following property: suppose you have a state-machine M0 and a function P which turns M0 into a grammar rule: G = P(M0). When we translate G back again into a state-machine we get M1 = T(P(M0)). Generally T o P != Id and M0 != M1. But how different are M0 and M1 actually?

Watch the two graphs GR1 and GR2 above. When we abstract from their particular drawings and focus on the nodes and edges only we can describe them using the following dictionaries:

GR1 = {0: set([1, 3, 4, -1, 7]),
 1: set([2]),
 2: set([1, 3, 4, -1, 7]),
 3: set([-1]),
 4: set([6, 4, 5, -1, 7]),
 5: set([5, 6]),
 6: set([4, -1, 7]),
 7: set([-1])}
GR2 = {0: set([1, 3, 6, -1, 7]),
 1: set([2]),
 2: set([1, 3, 6, -1, 7]),
 3: set([6, 3, 4, 5, -1]),
 4: set([4, 5]),
 5: set([3, 6, -1]),
 6: set([-1]),
 7: set([-1])}

A pair i: [j1, j2, … jn] describes the set of edges i -> j1,  i -> j2, …, i -> jn.

Checking for equivalence

We say that GR1 and GR2 are equivalent if there is a permutation P of  {-1, 0, 1, …, 7} and

GR2 == dict( (P(key), set(map(P, value))) for (key, value) in GR1.items() )

Maptracker is merely a cute name for an algorithm which constructs the permutation P from map representations of the kind GR1 and GR2. P itself will be described as a dictionary. Since the value -1 is a fixed point it will be omitted:

class Maptracker(object):
    def __init__(self, gr1, gr2):
        self.gr1 = gr1
        self.gr2 = gr2
 
    def accept(self, value, stack):
        e1, e2 = value  # e1 -&gt; e2
        V1 = self.gr1[e1]
        V2 = self.gr2[e2]
        #
        # e1 -&gt; e2 =&gt; v1 -&gt; v2
        #
        # check consistency of the choice of the mapping
        if len(V1)!=len(V2):
            return False
        m = dict(p for (p,q) in stack)
        if e2 in m.values():
            return False
        for v1 in V1:
            if v1 == e1:
                if e2 not in V2:
                    return False
            if v1 in m:
                if m[v1] not in V2:
                    return False
        for s in m:
            if e1 in self.gr1[s]:
                if e2 not in self.gr2[m[s]]:
                    return False
        return True
 
    def run(self):
        stack = []
        if len(self.gr1) != len(self.gr2):
            return {}
        sig1 = sorted(len(v) for v in self.gr1.values())
        sig2 = sorted(len(v) for v in self.gr2.values())
        if sig1!=sig2:
            return {}
 
        L1 = self.gr1.keys()
        L2 = self.gr2.keys()
        i = j = 0
        while i&lt;len(L1):
            e1 = L1[i]
            while j&lt;len(L2):
                e2 = L2[j]
                if self.accept((e1,e2),stack):
                    stack.append(((e1,e2),(i,j)))
                    j = 0
                    break
                j+=1
            else:
                if stack:
                    _, (i,j) = stack.pop()
                    if j == -1:
                        return {}
                    j+=1
                    continue
                else:
                    return {}
            i+=1
        return dict(elem[0] for elem in stack)

If no permutation could be constructed an empty dictionary {} is returned.

Let’s watch the dict which is computed by the Maptracker for GR1 and GR2:

&gt;&gt;&gt; M = Maptracker(GR1, GR2).run()
&gt;&gt;&gt; M
{0: 0, 1: 1, 2: 2, 3: 7, 4: 3, 5: 4, 6: 5, 7: 6}

We can check the correctness of M manually or by setting

P = lambda k: (-1 if k == -1 else M[k] )

and check the equality we have defined above.

Patching tracebacks

Posted in Algorithms, Langscape, Python on April 11th, 2011 by kay – Be the first to comment

One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.

The situation is easily explained: you have a program P written in some language L. There is a source-to-source translation of P into another program Q of a target language, preferably Python. So you write P but the code which is executed is Q. When Q fails Python produces a traceback. A traceback is a stack of execution frames – a snapshot of the current computation – and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:

Traceback (most recent call last):
  File "tests\pythonexpr.py", line 12, in
    bar()
  File "tests\pythonexpr.py", line 10, in bar
    foo()
  File "tests\pythonexpr.py", line 5, in foo
    b.p,
NameError: global name 'b' is not defined

The problem here is that line number information in the frames is from Q whereas the file and the lines being displayed are from P – the only file there is!

Hacking line numbers into parse trees

My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate Q or rather the parse tree corresponding to Q which got updated with the line numbers I used to expect. When the growth of line numbers in Q was non-monotonic, using CPythons internal line number table, lnotab, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won’t be overwritten. There is a  hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.

From token streams to string pattern

Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of P into expressions in Q or vice versa, break  P token into token sequences in Q etc. ) we can checkout the following construction. Let be

T(V, Q) = {T in TS_Q| V = T.Value }
the set of token in the token stream TS_Qof Q with the prescribed token value V. Analog to this we can build build a set T(V, P) for P and TS_P.

The basic idea is now to construct a mapping between T(V,Q)  and  T(V,P). In order to get the value V we examine the byte code of a traceback frame up to the last executed instruction f_lasti. We assume that executing f_lasti leads to the error. Now the instruction may not be coupled to a particular name, so we examine f_lasti or the last instruction preceding f_lasti for which the instruction type is in the set

{LOAD_CONST, LOAD_FAST, LOAD_NAME, LOAD_ATTR, LOAD_GLOBAL, IMPORT_NAME }
From the value related to one of those instructions, the type of the value which is one of  {NAME, STRING, NUMBER} and the execution line f_lineno we create a new token T_q = [tokentype, value, lineno]. For V we set V = value. Actually things are a little more complicated because the dedicated token T_q and the line in Q are not necessarily in a 1-1 relationship. So there might in fact be n>1 token being equal to T_q which originate from different lines in P. So let Token be the list of all token we found on line T_q.Line and k = Token.count(T_q). We add k to the data characterizing T_q.

So assume having found T_q. The map we want to build is T(T_q.Value, Q) -> T(T_q.Value, P). How can we do that?

In the first step we assign a character to each token in the  stream TS_Q and turn TS_Q into a string S_TS_Q. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is T_q -> c. For each T in T(T_q.Value, Q) we determine then a substring of S_TS_Q with c as a midpoint:

    class Pattern:
        def __init__(self, index, token, pattern):
            self.index = index
            self.token = token
            self.pattern = pattern
 
    S_TS_Q = u''
    m = 0x30
    k = 5
    tcmap = {}    
 
    # create a string from the token stream
    for T in TS_Q:
        if T.Value in tcmap:
            S_TS_Q+=tcmap[T.Value]
        else:
            s = unichr(m)
            tcmap[T.Value] = s
            S_TS_Q+=s
            m+=1
 
    # create string pattern
    pattern = []
    for T in TVQ:
        n = TS_Q.index(T)
        S = S_TS_Q[max(0, n-k): n+k]
        pattern.append(Pattern(n, T, S))

The same construction is used for the creation of target patterns from T(V, P). In that case we use the tcmap dictionary built during the creation of pattern from T(V, Q): when two token in TS_Q and TS_P have the same token value, the corresponding characters shall coincide.

The token mapping matrix

In the next step we create a distance matrix between source and target string pattern.

    n = len(source_pattern)
    m = len(target_pattern)
    Rows = range(n)
    Cols = range(m)
    M = []
    for i in range(n):
        M.append([-1]*m)
    for i, SP in enumerate(source_pattern):
        for j, TP in enumerate(target_pattern):
            M[i][j] = levenshtein(SP.pattern, TP.pattern)

As a distance metrics we use the edit- or levenshtein distance.

Having that matrix we compute an index pair (I,J) with M[I][J] = min{ M[i][j] | i in Rows and j in Cols}. Our interpretation of (I,J) is that we map source_pattern[I].token onto target_pattern[J].token. Since there is an I for which source_pattern[I].token == T_q the corresponding T_p =target_pattern[J].token is exactly the token in P we searched for.

The line in the current traceback is the line T_q.Line of Q. Now we have found T_p.Line of P which is the corrected line which shall be displayed in the patched traceback. Let’s take a brief look on the index selection algorithm for which M was prepared:

while True:
    k, I = 1000, -1
    if n&gt;m and len(Cols) == 1:
        return target_pattern[Cols[0]].token[2]
    else:
        for r in Rows:
            d = min(M[r])
            if d&lt;k:
                k = d
                I = r
        J = M[I].index(k)
        for row in M:
            row[J] = 100
    SP = source_pattern[I]
    if SP.token == T_q:
        tok = target_pattern[J].token
        return tok.Line
    else:
        Rows.remove(I)
        Cols.remove(J)

If there is only one column left i.e. one token in T(V, P) its line will be chosen. If the J-column was selected we avoid re-selection by setting row[J] = 100 on each row. In fact it would suffice to consider only the rows left in Rows.

Example

One example I modified over and over again for testing purposes was following one:

pythonexpr.py [P]
-----------------
def foo():
    a = ("c",
        0,
        (lambda x: 0+(lambda y: y+0)(2))(1),
        b.p,
        0,
        1/0,
        b.p)
 
def bar():
    foo()
 
bar()

You can comment out b.p turn a + in the lambda expression into / provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get

[Q]
---
import langscape; __langlet__ = langscape.load_langlet("python")
def foo():
    a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p)
def bar():
    foo()
bar()

So it is this code which is executed using the command line

python run_python.py pythonexpr.py
which runs in the Python langlet through run_python.py. So the execution process sees pythonexpr.py but the code which is compiled by Python will be Q.

See the mess that happens when the traceback is not patched:

Traceback (most recent call last):
  File "run_python.py", line 9, in &lt;module&gt;
    langlet_obj.run_module(module)
  ...
  File "langscape\langlets\python\tests\pythonexpr.py", line 6, in &lt;module&gt;
    0,
  File "langscape\langlets\python\tests\pythonexpr.py", line 5, in bar
    b.p,
  File "langscape\langlets\python\tests\pythonexpr.py", line 3, in foo
    0,
NameError: global name 'b' is not defined

There is even a strange coincidence because bar() is executed on line 5 in the transformed program and b.p is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via sys.excepthook,  the traceback patching mechanism whose major algorithm we’ve developed above we get

Traceback (most recent call last):
  File "run_python.py", line 9, in &lt;module&gt;
    langlet_obj.run_module(module)
  ...
  File "langscape\langlets\python\tests\pythonexpr.py", line 13, in &lt;module&gt;
    bar(),
  File "langscape\langlets\python\tests\pythonexpr.py", line 11, in bar
    foo(),
  File "langscape\langlets\python\tests\pythonexpr.py", line 5, in foo
    b.p,
NameError: global name 'b' is not defined

which is exactly right!

Conclusion

The algorithm described in this article is merely a heuristics and it won’t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won’t change too much.

An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don’t think it is too far away because I have some time now to work on it. It will still be under tested when it’s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.

Fuzzy string matching II – matching wordlists

Posted in Algorithms, Python on January 7th, 2011 by kay – 3 Comments

Small misspellings

An anonymous programming reddit commenter wrote about my fuzzy string matching article:

A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original “damn cool algorithm” matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.

This is a valid objection.

However, for the most common case, which is an edit distance of 1 you don’t need a Levenshtein automaton either. Here is the recipe:

Let a wordlist and an alphabet be given. An alphabet can be for example the attribute string.letters of the string module. For a string S all string variants of S with an edit distance <=1 over the alphabet can be computed as follows:

def string_variants(S, alphabet):
    variants = set()
    for i in range(len(S)+1):
        variants.add(S[:i]+S[i+1:])       # delete char at i
        for c in alphabet:
            variants.add(S[:i]+c+S[i:])   # insert char at i
            variants.add(S[:i]+c+S[i+1:]) # subst char at i
    return variants

The set of words that shall be matched is given by:

set(load(wordlist)) &amp; string_variants(S, alphabet)

The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.

When you want to build string variants of edit distance = 2, just take the result of string_variants and apply string_variants on it again.

The complexity of is

O((n*len(alphabet))^k)

where n is the string length and k the edit distance.

Alternative Approaches

For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of string_variants to determine for a given S all strings with edit-distance <=2 over an alphabet. So the most simple approaches probably serve you well in practise!

For k>2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn’t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can’t even imagine.

The basic idea is to organize our wordlist into an n-ary tree, the so called PrefixTree, and implement an algorithm which is variant of fuzzy_compare to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.

Prefix Trees

For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree – we call it a PrefixTree. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean is_word field.

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char     = char
        self.parent   = parent
        self.children = {}
        self.is_word  = False
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self.children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self.children[c] = tree
            tree.insert(value[1:])
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            c = value[0]
            if c in self.children:
                return value[1:] in self.children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Reading a wordlist into a PrefixTree can be simply done like this:

pt = PrefixTree()
for word in wordlist:
    pt.insert(word)

Before we criticise and modify the PrefixTree let us take a look at the matching algorithm.

Matching the PrefixTree

The algorithm is inspired by our fuzzy_compare algorithm. It uses the same recursive structure and memoization as fuzzy_compare.

def update_visited(ptree, visited):
    visited[ptree][-1] = 0
    T = ptree.parent
    while T is not None and T.char!='':
        if len(T.children) == 1:
            visited[T][-1] = 0
            T = T.parent
        else:
            return
 
def is_visited(i, T, k, visited):
    d = visited.get(T, {})
    if -1 in d:
        return True
    m = d.get(i,-1)
    if k&gt;m:
        d[i] = k
        visited[T] = d
        return False
    return True
 
def fuzzy_match(S, ptree, k, i=0, visited = None, N = 0):
    '''
    Computes all strings T contained in ptree with a distance dist(T, S)&lt;=k.
    '''
    trees = set()
    # handles root node of a PrefixTree
    if ptree.char == '' and ptree.children:
        N = len(S)
        S+='\0'*(k+1)
        visited = {}
        for pt in ptree.children.values():
            trees.update(fuzzy_match(S, pt, k, i, visited, N))
        return trees
 
    # already tried
    if is_visited(i, ptree, k, visited):
        return []
 
    # can't match ...
    if k == -1 or (k == 0 and S[i] != ptree.char):
        return []
 
    if ptree.is_word and (N-i&lt;=k or (N-i&lt;=k+1 and ptree.char == S[i])):
        trees.add(ptree.trace())
        if not ptree.children:
            update_visited(ptree, visited)
            return trees
 
    if ptree.char!=S[i]:
        trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))
 
    for c in ptree.children:
        if ptree.char == S[i]:
            trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N))
        else:
            trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N))
        trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N))
    return trees

Lazy PrefixTree construction

The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.

A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL hash_map and got a worse result: 14 sec of execution time – about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn’t cease to surprise me. Of course I feel bad because I haven’t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.

A possible solution for Python ( and C++ 😉 ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.

Here is the modified code:

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char      = char
        self.parent    = parent
        self.is_word   = False
        self._children = {}
        self._words    = set()
 
    def _get_children(self):
        if self._words:
            self._create_children()
        return self._children
 
    children = property(_get_children)
 
    def _create_children(self):
        for tree, word in self._words:
            tree.insert(word)
        self._words = set()
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self._children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self._children[c] = tree
            if len(value) == 1:
                tree.is_word = True
            tree._words.add((tree,value[1:]))
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            if value in self._words:
                return True
            c = value[0]
            if c in self._children:
                return value[1:] in self._children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Some numbers

The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.

Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec
Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec
Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec
Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec
Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec
Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec
Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec
We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.

Finally here are the numbers using string variants:

Execution time of string_variants("haus", string.letters): 0.0 sec
Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec
Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec
The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren’t necessary and for k >=3 PrefixTree based approaches doesn’t run amok.

Code

The code for fuzzy_compare and fuzzy_match can be downloaded here. It also contains tests, some timing measurements and a German sample wordlist.

Fuzzy string matching

Posted in Algorithms, Python on December 21st, 2010 by kay – 5 Comments

A damn hot algorithm

I found the following article written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.

I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:

k = 6= "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()

The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?

In the following article I’m constructing an approximate string matching algorithm from scratch.

Recursive rules for approximate string matching

Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.

We describe the rule which describes all possible character substitutions in “food” like this:

pattern(food, 1) = ?ood | f?od | fo?d  | foo?
Applying successive left factorings yields:
pattern(food, 1) = ?ood | f  ( ?od | o (?d  | o? ) )
This inspires a recursive notation which roughly looks like this:
pattern(food, 1) = ?ood | f pattern(ood, 1)
or more precisely:
pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)
where we have used a string variable S instead of the concrete string “food”.

When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:

pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)

Consuming or not consuming?

When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …

The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:

fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)

All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.

A first implementation

The following implementation is a good point to start with:

def fuzzy_compare(S1, S2, k, i=0, j=0):
    N1 = len(S1)
    N2 = len(S2)
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
        try:
            if S1[i] == S2[j]:
                i+=1
                j+=1
                continue
        except IndexError:
            return False
        if k == 0:
            return False
        else:
            if fuzzy_compare(S1, S2, k-1, i+1, j):
                return True
            if fuzzy_compare(S1, S2, k-1, i, j+1):
                return True
            if fuzzy_compare(S1, S2, k-1, i+1, j+1):
                return True
            return False

The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.

A second implementation using memoization

Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k&gt;m:
        visited[(i,j)] = k
        return False
    return True
 
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
    N1 = len(S1)
    N2 = len(S2)
    if visited is None:
        visited = {}
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
        try:
            if S1[i] == S2[j]:
                visited[(i,j)] = k
                i+=1
                j+=1
                continue
        except IndexError:
            return False
        if k == 0:
            return False
        else:
            if not is_visited(i+1, j, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
                    return True
            if not is_visited(i, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
                    return True
            if not is_visited(i+1, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
                    return True
            return False

A third implementation eliminating recursion

In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k&gt;m:
        visited[(i,j)] = k
        return False
    return True
 
def fuzzy_compare(S1, S2, k):
    N1 = len(S1)
    N2 = len(S2)
    i = j = 0
    visited = {}
    stack = []
    while True:
        if N1-i&lt;=k and N2-j&lt;=k:
            return True
        try:
            if S1[i] == S2[j]:
                visited[(i,j)] = k
                i+=1
                j+=1
                continue
        except IndexError:
            if stack:
                i, j, k = stack.pop()
            else:
                return False
        if k == 0:
            if stack:
                i, j, k = stack.pop()
            else:
                return False
        else:
            if not is_visited(i+1, j, k-1, visited):
                stack.append((i,j,k))
                i, j, k = i+1, j, k-1
            elif not is_visited(i, j+1, k-1, visited):
                stack.append((i,j,k))
                i, j, k = i, j+1, k-1
            elif not is_visited(i+1, j+1, k-1, visited):
                i, j, k = i+1, j+1, k-1
            elif stack:                               
                i, j, k = stack.pop()
            else:
                return False

This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of ifelif branches can impact performance of the algorithm. Do you see a way to improve it?

Checking the algorithm

When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.

def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [range(l1 + 1)] * (l2 + 1)
    for zz in range(l2 + 1):
      matrix[zz] = range(zz,zz + l1 + 1)
    for zz in range(0,l2):
      for sz in range(0,l1):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
                                   matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
                                   matrix[zz][sz] + 1)
    return matrix[l2][l1]

For exhaustive testing we define a set of strings as follows:

Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.

def string_set(S = None, k = 0, strings = None, n = 6):
    if S is None:
        strings = []
        S = ["a"]*n
        strings.append("".join(S))
    for i in range(k, n):
        S1 = S[:]
        S1[i] = "b"
        strings.append("".join(S1))
        string_set(S1, i+1, strings, n)
    return strings
 
def string_pairs(n):
    L1 = string_set(n=n)
    pairs = []
    for i in range(len(L1)):
        for k in range(1, n+1):
            L2 = string_set(n=k)
            for j in range(len(L2)):
                pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
                pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
    return pairs

Our test function is short:

def test(n):
    for S1, S2, D in string_pairs(n):
        assert fuzzy_compare(S1, S2, D) == True, (S1, S2, D)
        assert fuzzy_compare(S1, S2, D-1) == False, (S1, S2, D-1)

Have much fun!

Token Tracers

Posted in EasyExtend, Parsing, TBP on June 3rd, 2010 by kay – Be the first to comment

When I started programming EasyExtend in 2006 one of the major problems was the correct grammar -> NFA translation. I used big grammars and testing for correctness required lots of source code. The first heuristics I used was ugly and complex and it took about 2 or so years to find a neat trick which finally lead to replace it completely. The basic problem of systematic phrase or expression generation for testing purpose persisted though – until last week when I implemented a TokenTracer.

Tracers

A typical production rule in the Trail parser generator is translated into a single NFA which might look as in the following example

 1005: ["funcdef: [decorators] 'def' NAME parameters ':' suite",
        (1005, 0, 1005),
        {(1, 3, 1005): [(1006, 4, 1005)],
         (11, 5, 1005): [(1043, 6, 1005)],
         ('def', 2, 1005): [(1, 3, 1005)],
         (1004, 1, 1005): [('def', 2, 1005)],
         (1005, 0, 1005): [('def', 2, 1005), (1004, 1, 1005)],
         (1006, 4, 1005): [(11, 5, 1005)],
         (1043, 6, 1005): [(None, '-', 1005)]}],

It is not created for readability but it is nevertheless easy to decode. The funcdef grammar rule is assigned a numerical value, a rule identifier – here 1005. Asscociated with the rule identifier is a 3-list consisting of

  1. The rule in plain text
  2. The start state of a finite automaton (1005, 0, 1005)
  3. A finite automaton encoded as a dictionary of transitions.

Starting with (1005, 0, 1005) one can step through the automaton. The follow states are  [(‘def’, 2, 1005), (1004, 1, 1005)]. The first one obviously represents the def keyword whereas the second is a representation of the decorators non-terminal which has the rule identifier 1004. When you select the (1004, 1, 1005) state there is a single follow state, which is again the state of the def keyword otherwise you get the follow state (1, 3, 2005) of  (‘def’, 2, 1005). The state (None, ‘-‘, 1005) doesn’t have a follow state and it is the only one.

You can now define a function that keeps track of this stepping process through a rule. This function is called a Tracer.

A Tracer acts as follows:

&gt;&gt;&gt; tracer = Tracer(rules)
&gt;&gt;&gt; tracer.select(1005)   # selects automaton 1005 and returns the rule ids of the
['def', 1004]             # possible follow states
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)
[1006]
...

It is possible that a Tracer has to keep track of multiple traces at once. For example the exprlistrule

 1069: ["exprlist: expr (',' expr)* [',']",
        (1069, 0, 1069),
        {(12, 2, 1069): [(1053, 3, 1069)],
         (12, 4, 1069): [(None, '-', 1069)],
         (1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1053, 3, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)],
         (1069, 0, 1069): [(1053, 1, 1069)]}],

defines transitions of the kind

(1053, 1, 1069): [(12, 4, 1069), (12, 2, 1069), (None, '-', 1069)]

with two rules of rule id 12 in the follow set. When 12 is selected in the Tracer all follow sets of all rules with rule id = 12 are unified:

&gt;&gt;&gt; tracer.select(1069)
[1053]
&gt;&gt;&gt; tracer.select(1053)
[12, None]
&gt;&gt;&gt; tracer.select(12)
[1053, None]
...

TokenTracers

This kind of tracing functionality is central to EasyExtends implementation of Trace Based Parsing (TBP). For single grammar rules TBP coincides with “Thompson NFA” style parsing discussed at length by Russ Cox or more recently by Carl Friedrich Bolz who gave a Python implementation.

We want to consider now a different sort of tracer which is more complicated to create than those for single grammar rules. Those tracers have to meet the following requirement:

The list of rule id’s returned from tracer.select() shall contain only None or rule id’s of terminal symbols.

The rule id’s of terminals are exactly the  token types. The select function of a TokenTracer returns a list of token types and gets fed with a single token type. In the following example we step through the token stream of a simple function

def foo():
    print 42

Here we go

&gt;&gt;&gt; tracer = TokenTracer(rules)
&gt;&gt;&gt; tracer.select(1001)  # a single select using a top level non-terminal
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('def')
[1]
&gt;&gt;&gt; tracer.select(1)     # foo
[7]
&gt;&gt;&gt; tracer.select(7)     # (
[1, 7, 8, 16, 36]
&gt;&gt;&gt; tracer.select(8)     # )
[11]
&gt;&gt;&gt; tracer.select(11)    # :
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(4)     # \n
[5]
&gt;&gt;&gt; tracer.select(5)     # INDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select('print')
[1, 2, 3, 4, 7, 9, 13, 13, 14, 15, 25, 26, 32, 35, 'lambda', 'not']
&gt;&gt;&gt; tracer.select(2)     # 42
[4, 7, 9, 12, ..., 36, 48, '&lt;&gt;', 'and', 'if', 'in', 'is', 'is', 'not', 'or']
&gt;&gt;&gt; tracer.select(4)     # \n
[1, 2, 3, 6, 7, ... , 'try', 'while', 'with', 'yield']
&gt;&gt;&gt; tracer.select(6)     # DEDENT
[0, 1, 2, 3, 4, 7, ... , 'assert', 'break', 'class', 'continue', 'def', ...]
&gt;&gt;&gt; tracer.select(0)     # ENDMARKER

Application 1 – error detection

Using a TokenTracer it is dead simple to localize a syntax error which is – in the context free case – always an unexpected token. In principle Trail could delegate error recovery entirely to a TokenTracer.

Application 2 – autocorrection

A constant token is a token with a constant token string e.g. ‘;’ or ‘:’. Closely related are token like INDENT where the token string can be derived from context and a prescribed indentation. In sharp contrast are token like NAME, NUMBER and STRING where the token string is not language but user determined. In the select() sequence above we find constant token lists of length = 1 like [11] or [7]. If one of those token is omitted it can be inserted without guessing.

Application 3 – expression generation

The most intriguing aspect of TokenTracers is that each random token sequence which is constrained by a TokenTracer is syntactically correct. This can be used to create expression generators: first write a grammar G to describe the language syntax, then you derive a TokenTracer(G). Finally an expression generator ExprGen(TokenTracer(G)) is created which is used to build random token sequences being compliant with G by means of the TokenTracer. Those token-sequences can either be turned into valid parse trees and get compiled or un-tokenized into source code.

A valuation function fitness(expr) -> float on expressions motivates the use of genetic programming for breeding expressions of a certain kind. For example I’m strongly  interested in compact grammars which create big NFA expansions in Trail. It is not easy to see how those can be built by hand. Using GP one could set an arbitrary threshold like n = 1000 for the number of states in a single expanded NFA and tries to minimize the size of a grammar, where the size is measured in the number of tokens used for a grammar description in some meta-grammar ( e.g. EBNF ).

Inheritance and the C preprocessor

Posted in Algorithms, C on March 24th, 2010 by kay – 3 Comments

Defining n-ary trees using the C preprocessor

In this article I introduce a compile time C technique used to define inheritance. Instead of giving a lengthy motivation I’ll jump directly to the algorithm and discuss it later. I hope lovers of C and its preprocessor find it useful. #defines first!

#define TOP 0
#define SET_CHILD(n,parent) ( parent==TOP ? n: \
                            ( parent&lt;(1&lt;&lt;4) ? (n&lt;&lt;4) + parent : \
                            ( parent&lt;(1&lt;&lt;8) ? (n&lt;&lt;8) + parent : (n&lt;&lt;12)+parent)))
 
#define IS_SUBNODE(child, parent) ((child &amp; parent) == parent)
 
#define SELECT(X, a, best) ( a &gt; best &amp;&amp; IS_SUBNODE(X, a)? a : best)
 
#define SELECT_FROM_5(X, a, b, c, d, e) SELECT(X, a, \
                                        SELECT(X, b, \
                                        SELECT(X, c, \
                                        SELECT(X, d, \
                                        SELECT(X, e, 0)))))
 
#define SELECT_FROM_4(X, a, b, c, d) SELECT_FROM_5(X, a, b, c, d, 0)
#define SELECT_FROM_3(X, a, b, c)    SELECT_FROM_5(X, a, b, c, 0, 0)
#define SELECT_FROM_2(X, a, b)       SELECT_FROM_5(X, a, b, 0, 0, 0)

The SET_CHILD macro is used to define up to 15 child nodes of a given root for a n-ary tree of depth 5 with a single root node, named TOP. This is encoded within a single number of type word which is adequate for most embedded compilers. For 32 or 64 bit processors one can either support more child nodes or a deeper tree.

SET_CHILD is assigning a name to n-th child of a given parent. One starts with TOP as the parent of all nodes and recurses down:

#define A SET_CHILD(1, TOP)
#define B SET_CHILD(2, TOP)
...
#define A1 SET_CHILD(1, A)
#define A2 SET_CHILD(2, A)
...
#define B1 SET_CHILD(1, B)
#define B2 SET_CHILD(2, B)
...
#define A11 SET_CHILD(1, A1)
#define A12 SET_CHILD(2, A1)
...
#define A21 SET_CHILD(1, A2)
#define A22 SET_CHILD(2, A2)
...

By construction no more than 15 child nodes for a given parent are permitted. If more are used, macros like IS_CHILD will fail to work correctly.

Once a tree is created with the appropriate nodes, one can use IS_CHILD to check for child/parent relationships. The tree is constructed s.t. IS_CHILD(A, B) returns 1 iff A is a direct child of B or a grandchild of B etc. otherwise 0. So IS_CHILD(A22, A) evaluates to 1 just like IS_CHILD(A22, A2) or IS_CHILD(A22, TOP) but IS_CHILD(A22, A1) is 0.

The C preprocessor doesn’t support overloading and the flavors I checked didn’t support varargs wich wouldn’t probably be much helpful in this case either. So I defined a group of 5 SELECT_FROM_xx macros being distinguished only be the number of arguments. The number 5 isn’t magic and one can extend the range of SELECT_FROM_xx macros by need.

How is SELECT_FROM_xx used? The first argument X is an arbitary node of the tree. If one of the susequent nodes a, b, … c is identical with X, X will be the value of SELECT_FROM_xx(X, a, b, …, c). Otherwise the most-direct-parent of X among the nodes a, …c will be returned. If none of them is a parent of X the return value is TOP.

Example:

If we set

#define X A22

then we get

SELECT_FROM_2(X, A, B)        // = A
SELECT_FROM_3(X, A, B, A1)    // = A
SELECT_FROM_3(X, A, B, A2)    // = A2
SELECT_FROM_3(X, A2, B, A)    // = A2
SELECT_FROM_3(X, A2, A, A22)  // = A2
SELECT_FROM_2(X, A1, B)       // = TOP

Inheritance

With the definitions above we can influence conditional compilation:

#if SELECT_FROM_3(X,A2,A,B) == A2
        const int a = 0;
#else if SELECT_FROM_3(X,A2,A,B) == A
        const int a = 1;
#else if SELECT_FROM_3(X,A2,A,B) == B
        const int a = 2;
#else
        const int a = -1;
#endif

The virtue of the construction lies in its robustness. Suppose X is A22 then the first branch is selected but this remains true also if we build a “subclass” A22k , k = 1, …, 9, A, …, F of A22 and assign e.g.

#define X A225

So if we use conditional compilation for a given System S and create a subsystem T of S e.g. a new version of S, we have to adapt our C code only in places where T differs explicitely from S. This robustness is also the major virtue of using inheritance / polymorphism in OOP. It has led to disrespect of using case-statements in OOP since those do not exploit polymorphism and cause in turn less robust code. We see that case- or if-else statements can be confined with the very same idea and robustness even on the level of the C preprocessor. The additional charme of using the C preprocessor is that child/parent relationships are computed at compile time and do not cause any runtime performance penalty.