<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Trails in a Langscape</title>
	<atom:link href="http://fiber-space.de/wordpress/feed/" rel="self" type="application/rss+xml" />
	<link>http://fiber-space.de/wordpress</link>
	<description>Projects and projections</description>
	<lastBuildDate>Tue, 10 Apr 2012 09:53:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>The state of the Trail parser generator</title>
		<link>http://fiber-space.de/wordpress/2012/04/10/the-state-of-the-trail-parser-generator/</link>
		<comments>http://fiber-space.de/wordpress/2012/04/10/the-state-of-the-trail-parser-generator/#comments</comments>
		<pubDate>Tue, 10 Apr 2012 09:53:16 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Grammars]]></category>
		<category><![CDATA[Langscape]]></category>
		<category><![CDATA[Parsing]]></category>
		<category><![CDATA[TBP]]></category>
		<category><![CDATA[Trail]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1964</guid>
		<description><![CDATA[White Easter Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter&#8230; x + + + + + + + + + - - - - - - - - - x - - - - - - - - x + - - - - [...]]]></description>
			<content:encoded><![CDATA[<h3>White Easter</h3>

<p>Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter&#8230;
<pre> x + + + + + + + + +
 - - - - - - - - - x
 - - - - - - - - x +
 - - - - - - - x + +
 - - - - - - x + + +
 - - - - - x + + + +
 - - - - x + + + + +
 - - - x + + + + + +
 - - x + + + + + + +
 - x + + + + + + + +<big> </big></pre></p>

<h3>Flat Automatons</h3>

<p>This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn&#8217;t that I believed that this speed advantage wouldn&#8217;t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn&#8217;t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let&#8217;s begin.</p>

<p>If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely <em>flat</em> then and contained only terminal symbols which referred to each other . An intrinsic hurdle is <em>recursion</em>. A recursive rule like
<pre>X: X X | a</pre>
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
<pre>X X | a
X X X X | X X a | a X X | a X | X a | a a | a
...</pre></p>

<h3>Working embeddings</h3>

<p>The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won&#8217;t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:
<pre>X: X a
X: ɛ</pre>
which means that <span style="font-family: Courier New,Courier,monospace;">X</span> can produce the empty word. Since ɛ<span style="font-family: Courier New,Courier,monospace;">a = a</span> the rule <span style="font-family: Courier New,Courier,monospace;">X</span> accepts all finite sequences of <span style="font-family: Courier New,Courier,monospace;">a</span>. In EBNF we can rewrite the X-rule as
<pre>X: [X] a</pre>
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
<pre><span style="color: #3366ff;"><strong>(X,0)</strong></span>: (X,1) (a,2)
(X,1): (a,2)
(a,2): <strong><span style="color: #3366ff;">(FIN,-1)</span></strong></pre>
Each state carries a unique index, which allows us to distinguish arbitrary many different <span style="font-family: Courier New,Courier,monospace;">X</span> and <span style="font-family: Courier New,Courier,monospace;">a</span> states in the automaton. If we further want to embedd <span style="font-family: Courier New,Courier,monospace;">X</span> within another rule, say
<pre>Y: X b</pre>
which is defined by the table
<pre><span style="color: #3366ff;"><strong>(Y,0)</strong></span>: (X,1)
(X,1): (b,2)
(a,2): <span style="color: #3366ff;"><strong>(FIN,-1)</strong></span></pre>
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
<pre><span style="color: #3366ff;"><strong>(X,0,X)</strong></span>: (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): <span style="color: #3366ff;"><strong>(FIN,-1,X)</strong></span></pre>
With this notion we can embed <span style="font-family: Courier New,Courier,monospace;">X</span> in <span style="font-family: Courier New,Courier,monospace;">Y</span>:
<pre><strong><span style="color: #3366ff;">(Y,0,Y)</span></strong>: (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): <strong><span style="color: #993366;">(FIN,-1,X)</span></strong>
<strong><span style="color: #993366;">(FIN,-1,X)</span></strong>: (b,2,Y)
(b,2,Y): <span style="color: #3366ff;"><strong>(FIN,-1,Y)</strong></span></pre>
The initial state <span style="font-family: Courier New,Courier,monospace;">(X,0,X)</span> of <span style="font-family: Courier New,Courier,monospace;">X</span> was completely removed. The automaton is still erroneous though. The final state <span style="font-family: Courier New,Courier,monospace;">(FIN,-1,X)</span> of <span style="font-family: Courier New,Courier,monospace;">X</span> is not a final state in <span style="font-family: Courier New,Courier,monospace;">Y</span>. It even has a transition! We could try to remove it completely and instead write
<pre><span style="color: #3366ff;"><strong>(Y,0,Y)</strong></span>: (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): <strong> </strong><strong> </strong>(b,2,Y)
(b,2,Y): <span style="color: #3366ff;"><strong>(FIN,-1,Y)</strong></span></pre>
But suppose Y had the form:
<pre>Y: X X b</pre>
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state <span style="font-family: Courier New,Courier,monospace;">(FIN, -1, X)</span> when embedded in <span style="font-family: Courier New,Courier,monospace;">Y</span> into an ɛ-state in <span style="font-family: Courier New,Courier,monospace;">Y</span>:
<pre>(FIN,-1,X) =&gt; (X, 3, TRAIL_DOT, Y)</pre>
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape <span style="font-family: Courier New,Courier,monospace;">(_, _, 0, _)</span>. Finally the fully and correctly embedded automaton <span style="font-family: Courier New,Courier,monospace;">X</span> in <span style="font-family: Courier New,Courier,monospace;">Y</span> looks like this:
<pre><strong><span style="color: #3366ff;">(Y,0,0,Y)</span></strong>: (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): <strong><span style="color: #3366ff;">(FIN,-1,0,Y)</span></strong>
<strong> </strong><strong> </strong><strong> </strong><strong> </strong></pre>
The <span style="font-family: Courier New,Courier,monospace;">TRAIL_DOT</span> marks the transition between <span style="font-family: Courier New,Courier,monospace;">X</span> and <span style="font-family: Courier New,Courier,monospace;">Y</span> <em>in</em> <span style="font-family: Courier New,Courier,monospace;">Y</span>. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.</p>

<h3>Rules of Embedding</h3>

<p>At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don&#8217;t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:
<pre>R: A | B
A: a* c
B: a* d</pre>
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string &#8220;aaa&#8230;&#8221; we cannot decide which of the rules A or B we have to choose, unless we observe a &#8216;c&#8217; or &#8216;d&#8217; event i.e. our string becomes &#8220;aa&#8230;ac&#8221; or &#8220;aa&#8230;ad&#8221;. What we basically want is to defer the choice of a rule, making a <em>late choice</em> instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
<pre>R: a* c | a* d</pre>
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets <span style="font-family: Courier New,Courier,monospace;">{ (a,1,0,A), (a,2,0,B) }</span> which recognize the same character &#8220;a&#8221;. Those state sets will be stored during parsing. In case of a &#8220;c&#8221;-event which will be recognized by the state <span style="font-family: Courier New,Courier,monospace;">(c, 3, 0, &lt;strong&gt;&lt;span style=&#8221;color: #3366ff;&#8221;&gt;A&lt;/span&gt;&lt;/strong&gt;)</span> we only have to dig into the state-set sequence and follow the states<span style="font-family: Courier New,Courier,monospace;">(a, 1 , 0, &lt;strong&gt;&lt;span style=&#8221;color: #3366ff;&#8221;&gt;A&lt;/span&gt;&lt;/strong&gt;)</span> back to the first element of the sequence. Since <span style="font-family: Courier New,Courier,monospace;">(&lt;strong&gt;&lt;span style=&#8221;color: #3366ff;&#8221;&gt;A&lt;/span&gt;&lt;/strong&gt;, 4, TRAIL_DOT, R)</span> is the only follow state of <span style="font-family: Courier New,Courier,monospace;">(c, 3, 0, &lt;span style=&#8221;color: #3366ff;&#8221;&gt;&lt;strong&gt;A&lt;/strong&gt;&lt;/span&gt;)</span>we will actually see the sequence:
<pre>(<strong><span style="color: #3366ff;">A</span></strong>,4,TRAIL_DOT,R)
 \
  '...
      \
(c,3,0,<span style="color: #3366ff;"><strong>A</strong></span>)
(a,2,0,<strong><span style="color: #3366ff;">A</span></strong>)
(a,2,0,<span style="color: #3366ff;"><strong>A</strong></span>)
...
(a,2,0,<strong><span style="color: #3366ff;">A</span></strong>)<strong> </strong>
<strong> </strong><strong> </strong><strong> </strong><strong> </strong></pre>
from this sequence we can easily reconstruct contexts and build the tree
<pre>[R, [A, a, a, ..., c]]</pre>
All of this is realized by <em>late choice</em>. Until a &#8220;c&#8221; or &#8220;d&#8221; event we move within A and B at the same time because. The embedding of A and B in R <em>solves</em> the FIRST/FIRST conflict above. This is the meaning.</p>

<h3>FOLLOW/FIRST conflicts</h3>

<p>So far the article didn&#8217;t contain anything new. I&#8217;ve written about all of this before.</p>

<p>The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:
<pre>R: U*
U: A | B
A: a+ (B c)*
B: b</pre>
There is no FIRST/FIRST conflict between A and B and we can&#8217;t factor out a common prefix. But now suppose we want to parse the string &#8220;abb&#8221;. Obviously A recognizes the two initial characters &#8220;ab&#8221; and then fails at the 2nd &#8220;b&#8221; because &#8220;c&#8221; was expected. Now A can recognize &#8220;a&#8221; alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After &#8220;a&#8221; has been recognized by A the rule B may take over and match &#8220;b&#8221; two times:
<pre>[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]</pre>
Trail applies a &#8220;longest match&#8221; recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after <span style="font-family: Courier New,Courier,monospace;">a</span> and at that point the parser sets a so called <em>checkpoint</em> dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is &#8220;abcbb&#8221; then A will match &#8220;abc&#8221;, if it is &#8220;abcbcbb&#8221; then it is &#8220;abcbc&#8221; and so on.</p>

<p>The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.</p>

<h3>The inevitability of left recursions</h3>

<p>It is easy in top down parsing to recognize and remove or transform left recursive rule like this one
<pre>X: X a | ɛ</pre>
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don&#8217;t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn&#8217;t expected them. I&#8217;d like to go even one step further which is conjectural: we can&#8217;t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this <em>clash dynamics</em>, as I like to call it might lead to algorithmically undecidable problems. It&#8217;s nothing I&#8217;ve systematically thought about but I wouldn&#8217;t be too surprised.</p>

<p>For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as
<pre>X: X
</pre>
or
<pre>X: X a</pre>
In both cases the FIRST-sets of X don&#8217;t contain any terminal symbol and they can&#8217;t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
<pre>E: E '*' E | NUMBER</pre>
contains a FIRST/FIRST conflict between <span style="font-family: Courier New,Courier,monospace;">E</span> and <span style="font-family: Courier New,Courier,monospace;">NUMBER</span>. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
<pre>T: u v [T] u w</pre>
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that&#8217;s entirely changing the game. The question is do we <em>must</em> tack backtracking solutions onto Trail which are deeper involved than checkpoints?</p>

<p>After 6 months of tinkering I wished the answer was <em>no</em>. Actually I believe that it is unavoidable but it occurs at places were I didn&#8217;t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.</p>

<h3>Teaser</h3>

<p>Before I discuss left recursion problems in Trail in a follow up article I&#8217;ll present some results as a teaser.</p>

<p>Grammars for which parsing in Trail is O(n):
<pre>a) E: '(' E ')' | E '<em>' E | NUMBER
b) E: '(' E+ ')' | E '</em>' E | NUMBER</pre>
Other grammars in the same category are
<pre>c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
</pre>
However for the following grammars the parser is in O(2^n)
<pre>f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
</pre>
If we combine  d) and f) we get
<pre>h) G: G G (G | 'h') | G (G | 'h') | 'h'
</pre>
In this case Trail will deny service and throw a <span style="font-family: Courier New,Courier,monospace;">ParserConstructionError</span> exception. Some pretty polynomial grammars will be lost.</p>

<p>&nbsp;</p>

<p>&nbsp;</p>

<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2012/04/10/the-state-of-the-trail-parser-generator/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>LL(*) faster than LL(1)?</title>
		<link>http://fiber-space.de/wordpress/2011/09/12/ll-faster-than-ll1/</link>
		<comments>http://fiber-space.de/wordpress/2011/09/12/ll-faster-than-ll1/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 18:45:54 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Grammars]]></category>
		<category><![CDATA[Parsing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[TBP]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1876</guid>
		<description><![CDATA[No jumps When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn&#8217;t seem to be available anymore ). It was a pure Python parser following closely the example of CPython&#8217;s pgen. The implementation was very dense and the generated parser probably [...]]]></description>
			<content:encoded><![CDATA[<h3>No jumps</h3>

<p>When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn&#8217;t seem to be available anymore ). It was a pure Python parser following closely the example of CPython&#8217;s <em>pgen</em>. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.</p>

<p>In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -&gt; P<em>. How do we know that P</em> is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both <em>horizontally</em> i.e. within an NFA as well as <em>vertically</em>: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.</p>

<p>What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this
<pre>R: a* b | a* c
</pre>
The algorithm <em>never</em> checks out the first branch, runs through a sequence of  &#8216;a&#8217; until it hits &#8216;b&#8217; and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It&#8217;s easy to express this on grammar level by applying left factoring to the rule
<pre>R: a* ( b | c )</pre>
However there was never any rule transformation to simplify the problem.</p>

<h3>From rules to grammars</h3>

<p>It&#8217;s actually an old approach to regular expression matching which is  attributed to Ken Thompson. Russ Cox refreshed the collective memory  about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars &#8211; or it did and was given up again, I don&#8217;t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar
<pre>R: A | B
A: a* c
B: a* d</pre>
Again we have two branches, marked by the names of the non-terminals <span style="font-family: Courier New,Courier,monospace;">A</span> and <span style="font-family: Courier New,Courier,monospace;">B</span> and we want to decide late which one to choose.</p>

<p>First we turn the grammar into a regular expression:
<pre>R: a* c | a* d</pre>
but now we have lost context/structural information which needs to be somehow added:
<pre>R: a* c &lt;A&gt;| a* d &lt;B&gt;</pre>
The symbols &lt;A&gt; and &lt;B&gt; do not match a character or token. They merely represent the rules which <em>would have been used</em> when the matching algorithm scans beyond &#8216;c&#8217; or &#8216;d&#8217;. So once the scanner enters &lt;A&gt; it will be finally decided that rule A was used. The same is true for &lt;B&gt;. Our example grammar is LL(<em>) and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn&#8217;t been changed through rule embedding but now we can deal with the LL(</em>) grammar <em>as-if</em> it was an LL(1) grammar + a small context marker.</p>

<h3>Reconstruction</h3>

<p>What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:
<pre>R: A[R] | B[R]
A: a[A]* c[A]
B: a[B]* d[A]</pre>
Now we can embed A and B into R while being able to preserve the context:
<pre>R: a[A]* c[A] &lt;A[R]&gt;| a[B]* d[B] &lt;B[R]&gt;</pre>
Matching now the string <strong>aad</strong> yields the following sequence of sets of matching symbols:
<pre>{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {&lt;B[R]&gt;}</pre>
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], &#8230; will alway match <strong>a</strong>.</p>

<p>Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.</p>

<p>We start the interpretation by translating the rightmost symbol
<pre>&lt;B[R]&gt; -&gt; [R,[B, .]]
</pre>
The  dot &#8216;.&#8217; is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
<pre>[R, [B, .]]
</pre>
<pre>[R, [B, ., d]]
</pre>
<pre>[R, [B, ., a, d]]</pre>
<pre>[R, [B, ., a, a, d]]
</pre>
<pre>[R, [B, a, a, d]]</pre></p>

<h3>Drawbacks</h3>

<p>We can read the embedding process as <em>&#8216;embed rules A and B into R&#8217;</em> or dually <em>&#8216;expand R using rules A and B&#8217;</em>.  I&#8217;ve chosen the latter expression for the Trail parser generator because an <em>expanded rule R</em> has its own characteristics and is distinguished from an unexpanded rule.</p>

<p>The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn&#8217;t there are performance penalty due to the process of reconstruction?</p>

<h3>Performance</h3>

<p>To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.</p>

<p>In order to measure performance I created the following simple script</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">import</span> <span style="">time</span>
<span style="color: #000066;font-weight:bold;">import</span> <span style="">decimal</span>
<span style="color: #000066;font-weight:bold;">import</span> langscape
&nbsp;
text = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="">decimal</span>.__file__.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;.pyc&quot;</span>, <span style="color: #483d8b;">&quot;.py&quot;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;textlen&quot;</span>, <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span>
&nbsp;
python = langscape.<span style="color: black;">load_langlet</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;python&quot;</span><span style="color: black;">&#41;</span>
p4d = langscape.<span style="color: black;">load_langlet</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;p4d&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> <span style="">test</span><span style="color: black;">&#40;</span>langlet<span style="color: black;">&#41;</span>:
    tokens = langlet.<span style="">tokenize</span><span style="color: black;">&#40;</span>text<span style="color: black;">&#41;</span>
    a = <span style="">time</span>.<span style="">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">10</span><span style="color: black;">&#41;</span>:
        langlet.<span style="color: black;">parse</span><span style="color: black;">&#40;</span>tokens<span style="color: black;">&#41;</span>
        tokens.<span style="color: black;">reset</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;parser&quot;</span>, langlet.<span style="color: black;">config</span>.<span style="color: black;">langlet_name</span>, <span style="color: black;">&#40;</span><span style="">time</span>.<span style="">time</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> - a<span style="color: black;">&#41;</span>/<span style="color: #ff4500;">10</span>
&nbsp;
<span style="">test</span><span style="color: black;">&#40;</span>python<span style="color: black;">&#41;</span>
<span style="">test</span><span style="color: black;">&#40;</span>p4d<span style="color: black;">&#41;</span></pre></div></div>


<p>It imports a reasonably big Python module ( decimal.py ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:
<pre>parser python 2.39329998493
parser p4d 2.25759999752</pre>
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I&#8217;m mainly interested in qualitative results and algorithms at this point.</p>

<p>I&#8217;ve also checked out the script with PyPy, both with activated and deactivated JIT.</p>

<p>PyPy with option &#8211;JIT off:
<pre>parser python 6.5631000042
parser p4d 5.66440000534</pre>
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:</p>

<p>PyPy with JIT:
<pre>parser python 0.791500020027
parser p4d 1.06089999676</pre>
Here the Python parser is about 1/3 faster than the P4D parser.</p>

<p>The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It&#8217;s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique &#8211; also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I&#8217;ve implemented an ANSI C version of Trail.</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2011/09/12/ll-faster-than-ll1/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Spirits &#8211; for M.</title>
		<link>http://fiber-space.de/wordpress/2011/04/26/spirits-for-m/</link>
		<comments>http://fiber-space.de/wordpress/2011/04/26/spirits-for-m/#comments</comments>
		<pubDate>Tue, 26 Apr 2011 20:28:13 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Vision]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1824</guid>
		<description><![CDATA[]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"></p>

<div class="wp-caption alignnone" style="width: 544px"><img title="Das Haus des Gärtners" src="http://www.fiber-space.de/misc/Das Haus des Gärtners.jpg" alt="Das Haus des Gärtners" width="534" height="352" /><p class="wp-caption-text">Ledoux - Das Haus des Gärtners</p></div>

<div class="wp-caption alignnone" style="width: 545px"><img title="Erdfunkstelle Raisting" src="http://www.fiber-space.de/misc/Erdfunkstelle Raisting.JPG" alt="Erdfunkstelle Raisting" width="535" height="401" /><p class="wp-caption-text">Erdfunkstelle Raisting</p></div>

<p style="text-align: center;"></p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2011/04/26/spirits-for-m/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Maptrackers</title>
		<link>http://fiber-space.de/wordpress/2011/04/12/maptrackers/</link>
		<comments>http://fiber-space.de/wordpress/2011/04/12/maptrackers/#comments</comments>
		<pubDate>Tue, 12 Apr 2011 02:41:08 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1719</guid>
		<description><![CDATA[From graphs to maps A Maptracker is a special backtracking algorithm used to check the equivalence of certain maps which can be represented as connected, directed graphs or finite state machines. It shall be described in this article. The original motivation was to find an algorithm for reconstruction of grammars from finite state-machines with the [...]]]></description>
			<content:encoded><![CDATA[<h3>From graphs to maps</h3>

<p>A <em>Maptracker</em> is a special backtracking algorithm used to check the equivalence of certain maps which can be represented as connected, directed graphs or finite state machines. It shall be described in this article.</p>

<p>The original motivation was to find an algorithm for reconstruction of grammars from finite state-machines with the following property: suppose you have a state-machine <span style="font-family: Courier New,Courier,monospace;">M0</span> and a function P which turns <span style="font-family: Courier New,Courier,monospace;">M0</span> into a grammar rule: <span style="font-family: Courier New,Courier,monospace;">G = P(M0)</span>. When we translate G back again into a state-machine we get <span style="font-family: Courier New,Courier,monospace;">M1 = T(P(M0))</span>. Generally <span style="font-family: Courier New,Courier,monospace;">T o P != Id</span> and <span style="font-family: Courier New,Courier,monospace;">M0 != M1</span>. But how different are <span style="font-family: Courier New,Courier,monospace;">M0</span> and <span style="font-family: Courier New,Courier,monospace;">M1</span> actually?</p>

<p><img class="alignnone" title="MaptrackGraphs" src="http://www.fiber-space.de/misc/MaptrackGraphs.PNG" alt="" width="474" height="610" /></p>

<p>Watch the two graphs GR1 and GR2 above. When we abstract from their particular drawings and focus on the nodes and edges only we can describe them using the following dictionaries:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">GR1 = <span style="color: black;">&#123;</span><span style="color: #ff4500;">0</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">4</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">1</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">2</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">4</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">3</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">4</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">5</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">5</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span>, <span style="color: #ff4500;">6</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">6</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">7</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#125;</span></pre></div></div>



<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">GR2 = <span style="color: black;">&#123;</span><span style="color: #ff4500;">0</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">6</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">1</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">2</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">6</span>, -<span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">7</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">3</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span>, <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">5</span>, -<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">4</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">5</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">6</span>, -<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">6</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
 <span style="color: #ff4500;">7</span>: <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#125;</span></pre></div></div>


<p>A pair <span style="font-family: Courier New,Courier,monospace;">i: [j1, j2, ... jn]</span> describes the set of edges i -&gt; j1,  i -&gt; j2, &#8230;, i -&gt; jn.</p>

<h3>Checking for equivalence</h3>

<p>We say that GR1 and GR2 are <em>equivalent</em> if there is a permutation <span style="font-family: Courier New,Courier,monospace;">P</span> of  <span style="font-family: Courier New,Courier,monospace;">{-1, 0, 1, &#8230;, 7}</span> and</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">GR2 == <span style="color: #008000;">dict</span><span style="color: black;">&#40;</span> <span style="color: black;">&#40;</span>P<span style="color: black;">&#40;</span>key<span style="color: black;">&#41;</span>, <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: #008000;">map</span><span style="color: black;">&#40;</span>P, value<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">for</span> <span style="color: black;">&#40;</span>key, value<span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">in</span> GR1.<span style="color: black;">items</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span></pre></div></div>


<p><span style="font-family: Courier New,Courier,monospace;">Maptracker</span> is merely a cute name for an algorithm which constructs the permutation <span style="font-family: Courier New,Courier,monospace;">P</span> from map representations of the kind <span style="font-family: Courier New,Courier,monospace;">GR1</span> and <span style="font-family: Courier New,Courier,monospace;">GR2</span>. <span style="font-family: Courier New,Courier,monospace;">P</span> itself will be described as a dictionary. Since the value <span style="font-family: Courier New,Courier,monospace;">-1</span> is a fixed point it will be omitted:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">class</span> Maptracker<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, gr1, gr2<span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>.<span style="color: black;">gr1</span> = gr1
        <span style="color: #008000;">self</span>.<span style="color: black;">gr2</span> = gr2
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> accept<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value, stack<span style="color: black;">&#41;</span>:
        e1, e2 = value  <span style="color: #808080; font-style: italic;"># e1 -&gt; e2</span>
        V1 = <span style="color: #008000;">self</span>.<span style="color: black;">gr1</span><span style="color: black;">&#91;</span>e1<span style="color: black;">&#93;</span>
        V2 = <span style="color: #008000;">self</span>.<span style="color: black;">gr2</span><span style="color: black;">&#91;</span>e2<span style="color: black;">&#93;</span>
        <span style="color: #808080; font-style: italic;">#</span>
        <span style="color: #808080; font-style: italic;"># e1 -&gt; e2 =&gt; v1 -&gt; v2</span>
        <span style="color: #808080; font-style: italic;">#</span>
        <span style="color: #808080; font-style: italic;"># check consistency of the choice of the mapping</span>
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>V1<span style="color: black;">&#41;</span><span style="color: #306f30;">!</span>=<span style="color: #008000;">len</span><span style="color: black;">&#40;</span>V2<span style="color: black;">&#41;</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        m = <span style="color: #008000;">dict</span><span style="color: black;">&#40;</span>p <span style="color: #000066;font-weight:bold;">for</span> <span style="color: black;">&#40;</span>p,q<span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">in</span> stack<span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">if</span> e2 <span style="color: #000066;font-weight:bold;">in</span> m.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">for</span> v1 <span style="color: #000066;font-weight:bold;">in</span> V1:
            <span style="color: #000066;font-weight:bold;">if</span> v1 == e1:
                <span style="color: #000066;font-weight:bold;">if</span> e2 <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #000066;font-weight:bold;">in</span> V2:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
            <span style="color: #000066;font-weight:bold;">if</span> v1 <span style="color: #000066;font-weight:bold;">in</span> m:
                <span style="color: #000066;font-weight:bold;">if</span> m<span style="color: black;">&#91;</span>v1<span style="color: black;">&#93;</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #000066;font-weight:bold;">in</span> V2:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">for</span> s <span style="color: #000066;font-weight:bold;">in</span> m:
            <span style="color: #000066;font-weight:bold;">if</span> e1 <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">gr1</span><span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span>:
                <span style="color: #000066;font-weight:bold;">if</span> e2 <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">gr2</span><span style="color: black;">&#91;</span>m<span style="color: black;">&#91;</span>s<span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> run<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        stack = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">gr1</span><span style="color: black;">&#41;</span> <span style="color: #306f30;">!</span>= <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">gr2</span><span style="color: black;">&#41;</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        sig1 = <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>v<span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">for</span> v <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">gr1</span>.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        sig2 = <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>v<span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">for</span> v <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">gr2</span>.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">if</span> sig1<span style="color: #306f30;">!</span>=sig2:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
&nbsp;
        L1 = <span style="color: #008000;">self</span>.<span style="color: black;">gr1</span>.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        L2 = <span style="color: #008000;">self</span>.<span style="color: black;">gr2</span>.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        i = j = <span style="color: #ff4500;">0</span>
        <span style="color: #000066;font-weight:bold;">while</span> i<span style="color: #306f30;">&lt;</span>len<span style="color: black;">&#40;</span>L1<span style="color: black;">&#41;</span>:
            e1 = L1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>
            <span style="color: #000066;font-weight:bold;">while</span> j<span style="color: #306f30;">&lt;</span>len<span style="color: black;">&#40;</span>L2<span style="color: black;">&#41;</span>:
                e2 = L2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>
                <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">accept</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>e1,e2<span style="color: black;">&#41;</span>,stack<span style="color: black;">&#41;</span>:
                    stack.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>e1,e2<span style="color: black;">&#41;</span>,<span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                    j = <span style="color: #ff4500;">0</span>
                    <span style="color: #000066;font-weight:bold;">break</span>
                j+=<span style="color: #ff4500;">1</span>
            <span style="color: #000066;font-weight:bold;">else</span>:
                <span style="color: #000066;font-weight:bold;">if</span> stack:
                    _, <span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span> = stack.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                    <span style="color: #000066;font-weight:bold;">if</span> j == -<span style="color: #ff4500;">1</span>:
                        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
                    j+=<span style="color: #ff4500;">1</span>
                    <span style="color: #000066;font-weight:bold;">continue</span>
                <span style="color: #000066;font-weight:bold;">else</span>:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
            i+=<span style="color: #ff4500;">1</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">dict</span><span style="color: black;">&#40;</span>elem<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> <span style="color: #000066;font-weight:bold;">for</span> elem <span style="color: #000066;font-weight:bold;">in</span> stack<span style="color: black;">&#41;</span></pre></div></div>


<p>If no permutation could be constructed an empty dictionary <span style="font-family: Courier New,Courier,monospace;">{}</span> is returned.</p>

<p>Let&#8217;s watch the dict which is computed by the Maptracker for <span style="font-family: Courier New,Courier,monospace;">GR1</span> and <span style="font-family: Courier New,Courier,monospace;">GR2</span>:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #306f30;">&gt;&gt;&gt;</span> M = Maptracker<span style="color: black;">&#40;</span>GR1, GR2<span style="color: black;">&#41;</span>.<span style="color: black;">run</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #306f30;">&gt;&gt;&gt;</span> M
<span style="color: black;">&#123;</span><span style="color: #ff4500;">0</span>: <span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">1</span>: <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>: <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span>: <span style="color: #ff4500;">7</span>, <span style="color: #ff4500;">4</span>: <span style="color: #ff4500;">3</span>, <span style="color: #ff4500;">5</span>: <span style="color: #ff4500;">4</span>, <span style="color: #ff4500;">6</span>: <span style="color: #ff4500;">5</span>, <span style="color: #ff4500;">7</span>: <span style="color: #ff4500;">6</span><span style="color: black;">&#125;</span></pre></div></div>


<p>We can check the correctness of <span style="font-family: Courier New,Courier,monospace;">M</span> manually or by setting</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">P = <span style="color: #000066;font-weight:bold;">lambda</span> k: <span style="color: black;">&#40;</span>-<span style="color: #ff4500;">1</span> <span style="color: #000066;font-weight:bold;">if</span> k == -<span style="color: #ff4500;">1</span> <span style="color: #000066;font-weight:bold;">else</span> M<span style="color: black;">&#91;</span>k<span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span></pre></div></div>


<p>and check the equality we have defined above.</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2011/04/12/maptrackers/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Patching tracebacks</title>
		<link>http://fiber-space.de/wordpress/2011/04/11/patching-tracebacks/</link>
		<comments>http://fiber-space.de/wordpress/2011/04/11/patching-tracebacks/#comments</comments>
		<pubDate>Mon, 11 Apr 2011 05:13:53 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Langscape]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1791</guid>
		<description><![CDATA[One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt. The situation is easily explained: you have a program P written in some language L. There is a source-to-source translation of P into another program [...]]]></description>
			<content:encoded><![CDATA[<p>One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.</p>

<p>The situation is easily explained: you have a program <span style="font-family: Courier New,Courier,monospace;">P</span> written in some language <span style="font-family: Courier New,Courier,monospace;">L</span>. There is a source-to-source translation of <span style="font-family: Courier New,Courier,monospace;">P</span> into another program <span style="font-family: Courier New,Courier,monospace;">Q</span> of a target language, preferably Python. So you write <span style="font-family: Courier New,Courier,monospace;">P</span> but the code which is executed is <span style="font-family: Courier New,Courier,monospace;">Q</span>. When <span style="font-family: Courier New,Courier,monospace;">Q</span> fails Python produces a traceback. A traceback is a stack of execution frames &#8211; a snapshot of the current computation &#8211; and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">Traceback <span style="color: black;">&#40;</span>most recent call last<span style="color: black;">&#41;</span>:
  File <span style="color: #483d8b;">&quot;tests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">12</span>, <span style="color: #000066;font-weight:bold;">in</span>
    bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  File <span style="color: #483d8b;">&quot;tests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">10</span>, <span style="color: #000066;font-weight:bold;">in</span> bar
    foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
  File <span style="color: #483d8b;">&quot;tests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">5</span>, <span style="color: #000066;font-weight:bold;">in</span> foo
    b.<span style="color: black;">p</span>,
<span style="color: #FF0000;font-weight:bold;">NameError</span>: <span style="color: #000066;font-weight:bold;">global</span> name <span style="color: #483d8b;">'b'</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> defined</pre></div></div>


<p>The problem here is that line number information in the frames is from <span style="font-family: Courier New,Courier,monospace;">Q</span> whereas the file and the lines being displayed are from <span style="font-family: Courier New,Courier,monospace;">P</span> &#8211; the only file there is!</p>

<h3>Hacking line numbers into parse trees</h3>

<p>My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate <span style="font-family: Courier New,Courier,monospace;">Q</span> or rather the parse tree corresponding to <span style="font-family: Courier New,Courier,monospace;">Q</span> which got updated with the line numbers I used to expect. When the growth of line numbers in <span style="font-family: Courier New,Courier,monospace;">Q</span> was non-monotonic, using CPythons internal line number table, <span style="font-family: Courier New,Courier,monospace;">lnotab</span>, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won&#8217;t be overwritten. There is a  hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.</p>

<h3>From token streams to string pattern</h3>

<p>Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of <span style="font-family: Courier New,Courier,monospace;">P</span> into expressions in <span style="font-family: Courier New,Courier,monospace;">Q</span> or vice versa, break  <span style="font-family: Courier New,Courier,monospace;">P</span> token into token sequences in <span style="font-family: Courier New,Courier,monospace;">Q</span> etc. ) we can checkout the following construction. Let be
<pre>T(V, Q) = {T in TS_Q| V = T.Value }</pre>
the set of token in the token stream <span style="font-family: Courier New,Courier,monospace;">TS_Q</span>of <span style="font-family: Courier New,Courier,monospace;">Q</span> with the prescribed token value <span style="font-family: Courier New,Courier,monospace;">V</span>. Analog to this we can build build a set <span style="font-family: Courier New,Courier,monospace;">T(V, P)</span> for <span style="font-family: Courier New,Courier,monospace;">P</span> and <span style="font-family: Courier New,Courier,monospace;">TS_P</span>.</p>

<p>The basic idea is now to construct a mapping between <span style="font-family: Courier New,Courier,monospace;">T(V,Q)</span>  and  <span style="font-family: Courier New,Courier,monospace;">T(V,P)</span>. In order to get the value <span style="font-family: Courier New,Courier,monospace;">V</span> we examine the byte code of a traceback frame up to the last executed instruction <span style="font-family: Courier New,Courier,monospace;">f_lasti</span>. We assume that executing <span style="font-family: Courier New,Courier,monospace;">f_lasti</span> leads to the error. Now the instruction may not be coupled to a particular name, so we examine <span style="font-family: Courier New,Courier,monospace;">f_lasti</span> or the last instruction preceding <span style="font-family: Courier New,Courier,monospace;">f_lasti</span> for which the instruction type is in the set
<pre>{LOAD_CONST, LOAD_FAST, LOAD_NAME, LOAD_ATTR, LOAD_GLOBAL, IMPORT_NAME }</pre>
From the value related to one of those instructions, the type of the value which is one of  {<span style="font-family: Courier New,Courier,monospace;">NAME</span>, <span style="font-family: Courier New,Courier,monospace;">STRING</span>, <span style="font-family: Courier New,Courier,monospace;">NUMBER</span>} and the execution line <span style="font-family: Courier New,Courier,monospace;">f_lineno</span> we create a new token <span style="font-family: Courier New,Courier,monospace;">T_q = [tokentype, value, lineno]</span>. For <span style="font-family: Courier New,Courier,monospace;">V</span> we set <span style="font-family: Courier New,Courier,monospace;">V = value</span>. Actually things are a little more complicated because the dedicated token <span style="font-family: Courier New,Courier,monospace;">T_q</span> and the line in <span style="font-family: Courier New,Courier,monospace;">Q</span> are not necessarily in a 1-1 relationship. So there might in fact be <span style="font-family: Courier New,Courier,monospace;">n</span>&gt;<span style="font-family: Courier New,Courier,monospace;">1</span> token being equal to <span style="font-family: Courier New,Courier,monospace;">T_q</span> which originate from different lines in <span style="font-family: Courier New,Courier,monospace;">P</span>. So let <span style="font-family: Courier New,Courier,monospace;">Token</span> be the list of all token we found on line <span style="font-family: Courier New,Courier,monospace;">T_q.Line</span> and <span style="font-family: Courier New,Courier,monospace;">k = Token.count(T_q)</span>. We add <span style="font-family: Courier New,Courier,monospace;">k</span> to the data characterizing <span style="font-family: Courier New,Courier,monospace;">T_q</span>.</p>

<p>So assume having found <span style="font-family: Courier New,Courier,monospace;">T_q</span>. The map we want to build is <span style="font-family: Courier New,Courier,monospace;">T(T_q.Value, Q)</span> -&gt; <span style="font-family: Courier New,Courier,monospace;">T(T_q.Value, P)</span>. How can we do that?</p>

<p>In the first step we assign a character to each token in the  stream <span style="font-family: Courier New,Courier,monospace;">TS_Q</span> and turn <span style="font-family: Courier New,Courier,monospace;">TS_Q</span> into a string <span style="font-family: Courier New,Courier,monospace;">S_TS_Q</span>. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is <span style="font-family: Courier New,Courier,monospace;">T_q</span> -&gt; <span style="font-family: Courier New,Courier,monospace;">c</span>. For each <span style="font-family: Courier New,Courier,monospace;">T</span> in <span style="font-family: Courier New,Courier,monospace;">T(T_q.Value, Q)</span> we determine then a substring of <span style="font-family: Courier New,Courier,monospace;">S_TS_Q</span> with <span style="font-family: Courier New,Courier,monospace;">c</span> as a midpoint:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">    <span style="color: #000066;font-weight:bold;">class</span> Pattern:
        <span style="color: #000066;font-weight:bold;">def</span> <span style="">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, index, <span style="">token</span>, pattern<span style="color: black;">&#41;</span>:
            <span style="color: #008000;">self</span>.<span style="color: black;">index</span> = index
            <span style="color: #008000;">self</span>.<span style="">token</span> = <span style="">token</span>
            <span style="color: #008000;">self</span>.<span style="color: black;">pattern</span> = pattern
&nbsp;
    S_TS_Q = u<span style="color: #483d8b;">''</span>
    m = 0x30
    k = <span style="color: #ff4500;">5</span>
    tcmap = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>    
&nbsp;
    <span style="color: #808080; font-style: italic;"># create a string from the token stream</span>
    <span style="color: #000066;font-weight:bold;">for</span> T <span style="color: #000066;font-weight:bold;">in</span> TS_Q:
        <span style="color: #000066;font-weight:bold;">if</span> T.<span style="color: black;">Value</span> <span style="color: #000066;font-weight:bold;">in</span> tcmap:
            S_TS_Q+=tcmap<span style="color: black;">&#91;</span>T.<span style="color: black;">Value</span><span style="color: black;">&#93;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            s = <span style="color: #008000;">unichr</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span>
            tcmap<span style="color: black;">&#91;</span>T.<span style="color: black;">Value</span><span style="color: black;">&#93;</span> = s
            S_TS_Q+=s
            m+=<span style="color: #ff4500;">1</span>
&nbsp;
    <span style="color: #808080; font-style: italic;"># create string pattern</span>
    pattern = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">for</span> T <span style="color: #000066;font-weight:bold;">in</span> TVQ:
        n = TS_Q.<span style="color: black;">index</span><span style="color: black;">&#40;</span>T<span style="color: black;">&#41;</span>
        S = S_TS_Q<span style="color: black;">&#91;</span><span style="color: #008000;">max</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, n-k<span style="color: black;">&#41;</span>: n+k<span style="color: black;">&#93;</span>
        pattern.<span style="color: black;">append</span><span style="color: black;">&#40;</span>Pattern<span style="color: black;">&#40;</span>n, T, S<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></div></div>


<p>The same construction is used for the creation of target patterns from <span style="font-family: Courier New,Courier,monospace;">T(V, P)</span>. In that case we use the <span style="font-family: Courier New,Courier,monospace;">tcmap</span> dictionary built during the creation of pattern from <span style="font-family: Courier New,Courier,monospace;">T(V, Q)</span>: when two token in <span style="font-family: Courier New,Courier,monospace;">TS_Q</span> and <span style="font-family: Courier New,Courier,monospace;">TS_P</span> have the same token value, the corresponding characters shall coincide.</p>

<h3>The token mapping matrix</h3>

<p>In the next step we create a distance matrix between source and target string pattern.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">    n = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>source_pattern<span style="color: black;">&#41;</span>
    m = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>target_pattern<span style="color: black;">&#41;</span>
    Rows = <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>
    Cols = <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>m<span style="color: black;">&#41;</span>
    M = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>:
        M.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: #306f30;">*</span>m<span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i, SP <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">enumerate</span><span style="color: black;">&#40;</span>source_pattern<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">for</span> j, TP <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">enumerate</span><span style="color: black;">&#40;</span>target_pattern<span style="color: black;">&#41;</span>:
            M<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span> = levenshtein<span style="color: black;">&#40;</span>SP.<span style="color: black;">pattern</span>, TP.<span style="color: black;">pattern</span><span style="color: black;">&#41;</span></pre></div></div>


<p>As a distance metrics we use the edit- or levenshtein distance.</p>

<p>Having that matrix we compute an index pair <span style="font-family: Courier New,Courier,monospace;">(I,J)</span> with <span style="font-family: Courier New,Courier,monospace;">M[I][J] = min{ M[i][j] | i in Rows and j in Cols}</span>. Our interpretation of <span style="font-family: Courier New,Courier,monospace;">(I,J)</span> is that we map <span style="font-family: Courier New,Courier,monospace;">source_pattern[I].token</span> onto <span style="font-family: Courier New,Courier,monospace;">target_pattern[J].token</span>. Since there is an I for which <span style="font-family: Courier New,Courier,monospace;">source_pattern[I].token == T_q</span> the corresponding <span style="font-family: Courier New,Courier,monospace;">T_p =target_pattern[J].token</span> is exactly the token in <span style="font-family: Courier New,Courier,monospace;">P</span> we searched for.</p>

<p>The line in the current traceback is the line <span style="font-family: Courier New,Courier,monospace;">T_q.Line</span> of <span style="font-family: Courier New,Courier,monospace;">Q</span>. Now we have found <span style="font-family: Courier New,Courier,monospace;">T_p.Line</span> of <span style="font-family: Courier New,Courier,monospace;">P</span> which is the corrected line which shall be displayed in the patched traceback. Let&#8217;s take a brief look on the index selection algorithm for which <span style="font-family: Courier New,Courier,monospace;">M</span> was prepared:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
    k, I = <span style="color: #ff4500;">1000</span>, -<span style="color: #ff4500;">1</span>
    <span style="color: #000066;font-weight:bold;">if</span> n<span style="color: #306f30;">&gt;</span>m <span style="color: #000066;font-weight:bold;">and</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>Cols<span style="color: black;">&#41;</span> == <span style="color: #ff4500;">1</span>:
        <span style="color: #000066;font-weight:bold;">return</span> target_pattern<span style="color: black;">&#91;</span>Cols<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>.<span style="">token</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">else</span>:
        <span style="color: #000066;font-weight:bold;">for</span> r <span style="color: #000066;font-weight:bold;">in</span> Rows:
            d = <span style="color: #008000;">min</span><span style="color: black;">&#40;</span>M<span style="color: black;">&#91;</span>r<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">if</span> d<span style="color: #306f30;">&lt;</span>k:
                k = d
                I = r
        J = M<span style="color: black;">&#91;</span>I<span style="color: black;">&#93;</span>.<span style="color: black;">index</span><span style="color: black;">&#40;</span>k<span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">for</span> row <span style="color: #000066;font-weight:bold;">in</span> M:
            row<span style="color: black;">&#91;</span>J<span style="color: black;">&#93;</span> = <span style="color: #ff4500;">100</span>
    SP = source_pattern<span style="color: black;">&#91;</span>I<span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">if</span> SP.<span style="">token</span> == T_q:
        tok = target_pattern<span style="color: black;">&#91;</span>J<span style="color: black;">&#93;</span>.<span style="">token</span>
        <span style="color: #000066;font-weight:bold;">return</span> tok.<span style="color: black;">Line</span>
    <span style="color: #000066;font-weight:bold;">else</span>:
        Rows.<span style="color: black;">remove</span><span style="color: black;">&#40;</span>I<span style="color: black;">&#41;</span>
        Cols.<span style="color: black;">remove</span><span style="color: black;">&#40;</span>J<span style="color: black;">&#41;</span></pre></div></div>


<p>If there is only one column left i.e. one token in <span style="font-family: Courier New,Courier,monospace;">T(V, P)</span> its line will be chosen. If the J-column was selected we avoid re-selection by setting <span style="font-family: Courier New,Courier,monospace;">row[J] = 100</span> on each row. In fact it would suffice to consider only the rows left in <span style="font-family: Courier New,Courier,monospace;">Rows</span>.</p>

<h3>Example</h3>

<p>One example I modified over and over again for testing purposes was following one:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">pythonexpr.<span style="color: black;">py</span> <span style="color: black;">&#91;</span>P<span style="color: black;">&#93;</span>
-----------------
<span style="color: #000066;font-weight:bold;">def</span> foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    a = <span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;c&quot;</span>,
        <span style="color: #ff4500;">0</span>,
        <span style="color: black;">&#40;</span><span style="color: #000066;font-weight:bold;">lambda</span> x: <span style="color: #ff4500;">0</span>+<span style="color: black;">&#40;</span><span style="color: #000066;font-weight:bold;">lambda</span> y: y+<span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>,
        b.<span style="color: black;">p</span>,
        <span style="color: #ff4500;">0</span>,
        <span style="color: #ff4500;">1</span>/<span style="color: #ff4500;">0</span>,
        b.<span style="color: black;">p</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>


<p>You can comment out <span style="font-family: Courier New,Courier,monospace;">b.p</span> turn a <span style="font-family: Courier New,Courier,monospace;">+</span> in the lambda expression into <span style="font-family: Courier New,Courier,monospace;">/</span> provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: black;">&#91;</span>Q<span style="color: black;">&#93;</span>
---
<span style="color: #000066;font-weight:bold;">import</span> langscape<span style="color: #306f30;">;</span> __langlet__ = langscape.<span style="color: black;">load_langlet</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;python&quot;</span><span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">def</span> foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    a = <span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;c&quot;</span>, <span style="color: #ff4500;">0</span>, <span style="color: black;">&#40;</span><span style="color: #000066;font-weight:bold;">lambda</span> x: <span style="color: #ff4500;">0</span>+<span style="color: black;">&#40;</span><span style="color: #000066;font-weight:bold;">lambda</span> y: y+<span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, b.<span style="color: black;">p</span>, <span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">1</span>/<span style="color: #ff4500;">0</span>, b.<span style="color: black;">p</span><span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">def</span> bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>


<p>So it is this code which is executed using the command line
<pre>python run_python.py pythonexpr.py</pre>
which runs in the Python langlet through <span style="font-family: Courier New,Courier,monospace;">run_python.py</span>. So the execution process sees <span style="font-family: Courier New,Courier,monospace;">pythonexpr.py</span> but the code which is compiled by Python will be <span style="font-family: Courier New,Courier,monospace;">Q</span>.</p>

<p>See the mess that happens when the traceback is not patched:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">Traceback <span style="color: black;">&#40;</span>most recent call last<span style="color: black;">&#41;</span>:
  File <span style="color: #483d8b;">&quot;run_python.py&quot;</span>, line <span style="color: #ff4500;">9</span>, <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #306f30;">&lt;</span>module<span style="color: #306f30;">&gt;</span>
    langlet_obj.<span style="color: black;">run_module</span><span style="color: black;">&#40;</span>module<span style="color: black;">&#41;</span>
  ...
  <span style="color: black;">File</span> <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">6</span>, <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #306f30;">&lt;</span>module<span style="color: #306f30;">&gt;</span>
    <span style="color: #ff4500;">0</span>,
  File <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">5</span>, <span style="color: #000066;font-weight:bold;">in</span> bar
    b.<span style="color: black;">p</span>,
  File <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">3</span>, <span style="color: #000066;font-weight:bold;">in</span> foo
    <span style="color: #ff4500;">0</span>,
<span style="color: #FF0000;font-weight:bold;">NameError</span>: <span style="color: #000066;font-weight:bold;">global</span> name <span style="color: #483d8b;">'b'</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> defined</pre></div></div>


<p>There is even a strange coincidence because <span style="font-family: Courier New,Courier,monospace;">bar()</span> is executed on line 5 in the transformed program and <span style="font-family: Courier New,Courier,monospace;">b.p</span> is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via <span style="font-family: Courier New,Courier,monospace;">sys.excepthook</span>,  the traceback patching mechanism whose major algorithm we&#8217;ve developed above we get</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">Traceback <span style="color: black;">&#40;</span>most recent call last<span style="color: black;">&#41;</span>:
  File <span style="color: #483d8b;">&quot;run_python.py&quot;</span>, line <span style="color: #ff4500;">9</span>, <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #306f30;">&lt;</span>module<span style="color: #306f30;">&gt;</span>
    langlet_obj.<span style="color: black;">run_module</span><span style="color: black;">&#40;</span>module<span style="color: black;">&#41;</span>
  ...
  <span style="color: black;">File</span> <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">13</span>, <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #306f30;">&lt;</span>module<span style="color: #306f30;">&gt;</span>
    bar<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>,
  File <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">11</span>, <span style="color: #000066;font-weight:bold;">in</span> bar
    foo<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>,
  File <span style="color: #483d8b;">&quot;langscape<span style="color: #000099; font-weight: bold;">\l</span>anglets<span style="color: #000099; font-weight: bold;">\p</span>ython<span style="color: #000099; font-weight: bold;">\t</span>ests<span style="color: #000099; font-weight: bold;">\p</span>ythonexpr.py&quot;</span>, line <span style="color: #ff4500;">5</span>, <span style="color: #000066;font-weight:bold;">in</span> foo
    b.<span style="color: black;">p</span>,
<span style="color: #FF0000;font-weight:bold;">NameError</span>: <span style="color: #000066;font-weight:bold;">global</span> name <span style="color: #483d8b;">'b'</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> defined</pre></div></div>


<p>which is exactly right!</p>

<h3>Conclusion</h3>

<p>The algorithm described in this article is merely a heuristics and it won&#8217;t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won&#8217;t change too much.</p>

<p>An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don&#8217;t think it is too far away because I have some time now to work on it. It will still be under tested when it&#8217;s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2011/04/11/patching-tracebacks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fuzzy string matching II &#8211; matching wordlists</title>
		<link>http://fiber-space.de/wordpress/2011/01/07/fuzzy-string-matching-ii-matching-wordlists/</link>
		<comments>http://fiber-space.de/wordpress/2011/01/07/fuzzy-string-matching-ii-matching-wordlists/#comments</comments>
		<pubDate>Fri, 07 Jan 2011 08:20:28 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1737</guid>
		<description><![CDATA[Small misspellings An anonymous programming reddit commenter wrote about my fuzzy string matching article: A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original &#8220;damn cool algorithm&#8221; matches against [...]]]></description>
			<content:encoded><![CDATA[<h3>Small misspellings</h3>

<p>An anonymous programming reddit commenter wrote about my fuzzy string matching <a href="http://fiber-space.de/wordpress/?p=1579">article</a>:</p>

<blockquote>A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original &#8220;damn cool algorithm&#8221; matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.</blockquote>

<p>This is a valid objection.</p>

<p>However, for the most common case, which is an edit distance of 1 you don&#8217;t need a Levenshtein automaton either. Here is the recipe:</p>

<p>Let a <span style="font-family: Courier New,Courier,monospace;">wordlist</span> and an <span style="font-family: Courier New,Courier,monospace;">alphabet</span> be given. An alphabet can be for example the attribute <span style="font-family: Courier New,Courier,monospace;">string.letters</span> of the string module. For a string S all string variants of S with an edit distance &lt;=1 over the <span style="font-family: Courier New,Courier,monospace;">alphabet</span> can be computed as follows:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> string_variants<span style="color: black;">&#40;</span>S, alphabet<span style="color: black;">&#41;</span>:
    variants = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>:
        variants.<span style="color: black;">add</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#91;</span>:i<span style="color: black;">&#93;</span>+S<span style="color: black;">&#91;</span>i+<span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>       <span style="color: #808080; font-style: italic;"># delete char at i</span>
        <span style="color: #000066;font-weight:bold;">for</span> c <span style="color: #000066;font-weight:bold;">in</span> alphabet:
            variants.<span style="color: black;">add</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#91;</span>:i<span style="color: black;">&#93;</span>+c+S<span style="color: black;">&#91;</span>i:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>   <span style="color: #808080; font-style: italic;"># insert char at i</span>
            variants.<span style="color: black;">add</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#91;</span>:i<span style="color: black;">&#93;</span>+c+S<span style="color: black;">&#91;</span>i+<span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;"># subst char at i</span>
    <span style="color: #000066;font-weight:bold;">return</span> variants</pre></div></div>


<p>The set of words that shall be matched is given by:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #008000;">set</span><span style="color: black;">&#40;</span>load<span style="color: black;">&#40;</span>wordlist<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #306f30;">&amp;</span> string_variants<span style="color: black;">&#40;</span>S, alphabet<span style="color: black;">&#41;</span></pre></div></div>


<p>The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.</p>

<p>When you want to build string variants of edit distance = 2, just take the result of <span style="font-family: Courier New,Courier,monospace;">string_variants</span> and apply string_variants on it again.</p>

<p>The complexity of is</p>

<p><span style="font-family: Courier New,Courier,monospace;">O((n*len(alphabet))^k)</span></p>

<p>where <span style="font-family: Courier New,Courier,monospace;">n</span> is the string length and <span style="font-family: Courier New,Courier,monospace;">k</span> the edit distance.</p>

<h3>Alternative Approaches</h3>

<p>For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of <span style="font-family: Courier New,Courier,monospace;">string_variants</span> to determine for a given S all strings with edit-distance &lt;=2 over an alphabet. So the most simple approaches probably serve you well in practise!</p>

<p>For k&gt;2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn&#8217;t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can&#8217;t even imagine.</p>

<p>The basic idea is to organize our wordlist into an n-ary tree, the so called <span style="font-family: Courier New,Courier,monospace;">PrefixTree</span>, and implement an algorithm which is variant of <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.</p>

<h3>Prefix Trees</h3>

<p>For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree &#8211; we call it a<em> PrefixTree</em>. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean <span style="font-family: Courier New,Courier,monospace;">is_word</span> field.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">class</span> PrefixTree<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, char = <span style="color: #483d8b;">''</span>, parent = <span style="color: #008000;">None</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>.<span style="color: black;">char</span>     = char
        <span style="color: #008000;">self</span>.<span style="color: black;">parent</span>   = parent
        <span style="color: #008000;">self</span>.<span style="color: black;">children</span> = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span>  = <span style="color: #008000;">False</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> _tolist<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span>:
            <span style="color: #000066;font-weight:bold;">yield</span> <span style="color: #008000;">self</span>.<span style="color: black;">trace</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">for</span> p <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">children</span>.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            <span style="color: #000066;font-weight:bold;">for</span> s <span style="color: #000066;font-weight:bold;">in</span> p._tolist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">yield</span> s
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__iter__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._tolist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> insert<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> value:
            c = value<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
            tree = <span style="color: #008000;">self</span>.<span style="color: black;">children</span>.<span style="color: black;">get</span><span style="color: black;">&#40;</span>c<span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">if</span> tree <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #008000;">None</span>:
                tree = PrefixTree<span style="color: black;">&#40;</span>c, <span style="color: #008000;">self</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">children</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span> = tree
            tree.<span style="color: black;">insert</span><span style="color: black;">&#40;</span>value<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span> = <span style="color: #008000;">True</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__contains__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> value:
            c = value<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
            <span style="color: #000066;font-weight:bold;">if</span> c <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">children</span>:
                <span style="color: #000066;font-weight:bold;">return</span> value<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span> <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">children</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__len__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">parent</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> trace<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span>.<span style="color: black;">trace</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">char</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">char</span></pre></div></div>


<p>Reading a wordlist into a <span style="font-family: Courier New,Courier,monospace;">PrefixTree</span> can be simply done like this:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">pt = PrefixTree<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">for</span> word <span style="color: #000066;font-weight:bold;">in</span> wordlist:
    pt.<span style="color: black;">insert</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span></pre></div></div>


<p>Before we criticise and modify the <span style="font-family: Courier New,Courier,monospace;">PrefixTree</span> let us take a look at the matching algorithm.</p>

<h3>Matching the PrefixTree</h3>

<p>The algorithm is inspired by our <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> algorithm. It uses the same recursive structure and memoization as <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span>.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> update_visited<span style="color: black;">&#40;</span>ptree, visited<span style="color: black;">&#41;</span>:
    visited<span style="color: black;">&#91;</span>ptree<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> = <span style="color: #ff4500;">0</span>
    T = ptree.<span style="color: black;">parent</span>
    <span style="color: #000066;font-weight:bold;">while</span> T <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #008000;">None</span> <span style="color: #000066;font-weight:bold;">and</span> T.<span style="color: black;">char</span><span style="color: #306f30;">!</span>=<span style="color: #483d8b;">''</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>T.<span style="color: black;">children</span><span style="color: black;">&#41;</span> == <span style="color: #ff4500;">1</span>:
            visited<span style="color: black;">&#91;</span>T<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> = <span style="color: #ff4500;">0</span>
            T = T.<span style="color: black;">parent</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #000066;font-weight:bold;">return</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> is_visited<span style="color: black;">&#40;</span>i, T, k, visited<span style="color: black;">&#41;</span>:
    d = visited.<span style="color: black;">get</span><span style="color: black;">&#40;</span>T, <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">if</span> -<span style="color: #ff4500;">1</span> <span style="color: #000066;font-weight:bold;">in</span> d:
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
    m = d.<span style="color: black;">get</span><span style="color: black;">&#40;</span>i,-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">if</span> k<span style="color: #306f30;">&gt;</span>m:
        d<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = k
        visited<span style="color: black;">&#91;</span>T<span style="color: black;">&#93;</span> = d
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> fuzzy_match<span style="color: black;">&#40;</span>S, ptree, k, i=<span style="color: #ff4500;">0</span>, visited = <span style="color: #008000;">None</span>, N = <span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    Computes all strings T contained in ptree with a distance dist(T, S)&lt;=k.
    '</span><span style="color: #483d8b;">''</span>
    trees = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># handles root node of a PrefixTree</span>
    <span style="color: #000066;font-weight:bold;">if</span> ptree.<span style="color: black;">char</span> == <span style="color: #483d8b;">''</span> <span style="color: #000066;font-weight:bold;">and</span> ptree.<span style="color: black;">children</span>:
        N = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#41;</span>
        S+=<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\0</span>'</span><span style="color: #306f30;">*</span><span style="color: black;">&#40;</span>k+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
        visited = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        <span style="color: #000066;font-weight:bold;">for</span> pt <span style="color: #000066;font-weight:bold;">in</span> ptree.<span style="color: black;">children</span>.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            trees.<span style="color: black;">update</span><span style="color: black;">&#40;</span>fuzzy_match<span style="color: black;">&#40;</span>S, pt, k, i, visited, N<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">return</span> trees
&nbsp;
    <span style="color: #808080; font-style: italic;"># already tried</span>
    <span style="color: #000066;font-weight:bold;">if</span> is_visited<span style="color: black;">&#40;</span>i, ptree, k, visited<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
    <span style="color: #808080; font-style: italic;"># can't match ...</span>
    <span style="color: #000066;font-weight:bold;">if</span> k == -<span style="color: #ff4500;">1</span> <span style="color: #000066;font-weight:bold;">or</span> <span style="color: black;">&#40;</span>k == <span style="color: #ff4500;">0</span> <span style="color: #000066;font-weight:bold;">and</span> S<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> <span style="color: #306f30;">!</span>= ptree.<span style="color: black;">char</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">if</span> ptree.<span style="color: black;">is_word</span> <span style="color: #000066;font-weight:bold;">and</span> <span style="color: black;">&#40;</span>N-i<span style="color: #306f30;">&lt;</span>=k <span style="color: #000066;font-weight:bold;">or</span> <span style="color: black;">&#40;</span>N-i<span style="color: #306f30;">&lt;</span>=k+<span style="color: #ff4500;">1</span> <span style="color: #000066;font-weight:bold;">and</span> ptree.<span style="color: black;">char</span> == S<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:
        trees.<span style="color: black;">add</span><span style="color: black;">&#40;</span>ptree.<span style="color: black;">trace</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #000066;font-weight:bold;">not</span> ptree.<span style="color: black;">children</span>:
            update_visited<span style="color: black;">&#40;</span>ptree, visited<span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">return</span> trees
&nbsp;
    <span style="color: #000066;font-weight:bold;">if</span> ptree.<span style="color: black;">char</span><span style="color: #306f30;">!</span>=S<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>:
        trees.<span style="color: black;">update</span><span style="color: black;">&#40;</span>fuzzy_match<span style="color: black;">&#40;</span>S, ptree, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, visited, N<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">for</span> c <span style="color: #000066;font-weight:bold;">in</span> ptree.<span style="color: black;">children</span>:
        <span style="color: #000066;font-weight:bold;">if</span> ptree.<span style="color: black;">char</span> == S<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>:
            trees.<span style="color: black;">update</span><span style="color: black;">&#40;</span>fuzzy_match<span style="color: black;">&#40;</span>S, ptree.<span style="color: black;">children</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>, k, i+<span style="color: #ff4500;">1</span>, visited, N<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            trees.<span style="color: black;">update</span><span style="color: black;">&#40;</span>fuzzy_match<span style="color: black;">&#40;</span>S, ptree.<span style="color: black;">children</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, visited, N<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        trees.<span style="color: black;">update</span><span style="color: black;">&#40;</span>fuzzy_match<span style="color: black;">&#40;</span>S, ptree.<span style="color: black;">children</span><span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>, k-<span style="color: #ff4500;">1</span>, i, visited, N<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">return</span> trees</pre></div></div>


<h3>Lazy PrefixTree construction</h3>

<p>The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.</p>

<p>A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL <span style="font-family: Courier New,Courier,monospace;">hash_map</span> and got a worse result: 14 sec of execution time &#8211; about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn&#8217;t cease to surprise me. Of course I feel bad because I haven&#8217;t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.</p>

<p>A possible solution for Python ( and C++ <img src='http://fiber-space.de/wordpress/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.</p>

<p>Here is the modified code:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">class</span> PrefixTree<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, char = <span style="color: #483d8b;">''</span>, parent = <span style="color: #008000;">None</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>.<span style="color: black;">char</span>      = char
        <span style="color: #008000;">self</span>.<span style="color: black;">parent</span>    = parent
        <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span>   = <span style="color: #008000;">False</span>
        <span style="color: #008000;">self</span>._children = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        <span style="color: #008000;">self</span>._words    = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> _get_children<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>._words:
            <span style="color: #008000;">self</span>._create_children<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._children
&nbsp;
    children = <span style="color: #008000;">property</span><span style="color: black;">&#40;</span>_get_children<span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> _create_children<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">for</span> tree, word <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>._words:
            tree.<span style="color: black;">insert</span><span style="color: black;">&#40;</span>word<span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._words = <span style="color: #008000;">set</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> _tolist<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span>:
            <span style="color: #000066;font-weight:bold;">yield</span> <span style="color: #008000;">self</span>.<span style="color: black;">trace</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">for</span> p <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>.<span style="color: black;">children</span>.<span style="color: black;">values</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            <span style="color: #000066;font-weight:bold;">for</span> s <span style="color: #000066;font-weight:bold;">in</span> p._tolist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">yield</span> s
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__iter__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._tolist<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> insert<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> value:
            c = value<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
            tree = <span style="color: #008000;">self</span>._children.<span style="color: black;">get</span><span style="color: black;">&#40;</span>c<span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">if</span> tree <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #008000;">None</span>:
                tree = PrefixTree<span style="color: black;">&#40;</span>c, <span style="color: #008000;">self</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>._children<span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span> = tree
            <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>value<span style="color: black;">&#41;</span> == <span style="color: #ff4500;">1</span>:
                tree.<span style="color: black;">is_word</span> = <span style="color: #008000;">True</span>
            tree._words.<span style="color: black;">add</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>tree,value<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #008000;">self</span>.<span style="color: black;">is_word</span> = <span style="color: #008000;">True</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__contains__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, value<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> value:
            <span style="color: #000066;font-weight:bold;">if</span> value <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>._words:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            c = value<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
            <span style="color: #000066;font-weight:bold;">if</span> c <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>._children:
                <span style="color: #000066;font-weight:bold;">return</span> value<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span> <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">self</span>._children<span style="color: black;">&#91;</span>c<span style="color: black;">&#93;</span>
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> <span style="">__len__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">parent</span><span style="color: black;">&#41;</span>+<span style="color: #ff4500;">1</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
&nbsp;
    <span style="color: #000066;font-weight:bold;">def</span> trace<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #000066;font-weight:bold;">not</span> <span style="color: #008000;">None</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">parent</span>.<span style="color: black;">trace</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">char</span>
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">self</span>.<span style="color: black;">char</span></pre></div></div>


<h3>Some numbers</h3>

<p>The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.
<pre>Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec</pre>
<pre>Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec</pre>
We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.</p>

<p>Finally here are the numbers using string variants:
<pre>Execution time of string_variants("haus", string.letters): 0.0 sec</pre>
<pre>Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec</pre>
<pre>Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec</pre>
The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren&#8217;t necessary and for k &gt;=3 PrefixTree based approaches doesn&#8217;t run amok.</p>

<h3>Code</h3>

<p>The code for <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> and <span style="font-family: Courier New,Courier,monospace;">fuzzy_match</span> can be downloaded <a href="http://www.fiber-space.de/fuzzystring/fuzzystring.zip">here</a>. It also contains tests, some timing measurements and a German sample wordlist.</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2011/01/07/fuzzy-string-matching-ii-matching-wordlists/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Fuzzy string matching</title>
		<link>http://fiber-space.de/wordpress/2010/12/21/fuzzy-string-matching-and-grammars/</link>
		<comments>http://fiber-space.de/wordpress/2010/12/21/fuzzy-string-matching-and-grammars/#comments</comments>
		<pubDate>Tue, 21 Dec 2010 05:55:03 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1579</guid>
		<description><![CDATA[A damn hot algorithm I found the following article written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called &#8220;Levenshtein automatons&#8221;. Those automatons are inspired by the construction of [...]]]></description>
			<content:encoded><![CDATA[<h3>A damn hot algorithm</h3>

<p>I found the following <a href="http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata">article</a> written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called &#8220;Levenshtein automatons&#8221;. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein algorithm</a> which provides sufficient background for Nicks article.</p>

<p>I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">k = <span style="color: #ff4500;">6</span>
S   = <span style="color: #483d8b;">&quot;&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #008000;">str</span><span style="color: black;">&#40;</span>s<span style="color: black;">&#41;</span> <span style="color: #000066;font-weight:bold;">for</span> s <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">10</span><span style="color: black;">&#41;</span><span style="color: #306f30;">*</span>k<span style="color: black;">&#41;</span>
lev = levenshtein_automata<span style="color: black;">&#40;</span>S, k<span style="color: black;">&#41;</span>.<span style="color: black;">to_dfa</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>


<p>The algorithm is advertised as a &#8220;damn cool algorithm&#8221; by the author but since the major effect on my notebook was producing heat I wonder if &#8220;cool&#8221; shouldn&#8217;t be replaced by &#8220;hot&#8221;?</p>

<p>In the following article I&#8217;m constructing an approximate string matching algorithm from scratch.</p>

<h3>Recursive rules for approximate string matching</h3>

<p>Let &#8216;<span style="font-family: Courier New,Courier,monospace;">S</span> be a string with <span style="font-family: Courier New,Courier,monospace;">len(S)=n</span> and <span style="font-family: Courier New,Courier,monospace;">k</span> a positive number with <span style="font-family: Courier New,Courier,monospace;">k</span>&lt;=<span style="font-family: Courier New,Courier,monospace;">n</span>. By &#8220;?&#8221; we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length <span style="font-family: Courier New,Courier,monospace;">n</span> we can select arbitrary <span style="font-family: Courier New,Courier,monospace;">k</span> indexes in the set <span style="font-family: Courier New,Courier,monospace;">{0,&#8230;,n-1}</span> and substitute the characters of <span style="font-family: Courier New,Courier,monospace;">S</span> at those indexes using a wildcard symbol. If for example (S = &#8220;food&#8221; , k = 1 and index = 2) we get &#8220;fo?d&#8221;.</p>

<p>We describe the rule which describes all possible character substitutions in &#8220;food&#8221; like this:
<pre>pattern(food, 1) = ?ood | f?od | fo?d  | foo?</pre>
Applying successive left factorings yields:
<pre>pattern(food, 1) = ?ood | f  ( ?od | o (?d  | o? ) )</pre>
This inspires a recursive notation which roughly looks like this:
<pre>pattern(food, 1) = ?ood | f pattern(ood, 1)</pre>
or more precisely:
<pre>pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)</pre>
where we have used a string variable S instead of the concrete string &#8220;food&#8221;.</p>

<p>When using an arbitrary <span style="font-family: Courier New,Courier,monospace;">k</span> instead of a fixed k = 1 we get the following recursive equations:
<pre>pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)</pre></p>

<h3>Consuming or not consuming?</h3>

<p>When we try to find an efficient implementation for the <span style="font-family: Courier New,Courier,monospace;">pattern</span> function described above we need an interpretation of the <span style="font-family: Courier New,Courier,monospace;">?</span> wildcard action. It can consume a character and feed the rest of the string into a new call of <span style="font-family: Courier New,Courier,monospace;">pattern</span> or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step &#8230;</p>

<p>The basic idea when matching a string <span style="font-family: Courier New,Courier,monospace;">S1</span> against a string <span style="font-family: Courier New,Courier,monospace;">S2</span> is that we attempt to match <span style="font-family: Courier New,Courier,monospace;">S1[0]</span> against <span style="font-family: Courier New,Courier,monospace;">S2[0]</span> and when we succeed, we continue matching <span style="font-family: Courier New,Courier,monospace;">S[1:]</span> against <span style="font-family: Courier New,Courier,monospace;">S2[1:]</span> using the same constant <span style="font-family: Courier New,Courier,monospace;">k</span>. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">fuzzy_compare<span style="color: black;">&#40;</span>S1, S2<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span>, k-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
fuzzy_compare<span style="color: black;">&#40;</span>S2, S1<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span>, k-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
fuzzy_compare<span style="color: black;">&#40;</span>S1<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span>, S2<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span>, k-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span></pre></div></div>


<p>All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.</p>

<h3>A first implementation</h3>

<p>The following implementation is a good point to start with:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k, i=<span style="color: #ff4500;">0</span>, j=<span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>:
    N1 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S1<span style="color: black;">&#41;</span>
    N2 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S2<span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
        <span style="color: #000066;font-weight:bold;">if</span> N1-i<span style="color: #306f30;">&lt;</span>=k <span style="color: #000066;font-weight:bold;">and</span> N2-j<span style="color: #306f30;">&lt;</span>=k:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
        <span style="color: #000066;font-weight:bold;">try</span>:
            <span style="color: #000066;font-weight:bold;">if</span> S1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> == S2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>:
                i+=<span style="color: #ff4500;">1</span>
                j+=<span style="color: #ff4500;">1</span>
                <span style="color: #000066;font-weight:bold;">continue</span>
        <span style="color: #000066;font-weight:bold;">except</span> <span style="color: #FF0000;font-weight:bold;">IndexError</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">if</span> k == <span style="color: #ff4500;">0</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, j<span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i, j+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, j+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span></pre></div></div>


<p>The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.</p>

<h3>A second implementation using memoization</h3>

<p>Our second implementation still uses recursion but introduces a dictionary which records all <span style="font-family: Courier New,Courier,monospace;">(i,j)</span> index pairs that have been encountered and stores the current value of <span style="font-family: Courier New,Courier,monospace;">k</span>. If the algorithm finds a value <span style="font-family: Courier New,Courier,monospace;">k&#8217;</span> at <span style="font-family: Courier New,Courier,monospace;">(i,j)</span> with <span style="font-family: Courier New,Courier,monospace;">k&#8217;</span>&lt;=<span style="font-family: Courier New,Courier,monospace;">k</span> it will immediately return <span style="font-family: Courier New,Courier,monospace;">False</span> because this particular trace has been unsuccessfully checked out before. Using an<span style="font-family: Courier New,Courier,monospace;">n x n</span> matrix to memoize results will reduce the complexity of the algorithm which becomes <span style="font-family: Courier New,Courier,monospace;">O(n^2)</span> where n is the length of the string. In fact it will be even <span style="font-family: Courier New,Courier,monospace;">O(n)</span> because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> is_visited<span style="color: black;">&#40;</span>i, j, k, visited<span style="color: black;">&#41;</span>:
    m = visited.<span style="color: black;">get</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span>,-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">if</span> k<span style="color: #306f30;">&gt;</span>m:
        visited<span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span> = k
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k, i = <span style="color: #ff4500;">0</span>, j = <span style="color: #ff4500;">0</span>, visited = <span style="color: #008000;">None</span><span style="color: black;">&#41;</span>:
    N1 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S1<span style="color: black;">&#41;</span>
    N2 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S2<span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">if</span> visited <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #008000;">None</span>:
        visited = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
    <span style="color: #000066;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
        <span style="color: #000066;font-weight:bold;">if</span> N1-i<span style="color: #306f30;">&lt;</span>=k <span style="color: #000066;font-weight:bold;">and</span> N2-j<span style="color: #306f30;">&lt;</span>=k:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
        <span style="color: #000066;font-weight:bold;">try</span>:
            <span style="color: #000066;font-weight:bold;">if</span> S1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> == S2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>:
                visited<span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span> = k
                i+=<span style="color: #ff4500;">1</span>
                j+=<span style="color: #ff4500;">1</span>
                <span style="color: #000066;font-weight:bold;">continue</span>
        <span style="color: #000066;font-weight:bold;">except</span> <span style="color: #FF0000;font-weight:bold;">IndexError</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">if</span> k == <span style="color: #ff4500;">0</span>:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i+<span style="color: #ff4500;">1</span>, j, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, j, visited<span style="color: black;">&#41;</span>:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i, j+<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i+<span style="color: #ff4500;">1</span>, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                <span style="color: #000066;font-weight:bold;">if</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k-<span style="color: #ff4500;">1</span>, i+<span style="color: #ff4500;">1</span>, j+<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span></pre></div></div>


<h3>A third implementation eliminating recursion</h3>

<p>In our third and final implementation we eliminate the recursive call to <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> and replace it using a stack containing tuples <span style="font-family: Courier New,Courier,monospace;">(i, j, k)</span> recording the current state.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> is_visited<span style="color: black;">&#40;</span>i, j, k, visited<span style="color: black;">&#41;</span>:
    m = visited.<span style="color: black;">get</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span>,-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">if</span> k<span style="color: #306f30;">&gt;</span>m:
        visited<span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span> = k
        <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
    <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, k<span style="color: black;">&#41;</span>:
    N1 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S1<span style="color: black;">&#41;</span>
    N2 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>S2<span style="color: black;">&#41;</span>
    i = j = <span style="color: #ff4500;">0</span>
    visited = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
    stack = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
        <span style="color: #000066;font-weight:bold;">if</span> N1-i<span style="color: #306f30;">&lt;</span>=k <span style="color: #000066;font-weight:bold;">and</span> N2-j<span style="color: #306f30;">&lt;</span>=k:
            <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">True</span>
        <span style="color: #000066;font-weight:bold;">try</span>:
            <span style="color: #000066;font-weight:bold;">if</span> S1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> == S2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>:
                visited<span style="color: black;">&#91;</span><span style="color: black;">&#40;</span>i,j<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span> = k
                i+=<span style="color: #ff4500;">1</span>
                j+=<span style="color: #ff4500;">1</span>
                <span style="color: #000066;font-weight:bold;">continue</span>
        <span style="color: #000066;font-weight:bold;">except</span> <span style="color: #FF0000;font-weight:bold;">IndexError</span>:
            <span style="color: #000066;font-weight:bold;">if</span> stack:
                i, j, k = stack.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">else</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">if</span> k == <span style="color: #ff4500;">0</span>:
            <span style="color: #000066;font-weight:bold;">if</span> stack:
                i, j, k = stack.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">else</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            <span style="color: #000066;font-weight:bold;">if</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i+<span style="color: #ff4500;">1</span>, j, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                stack.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>i,j,k<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                i, j, k = i+<span style="color: #ff4500;">1</span>, j, k-<span style="color: #ff4500;">1</span>
            <span style="color: #000066;font-weight:bold;">elif</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                stack.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>i,j,k<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                i, j, k = i, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>
            <span style="color: #000066;font-weight:bold;">elif</span> <span style="color: #000066;font-weight:bold;">not</span> is_visited<span style="color: black;">&#40;</span>i+<span style="color: #ff4500;">1</span>, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>, visited<span style="color: black;">&#41;</span>:
                i, j, k = i+<span style="color: #ff4500;">1</span>, j+<span style="color: #ff4500;">1</span>, k-<span style="color: #ff4500;">1</span>
            <span style="color: #000066;font-weight:bold;">elif</span> stack:                               
                i, j, k = stack.<span style="color: black;">pop</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">else</span>:
                <span style="color: #000066;font-weight:bold;">return</span> <span style="color: #008000;">False</span></pre></div></div>


<p>This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of <span style="font-family: Courier New,Courier,monospace;">if</span> &#8230; <span style="font-family: Courier New,Courier,monospace;">elif</span> branches can impact performance of the algorithm. Do you see a way to improve it?</p>

<h3>Checking the algorithm</h3>

<p>When D is the Levenshtein distance between two strings S1 and S2 then <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare(S1, S2, k)</span> shall be <span style="font-family: Courier New,Courier,monospace;">True</span> for <span style="font-family: Courier New,Courier,monospace;">k</span>&gt;<span style="font-family: Courier New,Courier,monospace;">=D</span> and <span style="font-family: Courier New,Courier,monospace;">False</span> otherwise. So when you want to test <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> compute the Levenshtein distance and check <span style="font-family: Courier New,Courier,monospace;">fuzzy_compare</span> with the boundary values <span style="font-family: Courier New,Courier,monospace;">k = D</span> and <span style="font-family: Courier New,Courier,monospace;">k = D-1</span>.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> levenshtein<span style="color: black;">&#40;</span>s1, s2<span style="color: black;">&#41;</span>:
    l1 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>s1<span style="color: black;">&#41;</span>
    l2 = <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>s2<span style="color: black;">&#41;</span>
    matrix = <span style="color: black;">&#91;</span><span style="color: #008000;">range</span><span style="color: black;">&#40;</span>l1 + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span> <span style="color: #306f30;">*</span> <span style="color: black;">&#40;</span>l2 + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> zz <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>l2 + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>:
      matrix<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span> = <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>zz,zz + l1 + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> zz <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>,l2<span style="color: black;">&#41;</span>:
      <span style="color: #000066;font-weight:bold;">for</span> sz <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>,l1<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">if</span> s1<span style="color: black;">&#91;</span>sz<span style="color: black;">&#93;</span> == s2<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span>:
          matrix<span style="color: black;">&#91;</span>zz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> = <span style="color: #008000;">min</span><span style="color: black;">&#40;</span>matrix<span style="color: black;">&#91;</span>zz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz<span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>,
                                   matrix<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>,
                                   matrix<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
          matrix<span style="color: black;">&#91;</span>zz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> = <span style="color: #008000;">min</span><span style="color: black;">&#40;</span>matrix<span style="color: black;">&#91;</span>zz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz<span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>,
                                   matrix<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz+<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span>,
                                   matrix<span style="color: black;">&#91;</span>zz<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>sz<span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">return</span> matrix<span style="color: black;">&#91;</span>l2<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span>l1<span style="color: black;">&#93;</span></pre></div></div>


<p>For exhaustive testing we define a set of strings as follows:</p>

<p>Given a prescribed n we define the set of strings of length = n which consists of &#8220;a&#8221; and &#8220;b&#8221; characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> string_set<span style="color: black;">&#40;</span>S = <span style="color: #008000;">None</span>, k = <span style="color: #ff4500;">0</span>, strings = <span style="color: #008000;">None</span>, n = <span style="color: #ff4500;">6</span><span style="color: black;">&#41;</span>:
    <span style="color: #000066;font-weight:bold;">if</span> S <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #008000;">None</span>:
        strings = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        S = <span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;a&quot;</span><span style="color: black;">&#93;</span><span style="color: #306f30;">*</span>n
        strings.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>S<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>k, n<span style="color: black;">&#41;</span>:
        S1 = S<span style="color: black;">&#91;</span>:<span style="color: black;">&#93;</span>
        S1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #483d8b;">&quot;b&quot;</span>
        strings.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>S1<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        string_set<span style="color: black;">&#40;</span>S1, i+<span style="color: #ff4500;">1</span>, strings, n<span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">return</span> strings
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> string_pairs<span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>:
    L1 = string_set<span style="color: black;">&#40;</span>n=n<span style="color: black;">&#41;</span>
    pairs = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">for</span> i <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>L1<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">for</span> k <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, n+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>:
            L2 = string_set<span style="color: black;">&#40;</span>n=k<span style="color: black;">&#41;</span>
            <span style="color: #000066;font-weight:bold;">for</span> j <span style="color: #000066;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>L2<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:
                pairs.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>L1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>,L2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>,levenshtein<span style="color: black;">&#40;</span>L1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>, L2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                pairs.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>L2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>,L1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span>,levenshtein<span style="color: black;">&#40;</span>L2<span style="color: black;">&#91;</span>j<span style="color: black;">&#93;</span>, L1<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">return</span> pairs</pre></div></div>


<p>Our test function is short:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> <span style="">test</span><span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>:
    <span style="color: #000066;font-weight:bold;">for</span> S1, S2, D <span style="color: #000066;font-weight:bold;">in</span> string_pairs<span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>:
        <span style="color: #000066;font-weight:bold;">assert</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, D<span style="color: black;">&#41;</span> == <span style="color: #008000;">True</span>, <span style="color: black;">&#40;</span>S1, S2, D<span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">assert</span> fuzzy_compare<span style="color: black;">&#40;</span>S1, S2, D-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span> == <span style="color: #008000;">False</span>, <span style="color: black;">&#40;</span>S1, S2, D-<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span></pre></div></div>


<p>Have much fun!</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2010/12/21/fuzzy-string-matching-and-grammars/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Python26 expressions</title>
		<link>http://fiber-space.de/wordpress/2010/11/26/python26-expressions/</link>
		<comments>http://fiber-space.de/wordpress/2010/11/26/python26-expressions/#comments</comments>
		<pubDate>Fri, 26 Nov 2010 04:28:19 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[Grammars]]></category>
		<category><![CDATA[Langscape]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1688</guid>
		<description><![CDATA[When you look at the following listing you might think it&#8217;s just a sequence of nonsense statements in Python 26,maybe created for testing purposes: raise a, b, c import d from e import* import f from .g import&#40;a&#41; from b import c from .import&#40;e&#41; from f import&#40;g&#41; from .a import&#40;b, c as d,&#41; import e, [...]]]></description>
			<content:encoded><![CDATA[<p>When you look at the following listing you might think it&#8217;s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">raise</span> a, b, c
<span style="color: #000066;font-weight:bold;">import</span> d
<span style="color: #000066;font-weight:bold;">from</span> e <span style="color: #000066;font-weight:bold;">import</span><span style="color: #306f30;">*</span>
<span style="color: #000066;font-weight:bold;">import</span> f
<span style="color: #000066;font-weight:bold;">from</span> .<span style="color: black;">g</span> <span style="color: #000066;font-weight:bold;">import</span><span style="color: black;">&#40;</span>a<span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">from</span> b <span style="color: #000066;font-weight:bold;">import</span> c
<span style="color: #000066;font-weight:bold;">from</span> .<span style="color: #000066;font-weight:bold;">import</span><span style="color: black;">&#40;</span>e<span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">from</span> f <span style="color: #000066;font-weight:bold;">import</span><span style="color: black;">&#40;</span>g<span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">from</span> .<span style="color: black;">a</span> <span style="color: #000066;font-weight:bold;">import</span><span style="color: black;">&#40;</span>b, c <span style="color: #000066;font-weight:bold;">as</span> d,<span style="color: black;">&#41;</span>
<span style="color: #000066;font-weight:bold;">import</span> e, f <span style="color: #000066;font-weight:bold;">as</span> g
<span style="color: #000066;font-weight:bold;">from</span> .<span style="color: black;">a</span> <span style="color: #000066;font-weight:bold;">import</span><span style="color: black;">&#40;</span>b, c,<span style="color: black;">&#41;</span>
...</pre></div></div>


<p>( <a href="http://www.fiber-space.de/misc/python26expr.py">Here</a> is the complete listing ).</p>

<p>which is of course correct. It is a somewhat peculiar listing because you find expressions like</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">c<span style="color: black;">&#91;</span>:,: d:,<span style="color: black;">&#93;</span><span style="color: #306f30;">**</span>e</pre></div></div>


<p>or</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: black;">&#91;</span>c <span style="color: #000066;font-weight:bold;">for</span> d <span style="color: #000066;font-weight:bold;">in</span> e, <span style="color: #000066;font-weight:bold;">lambda</span> f = g, <span style="color: black;">&#40;</span>a, b,<span style="color: black;">&#41;</span> = c,: d, <span style="color: #000066;font-weight:bold;">for</span> e <span style="color: #000066;font-weight:bold;">in</span> f<span style="color: black;">&#93;</span></pre></div></div>


<p>and a few others such as</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">b<span style="color: black;">&#40;</span><span style="color: #306f30;">*</span>c, d, <span style="color: #306f30;">**</span>e<span style="color: black;">&#41;</span><span style="color: #306f30;">**</span>f</pre></div></div>


<p>which will be rejected by the Python compiler because the <span style="font-family: Courier New,Courier,monospace;">*c</span> argument precedes the name <span style="font-family: Courier New,Courier,monospace;">d</span> but it is nevertheless &#8220;grammar correct&#8221; by which I mean it is consistent with Pythons context free LL(1) grammar.</p>

<p>The nice thing about the listing: it is automatically generated and it is complete in a certain sense.</p>

<h3>Generative Grammars</h3>

<p>When you look at a grammar rule like</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">for_stmt<span style="color: #FF0000; font-weight: bold;">:</span> <span style="color: #a00;">'for'</span> exprlist <span style="color: #a00;">'in'</span> testlist <span style="color: #a00;">':'</span> suite <span style="">&#91;</span><span style="color: #a00;">'else'</span> <span style="color: #a00;">':'</span> suite<span style="">&#93;</span></pre></div></div>


<p>it can be understood as an advice for producing exactly 2 expressions, namely:</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;"><span style="color: #a00;">'for'</span> exprlist <span style="color: #a00;">'in'</span> testlist <span style="color: #a00;">':'</span> suite
<span style="color: #a00;">'for'</span> exprlist <span style="color: #a00;">'in'</span> testlist <span style="color: #a00;">':'</span> suite <span style="color: #a00;">'else'</span> <span style="color: #a00;">':'</span> suite</pre></div></div>


<p>Other rules like</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">dictmaker<span style="color: #FF0000; font-weight: bold;">:</span> test <span style="color: #a00;">':'</span> test <span style="">&#40;</span><span style="color: #a00;">','</span> test <span style="color: #a00;">':'</span> test<span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span> <span style="">&#91;</span><span style="color: #a00;">','</span><span style="">&#93;</span></pre></div></div>


<p>has an infinite number of productions</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">test <span style="color: #a00;">':'</span> test
test <span style="color: #a00;">':'</span> test <span style="color: #a00;">','</span>
test <span style="color: #a00;">':'</span> test <span style="color: #a00;">','</span> test <span style="color: #a00;">':'</span> test
test <span style="color: #a00;">':'</span> test <span style="color: #a00;">','</span> test <span style="color: #a00;">':'</span> test <span style="color: #a00;">','</span>
...</pre></div></div>


<p>When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite <span style="font-family: Courier New,Courier,monospace;">for_rule</span> being finite and <span style="font-family: Courier New,Courier,monospace;">dictmaker</span> being infinite the algorithm creates two productions for each.</p>

<p>After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.</p>

<h3>Embedding of productions</h3>

<p>Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand <span style="font-family: Courier New,Courier,monospace;">test</span> in <span style="font-family: Courier New,Courier,monospace;">dictmaker</span> by selecting and inserting one production we got for the rule <span style="font-family: Courier New,Courier,monospace;">test</span>. Unfortunately a grammar isn&#8217;t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.</p>

<p>Look at the following two rules:</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">expr_stmt<span style="color: #FF0000; font-weight: bold;">:</span> testlist <span style="">&#40;</span>augassign <span style="">&#40;</span>yield_expr<span style="color: #000066; font-weight: bold;">|</span>testlist<span style="">&#41;</span> <span style="color: #000066; font-weight: bold;">|</span>
           <span style="">&#40;</span><span style="color: #a00;">'='</span> <span style="">&#40;</span>yield_expr<span style="color: #000066; font-weight: bold;">|</span>testlist<span style="">&#41;</span><span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span><span style="">&#41;</span>
augassign<span style="color: #FF0000; font-weight: bold;">:</span> <span style="">&#40;</span><span style="color: #a00;">'+='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'-='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'*='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'/='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'%='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'&amp;='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'|='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'^='</span> <span style="color: #000066; font-weight: bold;">|</span>
            <span style="color: #a00;">'&lt;&lt;='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'&gt;&gt;='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'**='</span> <span style="color: #000066; font-weight: bold;">|</span> <span style="color: #a00;">'//='</span><span style="">&#41;</span></pre></div></div>


<p>The only place where <span style="font-family: Courier New,Courier,monospace;">augassign</span> occurs  in the grammar is in <span style="font-family: Courier New,Courier,monospace;">expr_stmt</span> but counting the number of productions for <span style="font-family: Courier New,Courier,monospace;">augassign</span> we get 12 whereas we only count 3 productions for <span style="font-family: Courier New,Courier,monospace;">expr_stmt</span> and there is just a single production which contains <span style="font-family: Courier New,Courier,monospace;">expr_stmt</span>. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can&#8217;t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.</p>

<p>Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">file_input<span style="color: #FF0000; font-weight: bold;">:</span> <span style="">&#40;</span>NEWLINE <span style="color: #000066; font-weight: bold;">|</span> stmt<span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span> ENDMARKER</pre></div></div>


<p>I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:</p>

<p><em>Wrap each initially determined production which is not a production of a start rule into a stmt</em></p>

<p>Take the production &#8216;+=&#8217; of <span style="font-family: Courier New,Courier,monospace;">augassign</span> as an example. We find that <span style="font-family: Courier New,Courier,monospace;">augassign</span> exists in <span style="font-family: Courier New,Courier,monospace;">expr_stmt</span>. So we take one <span style="font-family: Courier New,Courier,monospace;">expr_stmt</span> and <em>embedd</em> <span style="font-family: Courier New,Courier,monospace;">augassign</span> in the concrete form  &#8216;+=&#8217;.</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">testlist <span style="color: #a00;">'+='</span> yield_expr</pre></div></div>


<p>The subsequent embedding steps</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">expr_stmt   -<span style="color: #000066; font-weight: bold;">&gt;</span> small_stmt
small_stmt  -<span style="color: #000066; font-weight: bold;">&gt;</span> simple_stmt
simple_stmt -<span style="color: #000066; font-weight: bold;">&gt;</span> stmt</pre></div></div>


<p>When embedding <span style="font-family: Courier New,Courier,monospace;">small_stmt</span> into <span style="font-family: Courier New,Courier,monospace;">simple_stmt</span> one has to add a trailing NEWLINE. So our final result is:</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">testlist <span style="color: #a00;">'+='</span> yield_expr NEWLINE</pre></div></div>


<p>Any rule we used during successive embedding  doesn&#8217;t have to be used again as an initial rule of another embedding because it was already built into <span style="font-family: Courier New,Courier,monospace;">file_input</span>. It can be reused though when needed. I did not attempted to minimize the number of embeddings.</p>

<h3>Substitute non-terminals</h3>

<p>Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find <span style="font-family: Courier New,Courier,monospace;">import f</span> and <span style="font-family: Courier New,Courier,monospace;">import d</span> among the listed statements. I suspect one of them is a shortened form of <span style="font-family: Courier New,Courier,monospace;">import d.e</span> but since the rule for building <span style="font-family: Courier New,Courier,monospace;">d.e</span> allows using <span style="font-family: Courier New,Courier,monospace;">d</span> only and it is shorter, it will be chosen.</p>

<h3>Detecting Grammar flaws</h3>

<p>Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn&#8217;t mean that Pythons grammar isn&#8217;t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:</p>


<div class="wp_syntax"><div class="code"><pre class="bnf" style="font-family:monospace;">file_input<span style="color: #FF0000; font-weight: bold;">:</span> <span style="">&#40;</span>NEWLINE <span style="color: #000066; font-weight: bold;">|</span> stmt<span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span> ENDMARKER
simpleargs<span style="color: #FF0000; font-weight: bold;">:</span> fpdef <span style="">&#40;</span><span style="color: #a00;">','</span> fpdef<span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span>
defaultargs<span style="color: #FF0000; font-weight: bold;">:</span> fpdef <span style="color: #a00;">'='</span> test <span style="">&#40;</span><span style="color: #a00;">','</span> fpdef <span style="color: #a00;">'='</span> test<span style="">&#41;</span><span style="color: #000066; font-weight: bold;">*</span>
starargs<span style="color: #FF0000; font-weight: bold;">:</span> <span style="color: #a00;">'*'</span> NAME
dstarargs<span style="color: #FF0000; font-weight: bold;">:</span> <span style="color: #a00;">'**'</span> NAME
varargslist<span style="color: #FF0000; font-weight: bold;">:</span> <span style="">&#40;</span> simpleargs <span style="">&#91;</span><span style="color: #a00;">','</span> defaultargs<span style="">&#93;</span> <span style="">&#91;</span><span style="color: #a00;">','</span> starargs<span style="">&#93;</span> <span style="">&#91;</span><span style="color: #a00;">','</span>dstarargs<span style="">&#93;</span> <span style="color: #000066; font-weight: bold;">|</span>
               defaultargs <span style="">&#91;</span><span style="color: #a00;">','</span> starargs<span style="">&#93;</span> <span style="">&#91;</span><span style="color: #a00;">','</span>dstarargs<span style="">&#93;</span> <span style="color: #000066; font-weight: bold;">|</span>
               starargs <span style="">&#91;</span><span style="color: #a00;">','</span>dstarargs<span style="">&#93;</span> <span style="color: #000066; font-weight: bold;">|</span>
               dstarargs<span style="">&#41;</span> <span style="">&#91;</span><span style="color: #a00;">','</span><span style="">&#93;</span></pre></div></div>


<p>So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.</p>

<h3>Refactorings</h3>

<p>In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn&#8217;t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.</p>

<p>If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -&gt; OK;  P(G2)(Expr(G1)) -&gt; OK.</p>

<p>Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.</p>

<h3>Higher approximations</h3>

<p>Doesn&#8217;t the listing give us a 1st order approximation of the language? It&#8217;s a fun idea to think of all those listing expressions living in the &#8220;tangential space&#8221; of the language. &#8220;Higher approximation&#8221; would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set <span style="font-family: Courier New,Courier,monospace;">Tr(K, nfa)</span> of traces of length <span style="font-family: Courier New,Courier,monospace;">K</span> for a given nfa. <span style="font-family: Courier New,Courier,monospace;">Tr(K, nfa)</span> may be empty for some K.Unfortunately we can&#8217;t infer from <span style="font-family: Courier New,Courier,monospace;">Tr(K) = {}</span> that <span style="font-family: Courier New,Courier,monospace;">Tr(K+1) = {}</span>. So what is a good stop criterion then?</p>

<p>The algorithm for creating <span style="font-family: Courier New,Courier,monospace;">Tr(K, nfa)</span> is quite simple. The following functions are Langscape implementations:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #000066;font-weight:bold;">def</span> compute_tr<span style="color: black;">&#40;</span>K, nfa<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    Computes the set Tr(K, nfa) of traces of length K for a given nfa.
    The return value may be [] if no trace of length K exists.
    '</span><span style="color: #483d8b;">''</span>
    _, start, trans = nfa
    <span style="color: #000066;font-weight:bold;">return</span> compute_subtraces<span style="color: black;">&#40;</span>K, <span style="color: #ff4500;">0</span>, start, <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>, trans<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #000066;font-weight:bold;">def</span> compute_subtraces<span style="color: black;">&#40;</span>K, k, S, trace, trans<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    Computes complete traces of a given length.
&nbsp;
    :param K: The prescribed length a trace shall have.
    :param k: The current length of a trace ( used by recursive calls ).
    :param trace: the current trace.
    :param trans: the {state:[follow-states]} dictionary which characterizes
                  one NFA.
    '</span><span style="color: #483d8b;">''</span>
    traces = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    follow = trans<span style="color: black;">&#91;</span>S<span style="color: black;">&#93;</span>
    <span style="color: #000066;font-weight:bold;">for</span> F <span style="color: #000066;font-weight:bold;">in</span> follow:
        <span style="color: #000066;font-weight:bold;">if</span> F<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> <span style="color: #000066;font-weight:bold;">is</span> <span style="color: #008000;">None</span>:
            <span style="color: #808080; font-style: italic;"># termination condition fulfilled?</span>
            <span style="color: #000066;font-weight:bold;">if</span> k == K:
                traces.<span style="color: black;">append</span><span style="color: black;">&#40;</span>trace+<span style="color: black;">&#91;</span>F<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
        <span style="color: #000066;font-weight:bold;">else</span>:
            m = trace.<span style="color: black;">count</span><span style="color: black;">&#40;</span>F<span style="color: black;">&#41;</span>
            <span style="color: #808080; font-style: italic;"># impossible to terminate trace under this condition</span>
            <span style="color: #000066;font-weight:bold;">if</span> m == K:
                <span style="color: #000066;font-weight:bold;">continue</span>
            <span style="color: #000066;font-weight:bold;">else</span>:
                traces+=compute_subtraces<span style="color: black;">&#40;</span>K, <span style="color: #008000;">max</span><span style="color: black;">&#40;</span>k,m+<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, F, trace+<span style="color: black;">&#91;</span>F<span style="color: black;">&#93;</span>, trans<span style="color: black;">&#41;</span>
    <span style="color: #000066;font-weight:bold;">return</span> traces</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2010/11/26/python26-expressions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open source saturation</title>
		<link>http://fiber-space.de/wordpress/2010/08/07/open-source-saturation/</link>
		<comments>http://fiber-space.de/wordpress/2010/08/07/open-source-saturation/#comments</comments>
		<pubDate>Sat, 07 Aug 2010 02:34:32 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Programming Culture]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1592</guid>
		<description><![CDATA[Reading the following post of Jimmy Schementi who explained his exit at Microsoft with the loss of MS&#8217;s interest in IronRuby I start to wonder if this isn&#8217;t a sign of the times? Open source projects get started by a small team of employees and killed when they don&#8217;t attract a community which brings them [...]]]></description>
			<content:encoded><![CDATA[<p>Reading the following post of <a href="http://blog.jimmy.schementi.com/2010/08/start-spreading-news-future-of-jimmy.html">Jimmy Schementi</a> who explained his exit at Microsoft with the loss of MS&#8217;s interest in IronRuby I start to wonder if this isn&#8217;t a sign of the times? Open source projects get started by a small team of employees and killed when they don&#8217;t attract a community which brings them forth what rarely ever happens because everyone in OSS is already busy and either engaged with a major project, a brand which has been established a few years ago like (C)Python, Rails, Linux or Django  or doing solo acts as in my own case. Same with <a href="http://googleblog.blogspot.com/2010/08/update-on-google-wave.html">Google Wave</a> which was promising but the only wave it produced was a Tsunami of initial attention in the wikiredditblogosphere. Everyone expected Google would bring it forth just like any other commodity. I guess the same would happen to their Go language which was started by a superstar team of veteran programmers and would immediately go away if Google discontinues investment.</p>

<p>There are very few brands which are both new and do well like Clojure and Scala which seem to follow Pythons BDFL model and they are &#8211; unsurprisingly? &#8211; programming languages. Are there other examples of OSS projects that peaked in the last 2-3 years and established a community of regular committers who are not interns of a single company or do we see an almost inevitable saturation?</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2010/08/07/open-source-saturation/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Langscape</title>
		<link>http://fiber-space.de/wordpress/2010/07/16/langscape/</link>
		<comments>http://fiber-space.de/wordpress/2010/07/16/langscape/#comments</comments>
		<pubDate>Fri, 16 Jul 2010 16:15:47 +0000</pubDate>
		<dc:creator>kay</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://fiber-space.de/wordpress/?p=1571</guid>
		<description><![CDATA[Trails in a Langscape Welcome to Trails in a Langscape which is the new title of this blog. It is a minor change since URLs are not affected and the character of the blog will also remain the same. Langscape is the successor project of EasyExtend and is publically hosted at Google Code. Since I [...]]]></description>
			<content:encoded><![CDATA[<h3>Trails in a Langscape</h3>

<p>Welcome to <strong>Trails in a Langscape</strong> which is the new title of this blog. It is a minor change since URLs are not affected and the character of the blog will also remain the same. <strong>Langscape</strong> is the successor project of EasyExtend and is publically <a href="http://code.google.com/p/langscape/">hosted</a> at Google Code.</p>

<p>Since I created this WordPress blog instance I slowly worked on a new EasyExtend release. I published lots of related technical ideas but never released any code. Now the code <a href="http://code.google.com/p/langscape/source/checkout">is out</a>. It lives in an Hg repository, it is filed under a new label and hopefully a first packaged Langscape 0.1 release will follow soon. There is no project documentation at the moment and I still think about its organization. Another open issue is packaging and distribution, but I have no idea what is up-to-date in this area, how Langscape is possibly used, if anyone will ever create langlets or just use the growing toolbox applicable to Python, including syntactically guarded search and replace.</p>

<h3>Europython 2010</h3>

<p>Of course the time of publication is not arbitrarily chosen. I attend to <a href="http://www.europython.eu/">Europython 2010</a> next week in Birmingham and have a talk about EasyExtend/Langscape at Wednesday <a href="http://www.europython.eu/talks/timetable/">late in the afternoon</a> before we leave for Conference dinner. I hope many of you go to Europython as well and I&#8217;ll find a few of my casual readers in the audience. If the talk is so good as the fun I had preparing my slides, you&#8217;ll enjoy it as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://fiber-space.de/wordpress/2010/07/16/langscape/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

