## Archive for April, 2011

### Spirits – for M.

Posted in Vision on April 26th, 2011 by kay – Be the first to comment

Ledoux - Das Haus des Gärtners

Erdfunkstelle Raisting

### Maptrackers

Posted in Algorithms on April 12th, 2011 by kay – 2 Comments

### From graphs to maps

A Maptracker is a special backtracking algorithm used to check the equivalence of certain maps which can be represented as connected, directed graphs or finite state machines. It shall be described in this article.

The original motivation was to find an algorithm for reconstruction of grammars from finite state-machines with the following property: suppose you have a state-machine M0 and a function P which turns M0 into a grammar rule: G = P(M0). When we translate G back again into a state-machine we get M1 = T(P(M0)). Generally T o P != Id and M0 != M1. But how different are M0 and M1 actually?

Watch the two graphs GR1 and GR2 above. When we abstract from their particular drawings and focus on the nodes and edges only we can describe them using the following dictionaries:

```GR1 = {0: set([1, 3, 4, -1, 7]), 1: set([2]), 2: set([1, 3, 4, -1, 7]), 3: set([-1]), 4: set([6, 4, 5, -1, 7]), 5: set([5, 6]), 6: set([4, -1, 7]), 7: set([-1])}```
```GR2 = {0: set([1, 3, 6, -1, 7]), 1: set([2]), 2: set([1, 3, 6, -1, 7]), 3: set([6, 3, 4, 5, -1]), 4: set([4, 5]), 5: set([3, 6, -1]), 6: set([-1]), 7: set([-1])}```

A pair i: [j1, j2, … jn] describes the set of edges i -> j1,  i -> j2, …, i -> jn.

### Checking for equivalence

We say that GR1 and GR2 are equivalent if there is a permutation P of  {-1, 0, 1, …, 7} and

`GR2 == dict( (P(key), set(map(P, value))) for (key, value) in GR1.items() )`

Maptracker is merely a cute name for an algorithm which constructs the permutation P from map representations of the kind GR1 and GR2. P itself will be described as a dictionary. Since the value -1 is a fixed point it will be omitted:

```class Maptracker(object): def __init__(self, gr1, gr2): self.gr1 = gr1 self.gr2 = gr2   def accept(self, value, stack): e1, e2 = value # e1 -&gt; e2 V1 = self.gr1[e1] V2 = self.gr2[e2] # # e1 -&gt; e2 =&gt; v1 -&gt; v2 # # check consistency of the choice of the mapping if len(V1)!=len(V2): return False m = dict(p for (p,q) in stack) if e2 in m.values(): return False for v1 in V1: if v1 == e1: if e2 not in V2: return False if v1 in m: if m[v1] not in V2: return False for s in m: if e1 in self.gr1[s]: if e2 not in self.gr2[m[s]]: return False return True   def run(self): stack = [] if len(self.gr1) != len(self.gr2): return {} sig1 = sorted(len(v) for v in self.gr1.values()) sig2 = sorted(len(v) for v in self.gr2.values()) if sig1!=sig2: return {}   L1 = self.gr1.keys() L2 = self.gr2.keys() i = j = 0 while i&lt;len(L1): e1 = L1[i] while j&lt;len(L2): e2 = L2[j] if self.accept((e1,e2),stack): stack.append(((e1,e2),(i,j))) j = 0 break j+=1 else: if stack: _, (i,j) = stack.pop() if j == -1: return {} j+=1 continue else: return {} i+=1 return dict(elem[0] for elem in stack)```

If no permutation could be constructed an empty dictionary {} is returned.

Let’s watch the dict which is computed by the Maptracker for GR1 and GR2:

```&gt;&gt;&gt; M = Maptracker(GR1, GR2).run() &gt;&gt;&gt; M {0: 0, 1: 1, 2: 2, 3: 7, 4: 3, 5: 4, 6: 5, 7: 6}```

We can check the correctness of M manually or by setting

`P = lambda k: (-1 if k == -1 else M[k] )`

and check the equality we have defined above.

### Patching tracebacks

Posted in Algorithms, Langscape, Python on April 11th, 2011 by kay – Be the first to comment

One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.

The situation is easily explained: you have a program P written in some language L. There is a source-to-source translation of P into another program Q of a target language, preferably Python. So you write P but the code which is executed is Q. When Q fails Python produces a traceback. A traceback is a stack of execution frames – a snapshot of the current computation – and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:

```Traceback (most recent call last): File "tests\pythonexpr.py", line 12, in bar() File "tests\pythonexpr.py", line 10, in bar foo() File "tests\pythonexpr.py", line 5, in foo b.p, NameError: global name 'b' is not defined```

The problem here is that line number information in the frames is from Q whereas the file and the lines being displayed are from P – the only file there is!

### Hacking line numbers into parse trees

My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate Q or rather the parse tree corresponding to Q which got updated with the line numbers I used to expect. When the growth of line numbers in Q was non-monotonic, using CPythons internal line number table, lnotab, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won’t be overwritten. There is a  hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.

### From token streams to string pattern

Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of P into expressions in Q or vice versa, break  P token into token sequences in Q etc. ) we can checkout the following construction. Let be

`T(V, Q) = {T in TS_Q| V = T.Value }`
the set of token in the token stream TS_Qof Q with the prescribed token value V. Analog to this we can build build a set T(V, P) for P and TS_P.

The basic idea is now to construct a mapping between T(V,Q)  and  T(V,P). In order to get the value V we examine the byte code of a traceback frame up to the last executed instruction f_lasti. We assume that executing f_lasti leads to the error. Now the instruction may not be coupled to a particular name, so we examine f_lasti or the last instruction preceding f_lasti for which the instruction type is in the set

`{LOAD_CONST, LOAD_FAST, LOAD_NAME, LOAD_ATTR, LOAD_GLOBAL, IMPORT_NAME }`
From the value related to one of those instructions, the type of the value which is one of  {NAME, STRING, NUMBER} and the execution line f_lineno we create a new token T_q = [tokentype, value, lineno]. For V we set V = value. Actually things are a little more complicated because the dedicated token T_q and the line in Q are not necessarily in a 1-1 relationship. So there might in fact be n>1 token being equal to T_q which originate from different lines in P. So let Token be the list of all token we found on line T_q.Line and k = Token.count(T_q). We add k to the data characterizing T_q.

So assume having found T_q. The map we want to build is T(T_q.Value, Q) -> T(T_q.Value, P). How can we do that?

In the first step we assign a character to each token in the  stream TS_Q and turn TS_Q into a string S_TS_Q. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is T_q -> c. For each T in T(T_q.Value, Q) we determine then a substring of S_TS_Q with c as a midpoint:

``` class Pattern: def __init__(self, index, token, pattern): self.index = index self.token = token self.pattern = pattern   S_TS_Q = u'' m = 0x30 k = 5 tcmap = {}   # create a string from the token stream for T in TS_Q: if T.Value in tcmap: S_TS_Q+=tcmap[T.Value] else: s = unichr(m) tcmap[T.Value] = s S_TS_Q+=s m+=1   # create string pattern pattern = [] for T in TVQ: n = TS_Q.index(T) S = S_TS_Q[max(0, n-k): n+k] pattern.append(Pattern(n, T, S))```

The same construction is used for the creation of target patterns from T(V, P). In that case we use the tcmap dictionary built during the creation of pattern from T(V, Q): when two token in TS_Q and TS_P have the same token value, the corresponding characters shall coincide.

### The token mapping matrix

In the next step we create a distance matrix between source and target string pattern.

```    n = len(source_pattern)     m = len(target_pattern)     Rows = range(n)     Cols = range(m)     M = []     for i in range(n):         M.append([-1]*m)     for i, SP in enumerate(source_pattern):         for j, TP in enumerate(target_pattern):             M[i][j] = levenshtein(SP.pattern, TP.pattern)```

As a distance metrics we use the edit- or levenshtein distance.

Having that matrix we compute an index pair (I,J) with M[I][J] = min{ M[i][j] | i in Rows and j in Cols}. Our interpretation of (I,J) is that we map source_pattern[I].token onto target_pattern[J].token. Since there is an I for which source_pattern[I].token == T_q the corresponding T_p =target_pattern[J].token is exactly the token in P we searched for.

The line in the current traceback is the line T_q.Line of Q. Now we have found T_p.Line of P which is the corrected line which shall be displayed in the patched traceback. Let’s take a brief look on the index selection algorithm for which M was prepared:

```while True: k, I = 1000, -1 if n&gt;m and len(Cols) == 1: return target_pattern[Cols[0]].token[2] else: for r in Rows: d = min(M[r]) if d&lt;k: k = d I = r J = M[I].index(k) for row in M: row[J] = 100 SP = source_pattern[I] if SP.token == T_q: tok = target_pattern[J].token return tok.Line else: Rows.remove(I) Cols.remove(J)```

If there is only one column left i.e. one token in T(V, P) its line will be chosen. If the J-column was selected we avoid re-selection by setting row[J] = 100 on each row. In fact it would suffice to consider only the rows left in Rows.

### Example

One example I modified over and over again for testing purposes was following one:

```pythonexpr.py [P] ----------------- def foo(): a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p)   def bar(): foo()   bar()```

You can comment out b.p turn a + in the lambda expression into / provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get

```[Q] --- import langscape; __langlet__ = langscape.load_langlet("python") def foo(): a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p) def bar(): foo() bar()```

So it is this code which is executed using the command line

`python run_python.py pythonexpr.py`
which runs in the Python langlet through run_python.py. So the execution process sees pythonexpr.py but the code which is compiled by Python will be Q.

See the mess that happens when the traceback is not patched:

```Traceback (most recent call last): File "run_python.py", line 9, in &lt;module&gt; langlet_obj.run_module(module) ... File "langscape\langlets\python\tests\pythonexpr.py", line 6, in &lt;module&gt; 0, File "langscape\langlets\python\tests\pythonexpr.py", line 5, in bar b.p, File "langscape\langlets\python\tests\pythonexpr.py", line 3, in foo 0, NameError: global name 'b' is not defined```

There is even a strange coincidence because bar() is executed on line 5 in the transformed program and b.p is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via sys.excepthook,  the traceback patching mechanism whose major algorithm we’ve developed above we get

```Traceback (most recent call last): File "run_python.py", line 9, in &lt;module&gt; langlet_obj.run_module(module) ... File "langscape\langlets\python\tests\pythonexpr.py", line 13, in &lt;module&gt; bar(), File "langscape\langlets\python\tests\pythonexpr.py", line 11, in bar foo(), File "langscape\langlets\python\tests\pythonexpr.py", line 5, in foo b.p, NameError: global name 'b' is not defined```

which is exactly right!

### Conclusion

The algorithm described in this article is merely a heuristics and it won’t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won’t change too much.

An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don’t think it is too far away because I have some time now to work on it. It will still be under tested when it’s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.