One of the problems I early ran into when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.
The situation is easily explained: you have a program `P` written in some language `L`. There is a source-to-source translation of `P` into another program `Q` of a target language, preferably Python. So you write `P` but the code which is executed is `Q`. When `Q` fails Python produces a traceback. A traceback is a stack of execution frames – a snapshot of the current computation – and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:
Traceback (most recent call last): File "tests\pythonexpr.py", line 12, in bar() File "tests\pythonexpr.py", line 10, in bar foo() File "tests\pythonexpr.py", line 5, in foo b.p, NameError: global name 'b' is not defined |
The problem here is that line number information in the frames is from `Q` whereas the file and the lines being displayed are from `P` – the only file there is!
Hacking line numbers into parse trees
My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate `Q` or rather the parse tree corresponding to `Q` which got updated with the line numbers I used to expect. When the growth of line numbers in `Q` was non-monotonic, using CPythons internal line number table, `lnotab`, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won’t be overwritten. There is a hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.
From token streams to string pattern
Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of `P` into expressions in `Q` or vice versa, break `P` token into token sequences in `Q` etc. ) we can checkout the following construction. Let be
T(V, Q) = {T in TS_Q| V = T.Value }
the set of token in the token stream `TS_Q`of `Q` with the prescribed token value `V`. Analog to this we can build build a set `T(V, P)` for `P` and `TS_P`.
The basic idea is now to construct a mapping between `T(V,Q)` and `T(V,P)`. In order to get the value `V` we examine the byte code of a traceback frame up to the last executed instruction `f_lasti`. We assume that executing `f_lasti` leads to the error. Now the instruction may not be coupled to a particular name, so we examine `f_lasti` or the last instruction preceding `f_lasti` for which the instruction type is in the set
{LOAD_CONST, LOAD_FAST, LOAD_NAME, LOAD_ATTR, LOAD_GLOBAL, IMPORT_NAME }
From the value related to one of those instructions, the type of the value which is one of {`NAME`, `STRING`, `NUMBER`} and the execution line `f_lineno` we create a new token `T_q = [tokentype, value, lineno]`. For `V` we set `V = value`. Actually things are a little more complicated because the dedicated token `T_q` and the line in `Q` are not necessarily in a 1-1 relationship. So there might in fact be `n`>`1` token being equal to `T_q` which originate from different lines in `P`. So let `Token` be the list of all token we found on line `T_q.Line` and `k = Token.count(T_q)`. We add `k` to the data characterizing `T_q`.
So assume having found `T_q`. The map we want to build is `T(T_q.Value, Q)` -> `T(T_q.Value, P)`. How can we do that?
In the first step we assign a character to each token in the stream `TS_Q` and turn `TS_Q` into a string `S_TS_Q`. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is `T_q` -> `c`. For each `T` in `T(T_q.Value, Q)` we determine then a substring of `S_TS_Q` with `c` as a midpoint:
class Pattern: def __init__(self, index, token, pattern): self.index = index self.token = token self.pattern = pattern S_TS_Q = u'' m = 0x30 k = 5 tcmap = {} # create a string from the token stream for T in TS_Q: if T.Value in tcmap: S_TS_Q+=tcmap[T.Value] else: s = unichr(m) tcmap[T.Value] = s S_TS_Q+=s m+=1 # create string pattern pattern = [] for T in TVQ: n = TS_Q.index(T) S = S_TS_Q[max(0, n-k): n+k] pattern.append(Pattern(n, T, S)) |
The same construction is used for the creation of target patterns from `T(V, P)`. In that case we use the `tcmap` dictionary built during the creation of pattern from `T(V, Q)`: when two token in `TS_Q` and `TS_P` have the same token value, the corresponding characters shall coincide.
The token mapping matrix
In the next step we create a distance matrix between source and target string pattern.
n = len(source_pattern) m = len(target_pattern) Rows = range(n) Cols = range(m) M = [] for i in range(n): M.append([-1]*m) for i, SP in enumerate(source_pattern): for j, TP in enumerate(target_pattern): M[i][j] = levenshtein(SP.pattern, TP.pattern) |
As a distance metrics we use the edit- or levenshtein distance.
Having that matrix we compute an index pair `(I,J)` with `M[I][J] = min{ M[i][j] | i in Rows and j in Cols}`. Our interpretation of `(I,J)` is that we map `source_pattern[I].token` onto `target_pattern[J].token`. Since there is an I for which `source_pattern[I].token == T_q` the corresponding `T_p =target_pattern[J].token` is exactly the token in `P` we searched for.
The line in the current traceback is the line `T_q.Line` of `Q`. Now we have found `T_p.Line` of `P` which is the corrected line which shall be displayed in the patched traceback. Let’s take a brief look on the index selection algorithm for which `M` was prepared:
while True: k, I = 1000, -1 if n>m and len(Cols) == 1: return target_pattern[Cols[0]].token[2] else: for r in Rows: d = min(M[r]) if d<k: k = d I = r J = M[I].index(k) for row in M: row[J] = 100 SP = source_pattern[I] if SP.token == T_q: tok = target_pattern[J].token return tok.Line else: Rows.remove(I) Cols.remove(J) |
If there is only one column left i.e. one token in `T(V, P)` its line will be chosen. If the J-column was selected we avoid re-selection by setting `row[J] = 100` on each row. In fact it would suffice to consider only the rows left in `Rows`.
Example
One example I modified over and over again for testing purposes was following one:
pythonexpr.py [P] ----------------- def foo(): a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p) def bar(): foo() bar() |
You can comment out `b.p` turn a `+` in the lambda expression into `/` provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get
[Q] --- import langscape; __langlet__ = langscape.load_langlet("python") def foo(): a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p) def bar(): foo() bar() |
So it is this code which is executed using the command line
python run_python.py pythonexpr.py
which runs in the Python langlet through `run_python.py`. So the execution process sees `pythonexpr.py` but the code which is compiled by Python will be `Q`.
See the mess that happens when the traceback is not patched:
Traceback (most recent call last): File "run_python.py", line 9, in <module> langlet_obj.run_module(module) ... File "langscape\langlets\python\tests\pythonexpr.py", line 6, in <module> 0, File "langscape\langlets\python\tests\pythonexpr.py", line 5, in bar b.p, File "langscape\langlets\python\tests\pythonexpr.py", line 3, in foo 0, NameError: global name 'b' is not defined |
There is even a strange coincidence because `bar()` is executed on line 5 in the transformed program and `b.p` is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via `sys.excepthook`, the traceback patching mechanism whose major algorithm we’ve developed above we get
Traceback (most recent call last): File "run_python.py", line 9, in <module> langlet_obj.run_module(module) ... File "langscape\langlets\python\tests\pythonexpr.py", line 13, in <module> bar(), File "langscape\langlets\python\tests\pythonexpr.py", line 11, in bar foo(), File "langscape\langlets\python\tests\pythonexpr.py", line 5, in foo b.p, NameError: global name 'b' is not defined |
which is exactly right!
Conclusion
The algorithm described in this article is merely a heuristics and it won’t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won’t change too much.
An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don’t think it is too far away because I have some time now to work on it. It will still be under tested when it’s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.