Syntax algebra - first steps

Posted in Algorithms, Grammars on January 30th, 2010 by kay – Be the first to comment

Principle of relativity

I started to revisit syntactic mappings defined in EasyExtend 3 which are closely tied to the Python grammar being  in use. Those are functions like varargs2arglist, normalize, split_file_input or exprlist2testlist defined in the csttools.py module. One of the major challenges of future releases of EasyExtend ( or a definite successor project - i don’t know ) is to abstract from Python as a target language. In EE3 only Python can be targeted by a langlet transformation whereas in EE 4 langlets are symmetric: each langlet can be a target langlet for any other langlet. All langlets exist on the same footing and also: each langlet can be used as a parent langlet of a newly defined langlet which means a proper inheritance relationship and maximum reuse.

The relativity among langlets calls for raising the bar of abstraction. All Python specific dependencies, and there are a lot in EE3, have to be removed from the basic CST libraries. Indeed not even nodes of special importance like expr, stmt or atom shall be allowed in places other than modules which are specific to a particular langlet. The generalization of the mentioned functions leads to a study of abstract syntactic forms and some sort of “syntax algebra”. It isn’t studied rigorously in this article but shall be at least motivated. As a prerequisite I recommend to read my article about CST interpolation which discusses concepts of major relevance.

Embeddings

The exprlist2testlist function turns a node of type exprlist defined as

exprlist: expr (',' expr)* [',']

into a node of type testlistdefined as

testlist: test (',' test)* [',']

This works without adding information because {test, expr} is a valid interpolation i.e. there is a nested sequence

[test, [or_test, [and_test, [not_test, [comparison , expr]]]]]
which is a representation of a valid CST. In terms of CST interpolations test(expr) yields a node of type test and induces a homomorphism exprlist->testlist. More generally an interpolation {A, B} induces an embedding B (x B)* -> {A,B} (y {A,B})* = A (y A)* if x and y are constant terminal symbols i.e. terminal symbols where the corresponding token have a uniquely determined token string.

Blocks or no blocks

Another relevant example is the normalize function. The idea behind normalize is that statements like if x: y or def foo(): pass are semantically equivalent to block statements:

if x:
    y
def foo():
    pass

Those block statements can be used in a more general fashion because we can add other block statements in the thunked block. In Python blocks are expressed by the suite grammar rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Since {stmt, simple_stmt} is a valid interpolation, we can substitute all occurences of

suite: simple_stmt

with suites of the form

suite: NEWLINE INDENT stmt+ DEDENT

The general syntactic embedding is of the kind

B -> a b… {A,B} c d… with a, b, c, d … being constant terminal symbols.

Notice that INDENT is assumed to be a constant terminal despite the fact that the token string may vary. INDENT is treated special because in practical applications the number of used spaces for an indentation is fixed.  EasyExtend always uses 4 spaces.

Splitting nodes

The function split_file_input is the prototype of a node splitting function which can be thought in analogy to string splits. In this particular case we have a node file_input of the form

file_input: (NEWLINE | stmt)* ENDMARKER

and want to generate a sequence of nodes file_input: stmt ENDMARKER, file_input: NEWLINE ENDMARKER and file_input: ENDMARKER - one for each NEWLINE, stmt and ENDMARKER in the original file_input node. It doesn’t matter that file_input: NEWLINE ENDMARKER and file_input: ENDMARKER are likely be thrown away by an application because this can be decided by the nodes consuming function. The general split function is defined by

R: x (A|B…)* y -> [(R: x A y), (R: x B y), ..., (R: x y)]

Node rewritings

The mappings considered above were all rather straightforward. Now we want to discuss a rule transformation which is less obvious, namely that of a function signature into an argument tuple of a function call. In Python 2.6 the function signature is defined as

varargslist: ((fpdef ['=' test] ',')*
('*' NAME [',' '**' NAME] | '**' NAME) |
fpdef ['=' test] (',' fpdef ['=' test])* [','])
fpdef: NAME | '(' fplist ')'
fplist: fpdef (',' fpdef)* [',']

and an argument list of a function call by

arglist: (argument ',')* (argument [',']| '*' test [',' '**' test] | '**' test)
argument: test [gen_for] | test '=' test

Can we transform each varargslist into an arglist?

Let’s start our treatment of varargslist with fpdef. If we insert the RHS of fplist in fpdef we get

fpdef: NAME | '(' fpdef (',' fpdef)* [','] ')'

We show that this rule is a special form of  the node atom and since {test, atom} is a valid interpolation it is also a test node. The atom node is defined by

atom: NAME | '(' [yield_expr|testlist_gexp] ')' |  '[' [listmaker] ']' | ...

which can be specialized to

atom: NAME | '(' testlist_gexp ')'

Next we consider the testlist_gexp definition

testlist_gexp: test ( gen_for | (',' test)* [','] )

which can be specialized to

testlist_gexp: test (',' test)* [',']

We insert testlist_gexp in atom which yields

atom: NAME | '(' test (',' test)* [','] ')'

If we reduce test to atom we get a rule

atom: NAME | '(' atom (',' atom)* [','] ')'

which is isomorphic to fpdef. So we just need to substitute all occurrences of fpdef in fpdef with atom, then replace atom with test(atom) and finally replace the whole of atom again with test(atom). This procedure substitutes fpdef with test.

When we substitute each occurrence of NAME with testin varargslist we get:

(test ['=' test] ',')* ('*' test [',' '**' test] | '**' test) |
                       test ['=' test] (',' test ['=' test])* [',']

which can be reduced to

(argument ',')* ('*' test [',' '**' test] | '**' test) |
                 argument (',' argument)* [',']

which is the same as

(argument ',')* (argument [','] | '*' test [',' '**' test] | '**' test)

Voilà!

Syntax algebra

We have done some informal steps into syntax algebra with some real functions defined in EE 3 as a starting point. For the first three functions we have found general syntactical transformations which might be universally applicable. The last transformation is very specific though and it might be more interesting to determine an algorithm used to find a rule transformation of a particular kind. Although the search algorithm might be NP complete I assume that the found transformation - if one exists - has linear time complexity which is what we want. Such an algorithm would be another great achievement of EasyExtend which does not cease to surprise me.

About CST interpolation

Posted in Algorithms, Parsing, TBP, Trail on December 7th, 2009 by kay – 3 Comments

Eli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.

In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).

There are two major reasons I love CSTs.

  1. Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
  2. The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
  3. For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.

I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram

A grammar gives rise to a number of finite state machines - in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to

  • check correctness of CSTs under transformations
  • connect individual CST nodes through sequences of other CST nodes ( interpolation ).
  • insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)

Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike regexps and regexp engines which uniformly apply matching and searching to all strings.

Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.

Parse tree representations

Take a look at the following AST constructor kept as an example from Eli’s article

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The Expression constructor takes a node of type BinOp and produces a node of type Expression. It is used to represent that actual Python expression “xy”*3.

Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:

>>> import parser
>>> parser.expr('"xy"*3').tolist()
[258, [326, [303, [304, [305, [306, [307, [309, [310,
[311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]],
[16, '*'],
[315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]

The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.

The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:

eval_input  -- NT`258
  testlist  -- NT`326
    test  -- NT`303
      or_test  -- NT`304
        and_test  -- NT`305
          not_test  -- NT`306
            comparison  -- NT`307
              expr  -- NT`309
                xor_expr  -- NT`310
                  and_expr  -- NT`311
                    shift_expr  -- NT`312
                      arith_expr  -- NT`313
                        term  -- NT`314
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                STRING  -- T`3     L`1
                                  "xy"
                          STAR  -- T`16     L`1
                            *
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                NUMBER  -- T`2     L`1
                                  3
  ENDMARKER  -- T`0     L`2
    ''

It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.

Searching nodes

Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions find_first and find_all which keep a node and a node id as arguments. So when we seek for a particular node e.g. term in the syntax tree we just call find_first( node, symbol.term) where symbol.term is the node id of term encoded in symbol.py which is a standard library module. So for

nd = parser.expr(’”xy”*3′).tolist() we can apply find_first(nd, symbol.term)which returns

term  -- NT`314
  factor  -- NT`315
    power  -- NT`316
      atom  -- NT`317
        STRING  -- T`3     L`1
          "xy"

Traces

We want to name CST constructors just of the nodes they create. So expr creates a node of type symbol.expr, STRING a node of token.STRING and so on. In order to create a correct expr we have to call lots of node constructors. In source code this would be something like

expr(xor_expr(…(term(factor(…(STRING(”xy”)…), STAR(”*”), factor(…(NUMBER(”3″)…))…))

This doesn’t look much like noise reduction, but now consider this: when expr is created by the parser the parser starts with nothing but a sequence A = (STRING(”xy”), STAR(”*”), NUMBER(”3″)). So why isn’t it possible to start with A and expr and build expr(*A) ? We want to face a slightly more general problem namely having a sequence of nodes A = (a1, a2, …, ak) which are not necessarily token and a node constructor expr. Can we build a node expr(a1, …,ak) of type symbol.expr?

What is needed to identify an admissible sequence A with this property?

First of all let’s take a look at the grammar rule description of expr

expr: xor_expr ('|' xor_expr)*

Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes xor_expr VBAR xor_expr VBAR xor_expr is a trace of expr. But also xor_expr alone is a trace. So what is needed is to wrap a given sequence A = (a1, a2, …, ak) into a trace. We might start with the most simple case of a single node A= (a) which shall be wrapped into expr. As an example we consider A = (term).

Interpolation

In order to wrap term into expr we need a sequence of intermediate nodes xor_expr, and_expr,shift_expr, arith_expr` and then build

[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]

This sequence is uniquely determined by expr and term. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.

So when there is no choice in building a wrapper of type N around M we write {N, M} and call it an interpolation between N and M. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.

We have found already a valid interpolation {expr, term}. Other valid interpolations are {factor, STRING(”xy”)} and {factor, NUMBER(”3″)}. For term this already suffices to build a trace:

{factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)}

and with this trace we get

{expr, term({factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)})}

Now we are prepared to define an algorithm:

Let N be a node and A = (a1, ..., ak) a sequence of nodes.
Consider also the set of all nodes set(M1, ..., Mn) with
{N, Mi}, i=1,...,n being a valid interpolation starting with N.
 
For each Mi, i=1,...,n we try to build a trace
TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}).
 
If we have a found a trace for M is we get the result
{N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}

Autocompletion

Sometimes our algorithm might fail to find a trace for a node N and a sequence A but the error can still be corrected in a fully determinate fashion. Take the following rule for example:

dotted_name: NAME (’.’ NAME)*

together with a sequence A = (NAME, NAME) of two nodes. Obviously we cannot build a valid trace NAME DOT NAME from A directly but the insertion of DOT into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for DOT. It can always only be “.”. So it is possible to omit the DOT in A and still get a uniquely determined trace for dotted_name.

Applications

We’ve come to an end already. With the prerequisites given above it is perfectly possible to write

expr(STRING(”xy”), STAR(”*”), NUMBER(”2″))

and get a valid parse or even shorten it and write

expr(’”xy”‘, “*”, 2)

which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.

fn = lambda name, val: expr_stmt(name, “=”, val)

This lambda expression yields bindings of val to name. For example

fn(”a”, expr(’”xy”‘, “*”, 2)) is the parse tree equivalent of a = “xy”*2.

Notice that in any case the parse tree is syntactically correct by construction.

Wrapper Functions

Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of power which is defined as

power: atom trailer* ['**' factor]

This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:

def CST_Call(names, args = None, star_args = None, dstar_args = None):
    Names = [atom(names[0])]
    for name in names[1:]:
        Names.append(trailer('.', name))
    ARGS = list(args) if args else []
    if star_args:
        ARGS+=['*', star_args]
    if dstar_args:
        ARGS+=['**', dstar_args]
    if ARGS:
        return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')]
    else:
        return [symbol.power] + Names + [trailer('(', ')')]

“Resolver Games” are alive!

Posted in Python on October 24th, 2009 by kay – 3 Comments

Resolver One competition

I admit I participated in the Resolver One competition in the first round in January as one of the losers. When Resolver Systems announced their challenge I got the impression they encouraged using their spreadsheet almost like a medium for expressing ideas and thinking out of the box. However, Resolver Systems is yet another company which exists solely to sell stuff and consulting, not a patron of modern art or hacking experiments. So the awarded spreadsheets are looking a bit conventional and are technically unsophisticated. Their sophistication lies in external factors like the scientific ideas which are exercised. Some make extensive use of graphic capabilities of .NET which is also outside of the Resolver One API. It’s good demo stuff nevertheless and this might be their main purpose in the end.

Resolver Games

I’m glad to see Resolver Games being online now which was my own contribution. Resolver Games is about simple learning games for word lists. The examples I used were from “Teuton” ( Teuton is a German Python dialect, inspired by an Artima posting of Andy Dent, which replaces Pythons English keywords and builtins by German translations of them - Teuton is one of my fun projects and a langlet demo for EasyExtend ), the other one is IPA - the international phonetic alphabet - which is just a great learning target.

A Resolver Game consists of a pair of Resolver One spreadsheets. One for the word-list/game data and the other one for the game board. The game data spreadsheet is conventional and stateless. The game board is designed to be generic and independent of the particular game. The game board was tricky to program because it uses the spreadsheet for event handling and user interactions. Writing an event handler is a bit like scanning a screen and notify changes by comparing the actual with the previous image point by point. Resolver One stores data in the background and those data affect re-computations. Sometimes I wanted to cause a user event doing re-computations without changing the displayed data. I used the following trick: add and remove a blank to the cell-data and swap between the two representations periodically.

"+" -> "+ " -> "+" -> "+ " -> ...

When the cell content is “+” change it to “+ ” and vice versa. This goes unnoticed because there is no visual effect associated with the blank. Once I got into whitespace oriented programming the hard problems with state changes in Resolver Games became solvable.

One could argue that Resolver One is simply not the right-tool-for-the-job and it is overstretched in this application. I don’t disagree but this line of argument always appeared philistine to me and I reserve it to those who simply don’t know it better. A more serious objection to Resolver Games might be the fun aspect. Is it really fun to play? Resolver Games are surely a bit poor in game dramaturgy and visual effects. So I’d rather say NO, but I’m not a gamer anyway.

The future of EasyExtend

Posted in General on October 3rd, 2009 by kay – 4 Comments

The state of EasyExtend

Maybe an autumnal look on EasyExtend is justified. EE was and is an alien project which never resonated well with the Python community and its possible users. Actually up to now I don’t no anyone who has ever used it ( besides myself, of course ) and I wouldn’t wonder if this isn’t going to change in the future. For a Python programmer there are numerous alternatives now like python4ply, MetaPython and also 2to3 - why not? - which can be used to extend Python. None of them were available when I started with EE in 2006. Some people might also attempt to revive Logix which is among the more famous “dead” projects in the Python community. Logix might be in style and ambition precisely what Python users are looking for. EasyExtend isn’t even tangential.

Whenever I thought EE becomes stable I challenged it with bigger, more difficult problems: simultaneous transformations of multiple langlets, context sensitive languages, quirky real world grammars, online syntax definitions, source directed transformations, more expressible grammar syntax, language agnosticism etc.

Another major issue is performance. In the past I’ve used Psyco and also Cython. They boosted performance quite well and I got 3-5 times speedup for lexer+parser but I have clearly no performance model and I don’t see why those speedups shall be the limit? Python isn’t the right tool for the right job here and I suspect this had been an impediment for the current implementation already, since I overused containers like tuples and dicts in favor for classes and objects and their slow attribute access.

From EasyExtend to Langscape

The most likely path into the future of EasyExtend is to factor out components like the parser generator, the langlet transformer and most of the csttools and rewrite them in C++. I’ll probably start a completely new project which I intend to call “Langscape”. By means of SWIG it shall be possible to use Langscape also from environments like the JVM or the CLR. As a Python front end I’ll use the code I’ve developed for EasyExtend 4 which will probably never go public in the current form. I’ll still consider doing the functional testing in Python and I also want to preserve interactivity. Both the language front-end as well as back-end bindings become separated from Langscape. Langscape only deals with source code, grammars and CSTs.

Bickering about unit testing

Posted in Testing on September 25th, 2009 by kay – 1 Comment

Doubts on the effectiveness of unit testing

Unit testing has entered the programming mainstream with XUnit packages and derivations of them. They are available for all mainstream programming languages. It is not normal today shipping an OSS project without any tests. Programmers can read test cases like behavioral specifications of APIs and they often learn a lot about a system from this sort of code reading ( at least I do ).

Still unit testing is disputed as a reasonable practice by many respected programmers and I wonder if guys like Joel Spolsky or James Coplien aren’t basically right? Isn’t it true that UTs have to be permanently adapted as our code base changes and doesn’t this imply a significant maintenance overhead even and foremost in early phases? Coplien suggests design-by-contract as a more lightweight and DRY alternative to writing UTs: place pre and post-conditions directly into the code and check the available units i.e. the interface specifications. Isn’t this far more agile and won’t better coding practices make UTs go away just like many of the once celebrated design pattern go away when using powerful language level concepts like multimethods and higher order functions?

Black box testing

When you work as a tester in the industry you essentially specify and implement test-suites according to specifications. Your product is not the system under test ( SUT ). You are not interested in the inner working of a system and its components. The SUT is a black-box and the SUT code might change arbitrarily. If any code is exposed it is SUT API code being accessible by clients application like your test app. The API might even be fully away though and instead you’ll test in- and outgoing commands sent for and back between your test app and the SUT according to a specified command protocol. All of those tests are functional- or system level tests and the tested units remain hidden. As a tester you don’t care about the way the system is built but only how it behaves.

Can we use our standard UT frameworks to implement black box tests? Well, isn’t this actually their most frequent use?

Are there any UTs around?

What if the most common unit tests we are finding in the wild are functional or system blackbox tests applied to API level functions/classes, implemented in one of the available unit testing frameworks? Some of the system components are abstracted away and get replaced by mock objects representing networks or C/S databases but this just avoids system integration tests. A close reading of unit testing might indeed lead to Jim Copliens conclusion that they are better implemented as pre and post-conditions but you won’t test a system on such a fine grained level. Using UT frameworks for functional tests has short comings but it doesn’t mean they are not used for them. When the interface is kept small the likelihood that it gets badly broken when you evolve your system is manageable. This is the prime reason why programmers do not suffer from writing UTs and maintenance costs are kept under control. Every software tester in the industry knows that writing tests takes much effort and is very costly but changes in public APIs isn’t a major reason.

UTs and beyond

The missing link between between current UT systems and a test-system for all kinds of SUTs is a dataflow connection which triggers tests in a particular order. By this I mean that each test can produce data as a side-effect which can be required within another setup of a test-case. In Junit4 we have @before and @after annotations for running setups and tear-downs unconditionally. When adding two more annotations @require and @provide it becomes possible to specify conditions on running tests by means of the need of data. A test-runner has to match the @required data against the @provided ones and determines a schedule.

In case of Java this can be checked at compile time using an annotation processor. In .NET one might apply those checks once the assemblies are loaded during initialization of the test-runner. The only disadvantage of load-time checks is that all available test-modules have to be loaded initially and not on demand.

Choosers and ChooserMixins in C++ and Python

Posted in CPP, Chooser, Python, Testing on September 12th, 2009 by kay – 2 Comments

Chooser Objects

From time to time I’m amazed to find a simple algorithm which seemed to be a low hanging fruit which was just overlooked. In this particular case it is about generating and utilizing test data in a both simple and flexible manner. Mark Seaborn described the method in his outstanding blog article How to do model checking of Python code. He distilled what we might call the Chooser Algorithm from a scientific paper which buries the message under all sorts of methodological considerations and special case treatments and other bloat. This is sad because good algorithms are the crown jewels of programming. It also helped that he provided an implementation in Python and not in C or some sloppy computing-scientist-only pseudo code notation which changes from author to author.

We can motivate Chooser objects as follows.

Suppose you have a control flow statement defined in a function f. The path the flow control takes is determined by the value of some variable x:

def f(*args):
    ...
    x = g(*args)
    if x>0:
        ...
    else:
        ...

When we want to test the if-statement alone we can ignore the value of x computed by g. A simple method to achieve this is to introduce a for-loop in the code which iterates over a range of values which represent jumps to the individual if-statement branches:

def f(*args):
    ...
    x = g(*args)
    for x in (1,-1):
        if x>0:
            ...
        else:
            ...

However, this is a quite heavy change and we would likely not want to repeat this at another place. Instead of adding a for-loop we can introduce a non-deterministic choice over the values 1 and -1 and pull the iteration, represented by the loop, out of the function:

def test(chooser):
    def f(*args):
        ...
        x = g(*args)
        x = chooser.choose([1,-1])
        if x>0:
            ...
        else:
            ...
    f(1,2,3)  # call f with appropriate arguments

Here we inserted a call to choose which represents a set of choices. No new control flow is introduced. The function f must be called as many times as there are choices passed to choose.

The repeated call of f is managed by a new function check which is part of the Chooser Algorithm. It actually calls the test function which has a uniform interface and keeps a single chooser parameter.

class ModelCheckEscape(Exception): pass
 
def check(func):
    stack = [[]]
    while stack:
        chosen = stack.pop()
        try:
            func(Chooser(chosen, stack))
        except ModelCheckEscape:
            pass

The check function creates a Chooser object and passes it to func which is represents the system under test. The Chooser constructor takes two arguments. One is a list called chosen popped from a stack of such lists, the other one is the stack itself which might be filled with new lists.

class ModelCheckEscape(Exception): pass
 
class Chooser(object):
    def __init__(self, chosen, stack):
        self._chosen = chosen
        self._stack  = stack
        self._it     = iter(chosen)
 
    def choose(self, choices):
        try:
            choice = self._it.next()
            if choice not in choices:
                raise Exception("Program is not deterministic")
            return choice
        except StopIteration:
            self._stack+=[self._chosen + [choice] for choice in choices]
            raise ModelCheckEscape()

This is the definition of the Chooser object. It is a tiny bit of elementary but ingenuous code. In order to understand what it does consider the following test function with its three calls of choose:

def test(chooser):
    x = chooser.choose([True, False])
    if x:
        y = chooser.choose(["a", "b"])
    else:
        z = chooser.choose(["y", "z"])

On each choose call a value is returned from the _it iterator. Those values must conform to the choices passed to choose for every call of choose. Otherwise a ChooserException is raised. So we expect _it to be an iterator wrapped around lists like [True, "a"], [True, "b"], [False, "y"], [False, "z"]. Those lists are associated with the choices being made at (x, y) or (x, z).

In fact we observe some more of those lists, starting with the empty list [] and the incompletely filled lists [True] and [False]. When _it is wrapped around an incomplete list one of the choose calls will raise a StopIteration exception at _it.next(). Assume for example that _it = iter([True]) then _it is already exhausted after x and choose and will raise StopIteration at the definition of y. At this point each of the choices at y i.e. “a” and “b” will produce a new list. Those lists are [True, "a"] and [True, "b"] which are now complete. New lists are pushed on the stack as long as incomplete lists are popped from the stack incheck().

As a special case we consider a simple linear sequence of choose calls

def test(chooser):
    x = chooser.choose([True, False])
    y = chooser.choose(["a", "b"])
    z = chooser.choose(["y", "z"])

The set of complete lists according to this sequence will be the Cartesian product of the choices: {True, False} x {”a”, “b”} x {”y”, “z”}. If you just want Cartesian products there are more efficient alternatives to create them though.

These are the Chooser basics. For Python you can download the used code here.

Choosers in C++

I gave a C++ and STL based implementation of Chooser objects. The Chooser C++ API closely follows the Python example. You can download the code from the linked document.

In its most general form the choose method has following signature:

    template <typename Container>
    typename Container::value_type choose(Container& choices)

The return type is derived from the containers value_type attribute. Other than this the algorithm only relies on iterators which means that any STL container can be used. We can rewrite the simple test function above in C++:

void test(Chooser& chooser) {
    int x = chooser.choose(2);
    if x {
        string s = "ab";
        char y = chooser.choose(s);
    }
    else {
        string s = "yz";
        char z = chooser.choose(s);
    }
}

This is not all that much overhead. In case of the x definition we use an overloaded version of choose which takes a single integer parameter k. This is equivalent to a choice of values within the range {0, 1, …, k-1}. The most relevant case may be choose(2) which is the boolean choice.

The string type is an STL container type as well. More precisely it is a typedef for basic_string<char>. We can create a string object with a string literal but we cannot pass a string literal directly to choose which expects an explicit reference to a container from which the return type is derived ( char in this case ).

ChooserMixin classes

Suppose we want to introduce Chooser objects into arbitrary methods of an existing class. The Chooser Algorithm is implemented s.t. a Chooser object is explicitly passed as a parameter but this would require changes in a methods interface, something we try to avoid.

Visibility of Chooser instances in the local scope of a method can also be achieved by making them global or member variables. An inexpensive method which is safer than using globals is to use a mixin class. The mixin class defines aChooser instance and if some class wants to use it, it derives from the mixin.

class ChooserMixin {
protected:
    Chooser chooser;
public:
    void test() = 0;
 
    void check()
    {
        ...
        this->chooser = Chooser(chosen, queue);
        test();
        ...
    }
}

The test method is abstract. If f is the method we want to check, then the implementation of test would just invoke f with appropriate parameters:

void test() {
    f(arg1, arg2, ...);
}
It’s easy to change test without touching any other source code.

More advantages of ChooserMixins

When we use ChooserMixin we can define the choices C being used in chooser.choose(C) also as member variables. This makes choices configurable. A subclass of a ChooserMixin might read data from an external file or a database and populate the C container.

I wonder if it’s even possible to get rid of T x = chooser.choose(C)assignments in method source when using data binding techniques? In JavaFX we can restate the assigment in the form

var x = bind chooser.choose(C)

The bound variable x is updated whenever C is changed. Instead of creating a new instance of Chooser on each iteration, we replace the members defined in a single instance and trigger updates of C which in turn causes chooser.choose(C) to produce a new value. It remains to be examined if this idea is somehow practical.

Python - Hibernate - Jynx

Posted in Hibernate, Jynx, Jython on September 4th, 2009 by kay – Be the first to comment

Jynx 0.4 goes Hibernate

In Jynx 0.4 JPA/Hibernate annotations are supported. Although this is still work in progress some of the more complex nested annotations were tested as well as Hibernate extension annotations which cannot be single-name imported along with the corresponding JPA annotations without conflicts.

Jynx 0.4 provides other new features as well. One can now use @signature decorators to express Java method overloading. A simple Java parser is integrated. A Java parser was necessary to improve the Java class detection heuristics used to determine required imports when a Java proxy is created from a Jython class and compiled dynamically. Finally there is a new @bean_property decorator which creates a private attribute foo along with public getters and setters given a bean_property decorated method def foo(_):_. Full documentation of Jynx as well as its changes can be found here.

Using Hibernate from Jython

Starting and closing sessions and managing simple transactions is not difficult in Hibernate. In Jynx two context managers for with-statements are defined which hide open+close and begin+commit/rollback boilerplate from the programmer. Code for Hibernate sessions and transactions lives then in with-statement blocks.

class hn_session(object):
    '''
    Context manager which opens/closes hibernate sessions.
    '''
    def __init__(self, *classes):
        sessionFactory = buildSessionFactory(*classes)
        self.session   = sessionFactory.openSession()
 
    def __enter__(self):
        return self.session
 
    def __exit__(self, *exc_info):
        self.session.close()
 
class hn_transact(object):
    '''
    Context manager which begins and performs commits/rollbacks hibernate transactions.
    '''
    def __init__(self, session):
        self.tx = session.beginTransaction()
 
    def __enter__(self):
        return self.tx
 
    def __exit__(self, type, value, traceback):
        if type is None:
            self.tx.commit()
        else:
            self.tx.rollback()

A simple session using a single Entity Bean may then look like:

from __future__ import with_statement
 
from jynx.lib.hibernate import*
 
@Entity
class Course(Serializable):
    @Id
    @Column(name="COURSE_ID")
    @signature("public int _()")
    def getCourseId(self):
        return self.courseId
 
    @Column(name="COURSE_NAME", nullable = False, length=50)
    @signature("public String _()")
    def getCourseName(self):
        return self.courseName
 
    @signature("public void _(String)")
    def setCourseName(self, value):
        self.courseName = value
 
    @signature("public void _(int)")
    def setCourseId(self, value):
        self.courseId = value
 
with hn_session(Course) as session:
    course  = Course()
    course.setCourseId(121)
    course.setCourseName(str(range(5)))
    with hn_transact(session):
        session.saveOrUpdate(course)

Boilerplate Reduction

The standard class decorator for creating a Java class from a Jython class in Jynx is @JavaClass. In Jynx 0.4 some slightly extended decorators are introduced in particular @Entity and @Embeddable. Not only do they make Jython code more concise because one doesn’t have to stack @Entity and @JavaClass but translating with @Entity turns some automatically generated Java attributes into transient ones i.e. a @Transient annotation is applied which prevents those attributes to be mapped to table columns.

The massive boilerplate needed for defining a valid Entity Bean in the preceding example can be reduced using the @bean_property decorator:

@Entity
class Course(Serializable):
    @Id
    @Column(name="COURSE_ID")
    @bean_property(int)
    def courseId(self): pass
 
    @Column(name="COURSE_NAME", nullable = False, length=50)
    @bean_property(String)
    def courseName(self): pass

Applied to def courseId(self): pass the @bean_property decorator will cause the following Java code translation

    @Id @Column(name="COURSE_ID") private int courseId;
    int getCourseId() { return courseId; }
    int setCourseId(int value) { courseId = value; }

which specifies a complete Java Bean property.

Example

In the following example two Entities are associated using a one-to-one mapping between primary keys.

@Entity
class Heart(Serializable):
    @Id
    @bean_property(int)
    def id(self):pass
 
@Entity
class Body(Serializable):
    @Id
    @bean_property(int)
    def id(self):pass
 
    @OneToOne(cascade = CascadeType.ALL)
    @PrimaryKeyJoinColumn
    @bean_property(Heart)
    def heart(self):pass

Now we can check the behavior:

# session 1
with hn_session(Heart, Body) as session:
    body = Body()
    heart = Heart()
    body.heart = heart
    body.id = 1
    heart.id = body.id
    with hn_transact(session):
        session.saveOrUpdate(body)
        session.saveOrUpdate(heart)
 
# session 2
with hn_session(Heart, Body) as session:
    with hn_transact(session):
        b = session.get(Body, 1)
        assert b
        assert b.heart
        assert b.heart.id == 1

Summary

With Hibernate support in Jython we notice another clear departure from the CPython world and its web frameworks and components. Hibernate is distinctively Java and special techniques are needed to create compile time Java properties in a dynamic language. Jython has long been a second citizen in Python land. I suspect this is going to change with support of Java frameworks which alone have as many users/downloads as Python.

Jynx 0.3 - how to fix custom class loaders for use with Jython

Posted in Jynx, Jython on August 12th, 2009 by kay – Be the first to comment

Broken class loaders

Jynx 0.2 contained an ugly workaround for a bug I couldn’t fix for quite a while. The bug can be described as follows: suppose you defined code of a Java class A and compiled it dynamically:

A = JavaCompiler().createClass("A", A_source)

When you attempt to build a subclass

class B(A): pass
a NoClassDefFoundError exception was raised:

Traceback (most recent call last):
  File "C:\lang\Jython\jcompile.py", line 185, in <module>
    class B(A):pass
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
java.lang.NoClassDefFoundError: org/python/core/PyProxy (wrong name: A)

In that case the Jython runtime failed to create a proxy class for B while locating PyProxy which is a Jython core interface. From the traceback it wasn’t clear how to locate the error and I started to debug into Jython from Netbeans.

This is what happened: Jynx defines a ByteClassLoader class which is custom class loader for dynamic compilation of A. When A is loaded with loadClass a findClass method is called to locate A and this method had to be overwritten. The ByteClassLoader was bound to A automatically and used by Jython to locate interfaces such as org.python.core.PyProxy. This didn’t work and explains the failure. A possible fix is to respond to classes which cannot be dealt with from ByteClassLoader and delegate a findClass call to the parent class loader.

Curiously Jython stopped using ByteClassLoader after I changed the inheritance hierarchy from

class ByteClassLoader(ClassLoader):
    def __init__(self, code):
        super(ByteClassLoader, self).__init__(ClassLoader.getClassLoader())
        ...

to

class ByteClassLoader(URLClassLoader):
    def __init__(self, code):
        super(ByteClassLoader, self).__init__([], ClassLoader.getSystemClassLoader())
        ...

The URLClassLoader provides the opportunity to add URLs at runtime and therefore modifying the CLASSPATH dynamically.

No disk dumps in Jynx 0.3

Prior to Jynx 0.3 a workaround has been dumping A to disk and load the class from there. We discussed the subtle nuances of selecting the right class loader and loading A from disk moved the machinery into a correct state. This wasn’t only cumbersome but a hurdle when a programmer intended to work within a Java sandbox. With Jynx 0.3 I feel prepared to explore Java integration with Jynx on GAE-J.

Jynx 0.2 released

Posted in Java, Jynx, Jython on July 27th, 2009 by kay – Be the first to comment

I’ve released Jynx 0.2. Jynx is a Jython package which utilizes dynamic Java compilation from Jython and improves on Java scripting. With Jynx 0.2 two major new features are implemented now.

Annotation Extraction

In the initial Jynx 0.1 release an annotation object was defined which can be used as a decorator. A Python class such as

@JavaClass
class TestClass(Object):
    @annotation("Test")
    @signature("public void _()")
    def test_report_test_failure(self):
        assertTrue("length of empty list is 0", len([]) != 0)

equipped with the JavaClass decorator is compiled into a Java class on the fly which acts as a proxy for a Python object and provides the correct interface for being used within a Java framework which expects methods of a particular type signature and annotations. The class defined above can be used within JUnit 4.X.

Jynx 0.2 provides a new classmethod extract of the annotation class which can be used to extract Java annotation classes and acts as a factory function for Jython annotation objects.

# import Test annotation in JUnit 4.X
from org.junit import Test      
 
# a Python annotation object
Test = annotation.extract(Test) 
 
# keep a signature object as a parameter and returns a new Jython
# annotation object. The Java code generator will create a method
# with the correct signature and the @Test annotation
Test = Test(signature("public void _()")  
 
@JavaClass
class TestClass(Object):
    @Test
    def test_report_test_failure(self):
       assertTrue("length of empty list is 0", len([]) != 0)

As we see there is no overhead left here. When programming against a Java API / framework, Jython annotations can be defined within a single file and used application wide.

Classpath Manipulation

For reasons which are not completely transparent to me Java doesn’t permit runtime classpath manipulations. The JDK defines an addURL method in a special classloader called URLClassLoader. This method is protected and cannot generally be accessed without reflection. Internally the Sun JVM uses such a loader class ( or a subclass of it ) and when you are willing to accept a hack and programming against an implementation detail you can use the JVMs default class loader and add new paths to a classpath:

from java.lang import ClassLoader
systemLoader = ClassLoader.getSystemClassLoader()
systemLoader.addURL("file:///C|junit-4.6.jar")

Jynx defines a ClassPath class and a new sys module attribute classpath. Adding a file system path P to sys.classpath results in a method call

systemloader.addURL(URL("file:"+pathname2url(pth)))

which converts the file system path into a Java URL object and adds it to the classpath. Additionally the same path is added to the PYTHONPATH via sys.path:

sys.classpath.append(r"C:\junit-4.6.jar")

The advantage is that each Python package can maintain the Java packages it depends upon and no global CLASSPATH environment variable has to be adapted unless a Java or Jython class defines its own class loader.

Four things I’d change in Python - and a little more

Posted in Python on July 25th, 2009 by kay – 4 Comments

1. Import system

Replace the flat module cache by a set of ModuleTree objects rooted in nodes living on the PYTHONPATH. Apply relative path semantics by default and treat absolute paths as special cases. Internal paths which are used in import statements or for traversing ModuleTree objects and external ones ( file-system, zip-files, URLs etc. ) are related through representations of internal paths [1]. Representations shall be user definable. For ModuleTree objects custom import semantics may be defined. This replaces “import hooks” and provides similar functionality in a much safer and object oriented manner. Further effects: no physical module is imported twice for two different import paths; each module can be used as a script no matter how the path is written. No changes to the language syntax.

[1] What I mean here is a representation of a path algebra in systems which can be considered as the “environment” of Python. This sounds more grandiose than it actually is.

2. Decorators everywhere

This basically reflects my interest in improving Jython compliance with Java and lifting Jython classes to Java classes turning Java classes into Jython class proxies - everything at runtime. This doesn’t work without specifying Java interfaces in Jython. Those consist of two parts: type signatures + annotations. For functions and classes this works in Python without much hassle. With Python 3.0 style function annotations one can even remove a decorator for type signatures. It doesn’t work for members though. In Java you can write

public class CusomerId {
    @Id
    @Column(name = "CustId", nullable = false)
    private Integer cust_id;
}

In Python I want to write similarly

class CusomerId:
    @Id
    @Column(name = "CustId", nullable = False)
    cust_id = jproperty("private int")

which translates into

class CusomerId:
    cust_id = Id(Column(name="CustId", nullable=False)(jproperty("private int")))

This requires that assignment statements ( grammatically expr_stmt’s ) may be decorated, not just functions and classes.

3. A new opcode for code monitoring

I know Ned Batchelders coverage tool and I have written one by myself using EasyExtend. EasyExtends is more powerful in that it doesn’t only provide the simplest type of coverage namely statement coverage. However it uses source code weaving which might affect functionality in a highly reflective language. It would be far better to introduce a new opcode which is weaved into Pythons bytecode and acts as a sensor. The weaving can be activated using a command line option. The overall achievement is improved code monitoring. This solution might also be applied to improve debuggers by setting breakpoints within expressions.

4. Function annotation and the nonlocal statement backports

I wish to see function argument annotations and the nonlocal statement in Python 2.x.

Other things

Admittedly I felt a little depression after the huge disappointment which was Python 3. Instead of a bright future it looked much like a minor legacy transformation which essentially missed the point of relevant trends in language design which are marked by concurrency orientation and unification of declarative dataflow and OO in frameworks + languages like WPF/Silverlight, Flex and JavaFX. The best thing which can be said about Python 3 is that it didn’t turn into a running gag and actually shipped code.

However there are lots of positive news and much progress in many other areas. At many fronts Python performance is targeted: PyPy, Unladen Swallow, Psyco 2, Cython, Shedskin. Package distribution and deployment is addressed just like renovation of the standard library. With PyPy, Unladen Swallow, Jython and IronPython Python becomes or is already GIL free and fit for multicore. The one improvement I’m personally most pleased about is that of Jython. Aside from my eternal pets ( Trail + EasyExtend ) I enjoy exploring the Javaverse, which is incredibly rich, from the Jython + scripting angle with some promising first results, new challenges and also some disappointments. I actually expect the next 600 Python web frameworks of interest will not be written in CPython anymore but in Jython and IronPython using Java/.Net components. When will we see a Jython Enterprise Framework on the JVM which will be as powerful as Spring but as lightweight as Pylons?