Archive for April, 2009

Python vs TTCN-3

Posted in DSL, Python, Testing on April 26th, 2009 by kay – 5 Comments

Some time ago computing scientists Bernard Stepien and Liam Peyton from the University of Ottawa compared Python with TTCN-3. TTCN-3 means Testing and Test Control Notation and it is domain specific language specifically designed for writing tests in the domain of telecommunication. In almost any category poor Python just loses in comparison.

A few things of the presentation are just odd. At several places it contains Python code that is not even consistent. Examples: on slide 42 an “else if” construct is used instead of the correct “elif”. On slide 13 fields are not assigned to self which leads to code that fails to run. But leaving this carelessnes aside there is something important that estranges an experienced Python programmer. Take slide 12 for example. Here the authors state:

TTCN-3 templates to Python
  • TTCN-3 templates could be mapped to Python object instances.
  • However, there are serious limitations using the above technique with Python.
  • Python objects are practically typeless.

Then the presentation goes over 18 slides just to explain how bad plain Python classes are for serving the same purposes as TTCN-3 templates. Although this is correct why should anyone care? If you want to experience an exhilarating effect you have to turn grapes into wine and do not expect the same fun from grape juice. Then you can compare cheap wine with premium one. That’s also why people are used to compare Django to Ruby On Rails and not Django to Ruby or Rails to Python. That’s why libraries and frameworks exist in the first place.

So what about modeling a Template class that has same functionality as a TTCN-3 template? Here is a first attempt:

class Template(object):
    def __init__(self):
        self._frozen = False
        self._fields = []        
 
    def __setattr__(self, name, value):
        if name in ('_fields', '_frozen'):
            object.__setattr__(self, name, value)
            return
        if self._frozen:
            raise RuntimeError("Cannot set attribute value on a frozen template")
        if name not in self._fields:
            self._fields.append(name)
        object.__setattr__(self, name, value)
 
    def __eq__(self, other):
        if len(self._fields)!=len(other._fields):
            return False
        else:
            for f_self, f_other in zip(self._fields, other._fields):
                val_self  = getattr(self, f_self)
                val_other = getattr(self, f_other)
                if val_self != val_other:
                    return False
            return True
 
    def __ne__(self, other):
        return not self.__eq__(other)
 
    def freeze(self):
        self._frozen = True
 
    def clone(self):
        T = Template()
        T._fields = self._fields[:]
        for field in T._fields:
            setattr(T, field, getattr(self, field))
        return T

The Template class manages an internal _fields attribute that contains the attribute names created by __setattr__ dynamically. This is done to preserve the sequential order of the fields on template comparison. In the current implementation there aren’t any fields marked as optional and it will take additional effort to introduce them and modify __eq__.

It is now easy to demonstrate the behavior of wildcards. First we have to introduce another class for this purpose:

class Wildcard(object):
    def __eq__(self, other):
        return True
 
    def __ne__(self, other):
        return False

For any object X which is compared to wildcard it is always X == wildcard == True. So following template instances are equal:

templ_1 = Template()
templ_1.field_1 = wildcard
templ_1.field_2 = "abc"
 
templ_2 = Template()
templ_2.field_1 = "xyz"
templ_2.field_2 = wildcard
 
assert templ_1 == templ_2

A Pattern class can be created in the same way as the Wildcard class:

class Pattern(object):
    def __init__(self, regexp):
        self.pattern = re.compile(regexp)
 
    def __eq__(self, other):
        try:
            return bool(self.pattern.match(other))
        except TypeError:
            return True
 
    def __ne__(self, other):
        return not self.__eq__(other)

An ‘==’ comparison of a Pattern instance with a string will match the string against the pattern and yield True if the string could be matched and False otherwise.

So it is perfectly possible to reproduce most aspects of TTCN-3 templates by just a few lines of Python code. On the downside of TTCN-3 it isn’t possible to do things the other way round. No matter how hard you work in TTCN-3 it is impossible to create new TTCN-3 types showing interesting behavior. Pythons expressiveness is miles ahead.

Tail recursion decorator revisited

Posted in Python on April 20th, 2009 by kay – 6 Comments

A while ago some of us tried to find the best solution for a tail recursion decorator. The story began when Crutcher Dunnavant came up with a surprising decorator that was able to eliminate tail recursion. His solution used stack inspections and could flatten the call stack s.t. the maximum height of the stack became 2. I gave an alternative, faster implementation that omitted stack inspections. This solution was further improved by Michele Simionato and George Sakkis. The story already ends here because the performance penalty of those decorators was still quite considerable. On the samples we used the undecorated recursive function was more than twice as fast as the decorated one. So the decorator added more than 100% overhead.

Today I tried something new. I reimplemented Georges decorator in Cython which is considerably faster than Python and then I compared the performance of undecorated code, code that is decorated with the pure Python decorator and the Cython version of it.

Here you can download the relevant source code for the decorators.

The following implementation shows the Cython decorator defined in tail_recursive.pyx

cdef class tail_recursive:
    cdef int firstcall
    cdef int CONTINUE
    cdef object argskwd
    cdef object func
 
    def __init__(self, func):
        self.func = func
        self.firstcall = True
        self.CONTINUE = id(object())
 
    def __call__(self, *args, **kwd):
        if self.firstcall:
            self.firstcall = False
            try:
                while True:
                    result = self.func(*args, **kwd)
                    if result == self.CONTINUE: # update arguments
                        args, kwd = self.argskwd
                    else: # last call
                        return result
            finally:
                self.firstcall = True
        else: # return the arguments of the tail call
            self.argskwd = args, kwd
            return self.CONTINUE

Here are some performance tests.

As a sample function I used a tail recursive factorial

def factorial(n, acc=1):
    "calculate a factorial"
    return (acc if n == 0 else factorial(n-1, n*acc))

The timining function is defined by

import time
def mtime(foo):
    a = time.time()
    for j in range(10):
        for i in range(900):
            foo(i)
    return time.time()-a

The results I got were:

8.484 -- undecorated
9.405 -- with Cython decorator   + 10%
17.93 -- with Python decorator   + 111%
Next I checked out a pair of mutual recursive functions. Notice that in those pairs only one function may be decorated by tail_recursive.

def even(n):
    if n == 0:
        return True
    else:
        return odd(n-1)
 
def odd(n):
    if n == 0:
        return False
    else:
        return even(n-1)

Here are the results:

2.969 -- undecorated
3.312 -- with Cython decorator   + 11%
7.437 -- with Python decorator   + 150%
These are about the same proportions as for factorial example.

My conclusion is that one can expect about 10% performance penalty for Cythons tail_recursive decorator. This is a quite good result and I don’t shy away from recommending it.

Universal Set in Python

Posted in Mathematics, Python on April 16th, 2009 by kay – 7 Comments

Building sets from set() and ANY

When defining a regular expression there is a well known duality between a set of characters specified by enclosing characters or character ranges within square brackets […] and defining the pattern of the complement set [^…] by preceding the caret symbol ^ .

  • [0-9] — matches all digits
  • [^0-9] — matches all non-digits

The nature of the complement set mentioned here is less precisely defined and is implementation dependent in practice. If a regexp matcher knows about ASCII only then [^0-9] is the set of all ASCII characters which are not digits.

Suppose now we want to build a character matching set from sets in the sense of a set data-type as inhabited in Python as a builtin. This is easy because we start from the empty set() and add just more elements. Each finite set can be constructed in this way and we can establish a set ~ pattern correspondence:

set(range(9)) ~ [0-9]

But how do we find a corresponding set for [^0-9]?

This cannot be finitely constructed from the bottom up using set() as a start value but it has to be finitely de-constructed from the set of all elements. This looks like we are running into unlimited comprehension and the Russel paradox but actually the set theory we are using is quite weak and element membership can be decided just by doing hashtable lookups.

Instead of constructing a set from set() we want to (de-)construct it from ANY. The name ANY is not chosen by chance because we want to express the correspondence with . pattern:

ANY ~ .

( Actually ANY is a little stronger because . doesn’t match newline characters unless explicitly specified by the DOTALL flag in the re.compile function. )

Using ANY we can easily state the above correspondence:

ANY – set(range(9)) ~ [^0-9]

Implementation of ANY

We define ANY as an instance of univset. The univset has a set-like interface and implements set operations consistently on sets of the type ANY-A, ANY-B,… where A, B… are finite sets constructed from set(). For example

ANY-A ∩ ANY-B = ANY – ( A ∪ B )
ANY-A ∪ B = ANY – ( A – B )

class univset(object):
    def __init__(self):
        self._diff = set()
 
    def __sub__(self, other):
        S = univset()
        if type(other) == set:
            S._diff = self._diff | other
            return S
        else:
            S._diff = self._diff | other._diff
            return S
 
    def __rsub__(self, other):
        return other & self._diff
 
    def __contains__(self, obj):
        return not obj in self._diff
 
    def __and__(self, other):
        return other - self._diff
 
    def __rand__(self, other):
        return other - self._diff
 
    def __repr__(self):
        if self._diff == set():
            return "ANY"
        else:
            return "ANY - %s"%self._diff
 
    def __or__(self, other):
        S = univset()
        S._diff = self._diff - other
        return S
 
    def __xor__(self, other):
        return (self - other) | (other - self)
 
    def add(self, elem):
        if elem in self._diff:
            self._diff.remove(elem)
 
    def update(self, elem):
        self._diff = self._diff - other
 
    def __ror__(self, other):
        return self.__or__(other)
 
    def union(self, other):
        return self.__or__(other)
 
    def difference(self, other):
        return self.__sub__(other)
 
    def intersection(self, other):
        return self.__and__(other)
 
    def symmetric_difference(self, other):
        return self.__xor__(other)
 
    def issubset(self, other):
        if type(other) == set:
            return False
        if issubset(other._diff, self._diff):
            return True
        return False
 
    def issuperset(self, other):
        if self._diff & other:
            return False
        return True
 
    def __lt__(self, other):
        return self.issubset(other)
 
    def __eq__(self, other):
        if type(other) == set:
            return False
        try:
            return self._diff == other._diff
        except AttributeError:
            return False
 
    def __ne__(self, other):
        return not self.__eq__(other)
 
    def __le__(self, other):
        return self.__lt__(other) or self.__eq__(other)
 
    def __gt__(self, other):
        return self.issuperset(other)
 
    def __gt__(self, other):
        return self.issuperset(other) or self == other

Examples

ANY = univset()
NON_DIGITS = ANY - set(range(9))
assert 8 not in NON_DIGITS
assert 'a' in NON_DIGITS
assert 0 in NON_DIGITS | set([0])

The statement that never was

Posted in General, Python on April 15th, 2009 by kay – 1 Comment

The Python release archive is a nice repository for archaeological findings. My biggest surprise was the existence of an access statement and access keyword in the pre-history of Python 1.5.0. The syntax was already specified in release 1.0 with an accompanying accessobject.c and it began to fade in Python 1.4.

The access statement

That is how the access statement was defined in 1.0

access_stmt: 'access' ('*' | NAME (',' NAME)*) ':' accesstype  (',' accesstype)*
accesstype: NAME+
# accesstype should be ('public' | 'protected' | 'private') ['read'] ['write']
# but can't be because that would create undesirable reserved words!

The access_stmt was a small statement i.e. one that doesn’t contain a block and could be used like print, del or exec.

So the proper use of the access statement would be like

class Point:
    access x, y, z: private write, public read

or more compact:

class Point:
    access *: private write, public read

The word public wasn’t a language keyword which made it possible to use it in this way:

class AccessModifier:
    access public: public
    access private: public
    access protected: public

Note that the access statement was purely declarative even more so than Java’s where you can write

public int x = 10;
and define x while specifying its type and access modifier. It was also more elaborated in terms of intended protections by means of differentiating between read and write.

How it never took off and gone away

In the HISTORY file of Python 1.0 we find the notice

There’s a new reserved word: “access”. The syntax and semantics are still subject of of research and debate (as well as undocumented), but the parser knows about the keyword so you must not use it as a variable, function, or attribute name.

In Python 1.4 the access statement was already uncommented from the Grammar.

The story ends with the following remark in the HISTORY of Python 1.5:

‘access’ is no longer a reserved word, and all code related to its implementation is gone (or at least #ifdef’ed out). This should make Python a little speedier too!

Pyload part I ( Path objects )

Posted in Python on April 7th, 2009 by kay – Be the first to comment

This article is the first in a series that studies the design of a module import system. Although this work is situated in the Python context it can stay on its own to a large extent and ideas may be transferred to other systems and languages. We will provide as much integration with Python as necessary but keep the structure as general as possible. The article series will roughly cover following topics:

  1. Path objects – module paths and their representation
  2. ModuleNode objects – loading and caching modules using Path objects
  3. Import system – binding ModuleNode objects to import statements
  4. Implementation – putting it all together in an EasyExtend langlet as a reference implementation

In this article we will discuss Path objects. This topic is foundational and a bit dry but I hope the reader will be compensated by the elegance of some its constructions. Path objects are used to establish a relationship between internally used name spaces and external path structures in physical or logical “media” like file-systems, zip-files or the web. Those path structures can also be fully abstract and represented by special data-structures only. We’ll provide several examples.

First some terminology is introduced. Many of the notions given here a rather abstract but they mostly capture what people know about Python modules, packages and file system paths anyway. They provide a conceptual background for specifications given later.

Terminology

A module name is an ordinary Python name: a finite sequence of alphanumeric characters and underscores starting with an alphabetic character or underscore. A module path is a dot-separated sequence of module names. It is possible for a module path to be preceded by dots. In that case a module path is called a relative path. A.B.C and ..A.B.C are both module paths but only the latter one is relative. The names of a module path are also called its components. So A, B, C are the components of the module path A.B.C.

Besides module paths we consider external paths. The intuitive meaning of an external path is that of a pointer to a location of a module in some medium. Most commonly file system paths are used as external paths: modules are represented as files and the dot separators are mapped onto file-system separators. Throughout this article we use a slash “/” as an external path separator. So A.B.C is a module path and A/B/C is an external path. A proper external path definition is given below using a Path abstract base class.

A module can be loaded from an external path which yields an interpreter level <module> object. Each <module> object shall have a unique module name. If M is a known module name we write <M>. It is also possible to load <module> objects from builtins or even create fresh <module> objects on the fly. In any case we still consider a <module> being loaded from a path. If no such path is available we associate the <module> with the empty path.

A module path A.B.C… is valid if an external path …/A/B/C/… exists and <A> can be loaded from …/A, <B> can be loaded from …A/B etc.

We call a module P a package if there is a module M and if P1.M and P2.M are valid module paths then P1 = P2 = P. So each module has at most one package. If a module has no package we call it an unpackaged or a free module. For any module the chain of packages P1.P2….M containing M shall be finite. This implies that each of such chains has a maximum length. If P0.P1.P2…M is a a module path of maximal chain of packages we call P0 a top level module and the module path a full path. Each unpackaged module is by definition also a top level module.

Notice that the concept of a namespace is a little more general than the ones we have defined. Each <module> has an associated namespace. This is usually a __dict__ for user defined modules. This namespace can contain other modules as well and might be changed dynamically. We rather intend to have a close reference of the module path concept with the way it is used in an import statement.

With PYTHONPATH we denote a set of external paths from which modules can be found. Those paths can represent file-system paths, zip-files or other entities. The PYTHONPATH may contain external paths that are paths of some package. In this case the modules we can reach from the PYTHONPATH are not all top-level modules. In this situation the top-level module P0 of a full path P0.P1…..M may not be reachable from PYTHONPATH. We call such a full path a finger.

The intuitive idea of an external path as a file path is actually a bit too simplistic. In practice there might be various files that might correspond to a module. For example A.py, A.pyc, A.pyd are all files that correspond to a Python module A. An external path is a class of file paths and they are equivalent in the sense that they all describe a single module entity. In this sense a set of file suffixes is an equivalence class of file paths.

Path objects

Path objects are defined as abstract base classes. Concrete subclasses of Path are FSPath which uses file system operations to implement Pathmethods, ZipPath which combines file system path operations with those provided by the <zipfile> module. We have also defined a silly Path subclass called TuplePath. All of those Path objects are used to represent external paths or path-like objects.

from abc import *
class Path:
    __metaclass__ = ABCMeta
    def __init__(self, pathobj):
        self.path = pathobj
 
    @abstractmethod
    def fullpath(self):
        '''
        Derives full module path from given path.
        '''
 
    def find(self, modulepath):
        '''
        Interprets the modulepath as a sequence of parent / joinpath operations.
        '''
        P = self
        components = modulepath.split(".")
        if components[-1] == '':
            components.pop()
        for component in components:
            if component is "":
                P = P.parent()
            else:
                P = P.joinpath(component)
            if not P.exists():
                raise ValueError("invalid path object: '%s'"%P.path)
        return P
 
    @abstractmethod
    def ispackage(self):
        '''
        Returns true if path belongs to a package and false otherwise.
        '''
 
    @abstractmethod
    def ismodule(self):
        '''
        Returns True if path corresponds with a Python module and False otherwise.
        '''
 
    @abstractmethod
    def isempty(self):
        '''
        Returns True if this path object is empty and False otherwise.
        '''
 
    @abstractmethod
    def exists(self):
        '''
        Returns True if the path object is valid and False otherwise.
        '''
        return self.path in self.files
 
    @abstractmethod
    def parent(self):
        '''
        If path is .../A/B/C a new path object initialized by .../A/B will be
        created and returned.
        '''
 
    @abstractmethod
    def children(self):
        '''
        All valid one element extensions of path. If there are no such children return
        None.
        '''
 
    @abstractmethod
    def joinpath(self, *args):
        '''
        Joins the current path object with those provided by args in the sequence of
        occurrence in args. So if this path is ../A/B then self.joinpath('C/D', 'E') will
        return ../A/B/C/D/E.
        '''
 
    @abstractmethod
    def base(self):
        '''
        Returns the rightmost element of a path. If ../A/B/C is a path it C will be
        returned.
        '''
 
    @abstractmethod
    def split(self):
        '''
        Returns a tuple containing path components.
        '''
 
    @abstractmethod
    def splitbase(self):
        '''
        Returns the tuple (parent, base) for a given path.
        '''
 
    def __repr__(self):
        return "<%s : %s>"%(self.__class__.__name__, self.path)

The methods find and fullpath establish the relationship between Path objects and module paths. The fullpath method is intended to return the module path that corresponds to the Path object. The concrete find method is a representation of a module path in terms of the parent and joinpath methods which are also defined in the Path class. So each module path has a an unambiguous interpretation as a sequence of operations on a Path object.

FSPath objects

Now we start looking at concrete examples. The first and most important example is that of the FSPath class which relies on file system operations defined in os.path. The FSPath object enables configurable file suffixes. In Python file-suffix data are hardcoded. They can be accessed using the <imp> module function get_suffixes(). In our implementation we provide a class wrapper called SuffixInfo for the 3-tuples that describes a suffix.

import os
import imp
 
class SuffixInfo(object):
    def __init__(self, suffix_data):
        self.suffix    = suffix_data[0]
        self.read_mode = suffix_data[1]
        self.priority  = suffix_data[2]
 
class FSPath(Path):
    modulesuffixes = set()
    def __init__(self, pathobj):
        self.path = pathobj
 
    @classmethod
    def add_suffix(cls, suffixinfo):
        cls.modulesuffixes.add(suffixinfo)
 
    def fullpath(self):
        if self.ismodule():
           module = self.base()
           name, ext = os.path.splitext(module)
           P = self.parent().fullpath()
           return P+"."+name if P else name
        elif self.ispackage():
           directory, name = self.splitbase()
           P = directory.fullpath()
           return P+"."+name if P else name
        else:
           return ""
 
    def ispackage(self):
        if os.path.isdir(self.path):
            if os.path.isfile(self.path+os.sep+"__init__.py"):
                return True
        return False
 
    def isempty(self):
        return self.path == ""
 
    def exists(self):
        return self.ispackage() or self.ismodule()
 
    def ismodule(self):
        suffixes = [""]+[suffix.suffix for suffix in self.modulesuffixes]
        for suffix in suffixes:
            if os.path.isfile(self.path+suffix):
                return True
        return False
 
    def parent(self):
        return self.__class__(os.path.dirname(self.path))
 
    def children(self):
        if os.path.isdir(self.path):
            return [self.__class__(self.path.join(f)) for f in os.path.listdir(self.path)]
 
    def joinpath(self, *args):
        return self.__class__(os.sep.join([self.path]+list(args)))
 
    def base(self):
        return os.path.basename(self.path)
 
    def split(self):
        return self.path.split(os.sep)
 
    def splitbase(self):
        return self.parent(), self.base()

The next two path objects have separate listings.

ZipPath objects

The ZipPath object is similar to the FSPath object. The main difference is that there is no natural way to walk within a zip-file and the ZipPath class pulls out the packed directory structure and makes it explicit. This particular step has to be made only once per zip-file which means that for all parent(), joinpath() operations that create new ZipPath objects from an existing ZipPath no new unzip operations are performed and no new inspections are needed. Those will become relevant at different places where we have to load module content – but load operations don’t affect Path objects.

TuplePath objects

The TuplePath object may be entirely useless but since we enter here the level of plain data structures it might inspire more exotic Path objects that could be useful for some.

Broken beyond repair – Pythons import system

Posted in Python on April 3rd, 2009 by kay – 9 Comments

It often happens that language features we have once distasted become comfortable after a while. We learn to live with them, take their advantages and forget about their annoyance. For many Python developers things like the explicit self in the argument list of a method is among them or len as a function. Experienced Python programmers are usually weakly excited about alternative proposals and the longer this quirk is discussed their mood turns towards -0 which is encoded as “why bothering?”.

Other features are the cause of continuous struggle and we never get used to them but don’t talk much about them because they aren’t low hanging fruits. They might even turn from bad to worse in the course of improving them in some respects. Those features feel like being broken beyond repair which means that it’s not easy to patch them but they need a fundamental redesign. In my opinion Pythons import machinery has this quality and it was regrettably forgotten in the many “cleanups” of Python 3.

Pythons import machinery is complicated to understand but this alone isn’t an argument against it. Many systems are far more complex but they are well thought out. The import machinery seems to be designed by implementation and cobbled together. This has finally lead to accidental design artifacts like the module/script distinction which makes sense only in the light of Pythons import implementation.

Python always suffered from the problem that two modules M1.py and M2.py which import a third module M3.py from different import paths also receive different module objects <M3>. This is as if we gave up coordinate independence but believed as we are moving through the space all objects are changing continuously. The objects themselves not their perspective of them. We really get different objects and we can’t compare our data with each other.

Not only are module objects path dependent but the success of a module import may depend on the liveliness of other modules. Suppose M1.py defines the following relative import: from . import M2. This import fails if the package P containing both M1.py and M2.py has not been imported already. When running M1.py from the command line it’s not relevant that we declared P as a package by placing an __init__.py file into it we still get the increasingly famous error message:

ValueError: Attempted relative import in non-package

This is not a Monty Python skit. The package exists and exists not. The interpreter doesn’t make any use of the availability of package P as long as it hasn’t been imported explicitly. But why does it has to be imported only to find M2.py? There is no obvious answer. It’s just the way it is implemented.

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. (Donald E. Knuth)

Back to the first problem. Once M1.py has imported M3.py the <M3> will be accessible on M1’s import path in the module cache sys.modules which is a flat dictionary of {module-path: <module>} pairs. The module M2.pyuses a different import path and a new and different module <M3> is created which is cached using another module-path.

This lookup with a module-path string is fast and it is usually faster than ordinary attribute lookup in objects which has to jump through a nested dictionary. I do think that’s basically it. If we implement a proper module cache providing the moderate performance of attribute lookups in objects we can get rid of the tedious import machinery that Python has right now. But I’d like to go even a little further and claim that a whole import system can be designed directly from a tiny set of objects and their properties. No Finders, Loaders, Importers and all the clunky stuff which pretends to be objects but is actually a service which serves the void.

I’ll flesh this idea out in a series of articles on this blog soon. This can be understood then as the constructive part of my criticism.

Newsflash: chances are that higher Python officers will succeed in confusing the situation even more. PEP 382 is on the way. Good luck with explaining why Python needs two notions of packages whereas it is difficult enough to justify even one. It’s not a coincidence though that this is placed in the context of setuptools. When Pythons import system is a tragedy than distutils and the layers above are the accompanying farce.