Tail recursion decorator revisited

A while ago some of us tried to find the best solution for a tail recursion decorator. The story began when Crutcher Dunnavant came up with a surprising decorator that was able to eliminate tail recursion. His solution used stack inspections and could flatten the call stack s.t. the maximum height of the stack became 2. I gave an alternative, faster implementation that omitted stack inspections. This solution was further improved by Michele Simionato and George Sakkis. The story already ends here because the performance penalty of those decorators was still quite considerable. On the samples we used the undecorated recursive function was more than twice as fast as the decorated one. So the decorator added more than 100% overhead.

Today I tried something new. I reimplemented Georges decorator in Cython which is considerably faster than Python and then I compared the performance of undecorated code, code that is decorated with the pure Python decorator and the Cython version of it.

Here you can download the relevant source code for the decorators.

The following implementation shows the Cython decorator defined in tail_recursive.pyx

cdef class tail_recursive:
    cdef int firstcall
    cdef int CONTINUE
    cdef object argskwd
    cdef object func
 
    def __init__(self, func):
        self.func = func
        self.firstcall = True
        self.CONTINUE = id(object())
 
    def __call__(self, *args, **kwd):
        if self.firstcall:
            self.firstcall = False
            try:
                while True:
                    result = self.func(*args, **kwd)
                    if result == self.CONTINUE: # update arguments
                        args, kwd = self.argskwd
                    else: # last call
                        return result
            finally:
                self.firstcall = True
        else: # return the arguments of the tail call
            self.argskwd = args, kwd
            return self.CONTINUE

Here are some performance tests.

As a sample function I used a tail recursive factorial

def factorial(n, acc=1):
    "calculate a factorial"
    return (acc if n == 0 else factorial(n-1, n*acc))

The timining function is defined by

import time
def mtime(foo):
    a = time.time()
    for j in range(10):
        for i in range(900):
            foo(i)
    return time.time()-a

The results I got were:

8.484 -- undecorated
9.405 -- with Cython decorator   + 10%
17.93 -- with Python decorator   + 111%
Next I checked out a pair of mutual recursive functions. Notice that in those pairs only one function may be decorated by tail_recursive.

def even(n):
    if n == 0:
        return True
    else:
        return odd(n-1)
 
def odd(n):
    if n == 0:
        return False
    else:
        return even(n-1)

Here are the results:

2.969 -- undecorated
3.312 -- with Cython decorator   + 11%
7.437 -- with Python decorator   + 150%
These are about the same proportions as for factorial example.

My conclusion is that one can expect about 10% performance penalty for Cythons tail_recursive decorator. This is a quite good result and I don’t shy away from recommending it.

  1. I suggest the timeit standard library module for more accurate performance measurement.

  2. It looks like today is Tail Recursion day. I just posted the following comment in http://www.teamrubber.com/blog/python-tail-optimisation-using-byteplay/ :

    โ€œEven though I believe tail-recursive calls are pretty rare in real-world scenarios, it would be interesting to know which percentage of the calls are eligible for TCO in a typical Python program. Would you happen to have the numbers at hand?

    Also, reusing the frame is not going to be possible when it is not wide enough to hold all the variables of the callee. Iโ€™ll probably toy a bit with the idea in our Python JIT when I get the opportunity; tell me if you would like to hear about the results.โ€

    The same question applies: do you know if somebody already went trough the work of figuring out the impact general TCO could have on representative code?

  3. kay says:

    Marius, I get the same qualitative results and the same distribution of values when using time.clock() instead of time.time() on Windows. timeit doesn’t do anything else and won’t bring more accuracy by magics ( just take a look at the function inner defined in timeit ).

    Damien, yeah I noticed the byteplay solution popped up as well. So we are bigger than Sun and Oracle together ๐Ÿ™‚

    I ran at least two times into a situation that caused a rewrite of a recursive function. In both cases a large data structure was created and the rewrite wasn’t trivial. ( One can also manipulate the stack recursion limit but I’m a little scared about it ). That’s also why polit-economical arguments about usecase statistics, popularity etc. let me somewhat cold.

    Here are a few more numbers about the qualitative differences of factorial execution. There are no numbers for the byteplay version because the author didn’t made his code available.

    0.421  --  with Psyco
    0.480  --  + 14%  Cython function
    0.643  --  + 52%  Cython function + Cython deco
    0.805  --  + 91%  undecorated
    0.885  --  + 110%  with Cython decorator
    1.089  --  + 158%  with Psyco + Cython decorator
    1.685  --  + 300%  with Python decorator
    

    I was surprised that Psyco yields better results than the Cython version.

  4. Leon P Smith says:

    This story is (slightly) incomplete, the original decorator was critiqued on Tail call elimination decorator in Python, a front page story posted by Kay Schluehr, with myself playing the role of the curmudgeon, and Michel Salim and Kay offering suggestions about how to fix things.

    Nice to hear how things have progressed. I played around with your file a bit and it looks like the suggestion that only one of the two mutually-recursive functions be decorated is obsolete. I’m still not convinced that this properly handles tail calls though, although it doesn’t appear to be semantically unsound, like the original.

    Also, Damien, the prevalence of tail calls in “real life” depends on whether or not they are properly handled. So your suggestion that they don’t appear very often is rather circular logic.

  5. kay says:

    Leon, I am Kay Schluehr and I posted the LtU story ๐Ÿ™‚

    Your criticism was and still is in fact an important contribution and a living proof why implementors and testers shall not be the same person. Thanks a lot.

    BTW my results concerning mutually-recursive functions that are both decorated are negative for the Cython decorator ( TypeError exception ) but positive for the Python decorator.

    You might have noticed the TCO-storm in the wikiredditblogosphere after Guido blogged three articles about this topic. I feel those things shall settle somewhat before I reconsider those modest workarounds I do like. Ironically my interest in the decorator was revitalized by a concrete use case I had but the algorithm has been completely redesigned for totally different reasons.

  6. Leon P Smith says:

    Haha, I suspected that you were, but I didn’t catch on that this was your blog at first! For whatever reason the author isn’t listed when you view the page directly. I didn’t try out the cpython decorator, so I’ll take your word for it. And yes, I have noticed that Guido’s post has sparked quite a lot of discussion.

    The reason I still don’t like this from an implementation point of view is that functions are not tail recursive, rather call sites are. Now, I haven’t taken the time to understand all the ins and outs of decorators, but it looks to me like the current decorator, if you decorate both even and odd, temporarily has two stack frames. It’s not until the function re-enters itself that the superfluous frames are dropped.

    I’ll admit I prefer tail recursion to writing loops, but I’m crazy and I don’t particularly expect anybody to agree with me. Thanks to Python’s parallel assignment operator, writing loops isn’t terribly worse than tail recursion, in my opinion.

    The real beauty comes when you understand the basics of how a good Scheme or ML compiler works. (and Scheme in particular has a lot of not-so-great implementations available.) A lot of people seem to thiink that compilers rewrite tail recursion into a loop, when it’s just that the implementation methodology achieves almost the exact same effect in those specialized cases when tail recursion could be rewritten as a loop.

  1. There are no trackbacks for this post yet.

Leave a Reply