No jumps
When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.
In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P*. How do we know that P* is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.
What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this
R: a* b | a* c |
The algorithm never checks out the first branch, runs through a sequence of ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the rule
R: a* ( b | c ) |
However there was never any rule transformation to simplify the problem.
From rules to grammars
It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar
R: A | B A: a* c B: a* d |
Again we have two branches, marked by the names of the non-terminals `A` and `B` and we want to decide late which one to choose.
First we turn the grammar into a regular expression:
R: a* c | a* d |
but now we have lost context/structural information which needs to be somehow added:
R: a* c <A>| a* d <B> |
The symbols A and B do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters A it will be finally decided that rule A was used. The same is true for B. Our example grammar is LL(*) and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL(*) grammar as-if it was an LL(1) grammar + a small context marker.
Reconstruction
What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:
R: A[R] | B[R] A: a[A]* c[A] B: a[B]* d[A] |
Now we can embed A and B into R while being able to preserve the context:
R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]> |
Matching now the string aad yields the following sequence of sets of matching symbols:
{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>} |
All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match a.
Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.
We start the interpretation by translating the rightmost symbol
<B[R]> -> [R,[B, .]] |
The dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:
[R, [B, .]] |
[R, [B, ., d]] |
[R, [B, ., a, d]] |
[R, [B, ., a, a, d]] |
[R, [B, a, a, d]] |
Drawbacks
We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’. I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.
The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?
Performance
To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.
In order to measure performance I created the following simple script
import time import decimal import langscape text = open(decimal.__file__.replace(".pyc", ".py")).read() print "textlen", len(text) python = langscape.load_langlet("python") p4d = langscape.load_langlet("p4d") def test(langlet): tokens = langlet.tokenize(text) a = time.time() for i in range(10): langlet.parse(tokens) tokens.reset() print "parser", langlet.config.langlet_name, (time.time() - a)/10 test(python) test(p4d) |
It imports a reasonably big Python module ( decimal.py ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:
parser python 2.39329998493 parser p4d 2.25759999752
This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.
I’ve also checked out the script with PyPy, both with activated and deactivated JIT.
PyPy with option –JIT off:
parser python 6.5631000042 parser p4d 5.66440000534
Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:
PyPy with JIT:
parser python 0.791500020027 parser p4d 1.06089999676
Here the Python parser is about 1/3 faster than the P4D parser.
The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.
Is it this Jon Riehl by any chance? http://log.jonriehl.com/ He is still doing parsers in Python.
Yes, Michael. The parser I’ve mentioned is part of the Basil project which seems to be a reference implementation of PEP 269 which was written about a decade ago. I remember I found the PyPgen parser in an early version of PyPy. It was dropped there, later.
PyPgen doesn’t seperate a tokenization from a parsing phase. Instead the parser works lazily and reads the next token from the tokenizer which can produce it from a string on demand. One could do the tokenization in advance though and pass in a list of token wrapped into a handler with the necessary interface. Then it will be possible to compare the parser components speed-wise. Comparing the tokenizer used by Trail and tokenizer.py is not particularly fair, because the essential string matching by tokenizer.py is done using Pythons re module. So one would compare a pure Python solution against a C implementation – with a predicatble result.