About CST interpolation
Posted in Algorithms, Parsing, TBP, Trail on December 7th, 2009 by kay – 3 CommentsEli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.
In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).
There are two major reasons I love CSTs.
- Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
- The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
- For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.
I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram
A grammar gives rise to a number of finite state machines - in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to
- check correctness of CSTs under transformations
- connect individual CST nodes through sequences of other CST nodes ( interpolation ).
- insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)
Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike regexps and regexp engines which uniformly apply matching and searching to all strings.
Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.
Parse tree representations
Take a look at the following AST constructor kept as an example from Eli’s article
Expression( body=BinOp( left=Str(s='xy'), op=Mult(), right=Num(n=3)))
The Expression constructor takes a node of type BinOp and produces a node of type Expression. It is used to represent that actual Python expression “xy”*3.
Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:
>>> import parser >>> parser.expr('"xy"*3').tolist() [258, [326, [303, [304, [305, [306, [307, [309, [310, [311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]], [16, '*'], [315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]
The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.
The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:
eval_input -- NT`258 testlist -- NT`326 test -- NT`303 or_test -- NT`304 and_test -- NT`305 not_test -- NT`306 comparison -- NT`307 expr -- NT`309 xor_expr -- NT`310 and_expr -- NT`311 shift_expr -- NT`312 arith_expr -- NT`313 term -- NT`314 factor -- NT`315 power -- NT`316 atom -- NT`317 STRING -- T`3 L`1 "xy" STAR -- T`16 L`1 * factor -- NT`315 power -- NT`316 atom -- NT`317 NUMBER -- T`2 L`1 3 ENDMARKER -- T`0 L`2 ''
It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.
Searching nodes
Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions find_first and find_all which keep a node and a node id as arguments. So when we seek for a particular node e.g. term in the syntax tree we just call find_first( node, symbol.term) where symbol.term is the node id of term encoded in symbol.py which is a standard library module. So for
nd = parser.expr(’”xy”*3′).tolist() we can apply find_first(nd, symbol.term)which returns
term -- NT`314 factor -- NT`315 power -- NT`316 atom -- NT`317 STRING -- T`3 L`1 "xy"
Traces
We want to name CST constructors just of the nodes they create. So expr creates a node of type symbol.expr, STRING a node of token.STRING and so on. In order to create a correct expr we have to call lots of node constructors. In source code this would be something like
expr(xor_expr(…(term(factor(…(STRING(”xy”)…), STAR(”*”), factor(…(NUMBER(”3″)…))…))
This doesn’t look much like noise reduction, but now consider this: when expr is created by the parser the parser starts with nothing but a sequence A = (STRING(”xy”), STAR(”*”), NUMBER(”3″)). So why isn’t it possible to start with A and expr and build expr(*A) ? We want to face a slightly more general problem namely having a sequence of nodes A = (a1, a2, …, ak) which are not necessarily token and a node constructor expr. Can we build a node expr(a1, …,ak) of type symbol.expr?
What is needed to identify an admissible sequence A with this property?
First of all let’s take a look at the grammar rule description of expr
expr: xor_expr ('|' xor_expr)*
Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes xor_expr VBAR xor_expr VBAR xor_expr is a trace of expr. But also xor_expr alone is a trace. So what is needed is to wrap a given sequence A = (a1, a2, …, ak) into a trace. We might start with the most simple case of a single node A= (a) which shall be wrapped into expr. As an example we consider A = (term).
Interpolation
In order to wrap term into expr we need a sequence of intermediate nodes xor_expr, and_expr,shift_expr, arith_expr` and then build
[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]
This sequence is uniquely determined by expr and term. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.
So when there is no choice in building a wrapper of type N around M we write {N, M} and call it an interpolation between N and M. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.
We have found already a valid interpolation {expr, term}. Other valid interpolations are {factor, STRING(”xy”)} and {factor, NUMBER(”3″)}. For term this already suffices to build a trace:
{factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)}
and with this trace we get
{expr, term({factor, STRING(”xy”)}, STAR(”*”), {factor, NUMBER(”3″)})}
Now we are prepared to define an algorithm:
Let N be a node and A = (a1, ..., ak) a sequence of nodes. Consider also the set of all nodes set(M1, ..., Mn) with {N, Mi}, i=1,...,n being a valid interpolation starting with N. For each Mi, i=1,...,n we try to build a trace TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}). If we have a found a trace for M is we get the result {N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}
Autocompletion
Sometimes our algorithm might fail to find a trace for a node N and a sequence A but the error can still be corrected in a fully determinate fashion. Take the following rule for example:
dotted_name: NAME (’.’ NAME)*
together with a sequence A = (NAME, NAME) of two nodes. Obviously we cannot build a valid trace NAME DOT NAME from A directly but the insertion of DOT into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for DOT. It can always only be “.”. So it is possible to omit the DOT in A and still get a uniquely determined trace for dotted_name.
Applications
We’ve come to an end already. With the prerequisites given above it is perfectly possible to write
expr(STRING(”xy”), STAR(”*”), NUMBER(”2″))
and get a valid parse or even shorten it and write
expr(’”xy”‘, “*”, 2)
which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.
fn = lambda name, val: expr_stmt(name, “=”, val)
This lambda expression yields bindings of val to name. For example
fn(”a”, expr(’”xy”‘, “*”, 2)) is the parse tree equivalent of a = “xy”*2.
Notice that in any case the parse tree is syntactically correct by construction.
Wrapper Functions
Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of power which is defined as
power: atom trailer* ['**' factor]
This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:
def CST_Call(names, args = None, star_args = None, dstar_args = None): Names = [atom(names[0])] for name in names[1:]: Names.append(trailer('.', name)) ARGS = list(args) if args else [] if star_args: ARGS+=['*', star_args] if dstar_args: ARGS+=['**', dstar_args] if ARGS: return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')] else: return [symbol.power] + Names + [trailer('(', ')')]