About CST interpolation

Eli Bendersky has written a short overview article about Pythons _ast module which is supposed to make working with parse trees simpler by transforming them into other trees i.e. abstract syntax trees.

In this article I want to talk a bit about my reservations against this approach which is mostly justified by common wisdom, “convenience” and what not. IMO AST’s are a stumbling stone in the advancement of language front ends and you can do many things elegantly and more generic without them. They are there for a reason though and it is not easy to see immediately why you better get away with plain old concrete syntax trees ( CSTs ).

There are two major reasons I love CSTs.

  1. Parse trees are unambiguously represented by a grammar. So once you know the grammar you also no how to find and arrange nodes. For context free languages the grammar contains all syntactical information you’ll ever need.
  2. The reversion to source code is trivial. Just traverse the parse tree inorder, visit the leaf nodes containing terminals and concatenate their string content. Only whitespace has to be inserted.
  3. For parsing purposes grammars are translated into finite state machines. Those machines can also be used for non-parsing purposes like parse tree interpolation which provides most of the benefits of ASTs but won’t cause any additional translation overhead.

I assume 1. and 2. won’t tell anything new to the reader and it might be 3. which might contain novel information. The major argument can be summarized using the following diagram

A grammar gives rise to a number of finite state machines – in fact one machine for one grammar rule. Not only are those machines used to parse source code into CSTs but they can also be used to operate on them. In particular they are used to

  • check correctness of CSTs under transformations
  • connect individual CST nodes through sequences of other CST nodes ( interpolation ).
  • insert CST nodes into sequences of CSTs nodes which make up the content of a particular CST node (autocompletion)

Any of those tools/services only depend on the given grammar and are universally applicable to all languages. It is not much unlike `regexps` and `regexp` engines which uniformly apply matching and searching to all strings.

Only in the combination of verification, interpolation and autocompletion CSTs actually become handy for manipulation tasks. They also serve as a foundation for tools which are more close to the actual source code and define transformations in code without any additional syntax. That’s also why EasyExtend and successors will never see a particular macro language. Just like ASTs, macros are an obsolete technology in the light of proper language oriented transformation tools.

Parse tree representations

Take a look at the following AST constructor kept as an example from Eli’s article

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The `Expression` constructor takes a node of type `BinOp` and produces a node of type `Expression`. It is used to represent that actual Python expression `”xy”*3`.

Now take take a look at the following kludge which represents the same information in the form of a concrete syntax tree:

>>> import parser
>>> parser.expr('"xy"*3').tolist()
[258, [326, [303, [304, [305, [306, [307, [309, [310,
[311, [312, [313, [314, [315, [316, [317, [3, '"xy"']]]],
[16, '*'],
[315, [316, [317, [2, '3']]]]]]]]]]]]]]]], [4, ''], [0, '']]

The concrete parse tree is represented in the form of a nested list and yields all sorts of numerical tags which identify grammar rules being applied in top down parsing. The numerical tags shall be called node identifiers or short node ids.

The formatting can be done a little nicer by translating the node ids into node names and displaying the tree in tree form:

eval_input  -- NT`258
  testlist  -- NT`326
    test  -- NT`303
      or_test  -- NT`304
        and_test  -- NT`305
          not_test  -- NT`306
            comparison  -- NT`307
              expr  -- NT`309
                xor_expr  -- NT`310
                  and_expr  -- NT`311
                    shift_expr  -- NT`312
                      arith_expr  -- NT`313
                        term  -- NT`314
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                STRING  -- T`3     L`1
                                  "xy"
                          STAR  -- T`16     L`1
                            *
                          factor  -- NT`315
                            power  -- NT`316
                              atom  -- NT`317
                                NUMBER  -- T`2     L`1
                                  3
  ENDMARKER  -- T`0     L`2
    ''

It doesn’t change much in principle though. The AST is an order of magnitude more concise, more readable and better writable.

Searching nodes

Searching within a CST isn’t much of a problem and it is actually quite easy when we know the grammar. All that is needed are two functions `find_first` and `find_all` which keep a node and a node id as arguments. So when we seek for a particular node e.g. `term` in the syntax tree we just call `find_first( node, symbol.term)` where `symbol.term` is the node id of `term` encoded in `symbol.py` which is a standard library module. So for

`nd = parser.expr(‘”xy”*3’).tolist()` we can apply `find_first(nd, symbol.term)`which returns

term  -- NT`314
  factor  -- NT`315
    power  -- NT`316
      atom  -- NT`317
        STRING  -- T`3     L`1
          "xy"

Traces

We want to name CST constructors just of the nodes they create. So `expr` creates a node of type `symbol.expr`, `STRING` a node of `token.STRING` and so on. In order to create a correct `expr` we have to call lots of node constructors. In source code this would be something like

`expr(xor_expr(…(term(factor(…(STRING(“xy”)…), STAR(“*”), factor(…(NUMBER(“3”)…))…))`

This doesn’t look much like noise reduction, but now consider this: when `expr` is created by the parser the parser starts with nothing but a sequence A = (STRING(“xy”), STAR(“*”), NUMBER(“3”)). So why isn’t it possible to start with `A` and `expr` and build `expr(*A)` ? We want to face a slightly more general problem namely having a sequence of nodes `A = (a1, a2, …, ak)` which are not necessarily token and a node constructor `expr`. Can we build a node `expr(a1, …,ak)` of type `symbol.expr`?

What is needed to identify an admissible sequence A with this property?

First of all let’s take a look at the grammar rule description of `expr`

expr: xor_expr ('|' xor_expr)*

Any sequence of CST nodes which fits into this description shall be called a trace. So a sequence of nodes `xor_expr VBAR xor_expr VBAR xor_expr` is a trace of `expr`. But also `xor_expr` alone is a trace. So what is needed is to wrap a given sequence `A = (a1, a2, …, ak)` into a trace. We might start with the most simple case of a single node `A= (a)` which shall be wrapped into `expr`. As an example we consider `A = (term)`.

Interpolation

In order to wrap `term` into `expr` we need a sequence of intermediate nodes `xor_expr, and_expr, `shift_expr, arith_expr` and then build

`[expr, [xor_expr, [and_expr, [shift_expr, [arith_expr, term]]]]]`

This sequence is uniquely determined by `expr` and `term`. In order to build one we must be sure there is no non-trivial information that has to be added like a STRING, NAME or NUMBER token which contains actual information.

So when there is no choice in building a wrapper of type `N` around `M` we write `{N, M}` and call it an interpolation between `N` and `M`. Interpolations can always be constructed algorithmically using syntactical information provided by the language grammar alone. If N = M, we identify {N, N} with N.

We have found already a valid interpolation `{expr, term}`. Other valid interpolations are `{factor, STRING(“xy”)}` and `{factor, NUMBER(“3”)}`. For `term` this already suffices to build a trace:

`{factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)}`

and with this trace we get

`{expr, term({factor, STRING(“xy”)}, STAR(“*”), {factor, NUMBER(“3”)})}`

Now we are prepared to define an algorithm:

Let N be a node and A = (a1, ..., ak) a sequence of nodes.
Consider also the set of all nodes set(M1, ..., Mn) with
{N, Mi}, i=1,...,n being a valid interpolation starting with N.
 
For each Mi, i=1,...,n we try to build a trace
TA = ({A1, a1}, {A2, a1}, ..., {Ak, ak}).
 
If we have a found a trace for M is we get the result
{N, M({A1, a1}, {A2, a1}, ..., {Ak, ak})}

Autocompletion

Sometimes our algorithm might fail to find a trace for a node `N` and a sequence `A` but the error can still be corrected in a fully determinate fashion. Take the following rule for example:

`dotted_name: NAME (‘.’ NAME)*`

together with a sequence `A = (NAME, NAME)` of two nodes. Obviously we cannot build a valid trace `NAME DOT NAME` from A directly but the insertion of `DOT` into the trace is fully determined by the structure of the rule. Moreover there is no degree of freedom in the selection of the token string for `DOT`. It can always only be “.”. So it is possible to omit the `DOT` in A and still get a uniquely determined trace for `dotted_name`.

Applications

We’ve come to an end already. With the prerequisites given above it is perfectly possible to write

`expr(STRING(“xy”), STAR(“*”), NUMBER(“2”))`

and get a valid parse or even shorten it and write

`expr(‘”xy”‘, “*”, 2)`

which suffices to identify the token. Remind that this construction yields a parse tree which can be converted immediately to source code.

`fn = lambda name, val: expr_stmt(name, “=”, val)`

This lambda expression yields bindings of `val` to `name`. For example

`fn(“a”, expr(‘”xy”‘, “*”, 2))` is the parse tree equivalent of `a = “xy”*2`.

Notice that in any case the parse tree is syntactically correct by construction.

Wrapper Functions

Sometimes it is not easy to see how some expression with a particular semantics can be built. Take a function call for example. The Python grammar doesn’t provide a special node for it but just uses a special form of `power` which is defined as

`power: atom trailer* [‘**’ factor]`

This is very noisy and one better builds a functional wrapper which can be used for all sorts of calls:

def CST_Call(names, args = None, star_args = None, dstar_args = None):
    Names = [atom(names[0])]
    for name in names[1:]:
        Names.append(trailer('.', name))
    ARGS = list(args) if args else []
    if star_args:
        ARGS+=['*', star_args]
    if dstar_args:
        ARGS+=['**', dstar_args]
    if ARGS:
        return [symbol.power] + Names + [trailer('(', arglist(*ARGS), ')')]
    else:
        return [symbol.power] + Names + [trailer('(', ')')]
This entry was posted in Algorithms, Parsing, TBP. Bookmark the permalink.

3 Responses to About CST interpolation

  1. Martin says:

    Interesting. I’m not quite sure in what cases one would be interested in constructing the concrete syntax tree and use it to generate source code though. Can you mention some kind of application where this would be needed?

  2. kay says:

    Hi Martin, the considered transformation chain has always the form

         parse            transform              fn
    src --------> cst ----------------> cst' ------> fn(cst')
    

    with the following chain as a variant

         parse            transform              fn
    src --------> cst ----------------> cst' ------> fn(cst')
                                          |
                                          |
                                          V
                                         src' 
    

    The pair (src, cst) corresponds to an arbitrary source language and (cst’ , src’) to a destination language with an existing compiler or tool set. For example src is Python source and src’ is JavaScript source. It is very unlikely that JS accepts objects represented as cst’ and one has to unparse cst’ into src’ before a JS-compiler can handle it. src and src’ can also correspond to the same language as in refactorings. Moreover when one transforms small pieces of source as syntax trees it is easy to check the results visually. Syntactical correctness is guaranteed once the transformation succeeds but it might be incomplete e.g. some nodes still have to be transformed; besides this, syntax is not everything. Finally one might want to eliminate source of the original language from a code base but checking-in only source of the destination language. This requires src’ to be created as well.

    Unparsing cst’ into src’ means it is feasible to fully isolate the language frontend/transfomer from compiler machinery. It can be used much like a Unix tool which keeps a text-file of one language and produces a text-file of another language which is piped elsewhere. This is surely less efficient than using cst’ directly but this isn’t alway possible and re-parsing might be acceptable for scripts.

  3. Pingback: Trails of EasyExtend » Blog Archive » Syntax algebra - first steps - Projects and projections

Leave a Reply

Your email address will not be published. Required fields are marked *

*