Syntax algebra – first steps

Principle of relativity

I started to revisit syntactic mappings defined in EasyExtend 3 which are closely tied to the Python grammar being  in use. Those are functions like `varargs2arglist`, `normalize`, `split_file_input` or `exprlist2testlist` defined in the `csttools.py` module. One of the major challenges of future releases of EasyExtend ( or a definite successor project – i don’t know ) is to abstract from Python as a target language. In EE3 only Python can be targeted by a langlet transformation whereas in EE 4 langlets are symmetric: each langlet can be a target langlet for any other langlet. All langlets exist on the same footing and also: each langlet can be used as a parent langlet of a newly defined langlet which means a proper inheritance relationship and maximum reuse.

The relativity among langlets calls for raising the bar of abstraction. All Python specific dependencies, and there are a lot in EE3, have to be removed from the basic CST libraries. Indeed not even nodes of special importance like `expr`, `stmt` or `atom` shall be allowed in places other than modules which are specific to a particular langlet. The generalization of the mentioned functions leads to a study of abstract syntactic forms and some sort of “syntax algebra”. It isn’t studied rigorously in this article but shall be at least motivated. As a prerequisite I recommend to read my article about CST interpolation which discusses concepts of major relevance.

Embeddings

The `exprlist2testlist` function turns a node of type `exprlist` defined as

exprlist: expr (',' expr)* [',']

into a node of type `testlist`defined as

testlist: test (',' test)* [',']

This works without adding information because `{test, expr}` is a valid interpolation i.e. there is a nested sequence

[test, [or_test, [and_test, [not_test, [comparison , expr]]]]]

which is a representation of a valid CST. In terms of CST interpolations `test(expr)` yields a node of type `test` and induces a homomorphism `exprlist`->`testlist`. More generally an interpolation `{A, B}` induces an embedding `B (x B)*` -> `{A,B} (y {A,B})* = A (y A)*` if `x` and `y` are constant terminal symbols i.e. terminal symbols where the corresponding token have a uniquely determined token string.

Blocks or no blocks

Another relevant example is the `normalize` function. The idea behind `normalize` is that statements like `if x: y` or `def foo(): pass` are semantically equivalent to block statements:

if x:
    y
def foo():
    pass

Those block statements can be used in a more general fashion because we can add other block statements in the thunked block. In Python blocks are expressed by the `suite` grammar rule:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

Since `{stmt, simple_stmt}` is a valid interpolation, we can substitute all occurences of

suite: simple_stmt

with suites of the form

suite: NEWLINE INDENT stmt+ DEDENT

The general syntactic embedding is of the kind

`B` -> `a b… {A,B} c d…` with a, b, c, d … being constant terminal symbols.

Notice that `INDENT` is assumed to be a constant terminal despite the fact that the token string may vary. `INDENT` is treated special because in practical applications the number of used spaces for an indentation is fixed.  EasyExtend always uses 4 spaces.

Splitting nodes

The function `split_file_input` is the prototype of a node splitting function which can be thought in analogy to string splits. In this particular case we have a node `file_input` of the form

file_input: (NEWLINE | stmt)* ENDMARKER

and want to generate a sequence of nodes `file_input: stmt ENDMARKER`, `file_input: NEWLINE ENDMARKER` and `file_input: ENDMARKER` – one for each `NEWLINE`, `stmt` and `ENDMARKER` in the original `file_input` node. It doesn’t matter that `file_input: NEWLINE ENDMARKER` and `file_input: ENDMARKER` are likely be thrown away by an application because this can be decided by the nodes consuming function. The general split function is defined by

`R: x (A|B…)* y` -> `[(R: x A y), (R: x B y), …, (R: x y)]`

Node rewritings

The mappings considered above were all rather straightforward. Now we want to discuss a rule transformation which is less obvious, namely that of a function signature into an argument tuple of a function call. In Python 2.6 the function signature is defined as

varargslist: ((fpdef ['=' test] ',')*
('*' NAME [',' '**' NAME] | '**' NAME) |
fpdef ['=' test] (',' fpdef ['=' test])* [','])
fpdef: NAME | '(' fplist ')'
fplist: fpdef (',' fpdef)* [',']

and an argument list of a function call by

arglist: (argument ',')* (argument [',']| '*' test [',' '**' test] | '**' test)
argument: test [gen_for] | test '=' test

Can we transform each `varargslist` into an `arglist`?

Let’s start our treatment of varargslist with `fpdef`. If we insert the RHS of `fplist` in `fpdef` we get

fpdef: NAME | '(' fpdef (',' fpdef)* [','] ')'

We show that this rule is a special form of  the node `atom` and since `{test, atom}` is a valid interpolation it is also a `test` node. The `atom` node is defined by

atom: NAME | '(' [yield_expr|testlist_gexp] ')' |  '[' [listmaker] ']' | ...

which can be specialized to

atom: NAME | '(' testlist_gexp ')'

Next we consider the `testlist_gexp` definition

testlist_gexp: test ( gen_for | (',' test)* [','] )

which can be specialized to

testlist_gexp: test (',' test)* [',']

We insert `testlist_gexp` in `atom` which yields

atom: NAME | '(' test (',' test)* [','] ')'

If we reduce `test` to `atom` we get a rule

atom: NAME | '(' atom (',' atom)* [','] ')'

which is isomorphic to `fpdef`. So we just need to substitute all occurrences of `fpdef` in `fpdef` with `atom`, then replace `atom` with `test(atom)` and finally replace the whole of `atom` again with `test(atom)`. This procedure substitutes `fpdef` with `test`.

When we substitute each occurrence of `NAME` with `test`in `varargslist` we get:

(test ['=' test] ',')* ('*' test [',' '**' test] | '**' test) |
                       test ['=' test] (',' test ['=' test])* [',']

which can be reduced to

(argument ',')* ('*' test [',' '**' test] | '**' test) |
                 argument (',' argument)* [',']

which is the same as

(argument ',')* (argument [','] | '*' test [',' '**' test] | '**' test)

Voilà!

Syntax algebra

We have done some informal steps into syntax algebra with some real functions defined in EE 3 as a starting point. For the first three functions we have found general syntactical transformations which might be universally applicable. The last transformation is very specific though and it might be more interesting to determine an algorithm used to find a rule transformation of a particular kind. Although the search algorithm might be NP complete I assume that the found transformation – if one exists – has linear time complexity which is what we want. Such an algorithm would be another great achievement of EasyExtend which does not cease to surprise me.

This entry was posted in Algorithms, Grammars. Bookmark the permalink.