When you look at the following listing you might think it’s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:
raise a, b, c import d from e import* import f from .g import(a) from b import c from .import(e) from f import(g) from .a import(b, c as d,) import e, f as g from .a import(b, c,) ... |
( Here is the complete listing ).
which is of course correct. It is a somewhat peculiar listing because you find expressions like
c[:,: d:,]**e |
or
[c for d in e, lambda f = g, (a, b,) = c,: d, for e in f] |
and a few others such as
b(*c, d, **e)**f |
which will be rejected by the Python compiler because the `*c` argument precedes the name `d` but it is nevertheless “grammar correct” by which I mean it is consistent with Pythons context free LL(1) grammar.
The nice thing about the listing: it is automatically generated and it is complete in a certain sense.
Generative Grammars
When you look at a grammar rule like
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite] |
it can be understood as an advice for producing exactly 2 expressions, namely:
'for' exprlist 'in' testlist ':' suite 'for' exprlist 'in' testlist ':' suite 'else' ':' suite |
Other rules like
dictmaker: test ':' test (',' test ':' test)* [','] |
has an infinite number of productions
test ':' test test ':' test ',' test ':' test ',' test ':' test test ':' test ',' test ':' test ',' ... |
When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite `for_rule` being finite and `dictmaker` being infinite the algorithm creates two productions for each.
After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.
Embedding of productions
Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand `test` in `dictmaker` by selecting and inserting one production we got for the rule `test`. Unfortunately a grammar isn’t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.
Look at the following two rules:
expr_stmt: testlist (augassign (yield_expr|testlist) | ('=' (yield_expr|testlist))*) augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '<<=' | '>>=' | '**=' | '//=') |
The only place where `augassign` occurs in the grammar is in `expr_stmt` but counting the number of productions for `augassign` we get 12 whereas we only count 3 productions for `expr_stmt` and there is just a single production which contains `expr_stmt`. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can’t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.
Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:
file_input: (NEWLINE | stmt)* ENDMARKER |
I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:
*Wrap each initially determined production which is not a production of a start rule into a stmt*
Take the production ‘+=’ of `augassign` as an example. We find that `augassign` exists in `expr_stmt`. So we take one `expr_stmt` and *embedd* `augassign` in the concrete form ‘+=’.
testlist '+=' yield_expr |
The subsequent embedding steps
expr_stmt -> small_stmt small_stmt -> simple_stmt simple_stmt -> stmt |
When embedding `small_stmt` into `simple_stmt` one has to add a trailing NEWLINE. So our final result is:
testlist '+=' yield_expr NEWLINE |
Any rule we used during successive embedding doesn’t have to be used again as an initial rule of another embedding because it was already built into `file_input`. It can be reused though when needed. I did not attempted to minimize the number of embeddings.
Substitute non-terminals
Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find `import f` and `import d` among the listed statements. I suspect one of them is a shortened form of `import d.e` but since the rule for building `d.e` allows using `d` only and it is shorter, it will be chosen.
Detecting Grammar flaws
Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn’t mean that Pythons grammar isn’t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:
file_input: (NEWLINE | stmt)* ENDMARKER simpleargs: fpdef (',' fpdef)* defaultargs: fpdef '=' test (',' fpdef '=' test)* starargs: '*' NAME dstarargs: '**' NAME varargslist: ( simpleargs [',' defaultargs] [',' starargs] [','dstarargs] | defaultargs [',' starargs] [','dstarargs] | starargs [','dstarargs] | dstarargs) [','] |
So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.
Refactorings
In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn’t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.
If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -> OK; P(G2)(Expr(G1)) -> OK.
Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.
Higher approximations
Doesn’t the listing give us a 1st order approximation of the language? It’s a fun idea to think of all those listing expressions living in the “tangential space” of the language. “Higher approximation” would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set `Tr(K, nfa)` of traces of length `K` for a given nfa. `Tr(K, nfa)` may be empty for some K.Unfortunately we can’t infer from `Tr(K) = {}` that `Tr(K+1) = {}`. So what is a good stop criterion then?
The algorithm for creating `Tr(K, nfa)` is quite simple. The following functions are Langscape implementations:
def compute_tr(K, nfa): ''' Computes the set Tr(K, nfa) of traces of length K for a given nfa. The return value may be [] if no trace of length K exists. ''' _, start, trans = nfa return compute_subtraces(K, 0, start, [], trans) def compute_subtraces(K, k, S, trace, trans): ''' Computes complete traces of a given length. :param K: The prescribed length a trace shall have. :param k: The current length of a trace ( used by recursive calls ). :param trace: the current trace. :param trans: the {state:[follow-states]} dictionary which characterizes one NFA. ''' traces = [] follow = trans[S] for F in follow: if F[0] is None: # termination condition fulfilled? if k == K: traces.append(trace+[F]) else: m = trace.count(F) # impossible to terminate trace under this condition if m == K: continue else: traces+=compute_subtraces(K, max(k,m+1), F, trace+[F], trans) return traces |