Langscape

The state of the Trail parser generator

Posted in Algorithms, Grammars, Langscape, Parsing, TBP, Trail on April 10th, 2012 by kay – Be the first to comment

White Easter

Snow came back, if only for a brief moment, to remind me laying Trail to rest until next winter…

 x + + + + + + + + +
 - - - - - - - - - x
 - - - - - - - - x +
 - - - - - - - x + +
 - - - - - - x + + +
 - - - - - x + + + +
 - - - - x + + + + +
 - - - x + + + + + +
 - - x + + + + + + +
 - x + + + + + + + + 

Flat Automatons

This story begins, where the last blog article ended, in Sept. 2011. At that time I realized, contrary to my expectations, that careful embedding of finite automatons into each other could yield higher execution speed for parsers then without those embeddings, such as LL(1) and this despite a far more complex machinery and additional efforts to reconstruct a parse tree from state sequences, generated while traversing the automaton. It wasn’t that I believed that this speed advantage wouldn’t go away when moving towards an optimized, for-speed implementation and running my samples on PyPy confirmed this assumption but it was an encouraging sign that the general direction was promising and more ambitious goals could be realized. I wasn’t entirely mistaken but what came out in the end is also scary, if not monstrous. Anyway, let’s begin.

If the automaton embeddings I just mentioned were perfect the grammar which is translated into a set of those automatons, rule by rule, would dissolve into a single finite automaton. It was completely flat then and contained only terminal symbols which referred to each other . An intrinsic hurdle is recursion. A recursive rule like

X: X X | a
would give us an automaton, that has to be embedded into itself which is not finitely possible. Instead we could consider successive self-embedding as a generative process:
X X | a
X X X X | X X a | a X X | a X | X a | a a | a
...

Working embeddings

The difficulties of creating a flat, finite automaton from a grammar are twofold. Perfect embedding/inlining leads to information loss and recursive rules to cyclic or infinite expansion. Of both problems I solved the first and easier one in the early versions of the Trail parser generator. This was done by preparing automaton states and the introduction of special ɛ-states. In automata theory an ɛ-state corresponds to an empty word, i.e. it won’t be used to recognize a character / token. It solely regulates transitions within an automaton. In BNF we may write:

X: X a
X: ɛ
which means that X can produce the empty word. Since ɛa = a the rule X accepts all finite sequences of a. In EBNF we can rewrite the X-rule as
X: [X] a
which summarizes the optionality of the inner X well. We write the automaton of the X-rule as a transition table
(X,0): (X,1) (a,2)
(X,1): (a,2)
(a,2): (FIN,-1)
Each state carries a unique index, which allows us to distinguish arbitrary many different X and a states in the automaton. If we further want to embedd X within another rule, say
Y: X b
which is defined by the table
(Y,0): (X,1)
(X,1): (b,2)
(a,2): (FIN,-1)
the single index is not sufficient and we need a second index which individualizes each state by taking a reference to the containing automaton:
(X,0,X): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
With this notion we can embed X in Y:
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X): (FIN,-1,X)
(FIN,-1,X): (b,2,Y)
(b,2,Y): (FIN,-1,Y)
The initial state (X,0,X) of X was completely removed. The automaton is still erroneous though. The final state (FIN,-1,X) of X is not a final state in Y. It even has a transition! We could try to remove it completely and instead write
(Y,0,Y): (X,1,X) (a,2,X)
(X,1,X): (a,2,X)
(a,2,X):   (b,2,Y)
(b,2,Y): (FIN,-1,Y)
But suppose Y had the form:
Y: X X b
then the embedding of X had the effect of removing the boundary between the two X which was again a loss of structure. What we do instead is to transform the final state (FIN, -1, X) when embedded in Y into an ɛ-state in Y:
(FIN,-1,X) => (X, 3, TRAIL_DOT, Y)
The tuple which describes automaton states is Trail is grown again by one entry. A state which is no ɛ-state has the shape (_, _, 0, _). Finally the fully and correctly embedded automaton X in Y looks like this:
(Y,0,0,Y): (X,1,0,X) (a,2,0,X)
(X,1,0,X): (a,2,0,X)
(a,2,0,X): (X,3,TRAIL_DOT,Y)
(X,3,TRAIL_DOT,Y): (b,2,0,Y)
(b,2,0,Y): (FIN,-1,0,Y)
    
The TRAIL_DOT marks the transition between X and Y in Y. In principle we are free to define infinitely many ɛ-states. In the end we will define exactly 5 types.

Rules of Embedding

At this point it is allowed to ask if this is not entirely recreational. Why should anyone care about automaton embeddings? Don’t we have anything better to do with our time? This certainly not but demanding a little more motivation is justified. Consider the following grammar:

R: A | B
A: a* c
B: a* d
In this grammar we encounter a so called FIRST/FIRST conflict. Given a string “aaa…” we cannot decide which of the rules A or B we have to choose, unless we observe a ‘c’ or ‘d’ event i.e. our string becomes “aa…ac” or “aa…ad”. What we basically want is to defer the choice of a rule, making a late choice instead of checking out rules by trial and error. By careful storing and recalling intermediate results we can avoid the consequences of an initial bad choice, to an extent that parsing in O(n) time with string length n becomes possible. Now the same can be achieved through automaton embeddings which gives us:
R: a* c | a* d
but in a revised form as seen in the previous section. On automaton level the information about the containing rules A and B is still present. If we use R for parsing we get state sets { (a,1,0,A), (a,2,0,B) } which recognize the same character “a”. Those state sets will be stored during parsing. In case of a “c”-event which will be recognized by the state (c, 3, 0, A) we only have to dig into the state-set sequence and follow the states(a, 1 , 0, A) back to the first element of the sequence. Since (A, 4, TRAIL_DOT, R) is the only follow state of (c, 3, 0, A)we will actually see the sequence:
(A,4,TRAIL_DOT,R)
 \
  '...
      \
(c,3,0,A)
(a,2,0,A)
(a,2,0,A)
...
(a,2,0,A) 
    
from this sequence we can easily reconstruct contexts and build the tree
[R, [A, a, a, ..., c]]
All of this is realized by late choice. Until a “c” or “d” event we move within A and B at the same time because. The embedding of A and B in R solves the FIRST/FIRST conflict above. This is the meaning.

FOLLOW/FIRST conflicts

So far the article didn’t contain anything new. I’ve written about all of this before.

The FIRST/FIRST conflicts between FIRST-sets of a top down parser is not the only one we have to deal with. We also need to take left recursions into account, which can be considered as a special case of a FIRST/FIRST conflict but also FIRST/FOLLOW or better FOLLOW/FIRST conflicts which will be treated yet. A FOLLOW/FIRST conflict can be illustrated using the following grammar:

R: U*
U: A | B
A: a+ (B c)*
B: b
There is no FIRST/FIRST conflict between A and B and we can’t factor out a common prefix. But now suppose we want to parse the string “abb”. Obviously A recognizes the two initial characters “ab” and then fails at the 2nd “b” because “c” was expected. Now A can recognize “a” alone and then cancel the parse because (B c)* is an optional multiple of (B c). This is not a violation of the rules. After “a” has been recognized by A the rule B may take over and match “b” two times:
[R, [U, [A, a]], [U, [B, b]], [U, [B, b]]]
Trail applies a “longest match” recognition approach, which means here that A is greedy and matches as much characters in the string as possible. But according to the rule definition A can also terminate the parse after a and at that point the parser sets a so called checkpoint dynamically. Trail allows backtracking to this checkpoint, supposed the longest match approach fails after this checkpoint. Setting exactly one checkpoint for a given rule is still compliant with the longest match approach. If the given input string is “abcbb” then A will match “abc”, if it is “abcbcbb” then it is “abcbc” and so on.

The FOLLOW/FIRST conflict leads to a proper ambiguity and checkpoints are currently the approach used by Trail to deal with them. I also tried to handle FOLLOW/FIRST conflicts in an automaton embedding style but encountered fragmentation effects. The ambiguities were uncovered but paid with a loss of direction and longest match was effectively disabled.

The inevitability of left recursions

It is easy in top down parsing to recognize and remove or transform left recursive rule like this one

X: X a | ɛ
The phenomenology seems straightforward. But making those exclusions is like drawing political boundaries in colonialist Africa. Desert, vegetation, animals and humans don’t entirely respect decisions made by once rivaling French and English occupants. If embedding comes into play one has we can count on uncovering left recursions we didn’t expected them. I’d like to go even one step further which is conjectural: we can’t even know for sure that none will be uncovered. The dynamics of FIRST/FIRST conflicts that are uncovered by embeddings, this clash dynamics, as I like to call it might lead to algorithmically undecidable problems. It’s nothing I’ve systematically thought about but I wouldn’t be too surprised.

For almost any left recursive rule there is a FIRST/FIRST conflict of this rule with itself. Exceptions are cases which are uninteresting such as

X: X
or
X: X a
In both cases the FIRST-sets of X don’t contain any terminal symbol and they can’t recognize anything. They are like ɛ-states but also non-terminals. Very confusing. Trail rips them off and issues a warning. An interesting rule like
E: E '*' E | NUMBER
contains a FIRST/FIRST conflict between E and NUMBER. They cannot be removed through self embedding of E. Same goes with rules which hide a left recursion but leads to an embedding to embedding cycles, such as
T: u v [T] u w
which are quite common. We could try to work around them as we did with FOLLOW/FIRST conflicts, instead of downright solving them. Of course one can also give up top down parsing in favor for bottom up parsers of Earley type or GLR, but that’s entirely changing the game. The question is do we must tack backtracking solutions onto Trail which are deeper involved than checkpoints?

After 6 months of tinkering I wished the answer was no. Actually I believe that it is unavoidable but it occurs at places were I didn’t originally expected it and even in that cases I often observed/measured parsing efforts which is proportional to string length. Parse tree reconstruction from state-set traces, which was once straightforward becomes a particularly hairy affair.

Teaser

Before I discuss left recursion problems in Trail in a follow up article I’ll present some results as a teaser.

Grammars for which parsing in Trail is O(n):

a) E: '(' E ')' | E '' E | NUMBER
b) E: '(' E+ ')' | E '' E | NUMBER
Other grammars in the same category are
c) G: 'u' 'v' [G] 'u' 'w'
d) G: G (G | 'c') | 'c'
e) G: G G | 'a'
However for the following grammars the parser is in O(2^n)
f) G: G G (G | 'h') | 'h'
g) G: G [G G] (G | 'h') | 'h'
If we combine  d) and f) we get
h) G: G G (G | 'h') | G (G | 'h') | 'h'
In this case Trail will deny service and throw a ParserConstructionError exception. Some pretty polynomial grammars will be lost.

 

 

 

Patching tracebacks

Posted in Algorithms, Langscape, Python on April 11th, 2011 by kay – Be the first to comment

One of the problems I early ran into  when working on EasyExtend ( and later on Langscape ) was to get error messages from code execution which were not corrupt.

The situation is easily explained: you have a program P written in some language L. There is a source-to-source translation of P into another program Q of a target language, preferably Python. So you write P but the code which is executed is Q. When Q fails Python produces a traceback. A traceback is a stack of execution frames – a snapshot of the current computation – and all we need to know here is that each frame holds data about the file, the function, and the line which is executed. This is all you need to generate a stacktrace message such as:

Traceback (most recent call last):
  File "tests\pythonexpr.py", line 12, in
    bar()
  File "tests\pythonexpr.py", line 10, in bar
    foo()
  File "tests\pythonexpr.py", line 5, in foo
    b.p,
NameError: global name 'b' is not defined

The problem here is that line number information in the frames is from Q whereas the file and the lines being displayed are from P – the only file there is!

Hacking line numbers into parse trees

My first attempt to fix the problem of wrong line information ( I worked with Python 2.4 at that time and I am unaware about changes for later versions of Python ) was to manipulate Q or rather the parse tree corresponding to Q which got updated with the line numbers I used to expect. When the growth of line numbers in Q was non-monotonic, using CPythons internal line number table, lnotab, failed to assign line numbers correctly. Furthermore the CPython compiler has the habit of ignoring some line information but reconstructs them, so you cannot be sure that your own won’t be overwritten. There is a  hacking prevention built into the compiler as it seems and I gave up on that problem for a couple of years.

From token streams to string pattern

Recently I started to try out another idea. For code which is not optimized or obfuscated and preserves name, number and string information in a quantitative way ( turning some statements of P into expressions in Q or vice versa, break  P token into token sequences in Q etc. ) we can checkout the following construction. Let be

T(V, Q) = {T in TS_Q| V = T.Value }
the set of token in the token stream TS_Qof Q with the prescribed token value V. Analog to this we can build build a set T(V, P) for P and TS_P.

The basic idea is now to construct a mapping between T(V,Q)  and  T(V,P). In order to get the value V we examine the byte code of a traceback frame up to the last executed instruction f_lasti. We assume that executing f_lasti leads to the error. Now the instruction may not be coupled to a particular name, so we examine f_lasti or the last instruction preceding f_lasti for which the instruction type is in the set

{LOAD_CONST, LOAD_FAST, LOAD_NAME, LOAD_ATTR, LOAD_GLOBAL, IMPORT_NAME }
From the value related to one of those instructions, the type of the value which is one of  {NAME, STRING, NUMBER} and the execution line f_lineno we create a new token T_q = [tokentype, value, lineno]. For V we set V = value. Actually things are a little more complicated because the dedicated token T_q and the line in Q are not necessarily in a 1-1 relationship. So there might in fact be n>1 token being equal to T_q which originate from different lines in P. So let Token be the list of all token we found on line T_q.Line and k = Token.count(T_q). We add k to the data characterizing T_q.

So assume having found T_q. The map we want to build is T(T_q.Value, Q) -> T(T_q.Value, P). How can we do that?

In the first step we assign a character to each token in the  stream TS_Q and turn TS_Q into a string S_TS_Q. The character is arbitrary and the relationship between the character and the token string shall be 1-1. Among the mappings is T_q -> c. For each T in T(T_q.Value, Q) we determine then a substring of S_TS_Q with c as a midpoint:

    class Pattern:
        def __init__(self, index, token, pattern):
            self.index = index
            self.token = token
            self.pattern = pattern
 
    S_TS_Q = u''
    m = 0x30
    k = 5
    tcmap = {}    
 
    # create a string from the token stream
    for T in TS_Q:
        if T.Value in tcmap:
            S_TS_Q+=tcmap[T.Value]
        else:
            s = unichr(m)
            tcmap[T.Value] = s
            S_TS_Q+=s
            m+=1
 
    # create string pattern
    pattern = []
    for T in TVQ:
        n = TS_Q.index(T)
        S = S_TS_Q[max(0, n-k): n+k]
        pattern.append(Pattern(n, T, S))

The same construction is used for the creation of target patterns from T(V, P). In that case we use the tcmap dictionary built during the creation of pattern from T(V, Q): when two token in TS_Q and TS_P have the same token value, the corresponding characters shall coincide.

The token mapping matrix

In the next step we create a distance matrix between source and target string pattern.

    n = len(source_pattern)
    m = len(target_pattern)
    Rows = range(n)
    Cols = range(m)
    M = []
    for i in range(n):
        M.append([-1]*m)
    for i, SP in enumerate(source_pattern):
        for j, TP in enumerate(target_pattern):
            M[i][j] = levenshtein(SP.pattern, TP.pattern)

As a distance metrics we use the edit- or levenshtein distance.

Having that matrix we compute an index pair (I,J) with M[I][J] = min{ M[i][j] | i in Rows and j in Cols}. Our interpretation of (I,J) is that we map source_pattern[I].token onto target_pattern[J].token. Since there is an I for which source_pattern[I].token == T_q the corresponding T_p =target_pattern[J].token is exactly the token in P we searched for.

The line in the current traceback is the line T_q.Line of Q. Now we have found T_p.Line of P which is the corrected line which shall be displayed in the patched traceback. Let’s take a brief look on the index selection algorithm for which M was prepared:

while True:
    k, I = 1000, -1
    if n>m and len(Cols) == 1:
        return target_pattern[Cols[0]].token[2]
    else:
        for r in Rows:
            d = min(M[r])
            if d<k:
                k = d
                I = r
        J = M[I].index(k)
        for row in M:
            row[J] = 100
    SP = source_pattern[I]
    if SP.token == T_q:
        tok = target_pattern[J].token
        return tok.Line
    else:
        Rows.remove(I)
        Cols.remove(J)

If there is only one column left i.e. one token in T(V, P) its line will be chosen. If the J-column was selected we avoid re-selection by setting row[J] = 100 on each row. In fact it would suffice to consider only the rows left in Rows.

Example

One example I modified over and over again for testing purposes was following one:

pythonexpr.py [P]
-----------------
def foo():
    a = ("c",
        0,
        (lambda x: 0+(lambda y: y+0)(2))(1),
        b.p,
        0,
        1/0,
        b.p)
 
def bar():
    foo()
 
bar()

You can comment out b.p turn a + in the lambda expression into / provoking another ZeroDivision exception etc. This is so interesting because when parsed and transformed through Langscape and then unparsed I get

[Q]
---
import langscape; __langlet__ = langscape.load_langlet("python")
def foo():
    a = ("c", 0, (lambda x: 0+(lambda y: y+0)(2))(1), b.p, 0, 1/0, b.p)
def bar():
    foo()
bar()

So it is this code which is executed using the command line

python run_python.py pythonexpr.py
which runs in the Python langlet through run_python.py. So the execution process sees pythonexpr.py but the code which is compiled by Python will be Q.

See the mess that happens when the traceback is not patched:

Traceback (most recent call last):
  File "run_python.py", line 9, in <module>
    langlet_obj.run_module(module)
  ...
  File "langscape\langlets\python\tests\pythonexpr.py", line 6, in <module>
    0,
  File "langscape\langlets\python\tests\pythonexpr.py", line 5, in bar
    b.p,
  File "langscape\langlets\python\tests\pythonexpr.py", line 3, in foo
    0,
NameError: global name 'b' is not defined

There is even a strange coincidence because bar() is executed on line 5 in the transformed program and b.p is on line 5 in the original program but all the other line information is complete garbage. When we plug in, via sys.excepthook,  the traceback patching mechanism whose major algorithm we’ve developed above we get

Traceback (most recent call last):
  File "run_python.py", line 9, in <module>
    langlet_obj.run_module(module)
  ...
  File "langscape\langlets\python\tests\pythonexpr.py", line 13, in <module>
    bar(),
  File "langscape\langlets\python\tests\pythonexpr.py", line 11, in bar
    foo(),
  File "langscape\langlets\python\tests\pythonexpr.py", line 5, in foo
    b.p,
NameError: global name 'b' is not defined

which is exactly right!

Conclusion

The algorithm described in this article is merely a heuristics and it won’t work accurately in all cases. In fact it is impossible to even define conclusively what those cases are because source-to-source transformations can be arbitrary. It is a bit like a first-order approximation of a code transformation relying on the idea that the code won’t change too much.

An implementation note. I was annoyed by bad tracebacks when testing the current Langscape code base for a first proper 0.1 release. I don’t think it is too far away because I have some time now to work on it. It will still be under tested when it’s released and documentation is even more fragmentary. However at some point everybody must jump, no matter of the used methodology.

Python26 expressions

Posted in Grammars, Langscape, Python on November 26th, 2010 by kay – Be the first to comment

When you look at the following listing you might think it’s just a sequence of nonsense statements in Python 26,maybe created for testing purposes:

raise a, b, c
import d
from e import*
import f
from .g import(a)
from b import c
from .import(e)
from f import(g)
from .a import(b, c as d,)
import e, f as g
from .a import(b, c,)
...

( Here is the complete listing ).

which is of course correct. It is a somewhat peculiar listing because you find expressions like

c[:,: d:,]**e

or

[c for d in e, lambda f = g, (a, b,) = c,: d, for e in f]

and a few others such as

b(*c, d, **e)**f

which will be rejected by the Python compiler because the *c argument precedes the name d but it is nevertheless “grammar correct” by which I mean it is consistent with Pythons context free LL(1) grammar.

The nice thing about the listing: it is automatically generated and it is complete in a certain sense.

Generative Grammars

When you look at a grammar rule like

for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]

it can be understood as an advice for producing exactly 2 expressions, namely:

'for' exprlist 'in' testlist ':' suite
'for' exprlist 'in' testlist ':' suite 'else' ':' suite

Other rules like

dictmaker: test ':' test (',' test ':' test)* [',']

has an infinite number of productions

test ':' test
test ':' test ','
test ':' test ',' test ':' test
test ':' test ',' test ':' test ','
...

When I created the listing I selected a small number of productions for each grammar rule. Each symbol in the rule should be covered and have at least one occurrence in the set of productions. Despite for_rule being finite and dictmaker being infinite the algorithm creates two productions for each.

After having enough productions to cover all syntactical subtleties ( expressed by the grammar ) I had to built one big rule containing all productions. This was actually the most demanding step in the design of the algorithm and I did it initially wrong.

Embedding of productions

Intuitively we can interpret all non-terminal symbols in our productions as variables which may be substituted. We expand test in dictmaker by selecting and inserting one production we got for the rule test. Unfortunately a grammar isn’t a tree but a directed, cyclic graph so we have to be extremely careful for not running into an infinite replacement loop. This is only a technical problem though and it can be handled using memoization. Here is a bigger one.

Look at the following two rules:

expr_stmt: testlist (augassign (yield_expr|testlist) |
           ('=' (yield_expr|testlist))*)
augassign: ('+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' |
            '<<=' | '>>=' | '**=' | '//=')

The only place where augassign occurs  in the grammar is in expr_stmt but counting the number of productions for augassign we get 12 whereas we only count 3 productions for expr_stmt and there is just a single production which contains expr_stmt. It is obviously impossible using a naive top down substitution without leaving a rest of productions which can’t be integrated. We have a system of dependencies which has to be resolved and the initial set of production rules must be adapted without introducing new productions which also cause new problems. This is possible but in my attempts the expressions became large and unreadable, so I tried something else.

Observe, that the most import start rule of the grammar ( Python has actually 4! Can you see which ones? ) is:

file_input: (NEWLINE | stmt)* ENDMARKER

I expect that each language has a rule of such a kind on a certain level of nesting. It produces a sequence of statements and newlines. I tried the following Ansatz:

Wrap each initially determined production which is not a production of a start rule into a stmt

Take the production ‘+=’ of augassign as an example. We find that augassign exists in expr_stmt. So we take one expr_stmt and embedd augassign in the concrete form  ‘+=’.

testlist '+=' yield_expr

The subsequent embedding steps

expr_stmt   -> small_stmt
small_stmt  -> simple_stmt
simple_stmt -> stmt

When embedding small_stmt into simple_stmt one has to add a trailing NEWLINE. So our final result is:

testlist '+=' yield_expr NEWLINE

Any rule we used during successive embedding  doesn’t have to be used again as an initial rule of another embedding because it was already built into file_input. It can be reused though when needed. I did not attempted to minimize the number of embeddings.

Substitute non-terminals

Now since we got a single sequence of terminals and non-terminals which contains all our productions in a consistent way we are going to substitute the non-terminals. This is done s.t. that a minimum number of terminal symbols is required which explains some of the redundancies: we find import f and import d among the listed statements. I suspect one of them is a shortened form of import d.e but since the rule for building d.e allows using d only and it is shorter, it will be chosen.

Detecting Grammar flaws

Generating the above expressions also shows some flaws in the grammar which have to be corrected using the bytecode compiler ( or AST transformer ). This doesn’t mean that Pythons grammar isn’t carefully crafted, quite the contrary is true, but highlights some of the limitations of using an LL(1) grammar. For example, it is quite simple although a little cumbersome to express argument orderings in variable arguments lists using non-LL(1) grammars:

file_input: (NEWLINE | stmt)* ENDMARKER
simpleargs: fpdef (',' fpdef)*
defaultargs: fpdef '=' test (',' fpdef '=' test)*
starargs: '*' NAME
dstarargs: '**' NAME
varargslist: ( simpleargs [',' defaultargs] [',' starargs] [','dstarargs] |
               defaultargs [',' starargs] [','dstarargs] |
               starargs [','dstarargs] |
               dstarargs) [',']

So when you craft your own grammar, automatic expression generation might aid design decisions. Detecting flaws early can spare lots of code used to add additional checks later on.

Refactorings

In case of Langscape the primary goal was to safeguard grammar refactorings. It is not generally possible to proof that two context free grammars are equal i.e. recognize the same language. But the same holds for any two programs in the general case in even more powerful, Turing complete, languages. This doesn’t imply we never change any code. It is a standard practice to safeguard refactorings using unit tests and so we start to do here.

If we assume that two different grammars G1, G2 recognize the same language L then their parsers P(G1), P(G2) must at least be able to parse the grammar generated expression of the other grammar respectively: P(G1)(Expr(G2)) -> OK;  P(G2)(Expr(G1)) -> OK.

Of course we can refine this criterion by including bad case tests or comparing the selection sequences of TokenTracers for Expr(G1), Expr(G2) which must be equal. Last but not least we can use higher approximations.

Higher approximations

Doesn’t the listing give us a 1st order approximation of the language? It’s a fun idea to think of all those listing expressions living in the “tangential space” of the language. “Higher approximation” would simply mean longer traces though ( if they are possible due to the presence of a Kleene star ). This yields a simpler idea: we create the set Tr(K, nfa) of traces of length K for a given nfa. Tr(K, nfa) may be empty for some K.Unfortunately we can’t infer from Tr(K) = {} that Tr(K+1) = {}. So what is a good stop criterion then?

The algorithm for creating Tr(K, nfa) is quite simple. The following functions are Langscape implementations:

def compute_tr(K, nfa):
    '''
    Computes the set Tr(K, nfa) of traces of length K for a given nfa.
    The return value may be [] if no trace of length K exists.
    '''
    _, start, trans = nfa
    return compute_subtraces(K, 0, start, [], trans)
 
def compute_subtraces(K, k, S, trace, trans):
    '''
    Computes complete traces of a given length.
 
    :param K: The prescribed length a trace shall have.
    :param k: The current length of a trace ( used by recursive calls ).
    :param trace: the current trace.
    :param trans: the {state:[follow-states]} dictionary which characterizes
                  one NFA.
    '''
    traces = []
    follow = trans[S]
    for F in follow:
        if F[0] is None:
            # termination condition fulfilled?
            if k == K:
                traces.append(trace+[F])
        else:
            m = trace.count(F)
            # impossible to terminate trace under this condition
            if m == K:
                continue
            else:
                traces+=compute_subtraces(K, max(k,m+1), F, trace+[F], trans)
    return traces