Trace Based Parsing ( Part III ) – NFALexer

The NFALexer gives me an uneasy feeling. It is the most questionable of Trails components – one that cannot decide whether it wants to be a lexer that produces a token stream or a parser that produces parse trees. Right now it creates both and degrades parse trees to token streams.

Then there is also another component that actually does not belong to the Trail core anymore which is called the Postlexer. The Postlexer originates from the observation that Pythons `tokenize.py` module implements a two pass lexical analysis. In the first pass the source code is chunked into pieces using regular expressions whereas in the second pass whitespace is extracted and `NEWLINE`, `INDENT` and `DEDENT` token are produced. In a certain sense the split into NFALexer and Postlexer can be understood as a generalization and clarification of that module. But why creating a trace based lexer at all? Why not just staying faithful with regular expressions?

Issues with popular regexp implementations

When you take a look at `tokenize.py` you will observe code like this

...
Triple = group("[uU]?[rR]?'''", '[uU]?[rR]?"""')
String = group(r"[uU]?[rR]?'[^\n'\\]*(?:\\.[^\n'\\]*)*'",
               r'[uU]?[rR]?"[^\n"\\]*(?:\\.[^\n"\\]*)*"')
 
Operator = group(r"\*\*=?", r"&gt;&gt;=?", r"&lt;&lt;=?", r"&lt;&gt;", r"!=",
                 r"//=?",
                 r"[+\-*/%&amp;|^=&lt;&gt;]=?",
                 r"~")
Bracket = '[][(){}]'
Special = group(r'\r?\n', r'[:;.,`@]')
...

The group function will turn a sequence of regular expressions `s1`, `s2`, … into a new regexp `(s1 | s2 | …)`. This group building process will yield a big regular expression of this kind containing about 60-70 components.

What’s wrong with that kind of regular expressions for lexical analysis?

Python uses longest match or greedy tokenization. It matches as much characters as possible. So do regular expressions.

Do they?

The basic problem with Pythons regular expression implementation is that `s1|s2|…` could match something entirely different than `s2|s1|…`.

For example a simple `Float` regexp can be defined as

`Float = group(r’\d+\.\d*’, r’\.\d+’)`

whereas an `Int` is just

`Int = r'[1-9]\d*’`

Finally we have a `Dot` token

`Dot = r”\.”`

When we group those regular expressions as `group(Int, Dot, Float)` a float never gets matched as a `Float`: tokenizing `7.5` yields the token stream `Int, Dot, Int` not `Float`. Same with `group(Int, Float, Dot)`. But `group(Float, Int, Dot)` works as expected.

This behavior is called ordered choice and it is more of a misfeature than a feature. Ordered choice is o.k. for small regular expressions ( or PEGs ) where the programmer has complete oversight but it is not appropriate for 60 or more token and it is a design failure for systems like EasyExtend.

Regular expressions and trace based parsing

I briefly noted in part I of this article series that the original work on trace based parsing goes back to the late 60s and work by Ken Thompson on regular expression engines. So there is no intrinsic flaw to regular expressions but only some of their implementations. The implicit parallelism of TBP will just scan source code using `Int` and `Float` at the same time and the matched results will always correspond no matter how we group our matching rules.

Expansion in traced based lexers

The `Int`, `Float` example highlights a First/First conflict among different rules. Trail uses expansion to eliminate them. Actually expansion works as a conflict resolution strategy for all lexers I’ve examined. For that reason NFALexer lacks backtracking as a fall back.

Why the current NFALexer implementation is flawed

The NFALexer is slow. For each character the NFALexer produces a token and often the token are just thrown away again. Strings and comments might have a rich internal token structure but in the NFAParser we just want to use a single STRING token and comments are at best preserved to enable pretty printing.

The NFAParser for Python demands a single `NUMBER` token for all kinds of numbers. The token grammar defines about 12 different rules for numbers. That’s perfectly o.k. as a structuring principle we get far more structure than needed. It would be good if the NFA production for the lexer could adapt itself to the demand of the parser.

The weirdest aspect is the flattening of parse trees to token. This is caused by the fact that we want want to pass Python parse trees to the compiler and the compiler can’t deal with terminal symbols that are in fact none but parse trees.

Conclusion: rewrite the NFALexer s.t. it becomes a proper lexer i.e. a function that keeps a string and returns a token stream without producing any complex intermediate data structures . A trace based lexer should fly!

In the rest of the article we discuss some design aspects of the NFALexerl.

Parser Token and Lexer Symbols

In the previous article about NFA expansion we have already touched the concept of a node id ( nid ). For each rule with rule name R there is a mapping R → nid(R) where nid(R) is a number used to identify R. The mapping expresses a unique relationship and it is not only unique for a particular EasyExtend langlet but it is intended to be unique across all of them. This means that each langlet has an own domain or range of node ids:

   L = LANGLET_OFFSET
   nid_range = range(L, L+1024)

The `LANGLET_OFFSET` is a number `L = offset_cnt*1024, offset_cnt = 0, 1, 2, … `. The `offset_cnt` will be read/stored in the file `EasyExtend\fs` and it gets incremented for each new langlet that is created.

A range of size 1024 might appear small but big programming languages have about 80 terminal symbols and 200-300 non-terminals. Even COBOL has not more than 230 rules. COBOL has a few hundred keywords but this won’t matter, since keywords are all mapped to a single token. Maybe Perl is bigger but I guess the problems with Perl are those of context sensitivity and everything that can be encoded about Perl in a context free way can also be expressed in EasyExtend.

The node id range is again splitted into two ranges: one for terminals ( token ) and another one for non-terminals ( symbols ):

    L = LANGLET_OFFSET
    token_range  = range(L, 256 + L)
    symbol_range = range(256 + L, 1024 + L)

The used names “symbol” and “token” goes back to Pythons standard library modules `token.py` and `symbol.py`.

From lex_symbol to parse_token

Both NFALexer and NFAParser are parsers and use the same `LANGLET_OFFSET` and the same nid ranges. Since the terminal and non-terminal ranges coincide and we want to use the symbols of the NFALexer as the token of the NFAParser we have to apply a shift on the NFALexer symbols to become NFAParser token:

token-nid_NFAParser = symbol-nid_NFALexer – 256

Comparing `lexdef/lex_symbol.py` with `parsedef/parse_token.py` for arbitrary langlets lets you check this relation:

[TABLE=5]

Lexer Terminals

When symbols of the lexer are token of the parser how can we characterize the token of the lexer? The answer is simple: by individual characters. One can just define:

DIGIT: '0123456789'
CHAR: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
...

The disadvantage of this explicit notation is that it blows up the lexer NFAs and adds a new state for each character. Moreover we cannot write down the set of all characters and define a symbol that behaves like a regular expression dot that matches anything. So EasyExtend lets one define character sets separately.

Character sets

Character sets include sets for alphanumeric characters, whitespace, hexdigits and octdigits but also the set ANY used to represent any character. Unlike individual characters character sets have a numerical token-id. Character sets are defined in the module `EasyExtend/lexertoken.py`.

Here is an example:

BaseLexerToken.tokenid.A_CHAR  = 11
BaseLexerToken.charset.A_CHAR  = set(string.ascii_letters+"_")

Each `lexdef/lex_token.py` contains a `LexerToken` object. `LexerToken` is a copy of `BaseLexerToken` prototype. During the copying process the token-ids defined at `BaseLexerToken.tokenid.xxx` are shifted according to the langlet offset. `LexerToken` objects will be directly accessed and extracted by the NFALexer.

Below we give an example of a `LexerToken` extension. The langlet we use here is Teuton and the charactersets we define are German umlauts.

# -*- coding: iso-8859-1 -*-
# You might create here a new LexerToken object and set or overwrite
# properties of the BaseLexerToken object defined in lexertoken.py
 
LANGLET_OFFSET = 38912
from EasyExtend.lexertoken import NewLexerToken
LexerToken = NewLexerToken(LANGLET_OFFSET)
LexerToken.tokenid.A_UMLAUT = LANGLET_OFFSET+20
LexerToken.charset.A_UMLAUT = set('\xe4\xf6\xfc\xc4\xd6\xdc\xdf\x84\x8e\x94\x99\x81\x9a\xe1')
LexerToken.tokenid.A_ue     = LANGLET_OFFSET+21
LexerToken.charset.A_ue     = set(['\x81\xfc'])

Beyond character sets

Character sets are proper set objects. However it is possible to use other set-like objects as well. An interesting yet unused set-like type is the universal set described in this article.

Postlexer token

There is another important class of token we have not touched yet. This class of token is already present in Python and some of the token make Pythons fame. Most prominently `INDENT`and `DEDENT`. But there are also `NEWLINE`, `ENDMARKER`, `NL` and `ERRORTOKEN`. What they have in common is that there is no pattern describing them. When the source code is chunked into pieces using regular expressions there is no assignment of one of those chunks to the mentioned token. The chunks are produced in a secondary post processing step.

Same for EasyExtend. There are grammar rules for `INDENT`, `DEDENT`, `NEWLINE` etc. but on the right-hand-side we find terminals like `T_INDENT`, `T_DEDENT` and `T_NEWLINE` that are neither characters nor charsets. They are terminal symbols only in a formal sense used to pacify the parser generator. In EasyExtend postlexer token are produced and inserted into the token stream generated by the NFALexer. Their production is located in the Postlexer which is a function `Postlexer: TokenStream → TokenStream`.

Apart from the Postlexer there is another component called the Interceptor that is used to manipulate state sets on the fly during NFA parsing. Together the Postlexer and the Interceptor enable context sensitive parsing. It is interesting to note here that they both act on the level of lexical analysis and token production and there is no interference with the NFAParser. From a design point of view this is very pleasant because parsing is usually harder and more error prone than lexing and if context sensitivity can be captured during lexical or post-lexical analysis pressure is taken away from the parser.

Special token

In a final section we want to discuss three special token called `ANY`, `STOP` and `INTRON`.

INTRON

`INTRON` – at other occasions people also call it `JUNK`. I used a slightly less pejorative name. Although `INTRONs` do not encode any token used by the parser it can happen that the Postlexer still extracts information from them. For example Pythons Postlexer can derive `INDENT`, `DEDENT`and `NEWLINE` token from an `INTRON`. Other languages like C have no use for them and just throw them away. An `INTRON` can be annotated to a token as an attribute for the purpose of exact source code reconstruction from parse trees.

ANY

`ANY` is the counterpart of a regexp dot. It just matches any character. `ANY` is implemented as a charset – the empty set. Since it is the empty set one could expect to run into First/First conflicts with every other character set and indeed it does! However the `NFALexer` has a hardcoded disambiguation strategy and according the this strategy `ANY` is always endowed with the lowest priority. So if {ANY, ‘X’, ‘;’} is in the selection and the current character is ‘c’ it is matched by ANY. But if it is ‘X’ it is matched by ‘X’ and the ANY trace will be discontinued.

STOP

Sometimes a production rule accepts a proper subset of the productions of another one. The `NAME` rule is very popular for example

`NAME: A_CHAR+`

and it covers every production that specifies a particular name as a token:

`DEF: “def” `

So `DEF` is always also a `NAME` and if the trace terminates with “def” and does not continue with another character we have two trace and need to make a decision. The `STOP` terminal serves for disambiguation purposes. We can write

`DEF: “def” STOP`

If this happens and `NAME` cannot be continued `STOP` is selected. `STOP` does not match any character but extends the trace of `DEF` which means the `DEF` trace is selected, not `NAME`.

What’s next?

In the next article I’ll move to the borders of Trail and discuss recursion in NFAs and possible methods to deal with them.