Trace Based Parsing ( Part III ) – NFALexer

The NFALexer gives me an uneasy feeling. It is the most questionable of Trails components – one that cannot decide whether it wants to be a lexer that produces a token stream or a parser that produces parse trees. Right now it creates both and degrades parse trees to token streams.

Then there is also another component that actually does not belong to the Trail core anymore which is called the Postlexer. The Postlexer originates from the observation that Pythons tokenize.py module implements a two pass lexical analysis. In the first pass the source code is chunked into pieces using regular expressions whereas in the second pass whitespace is extracted and NEWLINE, INDENT and DEDENT token are produced. In a certain sense the split into NFALexer and Postlexer can be understood as a generalization and clarification of that module. But why creating a trace based lexer at all? Why not just staying faithful with regular expressions?

Issues with popular regexp implementations

When you take a look at tokenize.py you will observe code like this

...
Triple = group("[uU]?[rR]?'''", '[uU]?[rR]?"""')
String = group(r"[uU]?[rR]?'[^\n'\\]*(?:\\.[^\n'\\]*)*'",
               r'[uU]?[rR]?"[^\n"\\]*(?:\\.[^\n"\\]*)*"')
 
Operator = group(r"\*\*=?", r">>=?", r"<<=?", r"<>", r"!=",
                 r"//=?",
                 r"[+\-*/%&|^=<>]=?",
                 r"~")
Bracket = '[][(){}]'
Special = group(r'\r?\n', r'[:;.,`@]')
...

The group function will turn a sequence of regular expressions s1, s2, … into a new regexp (s1 | s2 | …). This group building process will yield a big regular expression of this kind containing about 60-70 components.

What’s wrong with that kind of regular expressions for lexical analysis?

Python uses longest match or greedy tokenization. It matches as much characters as possible. So do regular expressions.

Do they?

The basic problem with Pythons regular expression implementation is that s1|s2|… could match something entirely different than s2|s1|….

For example a simple Float regexp can be defined as

Float = group(r’\d+\.\d*’, r’\.\d+’)

whereas an Int is just

Int = r’[1-9]\d*’

Finally we have a Dot token

Dot = r”\.”

When we group those regular expressions as group(Int, Dot, Float) a float never gets matched as a Float: tokenizing 7.5 yields the token stream Int, Dot, Int not Float. Same with group(Int, Float, Dot). But group(Float, Int, Dot) works as expected.

This behavior is called ordered choice and it is more of a misfeature than a feature. Ordered choice is o.k. for small regular expressions ( or PEGs ) where the programmer has complete oversight but it is not appropriate for 60 or more token and it is a design failure for systems like EasyExtend.

Regular expressions and trace based parsing

I briefly noted in part I of this article series that the original work on trace based parsing goes back to the late 60s and work by Ken Thompson on regular expression engines. So there is no intrinsic flaw to regular expressions but only some of their implementations. The implicit parallelism of TBP will just scan source code using Int and Float at the same time and the matched results will always correspond no matter how we group our matching rules.

Expansion in traced based lexers

The Int, Float example highlights a First/First conflict among different rules. Trail uses expansion to eliminate them. Actually expansion works as a conflict resolution strategy for all lexers I’ve examined. For that reason NFALexer lacks backtracking as a fall back.

Why the current NFALexer implementation is flawed

The NFALexer is slow. For each character the NFALexer produces a token and often the token are just thrown away again. Strings and comments might have a rich internal token structure but in the NFAParser we just want to use a single STRING token and comments are at best preserved to enable pretty printing.

The NFAParser for Python demands a single NUMBER token for all kinds of numbers. The token grammar defines about 12 different rules for numbers. That’s perfectly o.k. as a structuring principle we get far more structure than needed. It would be good if the NFA production for the lexer could adapt itself to the demand of the parser.

The weirdest aspect is the flattening of parse trees to token. This is caused by the fact that we want want to pass Python parse trees to the compiler and the compiler can’t deal with terminal symbols that are in fact none but parse trees.

Conclusion: rewrite the NFALexer s.t. it becomes a proper lexer i.e. a function that keeps a string and returns a token stream without producing any complex intermediate data structures . A trace based lexer should fly!

In the rest of the article we discuss some design aspects of the NFALexerl.

Parser Token and Lexer Symbols

In the previous article about NFA expansion we have already touched the concept of a node id ( nid ). For each rule with rule name R there is a mapping R → nid(R) where nid(R) is a number used to identify R. The mapping expresses a unique relationship and it is not only unique for a particular EasyExtend langlet but it is intended to be unique across all of them. This means that each langlet has an own domain or range of node ids:

   L = LANGLET_OFFSET
   nid_range = range(L, L+1024)
The LANGLET_OFFSET is a number L = offset_cnt*1024, offset_cnt = 0, 1, 2, …. The offset_cnt will be read/stored in the file EasyExtend\fs and it gets incremented for each new langlet that is created.

A range of size 1024 might appear small but big programming languages have about 80 terminal symbols and 200-300 non-terminals. Even COBOL has not more than 230 rules. COBOL has a few hundred keywords but this won’t matter, since keywords are all mapped to a single token. Maybe Perl is bigger but I guess the problems with Perl are those of context sensitivity and everything that can be encoded about Perl in a context free way can also be expressed in EasyExtend.

The node id range is again splitted into two ranges: one for terminals ( token ) and another one for non-terminals ( symbols ):

    L = LANGLET_OFFSET
    token_range  = range(L, 256 + L)
    symbol_range = range(256 + L, 1024 + L)
The used names “symbol” and “token” goes back to Pythons standard library modules token.py and symbol.py.

From lex_symbol to parse_token

Both NFALexer and NFAParser are parsers and use the same LANGLET_OFFSET and the same nid ranges. Since the terminal and non-terminal ranges coincide and we want to use the symbols of the NFALexer as the token of the NFAParser we have to apply a shift on the NFALexer symbols to become NFAParser token:

token-nid NFAParser = symbol-nid NFALexer – 256

Comparing lexdef/lex_symbol.py with parsedef/parse_token.py for arbitrary langlets lets you check this relation:

[TABLE=5]

Lexer Terminals

When symbols of the lexer are token of the parser how can we characterize the token of the lexer? The answer is simple: by individual characters. One can just define:

DIGIT: '0123456789'
CHAR: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
...

The disadvantage of this explicit notation is that it blows up the lexer NFAs and adds a new state for each character. Moreover we cannot write down the set of all characters and define a symbol that behaves like a regular expression dot that matches anything. So EasyExtend lets one define character sets separately.

Character sets

Character sets include sets for alphanumeric characters, whitespace, hexdigits and octdigits but also the set ANY used to represent any character. Unlike individual characters character sets have a numerical token-id. Character sets are defined in the module EasyExtend/lexertoken.py.

Here is an example:

BaseLexerToken.tokenid.A_CHAR  = 11
BaseLexerToken.charset.A_CHAR  = set(string.ascii_letters+"_")

Each lexdef/lex_token.py contains a LexerToken object. LexerToken is a copy of BaseLexerToken prototype. During the copying process the token-ids defined at BaseLexerToken.tokenid.xxx are shifted according to the langlet offset. LexerToken objects will be directly accessed and extracted by the NFALexer.

Below we give an example of a LexerToken extension. The langlet we use here is Teuton and the charactersets we define are German umlauts.

# -*- coding: iso-8859-1 -*-
# You might create here a new LexerToken object and set or overwrite
# properties of the BaseLexerToken object defined in lexertoken.py
 
LANGLET_OFFSET = 38912
from EasyExtend.lexertoken import NewLexerToken
LexerToken = NewLexerToken(LANGLET_OFFSET)
LexerToken.tokenid.A_UMLAUT = LANGLET_OFFSET+20
LexerToken.charset.A_UMLAUT = set('\xe4\xf6\xfc\xc4\xd6\xdc\xdf\x84\x8e\x94\x99\x81\x9a\xe1')
LexerToken.tokenid.A_ue     = LANGLET_OFFSET+21
LexerToken.charset.A_ue     = set(['\x81\xfc'])

Beyond character sets

Character sets are proper set objects. However it is possible to use other set-like objects as well. An interesting yet unused set-like type is the universal set described in this article.

Postlexer token

There is another important class of token we have not touched yet. This class of token is already present in Python and some of the token make Pythons fame. Most prominently INDENTand DEDENT. But there are also NEWLINE, ENDMARKER, NL and ERRORTOKEN. What they have in common is that there is no pattern describing them. When the source code is chunked into pieces using regular expressions there is no assignment of one of those chunks to the mentioned token. The chunks are produced in a secondary post processing step.

Same for EasyExtend. There are grammar rules for INDENT, DEDENT, NEWLINE etc. but on the right-hand-side we find terminals like T_INDENT, T_DEDENT and T_NEWLINE that are neither characters nor charsets. They are terminal symbols only in a formal sense used to pacify the parser generator. In EasyExtend postlexer token are produced and inserted into the token stream generated by the NFALexer. Their production is located in the Postlexer which is a function Postlexer: TokenStream → TokenStream.

Apart from the Postlexer there is another component called the Interceptor that is used to manipulate state sets on the fly during NFA parsing. Together the Postlexer and the Interceptor enable context sensitive parsing. It is interesting to note here that they both act on the level of lexical analysis and token production and there is no interference with the NFAParser. From a design point of view this is very pleasant because parsing is usually harder and more error prone than lexing and if context sensitivity can be captured during lexical or post-lexical analysis pressure is taken away from the parser.

Special token

In a final section we want to discuss three special token called ANY, STOP and INTRON.

INTRON

INTRON – at other occasions people also call it JUNK. I used a slightly less pejorative name. Although INTRONs do not encode any token used by the parser it can happen that the Postlexer still extracts information from them. For example Pythons Postlexer can derive INDENT, DEDENTand NEWLINE token from an INTRON. Other languages like C have no use for them and just throw them away. An INTRON can be annotated to a token as an attribute for the purpose of exact source code reconstruction from parse trees.

ANY

ANY is the counterpart of a regexp dot. It just matches any character. ANY is implemented as a charset – the empty set. Since it is the empty set one could expect to run into First/First conflicts with every other character set and indeed it does! However the NFALexer has a hardcoded disambiguation strategy and according the this strategy ANY is always endowed with the lowest priority. So if {ANY, ‘X’, ‘;’} is in the selection and the current character is ‘c’ it is matched by ANY. But if it is ‘X’ it is matched by ‘X’ and the ANY trace will be discontinued.

STOP

Sometimes a production rule accepts a proper subset of the productions of another one. The NAME rule is very popular for example

NAME: A_CHAR+

and it covers every production that specifies a particular name as a token:

DEF: “def”

So DEF is always also a NAME and if the trace terminates with “def” and does not continue with another character we have two trace and need to make a decision. The STOP terminal serves for disambiguation purposes. We can write

DEF: “def” STOP

If this happens and NAME cannot be continued STOP is selected. STOP does not match any character but extends the trace of DEF which means the DEF trace is selected, not NAME.

What’s next?

In the next article I’ll move to the borders of Trail and discuss recursion in NFAs and possible methods to deal with them.

Leave a Reply