VHDL grammars and the parser sandwich

Eli Bendersky has mentioned some problems of parsing VHDL. I was quite pleased though finding the Postlexer idea being discussed in the following paper which provides a more thorough treatment of VHDL parsing issues.

They call it “Auxiliary Terminal Generator” which is a more precise notation of what I call a “Postlexer”. In that case a Postlexer resolves ambiguities caused by context sensitivity. The solution is to introduce auxiliary terminal symbols in between two context free passes – the one for lexical analysis and parsing.

What I distaste about their approach is that they coupled the grammar with the Postlexer using parsing actions.
Apparently generations of computing scientists enjoyed squeezing as much information as possible into their grammars instead of acknowledging the obvious: a separate parsing phase is often required for non-trivial grammars and since the problems are specific one can leave it entirely to a powerful general purpose language and the programmer to solve them. As a consequence the parser isn’t portable across languages but at least the grammars are.

Generally speaking one has to consider three functions now instead of two:

The Parser Sandwich

Lexer: Parsetable String -> Tokenstream
Postlexer: Tokenstream -> Tokenstream
Parser: Parsetable Tokenstream -> Tree

`Lexer` and `Parser` shall be universally applicable for any language whereas the `Postlexer` is always specific and can be omitted if the whole language can be described in a context free way.

This entry was posted in Grammars. Bookmark the permalink.

4 Responses to VHDL grammars and the parser sandwich

  1. Eli says:

    IMHO it’s more complicated than this – it ain’t a one-way road from the Lexer to the Tree. In non context-free languages (like C and VHDL), sometimes analysis of the Tree is required to update the Lexer. In C it’s the famous ‘typedef’ context-sensitivity. On the highest level the parser understands a typedef, and updates a symbol table which the Lexer is looking at to make lexing decisions. It can also be the Postlexer in your terminology, but it creates a necessary back-link.

    I’ve also referred to the paper you link to in my post. I found it to be quite a bad paper – all they say is “VHDL is hard, there are some ways to make parsing a bit easier, we tried some, got very partial results, and concluded VHDL is hard”.

  2. kay says:

    One can use the parser for analyzing the typedef but essentially the Postlexer doesn’t have to be that smart and fully understand it. The structure of a typedef is as follows

    typedef decl IDENTIFIER ;

    Once the Postlexer has found “typedef” it has to identify the correct semicolon delimiter. Then it can keep the IDENTIFIER and put it into a symbol table. In order to identify the semicolon it has to ensure that decl is skipped. So it has to count opening ( increment ) and closing ( decrement ) curly braces. If it enters a semicolon and the counter is zero the Postlexer has found the correct one.

    There are also known cases when this is going to fail. Ruby style string interpolations for example. Here one can insert a full Ruby expression within the curly braces of a string “string #{…} interpolation”. The contained expression can be another string and so on. I have not yet completely made up my mind on how to deal with it. I’d suggest a similar strategy as above but now one has to intercept the token stream on creation.

    Notice that I would modify the C grammar and introduce a TYPENAME terminal that cannot be created by the Lexer. This isn’t too different from the situation in Python where INDENT and DEDENT token are present in the grammar which cannot be built by the Lexer either.

  3. Paddy3118 says:

    What do languages gain by having context sensitive grammars?

    Are all context sensitive languages easier to read and maintain? Easier to write other types of parsers for?

    • paddy.
  4. kay says:

    Paddy, context sensitive languages are always harder to parse and maintain. They have worse localization properties but they do not suffer from readability. An example of a language that is context free all the way down is Pascal and it was intentionally designed with this property in mind. Another example is Lisp. I’m not entirely sure about Java, Erlang and C# but I believe they are context free too. Context sensitive are XML, Python, Haskell, Javascript, Ruby, Perl, C, VHDL. So I don’t have reasons to believe that one syntax class is superior to another one from a users point of view.

    However for those of us who create tools that deal with languages as a whole context sensitivity is quirky.

Leave a Reply

Your email address will not be published. Required fields are marked *

*