{"id":2045,"date":"2012-05-20T17:01:39","date_gmt":"2012-05-20T16:01:39","guid":{"rendered":"http:\/\/fiber-space.de\/wordpress\/?p=2045"},"modified":"2022-10-12T12:30:45","modified_gmt":"2022-10-12T11:30:45","slug":"greedy-grammars-and-any","status":"publish","type":"post","link":"http:\/\/fiber-space.de\/wordpress\/2012\/05\/20\/greedy-grammars-and-any\/","title":{"rendered":"Greedy grammars and Any"},"content":{"rendered":"<p>I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was <a href=\"http:\/\/pages.cpsc.ucalgary.ca\/~aycock\/spark\/\">Spark<\/a> by John Aycock, an Early parser. What puzzled me back then was the need to be explicit down to the character level.\u00a0 Regular expression engines, albeit cryptic, were a revelation because one could specify the structural information one needed and match the rest using a &#8216;.&#8217; which is the wildcard pattern that matches <em>any<\/em> character<em>.<\/em><\/p>\n<p>I came back to <em>Any<\/em> in the Trail parser generator lately. I was motivated by writing a reversible C preprocessor. Unlike conventional C preprocessors which are used in the compilation chain of C code, a reversible C preprocessor can used to refactor C code, while retaining the preprocessor directives and the macro calls. This is basically done by storing the #define directive along with the code to be substituted and the substitution. The substituted code and the substitution are exchanged after the refactoring step, such that it looks like no substitution happened at all.<\/p>\n<p>A comprehensive C preprocessor grammar can be found on the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/2scxys89%28v=vs.80%29.aspx\">following<\/a> MSDN site. What is most interesting to us are the following two EBNF productions:<br \/>\n#define identifier<\/em>[<strong>(<\/strong> <em>identifier<\/em><sub>opt<\/sub><strong>, <\/strong><em>&#8230;<\/em> <strong>,<\/strong> <em>identifier<\/em><sub>opt<\/sub> <strong>)<\/strong>] <em>token-string<\/em><sub>opt<\/sub><\/p>\n<dl>\n<dt> <em>token-string <\/em>: <\/dt>\n<dd>String of tokens&nbsp;<\/p>\n<\/dd>\n<\/dl>\n<\/blockquote>\n<p>The <em>String of tokens<\/em> phrase this is <em>Any+<\/em>.<\/p>\n<h3>Bon app\u00e9tit<\/h3>\n<p>Suppose one defines two macros<\/p>\n<pre lang=\"C\">#define min(a,b) ((a)<(b)?(a):(b))\r\n\r\n#define max(a,b) ((a)<(b)?(b):(a))<\/pre>\n<p>Obviously the defining string of the `min` macro can be recognized using <em>token-string<\/em> but how can we prevent that <em>token-string<\/em> eats the `max` macro as well? Once in motion token-string has a sound appetite and will eat the rest. The solution to this problem in case of regular expressions is to make <em>Any<\/em> non-greedy. The non-greediness can easily be expressed using the following requirement:<\/p>\n<p>If `S | Any` is a pattern with `S!=Any`. If S can match a character, `S` will be preferred over `Any`.<\/p>\n<p>In the production rule<\/p>\n<pre>R: ... Any* S ...<\/pre>\n<p>we can be sure that if S matches in R then Any won't be used to match - although it would match if we leave it greedy. Same goes with<\/p>\n<pre>R: ... (Any* | S) ...<\/pre>\n<h3>Non greediness in grammars<\/h3>\n<p>Grammars are more complicated than regular expressions and we have to take more care about our greediness rules. To illustrate some of the problems we take a look on an example<\/p>\n<pre lang=\"bnf\">R: A B | C\r\nA: a Any*\r\nB: b\r\nC: c<\/pre>\n<p>`Any` causes a follow\/first conflict between A and B. Making `Any` non-greedy alone won't help because a grammar rule or its corresponding NFA is always greedy! It follows a longest match policy and an NFA will be traversed as long as possible. So once the NFA of A is entered it won't be left because of the trailing `Any*`.<\/p>\n<p>Detecting the trailing `Any` in A is easy though. We solve the follow\/first conflict with a trailing `Any` by embedding `A` into `R`. Embedding strategies are the centerpiece of Trail and they shall not be recapitulated here. Just so much: embedding A in R doesn't destroy any information relevant for parsing. If A has been embedded `Any*` will be delimited by `B` to the right and we can safely apply R without the danger of `Any` consuming a token 'b'.<\/p>\n<p>Eventually we have to re-apply our embedding strategy: if `A` is a rule with a trailing `Any` and `A` is embedded in `B` and `B` has a trailing `Any` <em>after<\/em> this embedding\u00a0 then `B` will be embedded wherever possible.<\/p>\n<h3>A short experiment<\/h3>\n<p>Here is a mini-Python grammar is used to detect Python class definitions.<\/p>\n<pre lang=\"bnf\">file_input: (NEWLINE | stmt)* ENDMARKER\r\nclassdef: 'class' NAME ['(' Any+ ')'] ':' suite\r\nsuite: simple_stmt | NEWLINE INDENT stmt+ DEDENT\r\nsimple_stmt: Any* NEWLINE\r\nstmt: simple_stmt | compound_stmt\r\ncompound_stmt: classdef | Any* suite<\/pre>\n<p><em> <\/em><\/p>\n<p>Unlike a real Python grammar it is fairly easy to build. All rules are taken from the Python 2.7 grammar but only `file_input`, `suite` and `stmt` remained unchanged. In all other cases we have replaced terminal\/non-terminal information that isn't needed by `Any`.<\/p>\n<p>&nbsp;<\/p>\n<p><em><br \/>\n<\/em><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I only vaguely remember my first encounter with a parser generator which must by dated back to the late 1990s. I guess it was Spark by John Aycock, an Early parser. What puzzled me back then was the need to &hellip; <a href=\"http:\/\/fiber-space.de\/wordpress\/2012\/05\/20\/greedy-grammars-and-any\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[12,23,4],"tags":[],"_links":{"self":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/2045"}],"collection":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/comments?post=2045"}],"version-history":[{"count":36,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/2045\/revisions"}],"predecessor-version":[{"id":2321,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/2045\/revisions\/2321"}],"wp:attachment":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/media?parent=2045"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/categories?post=2045"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/tags?post=2045"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}