{"id":1441,"date":"2010-03-13T09:44:37","date_gmt":"2010-03-13T08:44:37","guid":{"rendered":"http:\/\/fiber-space.de\/wordpress\/?p=1441"},"modified":"2012-05-28T06:32:50","modified_gmt":"2012-05-28T05:32:50","slug":"restricted-backmatching","status":"publish","type":"post","link":"http:\/\/fiber-space.de\/wordpress\/2010\/03\/13\/restricted-backmatching\/","title":{"rendered":"Restricted backmatching"},"content":{"rendered":"<p>In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of\u00a0 a regexp engine<a href=\"http:\/\/code.google.com\/p\/re2\/\"> RE2<\/a>, created by Russ Cox who has written a <a href=\"http:\/\/swtch.com\/~rsc\/regexp\/regexp1.html\">revival paper<\/a> for Thompson NFAs\u00a0 a while ago with a few follow-ups which build on those ideas.\u00a0 Once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:<\/p>\n<blockquote><p>The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.<\/p><\/blockquote>\n<p>So backmatching can&#8217;t be implemented efficiently, but why? What is it that prevents an efficient implementation and can a subset be defined which exist within the constraints of O(n) lexers or parsers? In this article we will find that backmatching for convenient use cases is not a problematic feature in linear space and time regexp engines. Some backreferences are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT &#8211; but could it be in the end that only esoteric applications of backmatching are excluded from trace based parsing (TBP)?<\/p>\n<h3>General backmatching<\/h3>\n<p>When we think about backmatching in regular expressions we might define expressions like this<\/p>\n<pre lang=\"C\">a)  \"... (P) ... \\1\"<\/pre>\n<p>where (P) defines a simple pattern and \\1 refers to the string <em>value<\/em> matched by the pattern. We assume that there is a functional relationship between (P) and \\1. First it is (P) that matches then \\1 will match what (P) has matched before.<\/p>\n<p>Actually this perspective is a little simplistic and in the general case backmatching can be more powerful. Consider the following simple regexp:<\/p>\n<pre lang=\"C\">b)  ([ab]*)b*\\1<\/pre>\n<p>Here the match of\u00a0 `([ab]*)` depends on what \\1 will match but it is also highly ambiguous. If we match the following string<\/p>\n<p>s = &#8220;bb&#8221;<\/p>\n<p>the first b can be matched with `([ab]*)` and the last &#8220;b&#8221; with `\\1` but the whole string can also be matched with `b*`.<\/p>\n<p>Here is another more complicated example<\/p>\n<pre lang=\"C\">c)  (([ab]*)b*\\2)*[ab]*\\1b*<\/pre>\n<p>It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of `([ab]*)` on the match of \\1 but also on the match of \\2 which depends on the match of \\1 as well. Of course \\1 depends on both of the matches of `([ab]*)` and `(([ab]*)b*\\2)`. It is all tangled.<\/p>\n<p>General backmatching as in examples b) and c)\u00a0 can be used to solve NP-complete problems which exploits these tangles and finds resolutions. See <a href=\"http:\/\/perl.plover.com\/NPC\/NPC-3COL.html\">this<\/a> article for more examples. With NP complete backtracking algorithms built into regexp engines one gets such solutions for free. This is cool and clever but also unnecessary most of time.<\/p>\n<h3>Functional backmatching<\/h3>\n<p>If we restrict backmatching to simple functional relations between (P) and \\1 as in case a) we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.<\/p>\n<p>In an expression<\/p>\n<pre lang=\"C\">    ... (P) ... \\1<\/pre>\n<p>the back-reference \\1 can be <em>separated<\/em> from P when the following conditions are satisfied:<\/p>\n<p>1. P doesn&#8217;t contain back-references which means it is self contained.<\/p>\n<p>2. It is possible to write the expression in the way<\/p>\n<pre lang=\"C\">    ... L(P)R ... \\1<\/pre>\n<p>where L and R are left and right <em>delimiters<\/em> of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.<\/p>\n<p>The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets<\/p>\n<pre lang=\"C\">2.1 LAST-SET(L)  \/\\ FIRST-SET(P) = {}\r\n2.2 LAST-SET(P)  \/\\ FIRST-SET(R) = {}<\/pre>\n<p>If additionally following condition is true<\/p>\n<pre lang=\"C\">2.3 FIRST-SET(P) \/\\ LAST-SET(P) = {}<\/pre>\n<p>R can be empty and an expression<\/p>\n<pre lang=\"C\">    ... L(P)\\1 ...<\/pre>\n<p>is permitted.<\/p>\n<h3>End Cycles<\/h3>\n<p>The conditions 2.1 &#8211; 2.3 are still to restrictive. For example the regexp `(a)\\1` violates condition (2.3) but shall be permitted. What we really want to exclude is that \\1 is adjacent to what I call a non empty <em>endcycle<\/em>.<\/p>\n<p>An endcycle of P has the following definition:<\/p>\n<pre lang=\"C\">END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )<\/pre>\n<p>Take for example the regexp`P = (a*|b|c)`. Here `LAST-SET(P) = {a, b, c}` and\u00a0 `FOLLOW-SET({a,b,c}) = {a}` which means that `a` is in the endcycle of `P`.<\/p>\n<p>With endcycles in mind we can weaken the conditions of (2) considerably:<\/p>\n<p>If P has no endcycle i.e.<\/p>\n<pre lang=\"C\">    END-CYCLE(P) = {}<\/pre>\n<p>we permit<\/p>\n<pre lang=\"C\">    ... L(P)\\1 ...<\/pre>\n<p>if the following holds:<\/p>\n<pre lang=\"C\">    END-CYCLE(L) \/\\ FIRST-SET(P) = {}<\/pre>\n<p>If on the other hand<\/p>\n<pre lang=\"C\">    END-CYCLE(P) != {}<\/pre>\n<p>we permit<\/p>\n<pre lang=\"C\">    ... L(P)R ... \\1 ...<\/pre>\n<p>if the following is valid:<\/p>\n<pre lang=\"C\">    END-CYCLE(L) \/\\ FIRST-SET(P) = {}\r\n    END-CYCLE(P) \/\\ FIRST-SET(R) = {}<\/pre>\n<h3>Final\u00a0 Remarks<\/h3>\n<p>No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn&#8217;t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn&#8217;t much mental overhead and the implementation is kept simpler.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of\u00a0 a regexp engine RE2, created by Russ Cox who has written a revival paper for Thompson NFAs\u00a0 &hellip; <a href=\"http:\/\/fiber-space.de\/wordpress\/2010\/03\/13\/restricted-backmatching\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[22],"tags":[],"_links":{"self":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1441"}],"collection":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/comments?post=1441"}],"version-history":[{"count":11,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1441\/revisions"}],"predecessor-version":[{"id":1465,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1441\/revisions\/1465"}],"wp:attachment":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/media?parent=1441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/categories?post=1441"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/tags?post=1441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}