In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of a regexp engine RE2, created by Russ Cox who has written a revival paper for Thompson NFAs a while ago with a few follow-ups which build on those ideas. Once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:
The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.
So backmatching can’t be implemented efficiently, but why? What is it that prevents an efficient implementation and can a subset be defined which exist within the constraints of O(n) lexers or parsers? In this article we will find that backmatching for convenient use cases is not a problematic feature in linear space and time regexp engines. Some backreferences are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT – but could it be in the end that only esoteric applications of backmatching are excluded from trace based parsing (TBP)?
General backmatching
When we think about backmatching in regular expressions we might define expressions like this
a) "... (P) ... \1" |
where (P) defines a simple pattern and \1 refers to the string value matched by the pattern. We assume that there is a functional relationship between (P) and \1. First it is (P) that matches then \1 will match what (P) has matched before.
Actually this perspective is a little simplistic and in the general case backmatching can be more powerful. Consider the following simple regexp:
b) ([ab]*)b*\1 |
Here the match of `([ab]*)` depends on what \1 will match but it is also highly ambiguous. If we match the following string
s = “bb”
the first b can be matched with `([ab]*)` and the last “b” with `\1` but the whole string can also be matched with `b*`.
Here is another more complicated example
c) (([ab]*)b*\2)*[ab]*\1b* |
It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of `([ab]*)` on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of `([ab]*)` and `(([ab]*)b*\2)`. It is all tangled.
General backmatching as in examples b) and c) can be used to solve NP-complete problems which exploits these tangles and finds resolutions. See this article for more examples. With NP complete backtracking algorithms built into regexp engines one gets such solutions for free. This is cool and clever but also unnecessary most of time.
Functional backmatching
If we restrict backmatching to simple functional relations between (P) and \1 as in case a) we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.
In an expression
... (P) ... \1 |
the back-reference \1 can be separated from P when the following conditions are satisfied:
1. P doesn’t contain back-references which means it is self contained.
2. It is possible to write the expression in the way
... L(P)R ... \1 |
where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.
The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets
2.1 LAST-SET(L) /\ FIRST-SET(P) = {} 2.2 LAST-SET(P) /\ FIRST-SET(R) = {} |
If additionally following condition is true
2.3 FIRST-SET(P) /\ LAST-SET(P) = {} |
R can be empty and an expression
... L(P)\1 ... |
is permitted.
End Cycles
The conditions 2.1 – 2.3 are still to restrictive. For example the regexp `(a)\1` violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjacent to what I call a non empty endcycle.
An endcycle of P has the following definition:
END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) ) |
Take for example the regexp`P = (a*|b|c)`. Here `LAST-SET(P) = {a, b, c}` and `FOLLOW-SET({a,b,c}) = {a}` which means that `a` is in the endcycle of `P`.
With endcycles in mind we can weaken the conditions of (2) considerably:
If P has no endcycle i.e.
END-CYCLE(P) = {} |
we permit
... L(P)\1 ... |
if the following holds:
END-CYCLE(L) /\ FIRST-SET(P) = {} |
If on the other hand
END-CYCLE(P) != {} |
we permit
... L(P)R ... \1 ... |
if the following is valid:
END-CYCLE(L) /\ FIRST-SET(P) = {} END-CYCLE(P) /\ FIRST-SET(R) = {} |
Final Remarks
No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.