Linear bounds for backtracking parsers with memoization
On the comp.compilers mailing list I asked for a confirmation of the O(n) complexity claim for packrat parsers. A packrat parser is a particular top down recursive descendant parser which applies backtracking on failure of consuming a token by application of a parsing expression. Now a backtracking parser usually exhibits O(2n) complexity and the essential claim is that memoization brings it down to O(n) on the cost of expanding storage. The work of Bryan Ford about packrat parsers refers to an original article written by Birman & Ullmann that gave a proof. The cited article costs 19$ at IEEE which I felt I would spent when I failed to give a proof by myself. Fortunately the proof is elementary and not complicated. So I could save my money.
Here we go.
Each non-terminal N of a grammar G has a finite first set {T1, .., Tn} which are terminal symbols that can be recursively determined by examining the start of the rule N. So if N has the shape
N → R1 … | R2 …
then
FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R2) …
with FirstSet(Ri ) = { Ri } if Ri is a terminal symbol.
Next we consider the preimages of terminals with regard to FirstSets. We call them ancestors and for a terminal T the set of ancestors is defined by:
Ancestors(T) = {T} ∪ {N ∈ NonTerminals(G)| T ∈ FirstSet(N)}
Each terminal T has at least one ancestor ( itself ) and at most K = #NT(G)+1 where #NT(G) is the number of non-terminals of the grammar G.
Suppose now we feed a token stream T1 T2 … Tn into a top down parser. For each Ti in the stream we can apply all of the rules in Ancestors(Ti) and store the results in a set
Trees(i) = {(i, Tree(R))| R ∈ Ancestors(Ti) }
Since for all i the inequality#Ancestors(Ti) ≤ K holds the set
Trees =∪ i∈{1, …, n}Trees(i)
has at most K*n elements. If a new tree needs to be computed the sub-trees may be looked up in Trees. If one is missing that one is computed as well etc. In any case no more than K*n computations need to be performed to determine all elements of Trees. Since K was a constant that depended only on the grammar we can conclude that the asymptotic complexity of computing Trees is O(n). Since Trees contains a solution to our problem it can be computed in O(n) as well.
Should FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R1) … be FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R2) … ?
Yes, sure. I have corrected this. Thanks!