On the comp.compilers mailing list I asked for a confirmation of the O(n) complexity claim for packrat parsers. A packrat parser is a particular top down recursive descendant parser which applies backtracking on failure of consuming a token by application of a parsing expression. Now a backtracking parser usually exhibits O(2^{n}) complexity and the essential claim is that memoization brings it down to O(n) on the cost of expanding storage. The work of Bryan Ford about packrat parsers refers to an original article written by Birman & Ullmann that gave a proof. The cited article costs 19$ at IEEE which I felt I would spent when I failed to give a proof by myself. Fortunately the proof is elementary and not complicated. So I could save my money.

Here we go.

Each non-terminal `N` of a grammar G has a finite first set `{T1, .., Tn}` which are terminal symbols that can be recursively determined by examining the start of the rule `N`. So if `N` has the shape

`N → R1 … | R2 …`

then

`FirstSet(N)` = FirstSet(R1) ⋃ FirstSet(R2) …

with `FirstSet(`R_{i} ) = { R_{i} } if R_{i }is a terminal symbol.

Next we consider the preimages of terminals with regard to FirstSets. We call them *ancestors* and for a terminal T the set of ancestors is defined by:

`Ancestors(T) = {T} ∪ {N ∈ NonTerminals(G)| T ∈ FirstSet(N)}`

Each terminal T has at least one ancestor ( itself ) and at most `K = #NT(G)+1` where `#NT(G)` is the number of non-terminals of the grammar G.

Suppose now we feed a token stream T_{1} T_{2} … T_{n} into a top down parser. For each T_{i} in the stream we can apply all of the rules in `Ancestors(`T_{i}`)` and store the results in a set

`Trees(i) = {(i, Tree(R))| R ∈ Ancestors(`T_{i}`) }`

Since for all i the inequality`#Ancestors(`T_{i}`)` ≤ K holds the set

`Trees =`∪ _{i∈{1, …, n}}`Trees(i)`

has at most `K*n` elements. If a new tree needs to be computed the sub-trees may be looked up in `Trees`. If one is missing that one is computed as well etc. In any case no more than `K*n` computations need to be performed to determine all elements of `Trees`. Since K was a constant that depended only on the grammar we can conclude that the asymptotic complexity of computing `Trees` is O(n). Since `Trees` contains a solution to our problem it can be computed in O(n) as well.

Should FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R1) …

be FirstSet(N) = FirstSet(R1) ⋃ FirstSet(R2) … ?

Yes, sure. I have corrected this. Thanks!