Fuzzy string matching II – matching wordlists

Small misspellings

An anonymous programming reddit commenter wrote about my fuzzy string matching article:

A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original “damn cool algorithm” matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.

This is a valid objection.

However, for the most common case, which is an edit distance of 1 you don’t need a Levenshtein automaton either. Here is the recipe:

Let a `wordlist` and an `alphabet` be given. An alphabet can be for example the attribute `string.letters` of the string module. For a string S all string variants of S with an edit distance <=1 over the `alphabet` can be computed as follows:

def string_variants(S, alphabet):
    variants = set()
    for i in range(len(S)+1):
        variants.add(S[:i]+S[i+1:])       # delete char at i
        for c in alphabet:
            variants.add(S[:i]+c+S[i:])   # insert char at i
            variants.add(S[:i]+c+S[i+1:]) # subst char at i
    return variants

The set of words that shall be matched is given by:

set(load(wordlist)) & string_variants(S, alphabet)

The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.

When you want to build string variants of edit distance = 2, just take the result of `string_variants` and apply string_variants on it again.

The complexity of is

`O((n*len(alphabet))^k)`

where `n` is the string length and `k` the edit distance.

Alternative Approaches

For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of `string_variants` to determine for a given S all strings with edit-distance <=2 over an alphabet. So the most simple approaches probably serve you well in practise!

For k>2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn’t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can’t even imagine.

The basic idea is to organize our wordlist into an n-ary tree, the so called `PrefixTree`, and implement an algorithm which is variant of `fuzzy_compare` to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.

Prefix Trees

For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree – we call it a PrefixTree. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean `is_word` field.

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char     = char
        self.parent   = parent
        self.children = {}
        self.is_word  = False
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self.children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self.children[c] = tree
            tree.insert(value[1:])
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            c = value[0]
            if c in self.children:
                return value[1:] in self.children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Reading a wordlist into a `PrefixTree` can be simply done like this:

pt = PrefixTree()
for word in wordlist:
    pt.insert(word)

Before we criticise and modify the `PrefixTree` let us take a look at the matching algorithm.

Matching the PrefixTree

The algorithm is inspired by our `fuzzy_compare` algorithm. It uses the same recursive structure and memoization as `fuzzy_compare`.

def update_visited(ptree, visited):
    visited[ptree][-1] = 0
    T = ptree.parent
    while T is not None and T.char!='':
        if len(T.children) == 1:
            visited[T][-1] = 0
            T = T.parent
        else:
            return
 
def is_visited(i, T, k, visited):
    d = visited.get(T, {})
    if -1 in d:
        return True
    m = d.get(i,-1)
    if k>m:
        d[i] = k
        visited[T] = d
        return False
    return True
 
def fuzzy_match(S, ptree, k, i=0, visited = None, N = 0):
    '''
    Computes all strings T contained in ptree with a distance dist(T, S)>=k.
    '''
    trees = set()
    # handles root node of a PrefixTree
    if ptree.char == '' and ptree.children:
        N = len(S)
        S+='\0'*(k+1)
        visited = {}
        for pt in ptree.children.values():
            trees.update(fuzzy_match(S, pt, k, i, visited, N))
        return trees
 
    # already tried
    if is_visited(i, ptree, k, visited):
        return []
 
    # can't match ...
    if k == -1 or (k == 0 and S[i] != ptree.char):
        return []
 
    if ptree.is_word and (N-i<=k or (N-i<=k+1 and ptree.char == S[i])):
        trees.add(ptree.trace())
        if not ptree.children:
            update_visited(ptree, visited)
            return trees
 
    if ptree.char!=S[i]:
        trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))
 
    for c in ptree.children:
        if ptree.char == S[i]:
            trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N))
        else:
            trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N))
        trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N))
    return trees

Lazy PrefixTree construction

The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.

A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL `hash_map` and got a worse result: 14 sec of execution time – about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn’t cease to surprise me. Of course I feel bad because I haven’t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.

A possible solution for Python ( and C++ ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.

Here is the modified code:

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char      = char
        self.parent    = parent
        self.is_word   = False
        self._children = {}
        self._words    = set()
 
    def _get_children(self):
        if self._words:
            self._create_children()
        return self._children
 
    children = property(_get_children)
 
    def _create_children(self):
        for tree, word in self._words:
            tree.insert(word)
        self._words = set()
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self._children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self._children[c] = tree
            if len(value) == 1:
                tree.is_word = True
            tree._words.add((tree,value[1:]))
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            if value in self._words:
                return True
            c = value[0]
            if c in self._children:
                return value[1:] in self._children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Some numbers

The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.

Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec
Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec
Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec
Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec
Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec
Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec
Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec

We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.

Finally here are the numbers using string variants:

Execution time of string_variants("haus", string.letters): 0.0 sec
Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec
Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec

The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren’t necessary and for `k<=3` PrefixTree based approaches doesn't run amok.

Code

The code for `fuzzy_compare` and `fuzzy_match` can be downloaded here. It also contains tests, some timing measurements and a German sample wordlist.

This entry was posted in Algorithms, Python. Bookmark the permalink.

3 Responses to Fuzzy string matching II – matching wordlists

  1. Bob says:

    This is interesting, but I find the Levenshtein distance quite limiting. It assumes that substituting any letter for another has equal error. As an example, in typed text the word “ecliose” should be closer to the word “eclipse” than to the word “eclixse”, because p and o are next to each other on the keyboard.

    On the other hand, in OCR text the capital I looks almost identical to the lowercase l. Or, to make things even more difficult, capital H looks a lot like two lowercase l’s. Levenshtein distance fails on these.

    Can the algorithm be modified so that these cases can also be matched quickly? Like, say, you have a function f(a, b) that gives the “substitution error” for the two characters (with value 1 for most letter pairs and something like 0.2 for the pair (I, l)).

  2. kay says:

    @Bob. I understand the ranking or equal weight problem your articulate and the algorithms deliver only the dumb but comprehensive answer of returning the set of possible corrections. What I do think about the presented algos is that they are good starting points for excursions into deeper regions because they are fairly simple, in particular the approach with string-variants. The application programmer shall be confident that the easy problem has a simple solution which is also solid and that it can be used as one processing step.

    I can’t say anything about OCR but understanding and using it is high on my priority list.

  3. Steve Hanov says:

    You can speed up prefix tree construction to almost nothing by avoiding lookups when you insert. To do this, 1) insert words in alphabetical order. 2) When you insert a word, keep an array of the nodes that it used for each letter. 3) When you insert a word and there was a previous word, count the number of characters from the beginning that are in common with the previously inserted word. (the common prefix length). Truncate the array of nodes at that point, and start building from the node at the end of that array. This avoids lookups during insert, avoid recursion, and is very efficient.

Leave a Reply

Your email address will not be published. Required fields are marked *

*