Archive for January, 2011

Fuzzy string matching II – matching wordlists

Posted in Algorithms, Python on January 7th, 2011 by kay – 3 Comments

Small misspellings

An anonymous programming reddit commenter wrote about my fuzzy string matching article:

A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original “damn cool algorithm” matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.

This is a valid objection.

However, for the most common case, which is an edit distance of 1 you don’t need a Levenshtein automaton either. Here is the recipe:

Let a wordlist and an alphabet be given. An alphabet can be for example the attribute string.letters of the string module. For a string S all string variants of S with an edit distance <=1 over the alphabet can be computed as follows:

def string_variants(S, alphabet):
    variants = set()
    for i in range(len(S)+1):
        variants.add(S[:i]+S[i+1:])       # delete char at i
        for c in alphabet:
            variants.add(S[:i]+c+S[i:])   # insert char at i
            variants.add(S[:i]+c+S[i+1:]) # subst char at i
    return variants

The set of words that shall be matched is given by:

set(load(wordlist)) &amp; string_variants(S, alphabet)

The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.

When you want to build string variants of edit distance = 2, just take the result of string_variants and apply string_variants on it again.

The complexity of is

O((n*len(alphabet))^k)

where n is the string length and k the edit distance.

Alternative Approaches

For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of string_variants to determine for a given S all strings with edit-distance <=2 over an alphabet. So the most simple approaches probably serve you well in practise!

For k>2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn’t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can’t even imagine.

The basic idea is to organize our wordlist into an n-ary tree, the so called PrefixTree, and implement an algorithm which is variant of fuzzy_compare to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.

Prefix Trees

For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree – we call it a PrefixTree. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean is_word field.

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char     = char
        self.parent   = parent
        self.children = {}
        self.is_word  = False
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self.children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self.children[c] = tree
            tree.insert(value[1:])
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            c = value[0]
            if c in self.children:
                return value[1:] in self.children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Reading a wordlist into a PrefixTree can be simply done like this:

pt = PrefixTree()
for word in wordlist:
    pt.insert(word)

Before we criticise and modify the PrefixTree let us take a look at the matching algorithm.

Matching the PrefixTree

The algorithm is inspired by our fuzzy_compare algorithm. It uses the same recursive structure and memoization as fuzzy_compare.

def update_visited(ptree, visited):
    visited[ptree][-1] = 0
    T = ptree.parent
    while T is not None and T.char!='':
        if len(T.children) == 1:
            visited[T][-1] = 0
            T = T.parent
        else:
            return
 
def is_visited(i, T, k, visited):
    d = visited.get(T, {})
    if -1 in d:
        return True
    m = d.get(i,-1)
    if k&gt;m:
        d[i] = k
        visited[T] = d
        return False
    return True
 
def fuzzy_match(S, ptree, k, i=0, visited = None, N = 0):
    '''
    Computes all strings T contained in ptree with a distance dist(T, S)&lt;=k.
    '''
    trees = set()
    # handles root node of a PrefixTree
    if ptree.char == '' and ptree.children:
        N = len(S)
        S+='\0'*(k+1)
        visited = {}
        for pt in ptree.children.values():
            trees.update(fuzzy_match(S, pt, k, i, visited, N))
        return trees
 
    # already tried
    if is_visited(i, ptree, k, visited):
        return []
 
    # can't match ...
    if k == -1 or (k == 0 and S[i] != ptree.char):
        return []
 
    if ptree.is_word and (N-i&lt;=k or (N-i&lt;=k+1 and ptree.char == S[i])):
        trees.add(ptree.trace())
        if not ptree.children:
            update_visited(ptree, visited)
            return trees
 
    if ptree.char!=S[i]:
        trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))
 
    for c in ptree.children:
        if ptree.char == S[i]:
            trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N))
        else:
            trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N))
        trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N))
    return trees

Lazy PrefixTree construction

The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.

A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL hash_map and got a worse result: 14 sec of execution time – about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn’t cease to surprise me. Of course I feel bad because I haven’t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.

A possible solution for Python ( and C++ 😉 ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.

Here is the modified code:

class PrefixTree(object):
    def __init__(self, char = '', parent = None):
        self.char      = char
        self.parent    = parent
        self.is_word   = False
        self._children = {}
        self._words    = set()
 
    def _get_children(self):
        if self._words:
            self._create_children()
        return self._children
 
    children = property(_get_children)
 
    def _create_children(self):
        for tree, word in self._words:
            tree.insert(word)
        self._words = set()
 
    def _tolist(self):
        if self.is_word:
            yield self.trace()
        for p in self.children.values():
            for s in p._tolist():
                yield s
 
    def __iter__(self):
        return self._tolist()
 
    def insert(self, value):
        if value:
            c = value[0]
            tree = self._children.get(c)
            if tree is None:
                tree = PrefixTree(c, self)
                self._children[c] = tree
            if len(value) == 1:
                tree.is_word = True
            tree._words.add((tree,value[1:]))
        else:
            self.is_word = True
 
    def __contains__(self, value):
        if value:
            if value in self._words:
                return True
            c = value[0]
            if c in self._children:
                return value[1:] in self._children[c]
            return False
        return True
 
    def __len__(self):
        if self.parent is not None:
            return len(self.parent)+1
        return 0
 
    def trace(self):
        if self.parent is not None:
            return self.parent.trace()+self.char
        return self.char

Some numbers

The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.

Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec
Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec
Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec
Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec
Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec
Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec
Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec
We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.

Finally here are the numbers using string variants:

Execution time of string_variants("haus", string.letters): 0.0 sec
Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec
Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec
The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren’t necessary and for k >=3 PrefixTree based approaches doesn’t run amok.

Code

The code for fuzzy_compare and fuzzy_match can be downloaded here. It also contains tests, some timing measurements and a German sample wordlist.