## Archive for January, 2011

### Fuzzy string matching II – matching wordlists

Posted in Algorithms, Python on January 7th, 2011 by kay – 3 Comments

### Small misspellings

An anonymous programming reddit commenter wrote about my fuzzy string matching article:

A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original “damn cool algorithm” matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.

This is a valid objection.

However, for the most common case, which is an edit distance of 1 you don’t need a Levenshtein automaton either. Here is the recipe:

Let a wordlist and an alphabet be given. An alphabet can be for example the attribute string.letters of the string module. For a string S all string variants of S with an edit distance <=1 over the alphabet can be computed as follows:

```def string_variants(S, alphabet): variants = set() for i in range(len(S)+1): variants.add(S[:i]+S[i+1:]) # delete char at i for c in alphabet: variants.add(S[:i]+c+S[i:]) # insert char at i variants.add(S[:i]+c+S[i+1:]) # subst char at i return variants```

The set of words that shall be matched is given by:

`set(load(wordlist)) &amp; string_variants(S, alphabet)`

The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.

When you want to build string variants of edit distance = 2, just take the result of string_variants and apply string_variants on it again.

The complexity of is

O((n*len(alphabet))^k)

where n is the string length and k the edit distance.

### Alternative Approaches

For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of string_variants to determine for a given S all strings with edit-distance <=2 over an alphabet. So the most simple approaches probably serve you well in practise!

For k>2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn’t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can’t even imagine.

The basic idea is to organize our wordlist into an n-ary tree, the so called PrefixTree, and implement an algorithm which is variant of fuzzy_compare to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.

### Prefix Trees

For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree – we call it a PrefixTree. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean is_word field.

```class PrefixTree(object): def __init__(self, char = '', parent = None): self.char = char self.parent = parent self.children = {} self.is_word = False   def _tolist(self): if self.is_word: yield self.trace() for p in self.children.values(): for s in p._tolist(): yield s   def __iter__(self): return self._tolist()   def insert(self, value): if value: c = value tree = self.children.get(c) if tree is None: tree = PrefixTree(c, self) self.children[c] = tree tree.insert(value[1:]) else: self.is_word = True   def __contains__(self, value): if value: c = value if c in self.children: return value[1:] in self.children[c] return False return True   def __len__(self): if self.parent is not None: return len(self.parent)+1 return 0   def trace(self): if self.parent is not None: return self.parent.trace()+self.char return self.char```

Reading a wordlist into a PrefixTree can be simply done like this:

```pt = PrefixTree() for word in wordlist: pt.insert(word)```

Before we criticise and modify the PrefixTree let us take a look at the matching algorithm.

### Matching the PrefixTree

The algorithm is inspired by our fuzzy_compare algorithm. It uses the same recursive structure and memoization as fuzzy_compare.

```def update_visited(ptree, visited): visited[ptree][-1] = 0 T = ptree.parent while T is not None and T.char!='': if len(T.children) == 1: visited[T][-1] = 0 T = T.parent else: return   def is_visited(i, T, k, visited): d = visited.get(T, {}) if -1 in d: return True m = d.get(i,-1) if k&gt;m: d[i] = k visited[T] = d return False return True   def fuzzy_match(S, ptree, k, i=0, visited = None, N = 0): ''' Computes all strings T contained in ptree with a distance dist(T, S)&lt;=k. ''' trees = set() # handles root node of a PrefixTree if ptree.char == '' and ptree.children: N = len(S) S+='\0'*(k+1) visited = {} for pt in ptree.children.values(): trees.update(fuzzy_match(S, pt, k, i, visited, N)) return trees   # already tried if is_visited(i, ptree, k, visited): return []   # can't match ... if k == -1 or (k == 0 and S[i] != ptree.char): return []   if ptree.is_word and (N-i&lt;=k or (N-i&lt;=k+1 and ptree.char == S[i])): trees.add(ptree.trace()) if not ptree.children: update_visited(ptree, visited) return trees   if ptree.char!=S[i]: trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))   for c in ptree.children: if ptree.char == S[i]: trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N)) else: trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N)) trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N)) return trees```

### Lazy PrefixTree construction

The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.

A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL hash_map and got a worse result: 14 sec of execution time – about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn’t cease to surprise me. Of course I feel bad because I haven’t build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.

A possible solution for Python ( and C++ 😉 ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.

Here is the modified code:

```class PrefixTree(object): def __init__(self, char = '', parent = None): self.char = char self.parent = parent self.is_word = False self._children = {} self._words = set()   def _get_children(self): if self._words: self._create_children() return self._children   children = property(_get_children)   def _create_children(self): for tree, word in self._words: tree.insert(word) self._words = set()   def _tolist(self): if self.is_word: yield self.trace() for p in self.children.values(): for s in p._tolist(): yield s   def __iter__(self): return self._tolist()   def insert(self, value): if value: c = value tree = self._children.get(c) if tree is None: tree = PrefixTree(c, self) self._children[c] = tree if len(value) == 1: tree.is_word = True tree._words.add((tree,value[1:])) else: self.is_word = True   def __contains__(self, value): if value: if value in self._words: return True c = value if c in self._children: return value[1:] in self._children[c] return False return True   def __len__(self): if self.parent is not None: return len(self.parent)+1 return 0   def trace(self): if self.parent is not None: return self.parent.trace()+self.char return self.char```

### Some numbers

The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.

`Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec`
`Execution time of fuzzy_match("haus", pt, 1) - 1st run: 1.03 sec`
`Execution time of fuzzy_match("haus", pt, 1) - 2nd run: 0.03 sec`
`Execution time of fuzzy_match("haus", pt, 2) - 1st run: 1.95 sec`
`Execution time of fuzzy_match("haus", pt, 2) - 2nd run: 0.17 sec`
`Execution time of fuzzy_match("haus", pt, 3) - 1st run: 3.58 sec`
`Execution time of fuzzy_match("haus", pt, 3) - 2nd run: 0.87 sec`
We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.

Finally here are the numbers using string variants:

`Execution time of string_variants("haus", string.letters): 0.0 sec`
`Execution time of 2-iterated of string_variants("haus", string.letters): 0.28 sec`
`Execution time of 3-iterated of string_variants("haus", string.letters): 188.90 sec`
The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren’t necessary and for k >=3 PrefixTree based approaches doesn’t run amok.

### Code

The code for fuzzy_compare and fuzzy_match can be downloaded here. It also contains tests, some timing measurements and a German sample wordlist.