{"id":1737,"date":"2011-01-07T09:20:28","date_gmt":"2011-01-07T08:20:28","guid":{"rendered":"http:\/\/fiber-space.de\/wordpress\/?p=1737"},"modified":"2022-10-12T12:50:09","modified_gmt":"2022-10-12T11:50:09","slug":"fuzzy-string-matching-ii-matching-wordlists","status":"publish","type":"post","link":"http:\/\/fiber-space.de\/wordpress\/2011\/01\/07\/fuzzy-string-matching-ii-matching-wordlists\/","title":{"rendered":"Fuzzy string matching II &#8211; matching wordlists"},"content":{"rendered":"<h3>Small misspellings<\/h3>\n<p>An anonymous programming reddit commenter wrote about my fuzzy string matching <a href=\"http:\/\/fiber-space.de\/wordpress\/?p=1579\">article<\/a>:<\/p>\n<blockquote><p>A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away from each other and look nothing alike. Likewise, the original &#8220;damn cool algorithm&#8221; matches against sets of strings at the same time, where as the algorithms in the article all only compare two strings against each other.<\/p><\/blockquote>\n<p>This is a valid objection.<\/p>\n<p>However, for the most common case, which is an edit distance of 1 you don&#8217;t need a Levenshtein automaton either. Here is the recipe:<\/p>\n<p>Let a `wordlist` and an `alphabet` be given. An alphabet can be for example the attribute `string.letters` of the string module. For a string S all string variants of S with an edit distance &lt;=1 over the `alphabet` can be computed as follows:<\/p>\n<pre lang=\"python\">def string_variants(S, alphabet):\r\n    variants = set()\r\n    for i in range(len(S)+1):\r\n        variants.add(S[:i]+S[i+1:])       # delete char at i\r\n        for c in alphabet:\r\n            variants.add(S[:i]+c+S[i:])   # insert char at i\r\n            variants.add(S[:i]+c+S[i+1:]) # subst char at i\r\n    return variants<\/pre>\n<p>The set of words that shall be matched is given by:<\/p>\n<pre lang=\"python\">set(load(wordlist)) & string_variants(S, alphabet)<\/pre>\n<p>The used alphabet can be directly extracted from the wordlist in preparation of the algorithm. So it is not that we are running into trouble when non ASCII characters come up.<\/p>\n<p>When you want to build string variants of edit distance = 2, just take the result of `string_variants` and apply string_variants on it again.<\/p>\n<p>The complexity of is<\/p>\n<p>`O((n*len(alphabet))^k)`<\/p>\n<p>where `n` is the string length and `k` the edit distance.<\/p>\n<h3>Alternative Approaches<\/h3>\n<p>For k = 1 we are essentially done with the simple algorithm above. For k=2 and small strings the results are still very good using an iterative application of `string_variants` to determine for a given S all strings with edit-distance &lt;=2 over an alphabet. So the most simple approaches probably serve you well in practise!<\/p>\n<p>For k&gt;2 and big alphabets we create word lists which are as large or larger than the wordlist we check against.The effort runs soon out of control. In the rest of the article we want to treat an approach which is fully general and doesn&#8217;t make specific assertions. It is overall not as efficient as more specialized solutions can be but it might be more interesting for sophisticated problems I can&#8217;t even imagine.<\/p>\n<p>The basic idea is to organize our wordlist into an n-ary tree, the so called `PrefixTree`, and implement an algorithm which is variant of `fuzzy_compare` to match a string against this tree with a prescribed maximum edit distance of k for the words we extract from the tree during the match.<\/p>\n<h3>Prefix Trees<\/h3>\n<p>For a set of words we can factor common prefixes. For example {aa, ab, ca} can be rewritten as {a[a,b], c[a]}. Systematic factoring yields an n-ary tree &#8211; we call it a<em> PrefixTree<\/em>. Leaf nodes and words do not correspond in a 1-1 relationship though, because a word can be a proper prefix of another word. This means that we have to tag PrefixTree nodes with an additional boolean `is_word` field.<\/p>\n<pre lang=\"python\">class PrefixTree(object):\r\n    def __init__(self, char = '', parent = None):\r\n        self.char     = char\r\n        self.parent   = parent\r\n        self.children = {}\r\n        self.is_word  = False\r\n\r\n    def _tolist(self):\r\n        if self.is_word:\r\n            yield self.trace()\r\n        for p in self.children.values():\r\n            for s in p._tolist():\r\n                yield s\r\n\r\n    def __iter__(self):\r\n        return self._tolist()\r\n\r\n    def insert(self, value):\r\n        if value:\r\n            c = value[0]\r\n            tree = self.children.get(c)\r\n            if tree is None:\r\n                tree = PrefixTree(c, self)\r\n                self.children[c] = tree\r\n            tree.insert(value[1:])\r\n        else:\r\n            self.is_word = True\r\n\r\n    def __contains__(self, value):\r\n        if value:\r\n            c = value[0]\r\n            if c in self.children:\r\n                return value[1:] in self.children[c]\r\n            return False\r\n        return True\r\n\r\n    def __len__(self):\r\n        if self.parent is not None:\r\n            return len(self.parent)+1\r\n        return 0\r\n\r\n    def trace(self):\r\n        if self.parent is not None:\r\n            return self.parent.trace()+self.char\r\n        return self.char<\/pre>\n<p>Reading a wordlist into a `PrefixTree` can be simply done like this:<\/p>\n<pre lang=\"python\">pt = PrefixTree()\r\nfor word in wordlist:\r\n    pt.insert(word)<\/pre>\n<p>Before we criticise and modify the `PrefixTree` let us take a look at the matching algorithm.<\/p>\n<h3>Matching the PrefixTree<\/h3>\n<p>The algorithm is inspired by our `fuzzy_compare` algorithm. It uses the same recursive structure and memoization as `fuzzy_compare`.<\/p>\n<pre lang=\"python\">def update_visited(ptree, visited):\r\n    visited[ptree][-1] = 0\r\n    T = ptree.parent\r\n    while T is not None and T.char!='':\r\n        if len(T.children) == 1:\r\n            visited[T][-1] = 0\r\n            T = T.parent\r\n        else:\r\n            return\r\n\r\ndef is_visited(i, T, k, visited):\r\n    d = visited.get(T, {})\r\n    if -1 in d:\r\n        return True\r\n    m = d.get(i,-1)\r\n    if k>m:\r\n        d[i] = k\r\n        visited[T] = d\r\n        return False\r\n    return True\r\n\r\ndef fuzzy_match(S, ptree, k, i=0, visited = None, N = 0):\r\n    '''\r\n    Computes all strings T contained in ptree with a distance dist(T, S)>=k.\r\n    '''\r\n    trees = set()\r\n    # handles root node of a PrefixTree\r\n    if ptree.char == '' and ptree.children:\r\n        N = len(S)\r\n        S+='\\0'*(k+1)\r\n        visited = {}\r\n        for pt in ptree.children.values():\r\n            trees.update(fuzzy_match(S, pt, k, i, visited, N))\r\n        return trees\r\n\r\n    # already tried\r\n    if is_visited(i, ptree, k, visited):\r\n        return []\r\n\r\n    # can't match ...\r\n    if k == -1 or (k == 0 and S[i] != ptree.char):\r\n        return []\r\n\r\n    if ptree.is_word and (N-i<=k or (N-i<=k+1 and ptree.char == S[i])):\r\n        trees.add(ptree.trace())\r\n        if not ptree.children:\r\n            update_visited(ptree, visited)\r\n            return trees\r\n\r\n    if ptree.char!=S[i]:\r\n        trees.update(fuzzy_match(S, ptree, k-1, i+1, visited, N))\r\n\r\n    for c in ptree.children:\r\n        if ptree.char == S[i]:\r\n            trees.update(fuzzy_match(S, ptree.children[c], k, i+1, visited, N))\r\n        else:\r\n            trees.update(fuzzy_match(S, ptree.children[c], k-1, i+1, visited, N))\r\n        trees.update(fuzzy_match(S, ptree.children[c], k-1, i, visited, N))\r\n    return trees<\/pre>\n<h3>Lazy PrefixTree construction<\/h3>\n<p>The major disadvantage of the construction is the time it takes upfront to create the PrefixTree. I checked it out for a wordlist of 158.989 entries and it took about 10 sec. With psyco activated it still takes 7.5 sec.<\/p>\n<p>A few trivia for the curious. I reimplemented PrefixTree in VC++ using STL `hash_map` and got a worse result: 14 sec of execution time - about twice as much as Python + Psyco. The language designed with uncompromised performance characteristics in mind doesn't cease to surprise me. Of course I feel bad because I haven't build a specialized memory management for this function and so on. Java behaves better with 1.2 sec on average.<\/p>\n<p>A possible solution for Python ( and C++ ) but also for Java, when wordlists grow ever bigger, is to create the PrefixTree only partially and let it grow when needed. So the load time gets balanced over several queries and a performance can be avoided.<\/p>\n<p>Here is the modified code:<\/p>\n<pre lang=\"python\">class PrefixTree(object):\r\n    def __init__(self, char = '', parent = None):\r\n        self.char      = char\r\n        self.parent    = parent\r\n        self.is_word   = False\r\n        self._children = {}\r\n        self._words    = set()\r\n\r\n    def _get_children(self):\r\n        if self._words:\r\n            self._create_children()\r\n        return self._children\r\n\r\n    children = property(_get_children)\r\n\r\n    def _create_children(self):\r\n        for tree, word in self._words:\r\n            tree.insert(word)\r\n        self._words = set()\r\n\r\n    def _tolist(self):\r\n        if self.is_word:\r\n            yield self.trace()\r\n        for p in self.children.values():\r\n            for s in p._tolist():\r\n                yield s\r\n\r\n    def __iter__(self):\r\n        return self._tolist()\r\n\r\n    def insert(self, value):\r\n        if value:\r\n            c = value[0]\r\n            tree = self._children.get(c)\r\n            if tree is None:\r\n                tree = PrefixTree(c, self)\r\n                self._children[c] = tree\r\n            if len(value) == 1:\r\n                tree.is_word = True\r\n            tree._words.add((tree,value[1:]))\r\n        else:\r\n            self.is_word = True\r\n\r\n    def __contains__(self, value):\r\n        if value:\r\n            if value in self._words:\r\n                return True\r\n            c = value[0]\r\n            if c in self._children:\r\n                return value[1:] in self._children[c]\r\n            return False\r\n        return True\r\n\r\n    def __len__(self):\r\n        if self.parent is not None:\r\n            return len(self.parent)+1\r\n        return 0\r\n\r\n    def trace(self):\r\n        if self.parent is not None:\r\n            return self.parent.trace()+self.char\r\n        return self.char<\/pre>\n<h3>Some numbers<\/h3>\n<p>The numbers presented here should be taken with a grain of salt and not confused with a benchmark but still provide a quantitative profile which allows drawing conclusions and making decisions.<\/p>\n<pre>Load time of wordlist of size 158.989 into pt = PrefixTree(): 0.61 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 1) - 1st run: 1.03 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 1) - 2nd run: 0.03 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 2) - 1st run: 1.95 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 2) - 2nd run: 0.17 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 3) - 1st run: 3.58 sec<\/pre>\n<pre>Execution time of fuzzy_match(\"haus\", pt, 3) - 2nd run: 0.87 sec<\/pre>\n<p>We see that the second run is always significantly faster because in the first run the PrefixTree gets partially built while in the second run the built nodes are just visited.<\/p>\n<p>Finally here are the numbers using string variants:<\/p>\n<pre>Execution time of string_variants(\"haus\", string.letters): 0.0 sec<\/pre>\n<pre>Execution time of 2-iterated of string_variants(\"haus\", string.letters): 0.28 sec<\/pre>\n<pre>Execution time of 3-iterated of string_variants(\"haus\", string.letters): 188.90 sec<\/pre>\n<p>The 0.0 seconds result simply means that for a single run it is below a threshold. The other results can possibly be improved by a factor of 2 using a less naive strategy to create string variants avoiding duplicates. The bottom line is that or k = 1 and k = 2 using PrefixTrees, Levenshtein automata and other sophisticated algorithms aren't necessary and for `k<=3` PrefixTree based approaches doesn't run amok.\n\n\n<h3>Code<\/h3>\n<p>The code for `fuzzy_compare` and `fuzzy_match` can be downloaded <a href=\"http:\/\/www.fiber-space.de\/fuzzystring\/fuzzystring.zip\">here<\/a>. It also contains tests, some timing measurements and a German sample wordlist.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Small misspellings An anonymous programming reddit commenter wrote about my fuzzy string matching article: A maximum edit distance of 2 or 3 is reasonable for most applications of edit distance. For example, cat and dog are only 3 edits away &hellip; <a href=\"http:\/\/fiber-space.de\/wordpress\/2011\/01\/07\/fuzzy-string-matching-ii-matching-wordlists\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[30,4],"tags":[],"_links":{"self":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1737"}],"collection":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/comments?post=1737"}],"version-history":[{"count":20,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1737\/revisions"}],"predecessor-version":[{"id":2327,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1737\/revisions\/2327"}],"wp:attachment":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/media?parent=1737"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/categories?post=1737"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/tags?post=1737"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}