{"id":1579,"date":"2010-12-21T06:55:03","date_gmt":"2010-12-21T05:55:03","guid":{"rendered":"http:\/\/fiber-space.de\/wordpress\/?p=1579"},"modified":"2022-10-12T12:55:22","modified_gmt":"2022-10-12T11:55:22","slug":"fuzzy-string-matching-and-grammars","status":"publish","type":"post","link":"http:\/\/fiber-space.de\/wordpress\/2010\/12\/21\/fuzzy-string-matching-and-grammars\/","title":{"rendered":"Fuzzy string matching"},"content":{"rendered":"<h3>A damn hot algorithm<\/h3>\n<p>I found the following <a href=\"http:\/\/blog.notdot.net\/2010\/07\/Damn-Cool-Algorithms-Levenshtein-Automata\">article<\/a> written by Nick Johnson about the use of finite state machines for approximate\u00a0 string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called &#8220;Levenshtein automatons&#8221;. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">Levenshtein algorithm<\/a> which provides sufficient background for Nicks article.<\/p>\n<p>I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:<\/p>\n<pre lang=\"python\">k = 6\r\nS\u00a0  = \"\".join(str(s) for s in range(10)*k)\r\nlev = levenshtein_automata(S, k).to_dfa()<\/pre>\n<p>The algorithm is advertised as a &#8220;damn cool algorithm&#8221; by the author but since the major effect on my notebook was producing heat I wonder if &#8220;cool&#8221; shouldn&#8217;t be replaced by &#8220;hot&#8221;?<\/p>\n<p>In the following article I&#8217;m constructing an approximate string matching algorithm from scratch.<\/p>\n<h3>Recursive rules for approximate string matching<\/h3>\n<p>Let &#8216;`S` be a string with `len(S)=n` and `k` a positive number with `k`&lt;=`n`. By &#8220;?&#8221; we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length `n` we can select arbitrary `k` indexes in the set `{0,&#8230;,n-1}` and substitute the characters of `S` at those indexes using a wildcard symbol. If for example (S = &#8220;food&#8221; , k = 1 and index = 2) we get &#8220;fo?d&#8221;.<\/p>\n<p>We describe the rule which describes all possible character substitutions in &#8220;food&#8221; like this:<\/p>\n<pre>pattern(food, 1) = ?ood | f?od | fo?d\u00a0 | foo?<\/pre>\n<p>Applying successive left factorings yields:<\/p>\n<pre>pattern(food, 1) = ?ood | f\u00a0 ( ?od | o (?d\u00a0 | o? ) )<\/pre>\n<p>This inspires a recursive notation which roughly looks like this:<\/p>\n<pre>pattern(food, 1) = ?ood | f pattern(ood, 1)<\/pre>\n<p>or more precisely:<\/p>\n<pre>pattern(c, 1) = ?\r\npattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)<\/pre>\n<p>where we have used a string variable S instead of the concrete string &#8220;food&#8221;.<\/p>\n<p>When using an arbitrary `k` instead of a fixed k = 1 we get the following recursive equations:<\/p>\n<pre>pattern(c, k) = ?\r\npattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)<\/pre>\n<h3>Consuming or not consuming?<\/h3>\n<p>When we try to find an efficient implementation for the `pattern` function described above we need an interpretation of the `?` wildcard action. It can consume a character and feed the rest of the string into a new call of `pattern` or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step &#8230;<\/p>\n<p>The basic idea when matching a string `S1` against a string `S2` is that we attempt to match `S1[0]` against `S2[0]` and when we succeed, we continue matching `S[1:]` against `S2[1:]` using the same constant `k`. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:<\/p>\n<pre lang=\"python\">fuzzy_compare(S1, S2[1:], k-1)\r\nfuzzy_compare(S2, S1[1:], k-1)\r\nfuzzy_compare(S1[1:], S2[1:], k-1)<\/pre>\n<p>All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.<\/p>\n<h3>A first implementation<\/h3>\n<p>The following implementation is a good point to start with:<\/p>\n<pre lang=\"python\">def fuzzy_compare(S1, S2, k, i=0, j=0):\r\n    N1 = len(S1)\r\n    N2 = len(S2)\r\n    while True:\r\n        if N1-i<=k and N2-j<=k:\r\n            return True\r\n        try:\r\n            if S1[i] == S2[j]:\r\n                i+=1\r\n                j+=1\r\n                continue\r\n        except IndexError:\r\n            return False\r\n        if k == 0:\r\n            return False\r\n        else:\r\n            if fuzzy_compare(S1, S2, k-1, i+1, j):\r\n                return True\r\n            if fuzzy_compare(S1, S2, k-1, i, j+1):\r\n                return True\r\n            if fuzzy_compare(S1, S2, k-1, i+1, j+1):\r\n                return True\r\n            return False<\/pre>\n<p>The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.<\/p>\n<h3>A second implementation using memoization<\/h3>\n<p>Our second implementation still uses recursion but introduces a dictionary which records all `(i,j)` index pairs that have been encountered and stores the current value of `k`. If the algorithm finds a value `k'` at `(i,j)` with `k'`&lt;=`k` it will immediately return `False` because this particular trace has been unsuccessfully checked out before. Using an` n x n` matrix to memoize results will reduce the complexity of the algorithm which becomes `O(n^2)` where n is the length of the string. In fact it will be even `O(n)` because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.<\/p>\n<pre lang=\"python\">def is_visited(i, j, k, visited):\r\n    m = visited.get((i,j),-1)\r\n    if k<m:\r\n        visited[(i,j)] = k\r\n        return False\r\n    return True\r\n\r\ndef fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):\r\n    N1 = len(S1)\r\n    N2 = len(S2)\r\n   \u00a0if visited is None:\r\n        visited = {}\r\n    while True:\r\n        if N1-i<=k and N2-j<=k:\r\n            return True\r\n        try:\r\n            if S1[i] == S2[j]:\r\n                visited[(i,j)] = k\r\n                i+=1\r\n                j+=1\r\n                continue\r\n        except IndexError:\r\n            return False\r\n        if k == 0:\r\n            return False\r\n        else:\r\n            if not is_visited(i+1, j, k-1, visited):\r\n                if fuzzy_compare(S1, S2, k-1, i+1, j, visited):\r\n                    return True\r\n            if not is_visited(i, j+1, k-1, visited):\r\n               \u00a0if fuzzy_compare(S1, S2, k-1, i, j+1, visited):\r\n                    return True\r\n            if not is_visited(i+1, j+1, k-1, visited):\r\n                if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):\r\n                    return True\r\n            return False<\/pre>\n<h3>A third implementation eliminating recursion<\/h3>\n<p>In our third and final implementation we eliminate the recursive call to `fuzzy_compare` and replace it using a stack containing tuples `(i, j, k)` recording the current state.<\/p>\n<pre lang=\"python\">def is_visited(i, j, k, visited):\r\n    m = visited.get((i,j),-1)\r\n    if k<m:\r\n        visited[(i,j)] = k\r\n        return False\r\n    return True\r\n\r\ndef fuzzy_compare(S1, S2, k):\r\n    N1 = len(S1)\r\n    N2 = len(S2)\r\n    i = j = 0\r\n    visited = {}\r\n    stack = []\r\n    while True:\r\n        if N1-i<=k and N2-j<=k:\r\n            return True\r\n        try:\r\n            if S1[i] == S2[j]:\r\n                visited[(i,j)] = k\r\n                i+=1\r\n                j+=1\r\n                continue\r\n        except IndexError:\r\n            if stack:\r\n                i, j, k = stack.pop()\r\n            else:\r\n                return False\r\n        if k == 0:\r\n            if stack:\r\n                i, j, k = stack.pop()\r\n            else:\r\n                return False\r\n        else:\r\n            if not is_visited(i+1, j, k-1, visited):\r\n                stack.append((i,j,k))\r\n                i, j, k = i+1, j, k-1\r\n            elif not is_visited(i, j+1, k-1, visited):\r\n                stack.append((i,j,k))\r\n                i, j, k = i, j+1, k-1\r\n            elif not is_visited(i+1, j+1, k-1, visited):\r\n                i, j, k = i+1, j+1, k-1\r\n            elif stack:                               \r\n                i, j, k = stack.pop()\r\n            else:\r\n                return False<\/pre>\n<p>This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of `if` ... `elif` branches can impact performance of the algorithm. Do you see a way to improve it?<\/p>\n<h3>Checking the algorithm<\/h3>\n<p>When D is the Levenshtein distance between two strings S1 and S2 then `fuzzy_compare(S1, S2, k)` shall be `True` for `k`&gt;`=D` and `False` otherwise. So when you want to test `fuzzy_compare` compute the Levenshtein distance and check `fuzzy_compare` with the boundary values `k = D` and `k = D-1`.<\/p>\n<pre lang=\"python\">def levenshtein(s1, s2):\r\n    l1 = len(s1)\r\n    l2 = len(s2)\r\n    matrix = [range(l1 + 1)] * (l2 + 1)\r\n    for zz in range(l2 + 1):\r\n      matrix[zz] = range(zz,zz + l1 + 1)\r\n    for zz in range(0,l2):\r\n      for sz in range(0,l1):\r\n        if s1[sz] == s2[zz]:\r\n          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,\r\n                                   matrix[zz][sz+1] + 1,\r\n                                   matrix[zz][sz])\r\n        else:\r\n          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,\r\n                                   matrix[zz][sz+1] + 1,\r\n                                   matrix[zz][sz] + 1)\r\n    return matrix[l2][l1]<\/pre>\n<p>For exhaustive testing we define a set of strings as follows:<\/p>\n<p>Given a prescribed n we define the set of strings of length = n which consists of \"a\" and \"b\" characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.<\/p>\n<pre lang=\"python\">def string_set(S = None, k = 0, strings = None, n = 6):\r\n    if S is None:\r\n        strings = []\r\n        S = [\"a\"]*n\r\n        strings.append(\"\".join(S))\r\n    for i in range(k, n):\r\n        S1 = S[:]\r\n        S1[i] = \"b\"\r\n        strings.append(\"\".join(S1))\r\n        string_set(S1, i+1, strings, n)\r\n    return strings\r\n\r\ndef string_pairs(n):\r\n    L1 = string_set(n=n)\r\n    pairs = []\r\n    for i in range(len(L1)):\r\n        for k in range(1, n+1):\r\n            L2 = string_set(n=k)\r\n            for j in range(len(L2)):\r\n                pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))\r\n                pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))\r\n    return pairs<\/pre>\n<p>Our test function is short:<\/p>\n<pre lang=\"python\">def test(n):\r\n    for S1, S2, D in string_pairs(n):\r\n        assert fuzzy_compare(S1, S2, D) == True, (S1, S2, D)\r\n        assert fuzzy_compare(S1, S2, D-1) == False, (S1, S2, D-1)<\/pre>\n<p>Have much fun!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A damn hot algorithm I found the following article written by Nick Johnson about the use of finite state machines for approximate\u00a0 string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm &hellip; <a href=\"http:\/\/fiber-space.de\/wordpress\/2010\/12\/21\/fuzzy-string-matching-and-grammars\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[30,4],"tags":[],"_links":{"self":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1579"}],"collection":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/comments?post=1579"}],"version-history":[{"count":16,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1579\/revisions"}],"predecessor-version":[{"id":2329,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/1579\/revisions\/2329"}],"wp:attachment":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/media?parent=1579"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/categories?post=1579"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/tags?post=1579"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}