## Fuzzy string matching

### A damn hot algorithm

I found the following article written by Nick Johnson about the use of finite state machines for approximate string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.

I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:

k = 6 S = "".join(str(s) for s in range(10)*k) lev = levenshtein_automata(S, k).to_dfa() |

The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?

In the following article I’m constructing an approximate string matching algorithm from scratch.

### Recursive rules for approximate string matching

Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.

We describe the rule which describes all possible character substitutions in “food” like this:

pattern(food, 1) = ?ood | f?od | fo?d | foo?Applying successive left factorings yields:

pattern(food, 1) = ?ood | f ( ?od | o (?d | o? ) )This inspires a recursive notation which roughly looks like this:

pattern(food, 1) = ?ood | f pattern(ood, 1)or more precisely:

pattern(c, 1) = ? pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)where we have used a string variable S instead of the concrete string “food”.

When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:

pattern(c, k) = ? pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)

### Consuming or not consuming?

When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …

The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:

fuzzy_compare(S1, S2[1:], k-1) fuzzy_compare(S2, S1[1:], k-1) fuzzy_compare(S1[1:], S2[1:], k-1) |

All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.

### A first implementation

The following implementation is a good point to start with:

def fuzzy_compare(S1, S2, k, i=0, j=0): N1 = len(S1) N2 = len(S2) while True: if N1-i<=k and N2-j<=k: return True try: if S1[i] == S2[j]: i+=1 j+=1 continue except IndexError: return False if k == 0: return False else: if fuzzy_compare(S1, S2, k-1, i+1, j): return True if fuzzy_compare(S1, S2, k-1, i, j+1): return True if fuzzy_compare(S1, S2, k-1, i+1, j+1): return True return False |

The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.

### A second implementation using memoization

Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.

def is_visited(i, j, k, visited): m = visited.get((i,j),-1) if k>m: visited[(i,j)] = k return False return True def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None): N1 = len(S1) N2 = len(S2) if visited is None: visited = {} while True: if N1-i<=k and N2-j<=k: return True try: if S1[i] == S2[j]: visited[(i,j)] = k i+=1 j+=1 continue except IndexError: return False if k == 0: return False else: if not is_visited(i+1, j, k-1, visited): if fuzzy_compare(S1, S2, k-1, i+1, j, visited): return True if not is_visited(i, j+1, k-1, visited): if fuzzy_compare(S1, S2, k-1, i, j+1, visited): return True if not is_visited(i+1, j+1, k-1, visited): if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited): return True return False |

### A third implementation eliminating recursion

In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.

def is_visited(i, j, k, visited): m = visited.get((i,j),-1) if k>m: visited[(i,j)] = k return False return True def fuzzy_compare(S1, S2, k): N1 = len(S1) N2 = len(S2) i = j = 0 visited = {} stack = [] while True: if N1-i<=k and N2-j<=k: return True try: if S1[i] == S2[j]: visited[(i,j)] = k i+=1 j+=1 continue except IndexError: if stack: i, j, k = stack.pop() else: return False if k == 0: if stack: i, j, k = stack.pop() else: return False else: if not is_visited(i+1, j, k-1, visited): stack.append((i,j,k)) i, j, k = i+1, j, k-1 elif not is_visited(i, j+1, k-1, visited): stack.append((i,j,k)) i, j, k = i, j+1, k-1 elif not is_visited(i+1, j+1, k-1, visited): i, j, k = i+1, j+1, k-1 elif stack: i, j, k = stack.pop() else: return False |

This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of if … elif branches can impact performance of the algorithm. Do you see a way to improve it?

### Checking the algorithm

When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.

def levenshtein(s1, s2): l1 = len(s1) l2 = len(s2) matrix = [range(l1 + 1)] * (l2 + 1) for zz in range(l2 + 1): matrix[zz] = range(zz,zz + l1 + 1) for zz in range(0,l2): for sz in range(0,l1): if s1[sz] == s2[zz]: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz]) else: matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1) return matrix[l2][l1] |

For exhaustive testing we define a set of strings as follows:

Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.

def string_set(S = None, k = 0, strings = None, n = 6): if S is None: strings = [] S = ["a"]*n strings.append("".join(S)) for i in range(k, n): S1 = S[:] S1[i] = "b" strings.append("".join(S1)) string_set(S1, i+1, strings, n) return strings def string_pairs(n): L1 = string_set(n=n) pairs = [] for i in range(len(L1)): for k in range(1, n+1): L2 = string_set(n=k) for j in range(len(L2)): pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j]))) pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i]))) return pairs |

Our test function is short:

def test(n): for S1, S2, D in string_pairs(n): assert fuzzy_compare(S1, S2, D) == True, (S1, S2, D) assert fuzzy_compare(S1, S2, D-1) == False, (S1, S2, D-1) |

Have much fun!

Another nice read: “A fast bit-vector algorithm for approximate string matching based on dynamic programming” by Gene Myers and banded versions thereof.

If you have the time, check this out: lots of fuzzy string matching algorithms implemented there, and it’s Open Source: http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html

Please checkout the site: http://www.zhihuita.org, it provides a very efficient fuzzy string matching algorithm. It is written in C and compute similiarity based on edit distance.