Fuzzy string matching

A damn hot algorithm

I found the following article written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.

I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:

k = 6= "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()

The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?

In the following article I’m constructing an approximate string matching algorithm from scratch.

Recursive rules for approximate string matching

Let ‘`S` be a string with `len(S)=n` and `k` a positive number with `k`<=`n`. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length `n` we can select arbitrary `k` indexes in the set `{0,…,n-1}` and substitute the characters of `S` at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.

We describe the rule which describes all possible character substitutions in “food” like this:

pattern(food, 1) = ?ood | f?od | fo?d  | foo?

Applying successive left factorings yields:

pattern(food, 1) = ?ood | f  ( ?od | o (?d  | o? ) )

This inspires a recursive notation which roughly looks like this:

pattern(food, 1) = ?ood | f pattern(ood, 1)

or more precisely:

pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)

where we have used a string variable S instead of the concrete string “food”.

When using an arbitrary `k` instead of a fixed k = 1 we get the following recursive equations:

pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)

Consuming or not consuming?

When we try to find an efficient implementation for the `pattern` function described above we need an interpretation of the `?` wildcard action. It can consume a character and feed the rest of the string into a new call of `pattern` or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …

The basic idea when matching a string `S1` against a string `S2` is that we attempt to match `S1[0]` against `S2[0]` and when we succeed, we continue matching `S[1:]` against `S2[1:]` using the same constant `k`. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:

fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)

All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.

A first implementation

The following implementation is a good point to start with:

def fuzzy_compare(S1, S2, k, i=0, j=0):
    N1 = len(S1)
    N2 = len(S2)
    while True:
        if N1-i<=k and N2-j<=k:
            return True
        try:
            if S1[i] == S2[j]:
                i+=1
                j+=1
                continue
        except IndexError:
            return False
        if k == 0:
            return False
        else:
            if fuzzy_compare(S1, S2, k-1, i+1, j):
                return True
            if fuzzy_compare(S1, S2, k-1, i, j+1):
                return True
            if fuzzy_compare(S1, S2, k-1, i+1, j+1):
                return True
            return False

The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.

A second implementation using memoization

Our second implementation still uses recursion but introduces a dictionary which records all `(i,j)` index pairs that have been encountered and stores the current value of `k`. If the algorithm finds a value `k’` at `(i,j)` with `k’`<=`k` it will immediately return `False` because this particular trace has been unsuccessfully checked out before. Using an` n x n` matrix to memoize results will reduce the complexity of the algorithm which becomes `O(n^2)` where n is the length of the string. In fact it will be even `O(n)` because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k<m:
        visited[(i,j)] = k
        return False
    return True
 
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
    N1 = len(S1)
    N2 = len(S2)
    if visited is None:
        visited = {}
    while True:
        if N1-i<=k and N2-j<=k:
            return True
        try:
            if S1[i] == S2[j]:
                visited[(i,j)] = k
                i+=1
                j+=1
                continue
        except IndexError:
            return False
        if k == 0:
            return False
        else:
            if not is_visited(i+1, j, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
                    return True
            if not is_visited(i, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
                    return True
            if not is_visited(i+1, j+1, k-1, visited):
                if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
                    return True
            return False

A third implementation eliminating recursion

In our third and final implementation we eliminate the recursive call to `fuzzy_compare` and replace it using a stack containing tuples `(i, j, k)` recording the current state.

def is_visited(i, j, k, visited):
    m = visited.get((i,j),-1)
    if k<m:
        visited[(i,j)] = k
        return False
    return True
 
def fuzzy_compare(S1, S2, k):
    N1 = len(S1)
    N2 = len(S2)
    i = j = 0
    visited = {}
    stack = []
    while True:
        if N1-i<=k and N2-j<=k:
            return True
        try:
            if S1[i] == S2[j]:
                visited[(i,j)] = k
                i+=1
                j+=1
                continue
        except IndexError:
            if stack:
                i, j, k = stack.pop()
            else:
                return False
        if k == 0:
            if stack:
                i, j, k = stack.pop()
            else:
                return False
        else:
            if not is_visited(i+1, j, k-1, visited):
                stack.append((i,j,k))
                i, j, k = i+1, j, k-1
            elif not is_visited(i, j+1, k-1, visited):
                stack.append((i,j,k))
                i, j, k = i, j+1, k-1
            elif not is_visited(i+1, j+1, k-1, visited):
                i, j, k = i+1, j+1, k-1
            elif stack:                               
                i, j, k = stack.pop()
            else:
                return False

This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of `if` … `elif` branches can impact performance of the algorithm. Do you see a way to improve it?

Checking the algorithm

When D is the Levenshtein distance between two strings S1 and S2 then `fuzzy_compare(S1, S2, k)` shall be `True` for `k`>`=D` and `False` otherwise. So when you want to test `fuzzy_compare` compute the Levenshtein distance and check `fuzzy_compare` with the boundary values `k = D` and `k = D-1`.

def levenshtein(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    matrix = [range(l1 + 1)] * (l2 + 1)
    for zz in range(l2 + 1):
      matrix[zz] = range(zz,zz + l1 + 1)
    for zz in range(0,l2):
      for sz in range(0,l1):
        if s1[sz] == s2[zz]:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
                                   matrix[zz][sz])
        else:
          matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
                                   matrix[zz][sz+1] + 1,
                                   matrix[zz][sz] + 1)
    return matrix[l2][l1]

For exhaustive testing we define a set of strings as follows:

Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.

def string_set(S = None, k = 0, strings = None, n = 6):
    if S is None:
        strings = []
        S = ["a"]*n
        strings.append("".join(S))
    for i in range(k, n):
        S1 = S[:]
        S1[i] = "b"
        strings.append("".join(S1))
        string_set(S1, i+1, strings, n)
    return strings
 
def string_pairs(n):
    L1 = string_set(n=n)
    pairs = []
    for i in range(len(L1)):
        for k in range(1, n+1):
            L2 = string_set(n=k)
            for j in range(len(L2)):
                pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
                pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
    return pairs

Our test function is short:

def test(n):
    for S1, S2, D in string_pairs(n):
        assert fuzzy_compare(S1, S2, D) == True, (S1, S2, D)
        assert fuzzy_compare(S1, S2, D-1) == False, (S1, S2, D-1)

Have much fun!

This entry was posted in Algorithms, Python. Bookmark the permalink.

4 Responses to Fuzzy string matching

  1. MH says:

    Another nice read: “A fast bit-vector algorithm for approximate string matching based on dynamic programming” by Gene Myers and banded versions thereof.

  2. oblio says:

    If you have the time, check this out: lots of fuzzy string matching algorithms implemented there, and it’s Open Source: http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html

  3. Pingback: Fast and Easy Levenshtein distance using a Trie | Programming

  4. Pingback: Programming » Blog Archive » Fast and Easy Levenshtein distance using a Trie

Leave a Reply

Your email address will not be published. Required fields are marked *

*