Efficient Fuzzy Matching at Word Level

I’ve just solved a tricky problem with what I think is quite an elegant solution, and thought it would be interesting to share it.

I’m building a system in which I have to process fault data. Sometimes this comes with a standard fault code (hallelujah!), but quite often it comes with the manufacturer’s own fault code and a description which may (or may not) be quite close to the description against one of the standard faults. If I can match the description up, I can treat the fault as standard.

The problem is that the description matching is not exact. Variations in punctuation are common, but the wording can also change so that, for example, “Evaporative emission system incorrect purge flow” in one system is “Evaporative emission control system incorrect purge flow” in another. To a human reader this is fine, but eliminates simplistic exact matching.

I spent some time Googling fuzzy matching, but most of the available literature focuses on character or even bit-level matching and looks both complex and compute-intensive. However finally I found the Jaccard similarity coefficient. This is designed for establishing the “similarity” between two objects with similar lists of attributes, and I had a “lights on” moment and realised I could apply a similar algorithm, but to the set of words used in the pair of descriptions.

The algorithm to calculate the coefficient for a given pair is actually very simple:

Convert Text1 to a list of words/tokens, excluding spaces and punctuation. In VB.NET the string.split() function does this very neatly and you can specify exactly what counts as punctuation or white space. For simplicity it’s a good idea to convert both strings to uppercase to eliminate capitalisation variations.
Convert Text2 to a list of tokens on the same basis.
For each token from Text1, see if it appears in the list of tokens from Text2. If so, increment a counter M
For each token from Text2, see if it appears in the list of tokens from Text1. If so, increment M
Calculate the coefficient as M / (total number of tokens from both lists)

This produces a very intuitive result: 1 if the token sets are an exact match, 0 if they are completely disjoint, and a linearly varying value between. The process does, however, ignore transpositions, so that “Fuel rail pressure low” equates to “Fuel rail low pressure”. In my context this matches what a human assessor would do.

Now I simply have to repeat steps 2-5 above for each standard error description, and pick the one which produces the highest coefficient. If the value is below about 80% I treat the string as “matched”, and I can quote the coefficient to give a feel for “how good” the match is.

Hopefully that’s useful.