Efficient Fuzzy Matching at Word Level

I’ve just solved a tricky problem with what I think is quite an elegant solution, and thought it would be interesting to share it.

I’m building a system in which I have to process fault data. Sometimes this comes with a standard fault code (hallelujah!), but quite often it comes with the manufacturer’s own fault code and a description which may (or may not) be quite close to the description against one of the standard faults. If I can match the description up, I can treat the fault as standard.

The problem is that the description matching is not exact. Variations in punctuation are common, but the wording can also change so that, for example, “Evaporative emission system incorrect purge flow” in one system is “Evaporative emission control system incorrect purge flow” in another. To a human reader this is fine, but eliminates simplistic exact matching.

I spent some time Googling fuzzy matching, but most of the available literature focuses on character or even bit-level matching and looks both complex and compute-intensive. However finally I found the Jaccard similarity coefficient. This is designed for establishing the “similarity” between two objects with similar lists of attributes, and I had a “lights on” moment and realised I could apply a similar algorithm, but to the set of words used in the pair of descriptions.

The algorithm to calculate the coefficient for a given pair is actually very simple:

  1. Convert Text1 to a list of words/tokens, excluding spaces and punctuation. In VB.NET the string.split() function does this very neatly and you can specify exactly what counts as punctuation or white space. For simplicity it’s a good idea to convert both strings to uppercase to eliminate capitalisation variations.
  2. Convert Text2 to a list of tokens on the same basis.
  3. For each token from Text1, see if it appears in the list of tokens from Text2. If so, increment a counter M
  4. For each token from Text2, see if it appears in the list of tokens from Text1. If so, increment M
  5. Calculate the coefficient as M / (total number of tokens from both lists)

This produces a very intuitive result: 1 if the token sets are an exact match, 0 if they are completely disjoint, and a linearly varying value between. The process does, however, ignore transpositions, so that “Fuel rail pressure low” equates to “Fuel rail low pressure”. In my context this matches what a human assessor would do.

Now I simply have to repeat steps 2-5 above for each standard error description, and pick the one which produces the highest coefficient. If the value is below about 80% I treat the string as “matched”, and I can quote the coefficient to give a feel for “how good” the match is.

Hopefully that’s useful.

Posted in Agile & Architecture, Code & Development | 1 Comment

Positively On Fire…

Apologies, my first blog post of the New Year really should have wished you all the very best for 2015. Please accept this as a pseudo-first post, with said wishes. I also just wanted to post this shot from yesterday. Continue reading

Monday, January 12, 2015 in Photography, Thoughts on the World

Monochrome, Sort Of…

I’m making use of my new Windows MacBook to catch up with photo processing, including a few shots from our trip to Barbados last year. One of the things I particularly love about the Caribbean are the splashes of colour Continue reading

Wednesday, January 7, 2015 in Barbados, Photography, Travel

Google Bowls a Googly

Here’s a thing. Do a search for a restaurant, theatre or somewhere else you’d like to visit, using Google Chrome. Get a map using Google Maps, in Google Chrome. Print out a copy for reference – blank page! Copy the Continue reading

Monday, December 29, 2014 in Thoughts on the World

More Panoramas!

The astute among you will have noticed that I place a random panorama in the masthead of all my web site pages. I’ve just refreshed my album with a number of new images, which I hope you’ll enjoy. Continue reading

Wednesday, December 17, 2014 in Website & Blog

The Last Link in the Chain

Day 10 We start by driving back to the Linn Cove Viaduct, the last piece of the Parkway finally put in place in 1987. It’s a great feature in its own right, but there’s also some very colourful foliage on Continue reading

Thursday, December 4, 2014 in Travel, USA 2014

The Experiment Continues

The MacBook Pro has arrived, and for a nearly four year old PC it’s in very good nick. There’s one unfortunate scratch on the top lid, but otherwise it’s very clean and works well. The 8GB RAM I switched out Continue reading

Wednesday, December 3, 2014 in PCs/Laptops

Waterfall in the Rain

Day 9 We awake to something we haven’t seen so far this trip – rain. Fortunately I’m a great believer that bad weather makes good photographs, so hopefully we’ll still enjoy the day. We get back on the Parkway and Continue reading

Monday, November 24, 2014 in Travel, USA 2014

In the Blue Ridged Mountains…

Day 8 North Carolina. Lattes! Sparkling mineral water!! Vegetables!!! We drive through the Great Smoky Mountains National Park and join the Blue Ridge Parkway. This was one of FDR’s great public works initiatives in the 1930s. Running along the ridge Continue reading

An Unloved Park?

Day 7 A slightly frustrating day. Great Smoky Mountains National Park is the most visited of the American parks, but in some ways it feels like the least loved. We are shocked by fresh graffiti in what may well still Continue reading

An Experiment

Readers of longer standing may remember the agonising when I had to replace my 2009 Toshiba laptop, as there had been yet another shift in screen aspect ratio standards, and in order to preserve a decent vertical screen size, I Continue reading

Sunday, November 23, 2014 in PCs/Laptops, Thoughts on the World

A Tide In The Affairs Of Men

Observations on the Inaudible, Incomprehensible and Impossible “Interstellar” Continue reading

Sunday, November 9, 2014 in Reviews