Category Archives: Code & Development

Efficient Fuzzy Matching at Word Level

I’ve just solved a tricky problem with what I think is quite an elegant solution, and thought it would be interesting to share it.

I’m building a system in which I have to process fault data. Sometimes this comes with a standard fault code (hallelujah!), but quite often it comes with the manufacturer’s own fault code and a description which may (or may not) be quite close to the description against one of the standard faults. If I can match the description up, I can treat the fault as standard.

The problem is that the description matching is not exact. Variations in punctuation are common, but the wording can also change so that, for example, “Evaporative emission system incorrect purge flow” in one system is “Evaporative emission control system incorrect purge flow” in another. To a human reader this is fine, but eliminates simplistic exact matching.

I spent some time Googling fuzzy matching, but most of the available literature focuses on character or even bit-level matching and looks both complex and compute-intensive. However finally I found the Jaccard similarity coefficient. This is designed for establishing the “similarity” between two objects with similar lists of attributes, and I had a “lights on” moment and realised I could apply a similar algorithm, but to the set of words used in the pair of descriptions.

The algorithm to calculate the coefficient for a given pair is actually very simple:

  1. Convert Text1 to a list of words/tokens, excluding spaces and punctuation. In VB.NET the string.split() function does this very neatly and you can specify exactly what counts as punctuation or white space. For simplicity it’s a good idea to convert both strings to uppercase to eliminate capitalisation variations.
  2. Convert Text2 to a list of tokens on the same basis.
  3. For each token from Text1, see if it appears in the list of tokens from Text2. If so, increment a counter M
  4. For each token from Text2, see if it appears in the list of tokens from Text1. If so, increment M
  5. Calculate the coefficient as M / (total number of tokens from both lists)

This produces a very intuitive result: 1 if the token sets are an exact match, 0 if they are completely disjoint, and a linearly varying value between. The process does, however, ignore transpositions, so that “Fuel rail pressure low” equates to “Fuel rail low pressure”. In my context this matches what a human assessor would do.

Now I simply have to repeat steps 2-5 above for each standard error description, and pick the one which produces the highest coefficient. If the value is below about 80% I treat the string as “matched”, and I can quote the coefficient to give a feel for “how good” the match is.

Hopefully that’s useful.

Posted in Agile & Architecture, Code & Development | 1 Comment

Caught by The Law!

Don’t get too excited. Those of you hoping to see me carted off in manacles and an orange jumpsuit will be sadly disappointed… No, the law to which I refer is Moore’s Law, which states effectively, if you need reminding, Continue reading

Friday, July 18, 2014 in Agile & Architecture, Code & Development, PCs/Laptops, Thoughts on the World

Webkit, KitKat and Deadlocks!

I don’t know what provision Dante Alighieri made, but I’m hoping there’s a special corner of Hell reserved for paedophiles, mass murderers and so-called engineers from big software companies who think there might ever be a justification for breaking backwards Continue reading

Tuesday, June 17, 2014 in Agile & Architecture, Android, Code & Development, Thoughts on the World

My First Android App: Stash-It!

After a couple of months of busy early morning and late night programming, my first Android app has finally been released. Please meet Stash-It! Stash-It! responds to an odd side-effect of the difference between the iOS and Android security models. Continue reading

Thursday, April 10, 2014 in Agile & Architecture, Android, Apps, Code & Development, My Publications, Thoughts on the World

Developing for Android

Regular readers will realise that I’ve been rather quiet recently. The reason is that over the last couple of weeks I’ve bitten the bullet and started seriously developing an “app” for Android. As always when I have a programming project Continue reading

Thursday, February 13, 2014 in Android, Code & Development, Galaxy Note, VMWare

The Micro Four Thirds Lens Correction Project

Although most Micro Four Thirds (MFT) lenses are tiny,  the cameras produce great JPG files with apparently little or no geometric distortion. They do this by applying corrections in camera,  and the correction parameter data is also stored with the Continue reading

Wednesday, August 29, 2012 in Code & Development, Micro Four Thirds, Photography

Macs Are Really Easy? Ha!

There is a myth. The myth goes “Windows is complicated. Macs are really easy – they just work.” Like most myths this may have started from an original truth, but is now a lie. I am it’s latest, but I Continue reading

Friday, April 13, 2012 in Code & Development, Thoughts on the World, VMWare

Mac OSX–A Third-Class OS?

Does Apple’s opposition to virtualisation create a technical ghetto? Continue reading

Wednesday, February 29, 2012 in Code & Development, PCs/Laptops, Thoughts on the World, VMWare

First Bibble Plugin Published

I’ve just published my first plugin for the popular image processing suite, Bibble. CAQuest manages chromatic aberration correction, so if you find yourself always having to apply correction for “purple fringes”, this is the tool you need. To find out Continue reading

Saturday, December 11, 2010 in Code & Development, My Publications, Photography

Integrating External Content with WordPress

I’ve been developing andrewj.com for about 15 years, and although I’m not that prolific I’ve built up quite a lot of content. I recently converted my blog from an old bespoke (= “custom”, for my American friends) solution to one Continue reading

Thursday, August 12, 2010 in Code & Development, My Publications, Website & Blog

In Damnation of PHP

<rant>Apologies if the title is a bit strong, but I think it’s the nearest I can get to the opposite of “In Praise of PHP” I’ve just spent a week-end migrating my website to a new hosting server. As part Continue reading

Wednesday, June 16, 2010 in Code & Development, Thoughts on the World

Using Volume Shadowing with Ntbackup Under Vista

The brain-dead backup function of Windows Vista is enormously annoying. There are known ways to get good old ntbackup working, but they have their limitations. Read this article about my attempts to get round some of those limitations. Continue reading

Monday, July 9, 2007 in Code & Development, Thoughts on the World