{"id":1517,"date":"2015-01-16T15:46:45","date_gmt":"2015-01-16T14:46:45","guid":{"rendered":"http:\/\/www.andrewj.com\/blog\/?p=1517"},"modified":"2015-01-16T15:46:45","modified_gmt":"2015-01-16T14:46:45","slug":"efficient-fuzzy-matching-at-word-level","status":"publish","type":"post","link":"https:\/\/www.andrewj.com\/blog\/2015\/efficient-fuzzy-matching-at-word-level\/","title":{"rendered":"Efficient Fuzzy Matching at Word Level"},"content":{"rendered":"<p>I&#8217;ve just solved a tricky problem with what I think is quite an elegant solution, and thought it would be interesting to share it.<\/p>\n<p>I&#8217;m building a system in which I have to process fault data. Sometimes this comes with a standard fault code (hallelujah!), but quite often it comes with the manufacturer&#8217;s own fault code and a description which may (or may not) be quite close to the description against one of the standard faults. If I can match the description up, I can treat the fault as standard.<\/p>\n<p>The problem is that the description matching is not exact. Variations in punctuation are common, but the wording can also change so that, for example, &#8220;Evaporative emission system incorrect purge flow&#8221; in one system is &#8220;Evaporative emission <em>control <\/em>system incorrect purge flow&#8221; in another. To a human reader this is fine, but eliminates simplistic exact matching.<\/p>\n<p>I spent some time Googling fuzzy matching, but most of the available literature focuses on character or even bit-level matching and looks both complex and compute-intensive. However finally I found the <b>Jaccard similarity coefficient<\/b>. This is designed for establishing the &#8220;similarity&#8221; between two objects with similar lists of attributes, and I had a &#8220;lights on&#8221; moment and realised I could apply a similar algorithm, but to the set of words used in the pair of descriptions.<\/p>\n<p>The algorithm to calculate the coefficient for a given pair is actually very simple:<\/p>\n<ol>\n<li>Convert Text1 to a list of words\/tokens, excluding spaces and punctuation. In VB.NET the string.split() function does this very neatly and you can specify exactly what counts as punctuation or white space. For simplicity it&#8217;s a good idea to convert both strings to uppercase to eliminate capitalisation variations.<\/li>\n<li>Convert Text2 to a list of tokens on the same basis.<\/li>\n<li>For each token from Text1, see if it appears in the list of tokens from Text2. If so, increment a counter M<\/li>\n<li>For each token from Text2, see if it appears in the list of tokens from Text1. If so, increment M<\/li>\n<li>Calculate the coefficient as M \/ (total number of tokens from both lists)<\/li>\n<\/ol>\n<p>This produces a very intuitive result: 1 if the token sets are an exact match, 0 if they are completely disjoint, and a linearly varying value between. The process does, however, ignore transpositions, so that &#8220;Fuel rail pressure low&#8221; equates to &#8220;Fuel rail low pressure&#8221;. In my context this matches what a human assessor would do.<\/p>\n<p>Now I simply have to repeat steps 2-5 above for each standard error description, and pick the one which produces the highest coefficient. If the value is below about 80% I treat the string as &#8220;matched&#8221;, and I can quote the coefficient to give a feel for &#8220;how good&#8221; the match is.<\/p>\n<p>Hopefully that&#8217;s useful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve just solved a tricky problem with what I think is quite an elegant solution, and thought it would be interesting to share it. I&#8217;m building a system in which I have to process fault data. Sometimes this comes with &hellip; <a href=\"https:\/\/www.andrewj.com\/blog\/2015\/efficient-fuzzy-matching-at-word-level\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,9],"tags":[],"class_list":["post-1517","post","type-post","status-publish","format-standard","hentry","category-agile","category-code"],"_links":{"self":[{"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/posts\/1517","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/comments?post=1517"}],"version-history":[{"count":0,"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/posts\/1517\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/media?parent=1517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/categories?post=1517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.andrewj.com\/blog\/wp-json\/wp\/v2\/tags?post=1517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}