| |
| ||||||
The Fuzzy Lookup TransformationThis is a discussion on The Fuzzy Lookup Transformation within the BI Monkey forums, part of the CORTEX Blogs category; Fig 1: The Fuzzy Lookup Transformation In this slightly long post I will be covering the Fuzzy Lookup Transformation, one of my favourite components in SSIS. The sample package can ... |
![]() |
| | LinkBack | Thread Tools | Search this Thread | Display Modes |
| | #1 |
| Guru Join Date: Jun 2009
Posts: 122
![]() | Fig 1: The Fuzzy Lookup TransformationIn this slightly long post I will be covering the Fuzzy Lookup Transformation, one of my favourite components in SSIS. The sample package can be found*here for 2005 and guidelines on use are here. How can a Lookup be Fuzzy? A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key. For example, it can match “Jon Smith” to “John Smith” - this component is very useful for helping consolide client data such as names and addresses from multiple systems. For those of you who want to know more, I thoroughly recommend this article - Fuzzy Lookup and Fuzzy Grouping in SQL Server* Integration Services 2005 - which explains (unusually well for tech doco on BOL) how the Fuzzy Lookup process actually works. An important thing to get to grips with is that when the Fuzzy Lookup outputs its matches, it also outputs a Similarity and Confidence score for the match. The Similarity score is easy to understand - it’s a measure on a 0-1 scale of how similar the matched item is to the lookup key. So for example, “John Smith” and “Jon Smith” rate a 0.89, whereas “John Smith” and “James Smith” rate a lower 0.62. The Confidence score is on the same scale but not fixed for a given match - essentially the higher it is, the lower the number of alternative matches it has found. So from my example, “Johnathan Fruitbat” has just two matches, and the highest Confidence is 0.38. In contrast “John James Walliams” has 15 matches and the highest Confidence is just 0.08. In practice I have tended to pay most attention to the Similarity score. In the example i’ve prepared, I have created a list of names of varying similarity to help you get a feel for how the matching functions in practice, and what the scores look like. What are Fuzzy Lookup Indexes? Like a normal lookup, you need an input table and a lookup reference table. However in this case the lookup reference is a called an Index (this is a bit misleading as it is not an Index in the normal database sense of the word). These indexes store fragments of the lookup items for fuzzy searching. When you use a Fuzzy lookup you have 4 options* on working with these indexes. Fig 2: Configuring the Reference Table and Indexes for the Fuzzy Lookup
So in what scenarios would you use the different options? The answers below match the option numbers above:
Advanced settings for the Fuzzy Lookup I will skip over the columns tab, as is simply requires you to match the column in your source to the lookup key in your reference table, and specify any additional columns you would like returned. Under the Advanced Tab, you effectively have 3 options: Fig 3: The Advanced Tab of the Fuzzy LookupFirst is “Maximum number of matches to output per lookup”. This allows you to constrain the number of matches the fuzzy lookup will return per input key. So again referring to my example, if I set this to 10, even though “John James Walliams” has 15 possible matches, only 10 would be returned. “Johnathan Fruitbat” would still send back just 2 matches, as there are only two to return. An important takeaway here is that the Fuzzy Lookup does not return one row per input row - it can return many, or even zero. Second is “Similarity Threshold” - this is a sliding scale running from 0 to 1 which sets a cut-off for the quality of returns from* Fuzzy Lookup. In practice you rarely want to set this much below 0.75 - but you will need to do a few test runs to see which level is appropriate for your data. Third is “Token Delimiters” - this is one of the means the Fuzzy Lookup uses to break your lookup keys into smaller pieces for fuzzy matching. You would only add to or remove from these if your data has special characters delimiting (or not delimiting) items within the data. Finally, a couple of properties only available from the Properties pane are WarmCaches and Exhaustive. WarmCaches specifies whether the Indexes and Lookup Table are partly preloaded into memory - by default this is set to True and I can’t really see a scenario where this should be changed. Exhaustive increases the intensity of the matching process - making the checking more thorough - but at a performance cost. I would mark this as an option to play with depending on how well your matching is doing, and set it to True where possible. Where would you use the Fuzzy Lookup? The main uses for the Fuzzy Lookup that I have encountered has been matching supposedly common data from different data sources such as client lists and deduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be - but isn’t - the same, this is a very useful tool. MSDN Documentation for the Fuzzy Lookup Transformation can be found here for 2008 and here for 2005. More... |
| | |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | |
| |
| | |
| | |