|
||||||
| BI Monkey BI Monkey is the ‘nom de plume’ of James Beresford, a Certified Microsoft BI Professional and MBA living and consulting in Sydney |
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|
#1 (permalink) |
|
Senior Member
Join Date: Jun 2009
Posts: 62
![]() |
Fig 1: The Fuzzy Grouping TransformationIn this post I will be covering the Fuzzy Grouping Transformation. The sample package can be found here for 2005 and guidelines on use are here. What does the Fuzzy Grouping Transformation do? The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows. If this is all getting a bit confusing, let’s take a look at a example of some output: Fig 2: Fuzzy Grouping OutputThis shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row. So what has actually happened here?
Configuring the Fuzzy Grouping Transformation The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB. Fig 3: Configuring the Connection Manager for the Fuzzy Grouping TransformationNext, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically. Fig 4: Configuring the Columns used in the Fuzzy Grouping TransformationFor each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:
Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2. Fig 5: Configuring the Advanced options for the Fuzzy Grouping TransformationFinally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match. Where should you use the Fuzzy Grouping Transformation? The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated. Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance. MSDN Documentation for the Fuzzy Grouping Transformation can be found here for 2008 and here for 2005. If you are still struggling, try these additional resources: If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you. Get More from the original blog... |
|
|
|
![]() |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| The Script Transformation part 1 ? a simple Transformation | James Beresford | BI Monkey | 0 | 1st September 2009 02:01 PM |
| The Row Count Transformation | James Beresford | BI Monkey | 0 | 13th August 2009 11:51 AM |
| The Merge Transformation | James Beresford | BI Monkey | 0 | 20th July 2009 12:11 PM |
| The Multicast Transformation | James Beresford | BI Monkey | 0 | 7th July 2009 04:48 AM |
| The Fuzzy Lookup Transformation | James Beresford | BI Monkey | 0 | 23rd June 2009 08:34 PM |
|
|
|
|
|
|