Go Back   CORTEX Forums > Local Happenings > CORTEX Blogs > BI Monkey
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

BI Monkey BI Monkey is the ‘nom de plume’ of James Beresford, a Certified Microsoft BI Professional and MBA living and consulting in Sydney

Reply
 
LinkBack Thread Tools Display Modes
Old 5th November 2009, 07:00 PM   #1 (permalink)
Senior Member
 
Join Date: Jun 2009
Posts: 62
James Beresford is on a distinguished road
Thumbs up The Fuzzy Grouping Transformation

Fig 1: The Fuzzy Grouping Transformation


In this post I will be covering the Fuzzy Grouping Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Fuzzy Grouping Transformation do?

The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.

If this is all getting a bit confusing, let’s take a look at a example of some output:

Fig 2: Fuzzy Grouping Output


This shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row.

So what has actually happened here?
  • First, the component has loaded all the data into temporary tables.
  • Second, it scanned through using fuzzy algorithms to look for similar items.
  • Third, for each input row it output either the best match above the threshold similarity, or otherwise decided it had no matches.
You can then use this output to perform an aggregate group by operation to sum up your data by similar items, using the Aggregate Transformation.

Configuring the Fuzzy Grouping Transformation

The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB.

Fig 3: Configuring the Connection Manager for the Fuzzy Grouping Transformation


Next, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically.

Fig 4: Configuring the Columns used in the Fuzzy Grouping Transformation


For each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:
  • Input Column – the column selected with the checkbox for grouping
  • Output Alias – the name of the Input column in the Output data flow
  • Group Output Alias – the column which will hold the best match value from the grouping
  • Match Type – Exact or Fuzzy matching
  • Minimum Similarity – The minimum similarity for matches for that column
  • Similarity Output Alias – the name of the output column containing the similarity score for that column
  • Numerals – tunes how the matches handle numbers in the data
  • Comparison Flags – fine tuning the string handling
Full details for how these settings work – especially the last two – can be found here on MSDN. I would expect the most common ones you would play with would be Match Type – when grouping on multiple columns and are confident some of them have good quality data, setting to exact match will improve the overall result for matching a row of data. Similarly with Minimum Similarity, you can tune for the quality of match you will accept on individual columns.

Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2.

Fig 5: Configuring the Advanced options for the Fuzzy Grouping Transformation


Finally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match.

Where should you use the Fuzzy Grouping Transformation?

The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated.

Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance.

MSDN Documentation for the Fuzzy Grouping Transformation can be found here for 2008 and here for 2005.

If you are still struggling, try these additional resources:
If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.



Get More from the original blog...
James Beresford is offline   Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
The Script Transformation part 1 ? a simple Transformation James Beresford BI Monkey 0 1st September 2009 02:01 PM
The Row Count Transformation James Beresford BI Monkey 0 13th August 2009 11:51 AM
The Merge Transformation James Beresford BI Monkey 0 20th July 2009 12:11 PM
The Multicast Transformation James Beresford BI Monkey 0 7th July 2009 04:48 AM
The Fuzzy Lookup Transformation James Beresford BI Monkey 0 23rd June 2009 08:34 PM


All times are GMT +11. The time now is 02:27 AM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO 3.3.0