Go Back   CORTEX Forums > Local Happenings > CORTEX Blogs > BI Monkey
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Simple Data Quality Scoring with SSDQS & SSIS

This is a discussion on Simple Data Quality Scoring with SSDQS & SSIS within the BI Monkey forums, part of the CORTEX Blogs category; A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 5th October 2011, 12:38 PM   #1
Guru
 
Join Date: Jun 2009
Posts: 135
James Beresford is on a distinguished road
Thumbs up Simple Data Quality Scoring with SSDQS & SSIS

A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix bad data coming in depending on its assigned quality.

A practical example of this might be that in a Customer Address record, a missing Postcode might attract a high score as it’s a very important field. However a badly formatted work, home or mobile telephone number may attract a lower score as it may not be as important to the business. Though, cumulatively, if all three numbers are badly formatted that may be necessary to give a combined high score so the record gets examined.

An example of this is below. A failed Postcode gets a score of 3, and a failed telephone number gets a score of 1. Thus, anything with a score of 3 or above either has a failed Postcode, or 3 failed telephone numbers, and can thus be subject to special handling.

Fig 1: Data Quality Scoring Example


From the example above we can see this is a fairly arbitrary process in terms of how scores are calculated and used. SSDQS itself doesn’t natively support assigning a score or weight to a failed data item, but what it does do is provide us with a flexible engine to help us decide what is a failed data item. SSIS can then react to this pass / fail behaviour and apply a scoring.

Setting up an SSDQS Knowledge Base for Scoring

Given that the basis for scoring is pretty binary in nature, I set up a simple KB that had domains that would either pass or fail a piece of data. I first created a data set with three data fields:
  • Year – Values ranging from 1970 to 2025
  • Value – Values ranging from 0 to 100
  • Code – Values A,B,C,D,E
I then set up a KB to evaluate the fields as follows:
  • Year – Valid from 1975 to 2020
  • Value – Valid from 10 to 95
  • Code – Valid values A,C,E
Note that I did not set up any Domain Values or do any training – I just set up the KB, Domains and Domain Rules. All I want to use DQS for is to identify records that are invalid for SSIS to use in scoring.

Using SSIS to Score SSDQS output

Next I hookup up my SSIS Data Quality Cleansing Component to push the source data through the Knowledge Base, and get the status of each of the columns after they pass through. As there are no preloaded valid values in the Domains, the status comes back as either “Invalid” (it failed the Domain rule) or “Unknown” (in this configuration, this translates to a correct value).

DQS output Data Viewer


The Data Quality Cleansing Component doesn’t support scoring in itself. This has to be added using a Derived Column on an item by item basis. Using a simple IF / THEN / ELSE expression, I assign a score of 1 to each failed column based on the status of the record, as below:

Applying Score using a Derived Column


Because of the Pipeline nature of SSIS, I then need to add a second Derived Column transform downstream to weight and add the scores together to create a final, record level score:

Aggregating and Weighting Score using a Derived Column


This results in a final Data Quality “Score” assigned to each record:

Weighted Scoring output Data Viewer


What you then do with these scores is up to you. In my example package, I used a Conditional Split to send records with a score over a certain threshold to a different destination:

DQS Scoring example Data Flow


Improving the Scoring process

The example I’ve created is quite simplistic – it has hard coded weightings and redirection thresholds, and can only react to two (of a possible three) record statuses.* The process could be made more flexible using metadata driven weightings and thresholds (provided as package inputs).

Beyond that you have the option to handle the clean and dirty data more appropriately – by pushing dirty data into a cleanup process, halting ETL processes etc, etc.

The key takeaway here is that DQS enables you to create a scoring process that is independent of the actual Data Quality rules that pass or fail a piece of data. The DQS Knowledge Base is your flexible input of what qualifies as a good or bad record, instead of having to hard code using SQL or Derived Columns, which could get messy very quickly.



Get More from the original blog...
James Beresford is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
SQL Server Data Quality Services & SSIS James Beresford BI Monkey 2 8th February 2012 03:26 PM
SQL Server Data Quality Services & SSIS ? Performance James Beresford BI Monkey 0 14th September 2011 04:05 PM
Simple data converter from Excel admin Analytic News Feeds 0 7th September 2010 07:18 AM
Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data. admin Analytic News Feeds 0 20th April 2010 09:07 PM
SSIS Derived Column forcing data type changes James Beresford BI Monkey 0 16th September 2009 06:40 PM


All times are GMT +11. The time now is 10:56 PM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO