| |
| ||||||
Simple Data Quality Scoring with SSDQS & SSISThis is a discussion on Simple Data Quality Scoring with SSDQS & SSIS within the BI Monkey forums, part of the CORTEX Blogs category; A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix ... |
![]() |
| | LinkBack | Thread Tools | Search this Thread | Display Modes |
| | #1 |
| Guru Join Date: Jun 2009
Posts: 135
![]() | A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix bad data coming in depending on its assigned quality. A practical example of this might be that in a Customer Address record, a missing Postcode might attract a high score as it’s a very important field. However a badly formatted work, home or mobile telephone number may attract a lower score as it may not be as important to the business. Though, cumulatively, if all three numbers are badly formatted that may be necessary to give a combined high score so the record gets examined. An example of this is below. A failed Postcode gets a score of 3, and a failed telephone number gets a score of 1. Thus, anything with a score of 3 or above either has a failed Postcode, or 3 failed telephone numbers, and can thus be subject to special handling. Fig 1: Data Quality Scoring ExampleFrom the example above we can see this is a fairly arbitrary process in terms of how scores are calculated and used. SSDQS itself doesn’t natively support assigning a score or weight to a failed data item, but what it does do is provide us with a flexible engine to help us decide what is a failed data item. SSIS can then react to this pass / fail behaviour and apply a scoring. Setting up an SSDQS Knowledge Base for Scoring Given that the basis for scoring is pretty binary in nature, I set up a simple KB that had domains that would either pass or fail a piece of data. I first created a data set with three data fields:
Using SSIS to Score SSDQS output Next I hookup up my SSIS Data Quality Cleansing Component to push the source data through the Knowledge Base, and get the status of each of the columns after they pass through. As there are no preloaded valid values in the Domains, the status comes back as either “Invalid” (it failed the Domain rule) or “Unknown” (in this configuration, this translates to a correct value). DQS output Data ViewerThe Data Quality Cleansing Component doesn’t support scoring in itself. This has to be added using a Derived Column on an item by item basis. Using a simple IF / THEN / ELSE expression, I assign a score of 1 to each failed column based on the status of the record, as below: Applying Score using a Derived ColumnBecause of the Pipeline nature of SSIS, I then need to add a second Derived Column transform downstream to weight and add the scores together to create a final, record level score: Aggregating and Weighting Score using a Derived ColumnThis results in a final Data Quality “Score” assigned to each record: Weighted Scoring output Data ViewerWhat you then do with these scores is up to you. In my example package, I used a Conditional Split to send records with a score over a certain threshold to a different destination: DQS Scoring example Data FlowImproving the Scoring process The example I’ve created is quite simplistic – it has hard coded weightings and redirection thresholds, and can only react to two (of a possible three) record statuses.* The process could be made more flexible using metadata driven weightings and thresholds (provided as package inputs). Beyond that you have the option to handle the clean and dirty data more appropriately – by pushing dirty data into a cleanup process, halting ETL processes etc, etc. The key takeaway here is that DQS enables you to create a scoring process that is independent of the actual Data Quality rules that pass or fail a piece of data. The DQS Knowledge Base is your flexible input of what qualifies as a good or bad record, instead of having to hard code using SQL or Derived Columns, which could get messy very quickly. Get More from the original blog... |
| | |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| SQL Server Data Quality Services & SSIS | James Beresford | BI Monkey | 2 | 8th February 2012 03:26 PM |
| SQL Server Data Quality Services & SSIS ? Performance | James Beresford | BI Monkey | 0 | 14th September 2011 04:05 PM |
| Simple data converter from Excel | admin | Analytic News Feeds | 0 | 7th September 2010 07:18 AM |
| Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data. | admin | Analytic News Feeds | 0 | 20th April 2010 09:07 PM |
| SSIS Derived Column forcing data type changes | James Beresford | BI Monkey | 0 | 16th September 2009 06:40 PM |
| | |
| | |