| |
| ||||||
Parse nasty XLS with dynamic ETLThis is a discussion on Parse nasty XLS with dynamic ETL within the Data Integration News Feeds forums, part of the Data Integration Forum category; Dear Kettle friends, Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here ).* Since ... |
![]() |
| | LinkBack | Thread Tools | Search this Thread | Display Modes |
| | #1 |
| News Bot Join Date: Nov 2007
Posts: 15,067
![]() | Dear Kettle friends, Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here).* Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things.* However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”. Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity. Take a look at this file: ![]() Let’s assume that this spreadsheet describes the number of products sold on a given date. Spreadsheets like these are usually automatically generated by some kind of pivoting program.* Because of that they contains a varying number of columns with a dimension value in the column header.* In our sample we have one column for each date since the beginning of this year.* Not all dates are listed however, that would too simple, only days on which products were actually sold are listed.* In our sample that means: no weekend days are present in the spreadsheet.* In short: this spreadsheet will look different all year round. Surely, this can’t be too hard to read in by an ETL tool, right?* You just want to load this data into a database somewhere and be done with it. Now hang on.* Databases don’t support tables with varying number of columns.* So we actually want to only get 3 columns from this spreadsheet: The product code, the date of sales and the number of goods sold (the metric). In order to do that, we need to actually know the layout of the spreadsheet before we do the ETL.* We also need to un-pivot or normalize the data.* Again, to be able to do that, we need to know the exact layout of the file. Once at this point any serious ETL consultant will be doing one of the following:
If only we could somehow determine the layout of the file automatically, extract the date for each column and then inject that information into another transformation.* Mmm… This job describes the 2 main transformations used to solve our pesky little problem.* Both consist of a transformation that feeds ETL metadata into a transformation template.To determine the layout of the file we need to read the header of the spreadsheet file. Here’s how that should work: ![]() What this transformation (template) does is read the PROD_CODE column from the spreadsheet as well as a fixed number of columns from the first line of the spreadsheet.* Then the data is normalized.* Finally we only need to retain those columns where there is a non-null date The only problem with this transformation is that the number of columns is not dynamic so we need to either update it every day or you need to type in a a lot of columns up-front, say a thousand or so.*** Since we don’t like to type we want to generate that ETL metadata dynamically.* You could try to manually “generate” this data in a spreadsheet and then copy it over into the step grids or … this way: ![]() This transformation generates 1001 rows of field ETL metadata for 2 steps: “Read Spreadsheet” and “Row Normaliser”.* The output of the “Append streams” step looks exactly like the Fields section in the “Read Spreadsheet” step: ![]() This fields metadata is mapped and injected into our transformation template with the “ETL Metadata Injection” step: ![]() So our transformation to determine the layout of a spreadsheet actually executes another templated transformation after we injected the appropriate ETL metadata.* The resulting “layout.txt” file contains the following data: typefield;dateSo with this transformation we learned which column after the product code contains which date.* We can use this in the next transformation template to correspond our column number with the correct date: The layout of the spreadsheet and the row normaliser metadata can be injected into the template with the following transformation:![]() The file that comes out of this can obviously be stored into a database but for now it looks like this: PROD_CODE;date;metricThis is the format we wanted.* We did have to solve a few complex problems and as usual you can’t oversimplify solutions.* However, our solution will last forever, is low maintenance and is fairly generic in the sense that you could parameterize the whole job to read more than just this one spreadsheet. If you want to play with this example yourself, check out the attached zip test-case of PDI-5422.* Please note that you need a recent (as in the one from today) nightly build of Kettle/PDI to run. As always, let me know what you think and what other steps you need to support ETL Metadata Injection. Until next time, Matt More from Matt Casters on Data Integration (Pentaho) Blog... |
| | |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Physics of dynamic braking in slow motion | admin | Analytic News Feeds | 0 | 7th January 2011 06:57 PM |
| Dynamic sculpture brings weather into airport | admin | Analytic News Feeds | 0 | 28th December 2010 03:27 PM |
| BuzzNumbers featured in Dynamic Business Magazine | BuzzNumbers | BuzzNumbers | 0 | 11th July 2010 02:28 PM |
| Those Nasty “Hairball” Analytics Projects: Usually There’s A Data Warehouse At The Co | Latest News Headlines | Forrester | 0 | 10th July 2010 06:20 AM |
| Dynamic Network Analysis | admin | Prediction Markets News Feeds | 0 | 10th April 2010 06:30 AM |
| | |
| | |