Go Back   CORTEX Forums > Best Practices > Subject Matter Expertise > Data Integration Forum > Data Integration News Feeds
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Streaming XML content parsing with StAX

This is a discussion on Streaming XML content parsing with StAX within the Data Integration News Feeds forums, part of the Data Integration Forum category; Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.* The format looks like this: USD GBP 1 1 Fri, ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 13th August 2011, 06:09 AM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,067
Latest News Headlines is on a distinguished road
Post Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.* The format looks like this:

USD GBP 1 1 Fri, 01 Jun 2001 22:50:00 GMT 1.4181 1.4177 USD JPY 1 1 Fri, 01 Jun 2001 22:50:02 GMT 0.008387 0.008382 ...Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.* However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.* To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.* You could solve it like this:



Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.* The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.* The solution using this step is the following:



Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.* Every time you get the path, parent path, element value and so forth.* As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.* It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).* What you get is the following desired output:



The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.* The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!

Matt



More from Matt Casters on Data Integration (Pentaho) Blog...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
IBM researchers discuss the technology behind Watson, the language-parsing machine th admin Prediction Markets News Feeds 0 18th February 2011 08:54 PM
Five pitch for Optus streaming Latest News Headlines Latest News 0 6th December 2010 09:57 AM
Concepts in Streaming SQL Latest News Headlines Open Source News and Opinion 0 30th October 2010 05:32 PM
Streaming television on the go Latest News Headlines 2010 Q2 News Headlines 0 27th April 2010 02:58 AM
Streaming the Analytics Latest News Headlines Other International Vendors 0 17th April 2010 02:19 AM


All times are GMT +11. The time now is 05:58 PM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO