Go Back   CORTEX Forums > Best Practices > Subject Matter Expertise > Data Integration Forum > Data Integration News Feeds
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

The Single Threader step

This is a discussion on The Single Threader step within the Data Integration News Feeds forums, part of the Data Integration Forum category; Dear Kettle fans, At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 8th May 2011, 03:05 AM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,067
Latest News Headlines is on a distinguished road
Post The Single Threader step

Dear Kettle fans,

At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the brilliant idea to write a single threaded engine.

The idea back then was that since Hadoop itself was already using parallelism it might be more efficient for once to process rows of data in a single thread with minimal overhead.* This is very much like the approach that Talend has: single threaded but with very little to no overhead for data passing.* So an engine was written, for the Java fans materialized in class SingleThreadedTransExecutor, to allow for that to happen. * To make* along and tedious story short the Pentaho Hadoop team tested the performance and found out that the regular parallel (multi-threaded) engine worked faster.

At that point we had an engine without a use-case which is always a bad place to end up.* So the engine risked passage into oblivion.

However, there is actually a use-case for the step.* Once every couple of months we get the question (from the sales-team usually, not from actual users) if it is possible to limit the number of threads or processors used in a transformation.* Up until now the answer was “No, if you have 20 steps you’ll have 20 threads, end of story”.

The new “Single Threader” that we’re introducing and that uses the single threaded engine changes that.* The most pressing problem that this step solves is the reduction of data passing and thread context switching overhead.

Let’s take an example, a transformation with 100 steps.* To make matters worse, the dummy steps don’t do anything so all we’re measuring with this case is overhead:



Because this transformation uses over 100 threads on a 4-core system a lot over thread context switching is taking pace.* We also have over 100 row buffers and locks between the steps that lower performance.* Not by much, but as we’ll see it all adds up.

OK, now let’s put the 100 dummy steps in a sub-transformation:



For this we use 1 extra step, an Injector step that will accept the rows from this parent transformation:



Please note that we can execute the “Single Threader” step in multiple copies.* On my test-computer I have 4 cores so I can run in 4 different threads.* In the “Single Threader” step we can specify the sub-transformation we defined above as well as the number of rows we’ll pass through at once:



When we then look at the performance of both solutions we find out that our original transformation runs in 105 seconds on my system.* The new solution completes the task in about 55 seconds or almost have the time.

Since this behaves very much like a Mapping or sub-transformation you can also use it as a way to execute re-usable logic. As an additional advantage it makes complex transformations perhaps a bit less cluttered.

Well, there you have it: another option to tune the performance of your transformations.* You can find this feature in new downloads from out Jenkins CI build server or later in 4.2.0-RC1.

Until next time,

Matt



More from Matt Casters on Data Integration (Pentaho) Blog...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
VMS and ValEx create single hub Latest News Headlines 2010 Q2 News Headlines 0 29th June 2010 09:29 AM
NAB single ledger working for SAP Latest News Headlines 2010 Q1 News Headlines 0 5th February 2010 06:16 AM
Multiple LIKE clauses in a single WHERE statement James Beresford BI Monkey 0 28th January 2010 06:25 PM
Single Source Of Truth admin S - U 0 26th January 2010 04:26 PM
Be wary of ?the single metric? Infohrm Infohrm 0 8th December 2009 03:32 PM


All times are GMT +11. The time now is 05:56 PM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO