| |
| ||||||
The Single Threader stepThis is a discussion on The Single Threader step within the Data Integration News Feeds forums, part of the Data Integration Forum category; Dear Kettle fans, At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the ... |
![]() |
| | LinkBack | Thread Tools | Search this Thread | Display Modes |
| | #1 |
| News Bot Join Date: Nov 2007
Posts: 15,067
![]() | Dear Kettle fans, At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the brilliant idea to write a single threaded engine. The idea back then was that since Hadoop itself was already using parallelism it might be more efficient for once to process rows of data in a single thread with minimal overhead.* This is very much like the approach that Talend has: single threaded but with very little to no overhead for data passing.* So an engine was written, for the Java fans materialized in class SingleThreadedTransExecutor, to allow for that to happen. * To make* along and tedious story short the Pentaho Hadoop team tested the performance and found out that the regular parallel (multi-threaded) engine worked faster. At that point we had an engine without a use-case which is always a bad place to end up.* So the engine risked passage into oblivion. However, there is actually a use-case for the step.* Once every couple of months we get the question (from the sales-team usually, not from actual users) if it is possible to limit the number of threads or processors used in a transformation.* Up until now the answer was “No, if you have 20 steps you’ll have 20 threads, end of story”. The new “Single Threader” that we’re introducing and that uses the single threaded engine changes that.* The most pressing problem that this step solves is the reduction of data passing and thread context switching overhead. Let’s take an example, a transformation with 100 steps.* To make matters worse, the dummy steps don’t do anything so all we’re measuring with this case is overhead: ![]() Because this transformation uses over 100 threads on a 4-core system a lot over thread context switching is taking pace.* We also have over 100 row buffers and locks between the steps that lower performance.* Not by much, but as we’ll see it all adds up. OK, now let’s put the 100 dummy steps in a sub-transformation: ![]() For this we use 1 extra step, an Injector step that will accept the rows from this parent transformation: ![]() Please note that we can execute the “Single Threader” step in multiple copies.* On my test-computer I have 4 cores so I can run in 4 different threads.* In the “Single Threader” step we can specify the sub-transformation we defined above as well as the number of rows we’ll pass through at once: ![]() When we then look at the performance of both solutions we find out that our original transformation runs in 105 seconds on my system.* The new solution completes the task in about 55 seconds or almost have the time. Since this behaves very much like a Mapping or sub-transformation you can also use it as a way to execute re-usable logic. As an additional advantage it makes complex transformations perhaps a bit less cluttered. Well, there you have it: another option to tune the performance of your transformations.* You can find this feature in new downloads from out Jenkins CI build server or later in 4.2.0-RC1. Until next time, Matt More from Matt Casters on Data Integration (Pentaho) Blog... |
| | |
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| VMS and ValEx create single hub | Latest News Headlines | 2010 Q2 News Headlines | 0 | 29th June 2010 09:29 AM |
| NAB single ledger working for SAP | Latest News Headlines | 2010 Q1 News Headlines | 0 | 5th February 2010 06:16 AM |
| Multiple LIKE clauses in a single WHERE statement | James Beresford | BI Monkey | 0 | 28th January 2010 06:25 PM |
| Single Source Of Truth | admin | S - U | 0 | 26th January 2010 04:26 PM |
| Be wary of ?the single metric? | Infohrm | Infohrm | 0 | 8th December 2009 03:32 PM |
| | |
| | |