Go Back   CORTEX Forums > Vendors and Service Provders > Open Source Analytics > Open Source News and Opinion
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Pentaho, Hadoop, and Data Lakes

This is a discussion on Pentaho, Hadoop, and Data Lakes within the Open Source News and Opinion forums, part of the Open Source Analytics category; Earlier this week, at Hadoop World in New York,* Pentaho announced availability of our first Hadoop release. As part of the initial research into the Hadoop arena I talked to ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 15th October 2010, 03:06 AM   #1
News Bot
 
Join Date: Nov 2007
Posts: 15,085
Latest News Headlines is on a distinguished road
Post Pentaho, Hadoop, and Data Lakes

Earlier this week, at Hadoop World in New York,* Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.
In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost
Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture




More from James Dixon’s Blog ...
Latest News Headlines is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Comment on Pentaho and IBM Hadoop Announcements by Bob Latest News Headlines Open Source News and Opinion 0 28th May 2010 04:53 AM
Pentaho and IBM Hadoop Announcements Latest News Headlines Open Source News and Opinion 0 27th May 2010 03:13 PM
EMC?s Dan Hushon on Pentaho and Hadoop Latest News Headlines Open Source News and Opinion 0 20th May 2010 03:29 PM
Pentaho Harnesses Apache Hadoop to Deliver Big Data Analytics Latest News Headlines Other International Vendors 0 20th May 2010 03:16 AM
Pentaho and Hadoop: Big Data + Big ETL + Big BI = Big Deal Latest News Headlines Open Source News and Opinion 0 19th May 2010 07:20 PM


All times are GMT +11. The time now is 07:23 AM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO