Go Back   CORTEX Forums > Local Happenings > CORTEX Blogs > Innovations in Data Management
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

Truly Distributed Analytics

This is a discussion on Truly Distributed Analytics within the Innovations in Data Management forums, part of the CORTEX Blogs category; The growth and success of Hadoop is very interesting. It is emerging as a highly significant technology for the data scientist. It is a platform that can scale and accommodate ...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 22nd October 2010, 08:06 AM   #1
Senior Member
 
Join Date: Jun 2009
Posts: 71
Tony Bain is on a distinguished road
Post Truly Distributed Analytics

The growth and success of Hadoop is very interesting. It is emerging as a highly significant technology for the data scientist. It is a platform that can scale and accommodate data exploration even across some of the largest datasets that...

The growth and success of Hadoop is very interesting. *It is emerging as a highly significant technology for the data scientist. *It is a platform that can scale and accommodate data exploration even across some of the largest datasets that exist today. *Yahoo, I’m told, has a 43,000 node Hadoop cluster. *The mind boggles at the volume of data being crunched with this cluster and ones like it. *Hadoop is distributed. *More specifically, it is a distributed system. *A cluster of servers acting together to process a sequence of user initiated jobs.*

While the system may be considered distributed, the data being analyzed is, for all intensive purposes, centralized. *The data at the centre of job analysis jobs must be located within your cluster and directly accessible by your local applications. *This means as the volume of data under the microscope grows the size of the analytics platform grows to accommodate the influx of information.*

However as data science expands external data sources are becoming increasingly relevant for data analytics. *External data being data that is related to your business, but not produced within your organization. *Examples of such data may be environmental data (weather), geographic data (maps, places, addresses etc), shipping & delivery data and so on. *External data can provide insight into irregularity and opportunity within your own datasets that, without it, could be overlooked or misunderstood.*

While I spoke about this the other day somewhat in jest, some silly but simple examples may be the discovery that it is beneficial to increase advertising targeting those in their 30-50’s when “The O.C” is on TV or that it is beneficial to boost the advertising of certain novels in regions where it is currently pouring down outside. *These areas for opportunity couldn’t be discovered until your data is combined with externally sourced data (television scheduling, weather etc).

External data at the moment tends to be quite small and discrete so the current approach is to import external data into the local analytics environment. *And organizations such as Infochimps are doing a great job or organizing these external data sets and providing APIs for importing data into whatever localized analytics platform you are running. *However as the important and volume of external data grows I believe the impact of “importing” this data will grow and the volume of external data may become significantly greater than the local data in certain cases. *Also identifying what external data is relevant will become a role of analytics itself.

While it is early days, one project I am very excited about is focused on how analytics can be distributed between systems and even organizations. *Rather than centralizing large sets of data, the analytics jobs themselves span organizations and data centers. *And of course, when doing so, respecting the security and privacy expectations of all parties in the process.

*





Get More from the original blog...
Tony Bain is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Rosslyn Analytics Unveils New Apps for World?s First Free Self Service Analytics Plat Latest News Headlines 2010 Q1 News Headlines 0 23rd March 2010 10:29 PM
Rosslyn Analytics Unveils New Apps for World's First Free Self Service Analytics Plat Latest News Headlines Other International Vendors 0 23rd March 2010 03:06 AM
Java Software Developer Web Services - Distributed Systems admin 2010 Job Archive 0 15th February 2010 09:40 AM
Panini Strengthens and Expands Complete Set of Services for Distributed Capture Latest News Headlines 2009 Q4 News Headlines 0 5th November 2009 12:28 AM
Why you won't be building your killer app on a distributed hash table Tony Bain Innovations in Data Management 0 26th June 2009 10:27 AM


All times are GMT +11. The time now is 06:20 AM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO