Go Back   CORTEX Forums > Best Practices > Subject Matter Expertise > Data Warehousing
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read

The Flaws of the Classic Data Warehouse Architecture, Part 1

This is a discussion on The Flaws of the Classic Data Warehouse Architecture, Part 1 within the Data Warehousing forums, part of the Subject Matter Expertise category; A very interesting insight into data warehouse architecture http://www.b-eye-network.com/view/9752...


Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 27th March 2009, 03:49 PM   #1
glove
Guest
 
Posts: n/a
Default The Flaws of the Classic Data Warehouse Architecture, Part 1

A very interesting insight into data warehouse architecture

http://www.b-eye-network.com/view/9752
 
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Old 30th March 2009, 11:55 AM   #2
Member
 
Join Date: Oct 2007
Posts: 384
Blog Entries: 26
Steve Bennett will become famous soon enough
Post Interesting but ...

Thanks for the heads-up on this article Mark!

It is a good introduction but I'm not sure of all the facts listed. For example, what about Kimble? Her deals with a number of the identified issues.

Also, I'm not sure about the '8 out of 10' statement. I would like some evidence please.

I suspect that part 2 of the article will advocate some sort of appliance - but I could be wrong!

Check-out the external link from Mark to read other responses to this article.

Quote:
The Flaws of the Classic Data Warehouse Architecture, Part 1

by Rick van der Lans
Originally published March 4, 2009

Visit ten randomly picked customers that own a data warehouse architecture, and you will see that at least eight have developed a classic data warehouse architecture (CDWA). What do I mean by a CDWA? In a CDWA, data is copied periodically from production systems to the central data warehouse. And in some environments, before data is moved the central data warehouse, it is first stored in an operational data store (ODS). If an ODS is used, it is probably because of technical reasons.

Somewhere in this copy process, data is cleaned, filtered, and aggregated. Usually, data is copied from the central warehouse to the data marts. And, finally, depending on the business intelligence (BI) tools in use, data is copied from the data marts to cube technology. In short, this means that the classic data warehouse architecture consists of approximately four data storage layers (an ODS, a data warehouse, a few data marts, and several cubes) plus a chain of copy jobs to move data from one layer to another. Note that some organizations add more layers, such as a staging area.

It is important to understand that another characteristic of the CDWA is that it is based implicitly on the assumption that all BI reports will use data stored in the data warehouse. In other words, to be able to create a report, to perform some analytics, to show a key performance indicator, the required data has to be stored in the data warehouse first. Only then does it become available for reporting and analytics. This makes the CDW the heart of the BI system. In a nutshell, this is the architecture most organizations have created, and it is what you could call the classic data warehouse architecture.

This CDWA has served us well the last twenty years. In fact, up to five years ago we had good reasons to use this architecture. The state of database, ETL, and reporting technology did not really allow us to develop something else. All the tools were aimed at supporting the CDWA. But the question right now is: twenty years later, is this still the right architecture? Is this the best possible architecture we can come up with, especially if we consider the new demands and requirements, and if we look at new technologies available in the market? My answer would be no! To me, we are slowly reaching the end of an era. An era where the CDWA was king. It is time for change. This article is the first in a series on the flaws of the CDWA and on an alternative architecture, one that fits the needs and wishes of most organizations for (hopefully) the next twenty years. Let’s start by describing some of the CDWA flaws.

The first flaw is related to the concept of operational business intelligence. More and more, organizations show interest in supporting operational business intelligence. What this means is that the reports that the decision makers use have to include more up-to-date data. Refreshing the source data once a day is not enough for those users. Decision makers who are quite close to the business processes especially need 100% up-to-date data. But how do you do this? You don’t have to be a technological wizard to understand that, if data has to be copied four or five times from one data storage layer to another, to get from the production databases to the reports, doing this in just a few seconds will become close to impossible. We have to simplify the architecture to be able to support operational business intelligence. Bottom line, what it means is that we have to remove data storage layers and minimize the number of copy steps.

The second flaw is a very technical one, and deals with the enormous amounts of redundant data that we store in our CDW. Most data marts are 100% loaded with data coming from the CDW. The same applies to cubes; which means they are packed with redundant data. Even the ODS and the CDW contain masses of overlapping data. Inside each data storage layer we also store a lot of redundant data. Most of that redundant data is stored to improve the performance of queries, reports, and ETL scripts. This redundant data is hidden in indexes, materialized query tables, columns and tables with aggregated data, staging areas, and so on.

We know that our data warehouses take up huge amounts of storage, terabytes of storage. But how much original data is there really? An extensive study done by UK-based analyst Nigel Pendse shows that a BI application needs approximately five gigabytes of original data. This is the median value. This number sounds realistic, but how does it match the results of studies that tell us that the average data warehouse is ten terabytes large? If this would all be original data, according to the study of Pendse, you would need 2,000 BI applications with no overlapping data elements, to get to ten terabytes. And that is highly unlikely. As indicated, the amount of redundant data is enormous.

Obviously, there is a reason for all this. It is performance. To speed up queries, we need our indexes, materialized query tables and columns with aggregated data. You probably think who cares about storage? Storage is not that expensive anymore. But that is not the issue. The issue is flexibility. The more redundant the data, the less flexible the architecture is. Every change we make requires an extra change on the redundant data. We could simplify our warehouse architectures considerably by getting rid of most of the redundant data. Hopefully, the new database technology on the market, such as data warehouse appliances and column-based database technologies, will decrease the need to store so much redundant data.

Doing analytics and reporting on unstructured and external data brings us to the third flaw. Most warehouses are filled with structured data coming from production databases. But the need to do analytics on external data and on data stores with unstructured data is growing. Most vendors and analysts propose to handle these two forms of data as follows: if you want to analyze external or unstructured data, copy it into the CDW to make it available for analytics and reporting. In other words, the proposals are based on copying data. But why? Why not do the analytics straight on the unstructured data source itself, and straight on the external data? In a way, that would be the Internet-style solution. If I search something on the Internet, I don’t first copy all the data I need into one of my own databases. No, that data stays where it is. More and more of the document management systems do allow us to analyze their databases straight on. But can our current BI tools access them? With respect to external data, doing business intelligence over the Internet on external data can be done today in a very sophisticated way with some of the so-called mashup tools.

The fourth flaw of the CDW can be described with the term non-sharable specifications. Many BI tools allow us to include specifications to make the report definitions more data source independent. For example, in the Universe concept of Business Objects, we can define specific terms and relationships between tables. This is perfect for all the reports created with the BO tools. But what if we want to create other reports in, for example, Excel or with the Cognos tools? It probably means we have to develop comparable specifications in those tools. All those specifications are non-sharable. My experience is that most environments are heterogeneous; they do use different tools, so the need for sharable specifications does exist. This example of non-sharable specifications relates to BI tools, but I could also give you examples of other non-sharable specifications stored in ETL tools and database servers. Non-sharable specifications decrease flexibility and lead to inconsistent reports.

The last flaw I want to discuss in this article deals with flexibility. The world of software engineering has taught us that we have to separate the storage structure from the application structure. Why? Because if we separate them, changes on the storage structure do not always require changes to the application structure, which is good for maintenance and flexibility. One of the first authors who came up with this idea was David L. Parnas in his groundbreaking article “On the Criteria to be Used in Decomposing Systems into Modules,” published in 1972, where he introduced the notion of information hiding. This concept became the basis for more popular concepts such as object orientation and component based development.

No software engineer has problems with this concept, but we in the business intelligence business do. Our architectures are not at all based on information hiding. Almost all the reports we create are tight to the database server technology underneath. Take a simple example. If we use a reporting tool where we can write our own SQL statements to access a specific database server, we will use all the bells and whistles of that product to get optimal performance. But what if we want to replace our database server with another, one that supports a slightly different SQL dialect? Or worse, imagine we want to switch to an MDX-based database server, or maybe we want to access an external database that does not return a table but an XML document? In probably all those situations, we have to change our report definitions dramatically. It is time that we adopt the concept of information hiding in our data warehouse architectures to improve flexibility and to make it easier to adopt new emerging technologies.

To summarize, the CDWA has supported us well, and for many organizations, it will still be the right solution for a few years to come. But it is time for a new architecture because the demands are changing, the requirements are becoming more complex, and new technology has been introduced. But where do we go from here? That will be the topic of part 2 of this article. Stay tuned.

Rick van der Lans
Rick is an independent consultant, lecturer and author. He specializes in data warehousing, business intelligence (BI), application integration and database technology. He is managing director of R20/Consultancy B.V. He is an internationally acclaimed lecturer who has lectured professionally for the last twenty years in the U.S., many of the European and Middle East countries, South America and Australia. He has been invited by several major software vendors to do keynote speeches.

He is the founder and chairman of the Independent Analyst Platform (IAP). The IAP organizes events for the top 25 BI analysts of Europe and the U.S. He is the chairman of the European Meta Data Conference held in London annually and the annual BI event in The Netherlands.

He is the author of several books on computing, including Myths on Computing. Some of these books are available in different languages. Books such as the popular Introduction to SQL and the SQL Standard, are available in English, Chinese, Dutch, Italian and German and are sold worldwide. He is also the author of the successful books SQL for MySQL Developers and The SQL Guide to Oracle.
Steve Bennett is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Old 12th May 2009, 12:45 PM   #3
Member
 
Join Date: Oct 2007
Posts: 384
Blog Entries: 26
Steve Bennett will become famous soon enough
Post What’s Next for DBMS?

By Guy Harrison

InfoManagement Direct, May 7, 2009

Once upon a time, when client/server ruled, applications all used database management systems in much the same way. Graphical user interface clients – typically running on Windows – would issue SQL directly against a shared back-end relational database. The emergence of three-tier Web applications – in which business logic moved to a middle tier and presentation moved to the browser – resulted in only minor changes in the role of RDBMS. A single monolithic RDMBS continued to support multiple clients, which just happened to be Web or application servers rather than Windows client programs.

Over the past 25 years, there have been no successful challenges to the dominance of the RDBMS. The most successful upstarts – object-oriented, XML and document-oriented databases – account for a minute fraction of the overall database market. Until relatively recently, the RDBMS seemed to have a complete and permanent hold on the data management tier.

However, the emergence of new computing models – most importantly, the convergence of grid and utility computing models in the guise of “cloud computing”– suggest that the times might be changing for DBMS.

2008: The Year of the Cloud

Arguably the most exciting development in software application architecture in 2008 was the emergence of viable cloud computing environments. The term “cloud computing” is seriously overhyped, but might be summed up for our purposes as the provision of virtualized application software, platforms or infrastructure across the network, in particular the Internet. The concepts of utility computing, where computer resources are allocated on demand, and grid computing, the ability to use large numbers of homogenous commodity computers to support larger applications or tasks, became a practical reality for the first time in 2008, and became so in environments made available across the Internet – e.g., in the cloud.

Amazon arguably pioneered the public cloud with its Amazon Web Services Infrastructure as a Service cloud. AWS users can buy virtual computing resources – storage, memory, I/O and CPU by the GB or CPU hour, and can rapidly scale their resource utilization up or down as demand varies.

Google provided an even more abstracted Platform as a Service cloud with its Google App Engine offering. With GAE, users do not even need to explicitly request increases in CPU or memory as their application demand increases. They merely need to deploy their application into GAE, and GAE will manage these resources.

Microsoft’s Azure cloud – by far the most ambitious of the cloud computing architectures - appears to be a PaaS cloud primarily oriented toward .NET applications, but with more extensive application support services.

Most of the cloud architectural innovation is occurring in these “public” clouds. However, the architectures pioneered in the public clouds are expected to be mirrored within tomorrow’s enterprise data centers, and many vendors are emerging with “private cloud” infrastructure products.

RDBMS Meets the Cloud

The traditional relational database can be a poor fit in these cloud environments. The promise of the cloud is that you can increase computing resources by allocating additional virtual machines. Unfortunately, most relational databases cannot smoothly scale up in this manner. Oracle’s RAC cluster database provides a database scale-out capability, but licensing and administrative overhead currently make it an unattractive fit for zero-maintenance, pay-as-you-go cloud computing. The common MySQL “sharding” model and the use of distributed object caches such as memcached can be used to establish a large virtual database from many small database servers, but this solution is insufficiently elastic and automatic when demand scales rapidly.

As a result, cloud providers offer simpler, nonrelational databases that can provide elastic on-demand scalability. The major offerings are:
  • SimpleDB within Amazon AWS
  • BigTable within Google App Engine
  • Microsoft’s Azure Table Services and SQL Data Services
These cloud databases differ in implementation detail, but share common core features:
  • A hierarchical structure resembling a B-tree index or hash table. Rapid lookup via a single key value is provided.
  • Flexible or un-types attributes. Each “row” often can contain different “columns,” and columns may have multiple values or include a more complex embedded structure.
  • Automatic geo-redundancy. Elements stored in the database are guaranteed to be replicated across multiple data centers.
  • Automatic partitioning across multiple hosts and automatic scale-out as the size of – or demand on – the data store exceeds the capability of a single host.
However, these databases are missing many of the features we’ve come to expect in a relational solution:
  • Joins and complex queries must be implemented in procedural code.
  • Transactions are not supported. You can’t create a set of changes that must succeed or fail as a single unit.
  • “Eventual” consistency. When you make a change to a data item, it will be visible eventually, but not immediately, in all locations where a copy of that data is maintained.
These limitations seem severe, and there is a wide range of application types that probably cannot tolerate the relaxation of transactional integrity. But, for Web 2.0 style applications, these limitations are no big deal – does anyone care if your latest Facebook update takes a few minutes to be visible to all your friends around the world?

Will cloud databases disrupt the RDBMS market? Probably not. But the cloud environments available from Amazon, Google and Microsoft offer significant advantages over traditional hosting or in-house hardware models. Applications that exploit the cloud will find the scalability and zero-maintenance models of these cloud databases appealing. As these cloud applications emerge into the mainstream, the cloud database probably will establish a permanent and successful segment of the database taxonomy.

Column-Oriented Databases

Today’s relational databases are generally built around a one-size-fits-all architecture. The architecture – indeed the actual vendor/version – of an RDBMS used for data warehousing is generally the same as that used for an online transaction processing application. The RDBMS model has certainly proven to be a flexible one, and its ability to support both transaction processing and decision support operations is one of the reasons the RDBMS has come to dominate the database world.

However, as the size and complexity of data warehousing databases has increased, it’s been increasingly apparent that databases designed explicitly to support data warehousing could have a significant cost/performance advantage.

In particular, the common RDBMS implementation in which all data for a given row is stored as a block might be optimal for OLTP operations – create an entry, query an entry, change an entry – but is not optimal for many data warehousing and BI query scenarios. In these scenarios, storing the data for a specific column in the same block is far more effective, since data warehousing queries usually aggregate within columns and across rows.

In 2005, Mike Stonebraker, pioneer of relational databases and one of the creators of Postgress, and his team proposed a new model for column-centric databases called C-Store. In a C-Store database, data is physically oriented around projections, which represent sorted contents of one or more columns. Queries that perform typical data warehousing aggregate queries will typically find the data they need stored in one of these projections.

A further advantage of column orientation is that very high compression rates can be achieved. Compression works best when repeated patterns are present. The repetition within columns is almost far greater than the repetition within rows. For instance, a SURNAME column can be expected to have many repeated entries, whereas there will be little or no repetition between a SURNAME and a DATE_OF_BIRTH column. Column orientation, therefore, allows for very high compression ratios with only modest CPU requirements.

Of course, not everything about C-Store is faster. C-Store is a read-optimized database design, and the architecture exacts a heavy penalty when data is added or modified. In a traditional RDBMS, a new row requires the addition of just a single entry. In C-Store, entries must be added to every relevant projection. Further, since projections are ordered, new rows may require some reorganization. Consequently, the C-Store proposal includes a writable store, a sort of staging area, for inserts and updates. Entries to the writable store are periodically merged in the read-optimized, but, in the meantime, queries read from both stores to get the up-to-date view.

Despite the significant changes to the underlying architecture, C-Store exposes a familiar SQL interface and transactional model to the application, making it immediately compatible with existing business intelligence tools.

The C-Store proposal has been commercialized in the Vertica product, and an open source version called MonetDB is also available. Some elements of column-oriented compression also are becoming available in commercial databases such as Oracle.

Memory Only Databases and H-Store

There’s also been innovation in the world of online transaction processing databases. Moore’s law, the doubling of memory and CPU capacity roughly every 18 months, has been driving exponential increases in the processing capacity of desktop and middleware computers. Unfortunately, Moore’s law does not apply to I/O operations. Disk drives have gotten gradually faster, but storage capacity has been growing at a faster rate. As a result, the I/O per GB rate actually has been dropping over the past few decades.

All database systems make heroic attempts to avoid this disk I/O bottleneck by employing increasingly sophisticated caching mechanisms. In some cases, there may be four separate caches – within the application server, a result set and block cache within the database, and separate caching within dedicated storage devices. Despite all this, databases remain the slowest component in the modern application stack.

The imminent end of the spinning disk device and its replacement by solid state disk technology will relieve the bottleneck somewhat, but not enough to prevent the database from remaining the ultimate limit on transaction processing.

The disk I/O conundrum has lead to a fairly healthy niche of in-memory databases. In addition, there are calls for a complete rewrite of the disk-based RDBMS in favor of a design that more directly meets OLTP requirements.

Most notably, Mike Stonebraker and colleagues have proposed a complete do-over of the RDBMS, called H-Store. H-Store is an in-memory database in which transactional integrity is guaranteed not by a write-to disk, but by replication to other databases in the cluster. The clustering is achieved by requiring a hierarchical data model. Elements in the hierarchy are distributed across the cluster using a mechanism similar to that of the cloud databases, such as SimpleDB. This allows H-Store to provide scalable “shared nothing” clustering across large numbers of machines.

The architecture is single threaded, removing much of the complexity that arises in RDBMS when multiple concurrent sessions want to update the same piece of data. To take advantage of multi-core machines, one would run multiple H-Store instances “clustered” on the one machine.

Most shockingly to some, H-Store rejects SQL as the transactional language, and, instead, will require that transactions be written as stored procedures using an object-oriented syntax similar to Ruby on Rails ActiveRecord.

Of course, main memory is not yet abundant enough to allow the world’s larger databases to be moved to an H-Store model. One would imagine that short-term transactional data would reside in a high performance H-Store and longer-term data for analysis purposes would be shipped to C-Store.

H-Store exists only in prototype form today, but expect to see a commercial database with at least some of these features within the next few years. I also wouldn’t be surprised to see cloud databases adopt some of these ideas in the future.

The relational database model has been incredibly successful; there’s no real doubt that it will continue to be the dominant model for data storage for the foreseeable future. However, the emergence of cloud databases, column-oriented data warehouses and high performance in-memory OLTP databases suggest it might not be the only model for data storage in the future.

Guy Harrison is a chief architect for database solutions at Quest Software and is a recognized expert with more than 15 years’ experience in application and database administration, performance tuning and software development. Guy is the author of Oracle SQL High Performance Tuning (Prentice Hall), Oracle Desk Reference (Prentice Hall) and MySQL Stored Procedure Programming (O’Reilly with Steven Feuerstein) and is a regular speaker at trade shows and events.
Steve Bennett is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiTweet this Post!
Reply With Quote
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Part-Time Experienced Business Analyst Available Steve Bennett CORTEXers Wanted 1 15th August 2008 12:18 PM
Data Mart V Data Warehousing - The geat Debate binboy Data Warehousing Tips and Techniques 1 16th July 2008 11:15 AM


All times are GMT +11. The time now is 02:48 AM.

© The Business Intelligence Group

Search Engine Optimization by vBSEO