A good understanding or a good description of a problem is a prerequisite to finding a solution. In this blog, I want to introduce just such a problem description for a data platform that is used for analytic purposes. Traditionally, this is called a data warehouse (DW), but labels, such as data mart, big data platform, and data hub are also used in this context. I’ve named this problem description the Data Warehousing Quadrant. (An initial version has been shown in this blog.)
Since then, I’ve used it in many meetings with customers, partners, analysts, colleagues and students. It has the nice effect that it makes people think about their own data platform (problem) as they try to locate where they are and where they want to go. This is extremely helpful as it triggers the right dialog. Only if you work on the right questions will you find the right answers. Or put another way? If you start with the wrong questions—a situation that occurs far more often than you’d expect—then you are unlikely to find the right answers.
The Data Warehousing Quadrant (Fig. 1) has two problem dimensions that are independent from each other:
- Data Volume: This is a technical dimension which comprises all sorts of challenges caused by data volume and/or significant performance requirements such as: query performance, ETL or ELT performance, throughput, high number of users, huge data volumes, load balancing etc. This dimension is reflected on the vertical axis in fig. 1.
- Model Complexity: This reflects the challenges triggered by the semantics, the data models, the transformation, and load processes in the system. The more data sources that are connected to the DW, the more data models, tables, processes exist. So, the number of tables, views, connected sources is probably a good proxy for the complexity of modeling inside the DW. Why is this complexity relevant? The lower it is, the less governance is required in the system. The more tables, models, processes there are, the more dependencies between all this objects exists and the more difficult it becomes to manage all those dependencies whenever something (like a column of a table) needs to be added, changed, removed. This is the day-to-day management of the “life” of a DW system. This dimension is reflected on the horizontal axis in fig. 1.
Figure 1: The DW quadrant.
Now, these two dimensions create a space that can be divided into four (sub) quadrants.
Bottom Left: Data Marts
Here, the typical scenarios are:
- A departmental data mart, (like a marketing department) sets up a small, maybe even an open source-based RDBMS system and creates a few tables that help to track a marketing campaign. Those tables hold data of customers that were approached, their reactions or answers to questionnaires, addresses etc. SQL or other views allow some basic evaluations. After a few weeks, the marketing campaign ends, hardly any or no data gets added and the data, the underlying tables and views slowly “die” as they are not used anymore. Probably, one or two colleagues are sufficient to handle the system, both setting it up and creating the tables and views. They know the data model intimately, data volume is manageable, and change management is hardly relevant as the data model is either simple (thus changes are simple) or has a limited lifespan (≈ the duration of the marketing campaign).
- An operational data mart. This can also be the data that is managed via a certain operational application as you find them, like in an ERP, CRM, or SRM system. Here, tables, data are given and data consistency is managed by the related application. There is no requirement to involve additional data from other sources as the nature of the analysis is limited to the data sitting in that system. Typically, data volumes and number of relevant tables are limited and do not constitute a real challenge.
Top Left: Very Large Data Warehouses (VLDWs)
Here, a typical situation is that there is a small number of business processes—each one supported via an operational RDBMS—with at least one of them producing huge amounts of data. Imagine the sales orders submitted via Amazon’s website—this article cites 426 items ordered per second on Cyber Monday in 2013. Now, the model complexity is considerably simple as only a few business processes, thus tables (that describe those processes), are involved. However, the major challenges originate in the sheer volume of data produced by at least one of those processes. Consequently, topics such as DB partitioning, indexing, other tuning, scale-out, parallel processing are dominant while managing the data models or their lifecycles is fairly straightforward.
Bottom Right: Enterprise Data Warehouses (EDWs)
When we talk about enterprises, we look at a whole bunch of underlying business processes: financial, HR, CRM, supply-chain, orders, deliveries, billing, and so on. Each of these processes is typically supported by some operational system which has a related DB in which it stores the data describing the ongoing activities within the respective process. Because there are natural dependencies and relationships between those processes (there has to be an order before something is delivered or billed) then it makes sense for business analysts to explore and analyse those business processes not only in an isolated way but also to look at those dependencies and overlaps. Everyone understands that orders might be hampered if the supply chain is not running well. In order to underline this with facts, the data from the supply chain and the order systems need to be related and combined to see the mutual impacts.
Data warehouses that cover a large set of business processes within an enterprise are therefore called enterprise data warehouses (EDWs). Their characteristic is the large set of data sources (reflecting the business processes) which, in turn, translates into a large number of (relational) tables. A lot of work is required to cleanse and harmonise data in those tables. In addition, the dependencies between the business processes and its underlying data are reflected in the semantic modeling on top of those tables. Overall, a lot of knowledge and IP goes into building up an EDW. This makes it sometimes expensive but, also, extremely valuable.
An EDW does not remain static. It gets changed, adjusted, new sources get added, some models get refined. Changes in the day-to-day business (like changes in a company’s org structure) translate into changes in the EDW. This, by the way, does apply to the other DWs mentioned above, too. However, the lifecycle is more prominent with EDWs than in the other cases. In other words: here, the challenges by the model complexity dimension dominate the life of an EDW.
Top Right: Big Data Warehouses (BDWs)
Finally, there is the top-right quadrant, which starts to become relevant with the advent of Big Data. Please beware that “Big Data” not only refers to data volumes but also incorporating types of data that have not been used that much so far. Examples are
- Videos + images
- Free text from email or social networks
- Complex log and sensor data
This requires additional technologies involved that currently surge in the wider environment of Hadoop, Spark, and the like. Those infrastructures are used to complement traditional DWs to form BDWs, aka modern data warehouses, aka big data hubs (BDHs). Basically, those BDWs see challenges from both dimensions—the data volume and the modeling complexity. The latter is being augmented by the fact that models might span various processing and data layers, like Hadoop + RDBMS.
SAP addresses those two dimensions or the forces that push along those dimensions via various products, namely SAP HANA and VORA for the data volume and performance challenges, while SAP BW/4HANA and tooling for BDH will help along the complexity. Obviously, the combination of those products is then well suited to address the cases of big data warehouses.
How To Use The DW Quadrant?
Now, how can the DW quadrant help? I have introduced it to various customers and analysts and it made them think. They always start mapping their respective problems or perspectives to the space outlined by the quadrant. It is useful to explain and express a situation and potential plans of how to evolve a system. Here are two examples:
SAP addresses those two dimensions or the forces that push along those dimensions via various products, namely SAP HANA and VORA for the data volume and performance challenges, while BW/4HANA and tooling for BDH will help along the complexity. Obviously, the combination of those products is then well suited to address the cases of big data warehouses.
An additional aspect is that no system is static but evolves over time. In terms of the DW quadrant, this means that you might start bottom-left as a data mart to then grow into one or the other or both dimensions. These dynamics can force you to change tooling and technologies. E.g. you might start as a data mart using an open source RDBMS (MySQL et al.) and Emacs (for editing SQL). Over time, data volumes grow – which might require to switch to a more scalable and advanced commercial RDBMS product – and/or sources and models are added which requires a development environment for models that has a repository, SQL generating graphical editors etc. Power Designer or BW/4HANA are examples for the latter.
Connect with me here and on Twitter via @tfxz.
VN:F [1.9.22_1171]The Data Warehousing Quadrant,