There are currently two kinds of statements circulating when it comes to Business Intelligence: 1) Thanks to HANA all BI queries can be executed on the operational system – no ETL needed – and 2) HANA is the perfect Data Warehouse database for BI.
Running all queries on the source system directly is very appealing as it sounds easier, cheaper and the data is current (=realtime) per definition. But it has significant problems, yet the classic Data Warehouse approach has its own issues. In this post I’d like to provide pros and cons and want to show the vision I am executing against – The Realtime Data Warehouse.
For me the key problem is the term “Business Intelligence” – it is not well-defined.
Does Business Intelligence mean nice and interactive reports?
Example: A customer called, he wants to order something and immediately a screen pops up with detailed statistics: This customer has bought a lot in the recent years, the tendency was to pay rather late and he never bought from the new product line before. Important information in the sales process.
It is the most natural thing to embed such operational report in the operational system and use its data directly. The reason such operational reports moved into the Data Warehouse 15 years ago was because of better response times there, at the expense of having a second database and data being a day old. Thanks to S/4HANA or generally speaking, the ability of HANA to run OLTP transactions and OLAP queries simultaneously, this is no longer needed. Or in other words, if the term “Business Intelligence” is used as synonym for “Operational Reports” there is no need for a Data Warehouse.
Business Intelligence for new insight
Example of a Heatmap
But the idea of a Data Warehouse was to provide methods to gain new insights by enabling the business user to analyze the data themselves.
Example: The revenue of the sports division is declining throughout the last six months. But why? A quick query putting the sales in relationship with the market development showed that the market as such is doing fine. So it has to be a home-grown problem. Looking at a heatmap with region versus product-line and it gets more obvious: All is green except Basketball which has a problem in all regions. Soccer is doing fine. Is it possible the shift towards soccer was underestimated and sales should focus there?
The nature of these queries is that they consider lots of data and combine the data in unforeseen ways to gain new insights. As such there are certain requirements to be solved. Either in the operational system, if that is preferred, or in the Data Warehouse.
Requirement 1: Historical correct data
ERP query for 2010 data executed yesterday
Example: In 2010 the region of Germany did particularly well, it generated a revenue of 19bn EUR. Running the same report today to see last year’s data shows a revenue of only about 8m EUR. What did happen?
The answer is simple, one customer’s headquarter moved from Germany to somewhere else. Therefore all revenue, including the past revenue, is now part of another customer region. The operational system is all about the current situation. Keeping the old data would make it more expensive to maintain and all queries very complex.
This happens even in unexpected situations. For example the sales order data might look like being historical correct because it has a sales order date, but it is not. A sales order was created in Jan, hence the full amount of 10’000USD was shown by the report in Jan. A month later the quantity was reduced in half, so the historical correct point of view would be +10’000USD in Jan and -5’000USD in Feb. Otherwise all the past reports would show different data today.
ERP query for 2010 data executed today
The operational system cares about the current sales order value, it would return for the same report an amount of 5’000USD for Jan suddenly.
When using a separate Data Warehouse database storing the historical data is no problem. It can be defined very flexible using Slow Changing Dimensions and other techniques without performance downsides. Actually, one of the main reasons for a Data Warehouse is historical data.
(Link: How to load a Slow Changing Dimension in HANA)
Keeping history in the OLTP system is expensive, has too many side effects, makes queries too complicated. The easiest and cheapest way to add historical data in the OLTP is by having a cold storage space. A (sub-)database holding all the data not mission critical. Maybe even on another server, one that is cheaper because it does not have the same high Service Level Agreements as the main ERP server? Call this database a Data Warehouse – it is semantics only.
Requirement 2: Standardized data
In the standard report of the initial example, the statistics of one customer were shown. One customer is easy to identify, it is a customer with a given customer number. But when combining many different rows, the master data should be identical or the wrong conclusions are drawn.
Example: What is wrong with the query “Show me the revenue of customers where city equals Beijing”? A careful look at all customer records would show that there are various spellings of “Beijing”: “BEIJING”, “北京”, “Peking”, “Běijīng”.
That’s natural, the only important part about the spelling in the ERP system is that the sales order can be shipped to the customer. But if the Beijing office is closed because there is so little revenue in “Beijing”, whose fault is it?
Of course it can be argued that the city should be entered in a standardized way. But at some point there are so many rules the OLTP system has to follow, it becomes unusable. When a customer record cannot be created because the city name is not in the reference table, when typing in the fields takes longer to be rule compliant while the customer is waiting on the phone,…. not good.
Requirement 3: Consistency of data
Data Consistency Example
Example: Only 15% of the customers are female. That is an interesting finding, isn’t it? It is difficult enough to draw the correct conclusions out of the fact, e.g. does that mean we are giving away potential sales and should focus on female customers? Does it mean our products should be presented with male customers in mind? But when the reality is there are another 15% male customers and for 70% of the customers we simply don’t know, the information might better not be used for analysis.
In the OLTP system there is no such quality measure. There are fields being used by the business users in various ways and when these users draw the wrong conclusions, the company is impacted immediately. So there should be some logic which constantly scans the source data and provides statistics to the team responsible for the Business Intelligence solution. In a Data Warehouse this monitoring is a natural part of the ETL logic. Plus data can be enriched – say the gender can be derived from the first name and country from the city name. And nothing wrong with playing back this information into the ERP system.
Requirement 4: Simplify the OLTP data model for easier reporting
One major argument in favor of a Data Warehouse is its simpler to understand Star Schema data model.
Example: There is a fact – say Sales Order Revenue – and linked to it is the master data. This is exactly how business users are navigating through the data: Show me the Sales Order Revenue per Customer City, Year and Material Group. Compare that with the same query but the previous year.
Doing the same on the operational tables directly would be way more complex. The revenue can be read from the sales order line item, it has a link to the material master which is linked to the material group text table. The order line is linked to the order where the customer number is stored and in the customer table the city can be found. For the year the function year() has to be applied on the order create data column. Oh, cancelled orders should be ignored in the revenue calculation and returns decrease the revenue amount.
Will a business user be able to do all that by himself? Unlikely. But by creating a database view on the tables this complexity can be hidden. This is the idea of HANA Live – one view with all the subject matter data of one area. Having one view with all the sales data including master data is not as flexible as picking the fact and the master data at will. Although theoretically the views can be designed as facts and master data as well, combining views often has significant performance issues and it is avoided therefore.
Requirement 5: Data Transformation for easier consumption
Example of a simple Sales Item view of HANA Life
A common CalcView
As shown above, it is mandatory to simplify the data model and to apply transformations in order to turn database fields into true measures. A sales order of type Return does decrease the revenue etc. These transformations tend to get very complex, making the view – in the OLTP case – very complex or the ETL process – in case of a Data Warehouse – complex. But the transformations have to be done, hence neither approach has the lead there. If five tables have to be joined to get the data and some if-then-else logic applied on the order amount, it makes no difference if that logic is done in a view, in the ETL job or in the query the user creates manually.
Except the fact that the view or the query has to execute the logic every single time somebody queries the data, whereas in the Data Warehouse case the transformation is executed only once, when the changed data is loaded. As long as the transformations are performed in sub-seconds both approaches are okay. In reality the mount and complexity done in the view-approach is very limited.
The occasionally brought argument “ETL is so complex” is true, but in the view based approach ETL is replaced by the even more difficult view creation.
Requirement 6: Performance, Performance, Performance
While the marketing statement – HANA is so fast, it can do everything in sub seconds – is surprisingly close to reality, even HANA has to honor the laws of physics. If a transformation is so complex it takes 1ms per row, reading a 1 million row table either takes 1000 seconds or a server with 1000 CPUs. With the complexity of the business world it is no problem to require even more CPU time than that. And there goes the sub-second response time.
This SCN post shows some examples of transformations and their performance impact to get a feel.
Requirement 7: And what about Realtime?
In summary, the advantage of using views directly on the OLTP system is current data (=realtime), the advantage of the Data Warehouse is better reporting. Hmm, the combination of both sounds like a good idea. Indeed, it does and that had been a hot topic in 2004 but for a very short period only. One article which shows why this never took off is this one. In short, the technical complexity makes it almost impossible to implement.
One suggested solution is to run the delta load of a Data Warehouse frequently. Every hour, every 5 minutes, every second(?) and call that near realtime. How long does a typical delta load of a Data Warehouse take? Say 2 hours? And this task should be executed every 5 minutes? That won’t work. The main issue is that finding the changes in the source is often not possible and then the entire source table has to be read and compared with the target – and that takes its time. One way to mitigate that would be to load a subset of the data frequently. But then there is a consistency problem right away. The sales order data should be loaded often. An order is linked to a customer and the queries all use the customer information. Hence customer has to be loaded as well. Now the customer has a country key and if the table with the country name does not contain it, some queries won’t work. So although the country seldom does change, it has to be loaded the instant a customer record is using a new value. Since the main advantage of a Data Warehouse is the thought of self-service BI, at the end all data has to be loaded.
Further more there is a consistency problem. Right this second the load of the sales order started and a new customer record and sales order was created in the source. As the customer load was finished for this tact already, the sales order won’t have a matching customer record for now. That is weird side effects when querying the data.
HANA realtime dataflows
All of the above is solved when using the HANA Smart Data Integration adapters and their realtime capabilities. It is a transactional load, hence data is loaded in the same order as the changes happened in the source. And the ETL logic does not have to constantly ask “Did something change?” in every single source table, instead the source system does push the changed data the instant the data was committed.
Even building dataflows that perform an initial load and then apply the changes in realtime is dramatically simplified compared to all other products in the market. The dataflow is designed once and when the “Realtime” flag is set, the activation logic creates the initial load flow and the realtime flow including the adapter subscriptions. As result, a dataflow for handling the delta is no longer a complex manual task, it is a checkbox. (see: Realtime Transformations in HANA)
With a classic Data Warehouse database there would be many problems still. When are the aggregate tables updated? How is a “rollback segment too small” error prevented when there are many changes while a query is running. When are the indexes maintained so they do not grow that fast? When are the reports refreshed? None of these problems apply for HANA – there are no aggregate tables needed, there are no indexes on all fields needed, there is no rollback segment as read consistency is handled in a different way.
(Link: Comparing HANA with other databases)
You can’t report on what is not there. The operational systems do not have historical data for a good (cost, performance) reason but Business Intelligence is all about comparing data, often with previous data. As a summary having a second database is a requirement in most cases anyhow. And instead of copying the operational data model, optimizing the data model for query performance does not cost a thing and has significant advantages – this secondary database is then called Data Warehouse. Loading the Data Warehouse in realtime was an impossible to do task but thanks to HANA and its Smart Data Integration option it is quite easy in fact.
VN:F [1.9.22_1171]Realtime Business Intelligence and HANA,