A user wants to report on ERP data. Hence the most obvious approach would be to, well, query the ERP system. This has not been done for a while for pure technical reasons, it is too complex to find the data, query response times are too slow, such queries impact the performance of the source system, etc. But since these are technical reasons and the technology evolves, it is a good idea to question, if the reasons are valid still. Especially with the query speed Hana provides, a major issue might be gone.
The data model of an ERP system is complex
In today’s environment the end user expects to query the data by himself, a self service Business Intelligence system. Providing the user with just a small subset of the data in order to simplify the task is not an option. In the classic Data Warehouse approach the ERP data model is simplified into measures, say Revenue, that can be analyzed per various properties, e.g. per Material, per Customer, per Region, per ….. This is a natural way the user asks for information and allows to combine data as well, e.g. what is the order backlog? It is the sum of all ordered products minus the sum of delivered products.
In the Data Warehouse this simplified data model is created in another database to speed up query performance along the lines and for other advantages. In a virtual data model the very same simplified data structures exist but are database views only. Two requirements have to be fulfilled to support this:
Requirement 1: Database views need to support more than SQL as the transition from the ERP data model into measures and attributes consists of more than just joins and SQL expressions. Simple things like lookup the last record, pivot the data, split data to process it differently, data quality checks,…. cannot be expressed in a single SQL query.
Requirement 2: Above views need to return the data in sub seconds. else the system will not be used.
For traditional databases the answer to both questions is a straight “no-way”. Hana supports CalcViews, reading from procedures, it provides the Task Framework with its rich data transformation options. So it does support the first requirement easily. The point about performance is not that simple. In many cases yes, but the transformation can get more and more complex easily and will break the requirement at some point.
Performance of views
It is not sufficient to return the data quickly most of the time, a good solution requires options to ensure that for every view. One way would be to cache the view results. Then the first time the view is accessed it takes the full time, the second time the data is readily available and can be returned from the cache in no-time. And the third time the query returns wrong data as a record was updated and the cache now contains outdated data. Not good. Solution could be to invalidate the cache whenever the base table got changed. That does not help either. A table containing sales orders would mean the cache is made invalid every second due to the high frequency new orders are entered and it would be essentially the same as having no cache at all.
Another approach would be to cache all views with data from the same time. Then the data is outdated, yes, but at least consistent. A situation where a sales order (cached) does not exist for a delivery (not cached) cannot happen then. If the time the caches are built can be controlled, say once a day at midnight, this solution would essentially be a Data Warehouse. Loaded once a day, all data transformed in a way to ensure the end user queries return quickly. Not good either. Certainly not realtime.
The best would be an option where this cache gets updated whenever any of the view’s base tables got changed and the only the relevant portion of the cache being recomputed, not the entire cache of course. As seen from a technical point this is a nightmare. A simple view joining three tables only, e.g. order, line item and customer, has so many cases already. A new line item got added, okay, that means a new row is added to the cache with data from the order and customer. That’s simple. A customer record got updated? Now all sales orders of that customer with all their line items have to be updated. Definitely not an easy task, and it can get more and more complex with the amount and types of transformations. But there is no other option than having a cache which is never out of date for complex Views.
Requirement 3: The option to merge changes in the base tables of a view into a view cache, in order to have an always current cache for those views the execution time would be too long.
Only a few solutions in the market allow result set caching at all, none support merging table changes into the cache, certainly not in realtime. Including Hana. But Hana has the Task Framework, can this be used?
Access to other system’s data
While the scope of this topic is ERP data only, a reasonable solution should support adding external data as well. Every database supports Data Federation one way or the other, so that is a given. In case of Hana the feature is called Smart Data Access. There are various quality metrics for Data Federation, e.g. what kind of sources are supported, what can be executed in the source versus all data needs to be read completely as the transformation is too complex for the source etc.
One point is obvious right away. Data coming from remote will be returned at the remote systems query speed, if that. So things like caching, identifying changed data, realtime update of the cache with the changed data, would be even more important.
Requirement 4: Transparent access to data of remote system plus Requirement 3.
As Requirement 3 cannot be fulfilled already, this one is even less possible. Hmmm, Hana’s Smart Data Integration option allows to connect to any remote system and supports capturing changes. So maybe something can be built?
Historical correct data
A user did execute the query yesterday, sum(order_amount) where year = 2015, and the result was 1 million USD. Today the very same query is executed and the result is just 0.9 million. For a past year?? How is that possible? Quite simple, orders can be changed in the ERP system. Last year an order for 100’000USD had been entered but canceled yesterday. Everything just fine. Is this what the user wants to see? Very likely not, the expectation would rather be to see two bookings, in 2015 a booking of +100’000USD and today one for -100’000USD. The order did exist last year, it is part of the 2015 financial statement and cannot be made undone in a previous year.
These kinds of historical data require data to be persisted. That is actually one of the main differences between an Operational Report – show the data as of today – versus a Data Warehouse where data is available in a historical correct way. This goes well beyond caching a view, it would require to change all tables in the ERP system to maintain the full history. This is possible in Hana, see the History Table feature, but would increase the amount of storage space significantly. And with more data the query performance goes down a bit. Worse, joining a sales order table with a line item table is simple. But joining those two tables when both have a valid-since data as well? Very complex and time consuming hence. The option here is to have some kind of ETL logic but in realtime.
Requirement 5: Support historical correct queries
Hana has all the building blocks
In above text, the Virtual Data model was silently assumed to be implemented via database views. Hence the requirements to cache views, to update the cache, to find a solution for adding historical data in a view when the base tables has none – which obviously would not work. Looking at it from one level higher, does anybody care if the virtual data model is based on views or anything else, as long as it has the characteristics of a view?
From a user perspective, a select statement is executed against a database object. What that object actually is, he does not care.
From a developer point of view it is the same story. The transformations are to be defined plus the characteristics and what kind of objects are created is secondary. It could be a view in case the performance is good enough, it could be a table if the history has to be maintained, it could even be a mixture of both. From the UI point of view, all is the same.
The source, transformations and the target are defined in the flowgraph editor, e.g. like this.
And then the semantics are set, e.g. the target shall be a view, or an existing table loaded in realtime, or…..
The next obvious step is to allow adding materialization points by dragging in a table at any point. For example the data prior to the aggregation should contain the history, hence a table is added there.
As of today the FlowGraph UI does not allow intermediate tables, it has to be designed as two objects. The complex part, the backend, how to get realtime changes from the sources and what they mean for transformations, is implemented already.
With the term Virtual Data Models often a database view based solution is associated. Database views have advantages but disadvantages as well. Some are of technical nature like query performance, other of logical nature like cannot show historical correct data if the source does not save all intermediate changes. All other approaches have their own limitations and issues. The main problem as of today is that ETL, EII, EAI, Replication, Federation,…. are all different tools with nothing in common. In Hana all these technologies are in one place, in the database kernel. Therefore the user does not have to decide for one technology and stick with it, he can combine all of them and pick what makes the most sense.
Even better, the user can start with all being views and in case this is not the best option for whatever reason, set a checkbox for those and switch to something else. This allows the virtual data model approach to be the best of all worlds, as it does combine the advantages of all integration styles compensating each others disadvantages. The agility of views, with the speed of cached data, allowing the “cache” to maintain history. Running on the same database as the ERP system or on a dedicated system to free up the ERP system from adhoc queries, whatever makes the most sense at any time.
A short story about time
Since a couple of terms like EII have been mentioned, maybe it is a good idea to list the various technologies with pros and cons in the chronological appearance.
- SQL Views: It all started with relational databases. These have highly normalized data models, making it hard to combine the data by a non-expert. Using classic SQL database views, the task can be simplified a bit. These can be called the most basic of all virtual data models. Major dowsides were
- The lack of flexibility: A view reads all data it is supposed to read, even if it is not used by the user. And if a field is missing, the user has again to join the view to something else.
- Lack of transformation capabilities: Limited to SQL only.
- Lack of performance: There is no difference between selecting from a view compared to executing the same as SQL select. In best case. In worst case the view reads too much data if a filter cannot be merged into it by the optimizer.
- Lack of semantics: How to combine views correctly, how to aggregate measures and similar definitions are not part of views.
- ETL Tools: With the dawn of Data Warehouses the Extract-Transform-Load tools were more in need, tools that allow to extract source data, transform it into a data model more suitable for queries and load the data.
- Way better query response times.
- Supports complex transformations.
- Supports historical data.
- Flexibility was so-so. New data can be added relatively easily but requires IT. Not as quick as adding a view.
- No realtime but loaded once a day usually.
- ETL process is easier than implementing the same in views but the delta loads are hard to implement.
- To solve at least the performance aspect of views, the next level had been Materialized Views. These required the source data to be local and this “cache” can been updated on request only. In essence these views are pre-calculated aggregation tables.
- Data Federation capabilities built into the databases: Allow databases to query remote databases via database links.
- Around 2005 the term Enterprise Information Integration (EII) became popular and had been sold as the replacement of every Data Warehouse. As the databases have been too limited, the idea was to provide a Virtual Data Model one level higher. These data model’s tables can be queried and they reach out into the source databases to get the data. Some of these tools allowed result set caching but no update of it. At the end their technology mostly ended up in the BI tools to support combining the data from multiple sources
So what are the elements in Hana supporting Virtual Data Models?
- SQL Views: Hana Life is an example of a Virtual Data Model being built mostly on SQL views. Thanks to Hana the performance is not such a large topic anymore and for the other downsides listed above, there are additional options.
- Attribute-, Analytic- and CalcViews: These are views with a richer transformation language, SQL Script can be used here. They also have additional metadata to define what the attributes, hierarchies and measures are and how to combine multiple views. So pretty much every limitation of SQL views are resolved with those and hence a very good foundation for the Virtual Data Model. Only real downside is the lack of storing the historical data, as a view does not store anything per definition.
- FlowGraph as the superset of all: It allows for Batch ETL, but also realtime transformations and CalculationViews. In case the data is loaded into a target table, the user can choose to load with preserving the history. Thanks to the realtime adapters the delta problem is solved as well – changes are being actively sent into Hana by the adapters.
- Data Federation: Best of it, all of the above options can be used on local data but also on remote data.
Perfect, isn’t it?
VN:F [1.9.22_1171]Virtual Data Models in 2016,