Do you remember all the times you stored the results of a database query in addition to the original data for performance reasons? Then you probably also recall the significant drawbacks that go along with these so-called materialized views: they introduce data redundancy that makes your data model more complex, requires additional database operations to ensure consistency, and increases the database footprint. In this blog post, we demonstrate that, thanks to SAP HANA’s unique in-memory technology, you can simplify your data model by getting rid of materialized views.
This is the second part of our deep dive series on SAP Simple Finance, SAP’s next-generation Financials solution. In the first part, we show how SAP Simple Finance uses the capabilities of SAP HANA to simplify Financials and deliver non-disruptive innovation by removing redundancy. This brings significant benefits: the core data model is as simple as possible with two tables for accounting documents and line items, the database footprint shrinks by orders of magnitude, and the transactional throughput more than doubles.
The benefits are convincing and SAP Simple Finance demonstrates that it can be done. You may ask yourself how this is technically possible and whether you can take the same approach for your applications by running on SAP HANA. The following paragraphs summarize our answers and the longer article below gives more details. Furthermore, the next blog post in the deep dive series will explore the specific case of materialized aggregates, which refer to redundantly stored aggregation results.
The following example shows the motivation for materialized views in traditional database systems: You have an SQL query that selects database rows based on several parameters, for example, all open items for a particular customer. Executing this query against a large base table requires scanning through the whole table of all accounting document line items in order to find the rows that match the selection criterion. In a traditional, disk-based database system, this may be too slow for practical purposes. The alternative is building up a materialized view that explicitly stores the smaller subset of open items and is constantly updated. When querying open items for a particular customer, the database then only needs to scan through the smaller materialized view, resulting in a sufficiently fast response time also on disk-based database systems.
In view of the significant drawbacks of materialized views, the goal is to replace materialized views with on-the-fly calculation. The numerous benefits of getting rid of materialization include an entirely new level of flexibility, increased throughput, and simplicity (for more details, see the long article). The costs of doing so are actually minor, as we outline below: in fact, in-memory response times of on-the-fly calculated queries are typically faster than queries against materialized views on a disk-based database. As illustrated in Figure 1, this tips the seesaw in favor of removing materialized views.
Figure 1: Replacing Materialized Views with on-the-fly calculation
Looking at in-memory database systems only, materialized views are almost never necessary nor beneficial thanks to the superior performance. We show below that in-memory technology shifts the break-even point in a way that materialization is only beneficial in rare circumstances of highly selective queries. A single core of a CPU is able to scan 1 billion line items in less than half a second. In the same time, a disk-based system could only access 50 random disk locations (based on a latency of 10 ms).
In line with this reasoning, SAP Simple Finance took the opportunity offered by SAP HANA and removed materialized views from the data model: tables such as all open Accounts Receivable line items (BSID) have been replaced non-disruptively by compatibility views calculated on-the-fly (for details on all changes, see the first part of this series). The same applies to materialized aggregates such as total balance amounts for each customer per fiscal period (KNC1). Hence, in the next part of the series, we continue our deep dive by looking at queries that include aggregation functions and how they can be tackled similarly.
This blog post continues after the break with an in-detail look at the points that we have summarized so far. We first look at the concept and maintenance of materialized views. Afterwards, we investigate the implications of materializing views and provide decision support to get rid of materialization.
In the following, we first always consider in-memory database systems only and the new opportunities they enable when considering whether to materialize or not. The comparison of in-memory to disk-based database systems is then considered separately. Simply accessing a pre-computed value will always be faster than computing it by running over multiple tuples, even in an in-memory database. The difference is that with the speed of in-memory technology it has now become feasible to dispense of the materialization, because computation on-the-fly is fast enough in most cases, especially compared to traditional disk-based database systems and typical disk latencies of 10 ms. We investigate the situations where systems can dispense of materializing views or aggregates thanks to the speed of SAP HANA and show that materialized views or aggregates are unnecessary – and, thus, harmful – in almost all scenarios.
The Concept of Materialized Views and Their Maintenance
A view represents the result of a stored query on the database. Essentially, it is a named SQL query that can be queried like any table of the database. In the following, we focus on the case of a single base table with arbitrary selection conditions. The following assumes a query with projection and selection, but does not consider joins of tables as the base relation for reasons of simplicity. We neglect aggregation in this section, so that a view always references a subset of tuples from a base relation according to a query.
A materialized view explicitly stores copies of the corresponding tuples in the database (Gupta and Mumick: “Maintenance of Materialized Views: Problems, Techniques, and Applications”; IEEE Data Eng. Bull., 18(2); 1995). In the absence of aggregates, materialized views have the same granularity level as the base tables. If the query also aggregates the items, we speak of materialized aggregates – they will be covered in detail in the next blog post. In contrast to a simple index on a table column, a materialized view describes semantic information, as the selection criteria can be more complex than a simple indexation by one value.
If an item matches the condition of the materialized view, those properties of the item that are part of the view’s projection are redundantly stored. Whenever the base tables are modified, it may be necessary to modify the materialized view as well, depending on the modified tuples and the view’s selection criteria. There are several cases to consider:
- Inserting a new tuple into a base table that matches the criteria of the materialized view requires inserting it into the materialized view.
- As part of an update of a base table, a change of a tuple’s properties does not only have to be propagated to copies of the tuple (update operation), but may also result in the whole tuple now being newly included or excluded (insert / delete) if the new values of some properties changes the value of the materialized view’s selection criterion.
- When deleting a tuple from the base table, all copies in materialized views have to be deleted as well.
In summary, each materialized view leads to additional database operations whenever data in the base table is being modified. Instead of just modifying the base table, additional operations are required to maintain consistency in the view of redundant data. In a system with several materialized views, each transaction may require several times more modifying database operations than would be necessary to just record the change itself in the base table. This lowers the transactional throughput of the whole system as the number of costly modifying operations increases and locking frequently leads to contention.
This also applies in case of lazy materialization. A lazy maintenance strategy only modifies materialized views when they are accessed (Zhou, Larson, and Elmongui: “Lazy maintenance of materialized views”; in: Proceedings of VLDB 2007). However, in the typical OLTP workload of an enterprise system, both modifying transactions and reading queries happen so frequently and intermingled that the number of additional operations due to materialization remains the same: almost all transactional modifications will be followed by queries accessing the materialized views that require propagating the modifications.
Hence, materialized view maintenance adds to the operational complexity of a database system, requires additional modifying operations and lowers the overall system transactional throughput. Furthermore, there is a cost associated to the additional storage that is required for the redundant data which can be substantial in size. These drawbacks have to be balanced against the main benefit of a materialized view: the increased performance of queries against the view as the query underlying the view does not have to be evaluated on each access.
Implications of Materialized Views on Database Size and Query Performance
A materialized view on one base table (which acts as a sophisticated index) will always be smaller than the base table in terms of number of tuples and overall size (or equally large in case of an exact duplicate). However, in case of multiple materialized views on the same base table that are not mutually exclusive the overall size of materialized views in a database schema can be larger than the base tables. The more and more materialized views have been added for performance reasons in the past, the more storage space is taken up by the redundant data.
The drawbacks of a materialized view can thus be summarized as follows:
- Reduced throughput due to the overhead on each update, insert, or delete.
- Increased storage space for the materialized redundant data.
This has to be weighed against the potential impact on performance. The following calculations will show that the shift to in-memory technology diminishes the difference in performance between materialization and on-the-fly calculation, making the former much less worthwhile.
Let us assume that the base table contains n tuples, of which a given view selects m through its selection condition. These m tuples would be stored redundantly in a materialized view. The ratio describes the selectivity of a query. The higher this factor, the more selective the query is. Any query that accesses the view will usually apply further selections on top.
Two (inter-related) factors influence the performance impact of a materialized view for such queries on an in-memory column store. The impact will be even larger when compared to a traditional, disk-based row store.
- Already materialized result: The materialized view has already applied the selection criteria against the base table and thus queries accessing the materialized view do not perform the column scans that identify the m tuples of the view out of all n tuples of the base table again.
- Smaller base for selections: The additional selection of queries directly operates on the smaller set of records as the result of the view has been physically stored in the database. That is, the necessary column scans operate on attribute vectors that contain entries for m instead of n tuples. The smaller input relation influences performance proportional to the selectivity factor .
In both cases, the extent of the performance impact of a materialized view depends on the ratio of n to m. On an abstract level, the operations necessary for a query with and without materialized view can be compared as follows – again, both times looking at an in-memory database:
- Without a materialized view, the response time will be proportional to n, as all full column scans will operate on attribute vectors with n entries.
- With a materialized view in place, the response time of a query will be proportional to m, the smaller number of entries contained in the materialized view.
Influence of Selection Criteria
|In addition to the number of tuples in base table (n) and view (m), let’s furthermore assume that the selection of the view depends on c different columns. Any query that accesses the view may apply further selections on top, taking into account d columns for the selection; e columns thereof have not already been part of the initial view selection.With regard to the two factors outlined in the main part, the selection criteria then have the following effect:
- Already materialized: Assuming that independent, possibly parallel column scans are the fastest access path due to the data characteristics, the materialization already covers the scans over the c columns, each with n entries, that are part of the view selection.
- Smaller base: With a materialized view, the d additional selections of queries require d column scans on columns with m entries, instead of n entries without materialized views.
When now comparing the situation with and without materialization, it has to be kept in mind that in the absence of a materialized view, some of the additional selections overlap with the view selection criteria and can be combined into a single column scan. Hence, only e additional scans besides the c attributes are necessary (but, of course, on a larger set of data).
Query against materialized view
Query without materialized view
d column scans, each with m entries (assuming independent, possibly parallel access)
(c+e) column scans, each with n entries
Proportional to d × m
Proportional to (c+e) × n (c+e ≥ d, n ≥ m)
For deciding whether to materialize a certain view, the difference in the number of columns to consider for the selection (d vs c+e) is significantly smaller and has thus less influence on the performance compared to the difference in the number of entries to scan (n vs m). In turn, the selectivity factor remains most important.
In summary, the performance will only improve by the selectivity factor . The more detailed calculations in the side bar also take into account the selection criteria and show that the selectivity still is the most important factor.
In addition to restricting the number of tuples to consider, a materialized view may also include only a subset of the base columns. However, for the performance of queries in a columnar store, it does not matter how many columns from the base relation are projected for the materialized view: in contrast to a row store, each column is stored entirely separate. Adding more columns to the materialized view does not impact the performance of any query on the materialized view if the query explicitly lists the columns in its projection (which should be the case for all queries, as SELECT * queries are detrimental to performance in row and column stores alike, besides other disadvantages such as missing transparency of source code). Duplicating more columns does, of course, increase the storage size. In general, a materialized view should encompass all columns that are relevant to the use case in order to increase its usefulness, because the materialized value can only be accessed by a query if all required columns have been materialized. In turn, keeping redundant data thus gets more costly in terms of required storage and complexity of modifications.
Decision Support – To Materialize or Not To Materialize?
The linear cost model described above has long been used in materialization discussions and has been confirmed experimentally (see Harinarayan, Rajaraman, and Ullman: “Implementing Data Cubes Efficiently”; in: Proceedings of SIGMOD 1996). It is especially suitable for columnar in-memory database system, because these store the entries of each column sequentially.
The first step when deciding whether to materialize or not in a columnar in-memory database thus consists of analyzing the selectivity of the query underlying the view. Based on the above, a materialized view may be reasonable performance-wise only if the following two criteria were fulfilled:
- Absolute measure: Does the performance without materialization not meet expected requirements?In an in-memory database system such as SAP HANA, queries run much faster than in a traditional database system (see Real HANA Performance Test Benchmarks). This means that many queries with a previously bad performance perform sufficiently fast enough in an in-memory database and therefore require no further optimizations (such as materialization). For example, imagine a view on a table with 1 billion line items. Each entry in the column of the selection criterion takes up 2 byte (after dictionary compression). Scanning the whole column of 1907 MB takes less than half a second using a single core, assuming a memory processing speed of 4 MB per ms per core (1907 MB divided by 4MB/ms per core = 477 ms per core). Even with only four cores, which is nowadays commodity hardware, 8 different attributes could be scanned in parallel in still under a second without any materialization.
- Relative measure: Is the performance with materialization significantly better than without?Even if according to the absolute considerations a speed-up would be beneficial, the performance would still have to be compared and the potential performance advantage traded off with the disadvantages of materialization (mostly lowered throughput and increased database size).The performance savings will be proportional to the selectivity factor . If m is not orders of magnitude smaller, but for example only 10% of the base size, materializing will thus not yield significant savings. Instead, other means to increase the performance would be necessary.The additionally required storage is proportional to m/n-th of the base table. A large share of columns will typically be replicated in this scenario in order to not restrict the usefulness of the materialized view. For example, the materialized view of open customer items in SAP ERP Financials (BSID) replicated half of the columns of the accounting document line items table BSEG.
In summary, the need for materialized views as described above vanishes with in-memory columnar databases. Materialization is simply not needed to provide fast access similar to an index. Figure 2 (repeated from above) highlights why eliminating materialized views is preferable now: the impact on response times compared to accessing a materialized view is less significant as in-memory technology reduces the overall performance. This is done in a non-disruptive way by instead providing a non-materialized compatibility view that represents the same query as the former materialized view, but is calculated on-the-fly. Applications seamlessly access this virtual view without requiring any modifications. We already explained the topic in a corresponding chapter of our last blog post and will dive deeper in a future blog post.
Figure 2: Replacing Materialized Views with on-the-fly calculation
The break-even point at which a materialized view becomes beneficial for performance reasons is reached much later in terms of the selectivity of the underlying query. For view queries with low selectivity, a materialized view constitutes almost pure overhead because the performance without materialization is nearly the same and, moreover, acceptable in absolute terms. The benefit of materialization gradually increases with the selectivity. However, the benefit in terms of performance – depicted in Figure 3 below as the distance between the lines of the in-memory scenario – has to be balanced against the cost.
Figure 3: Query performance depending on selectivity
Not relying on a materialized view improves flexibility, increases the transactional throughput, lowers complexity, and reduces storage costs. Additionally, the performance impact of materialization as experienced by users diminishes with in-memory technology. The effect on the break-even point beyond which the benefit of a materialized view outweighs its costs is depicted in the following Figure 4. With the move from traditional, disk-based database systems to in-memory systems, even the most selective queries do not sufficiently benefit from materialization to outweigh the costs.
Figure 4: Shift of break-even point of materialization thanks to in-memory database
The above reasoning also holds true when looking at the complexity of queries instead of (or in addition to) selectivity: even in case of the most complex queries does the performance impact of materialization no longer outweigh the costs.
The case of SAP Simple Finance demonstrates these points in more detail, as SAP Simple Finance removes materialized views in an entirely non-disruptive manner. It demonstrates that above calculations on the feasibility of removing materialized views indeed apply in practice. In an example SAP system, BSID (open Accounts Receivable line items) contains roughly every 300th item from the base table BSEG. Even for this already moderate selectivity, removing the materialized view has been feasible. Each query on BSID now transparently accesses the corresponding compatibility view so that the entire result is calculated on-the-fly.
The second building block of the removal of materialized redundancy in SAP Simple Finance is the replacement of materialized aggregates, which we will discuss in the next blog post.
VN:F [1.9.22_1171]How Simple Finance Removes Redundancy: The Case of Materialized Views,