How Simple Finance Removes Redundancy: The Case of Materialized Aggregates

Imagine you calculate the sum of a long series of numbers in your head, for example, the sales of the current month. For you as a human being, it probably feels natural to write down the final result as soon as you have finished the calculation. You do this because you are afraid to forget the result and to have to start all over.

For a computer, the situation is different. A computer is blazingly fast at doing such simple calculations. There is no need to physically store the aggregate – the result of such an aggregation – in a database. The computer can always quickly calculate the result again and is not afraid of the work which would feel tedious for a human being. In fact, it would be much more tedious for the computer (and any program running on it), to update the stored aggregate figure every time the base data changes. For example, if a new sales order comes on, he would not only need to record the new item in the database, but also update the stored aggregate, requiring additional expensive modifying operations.

If this scenario sounds familiar to you, that may be from the first two parts of this series on SAP Simple Finance, where we discussed the removal of redundancy in SAP Simple Finance and in particular the case of materialized views, physical database tables that store query results. Materialized aggregates lead to similar issues as those mentioned previously and have nevertheless similarly been used in the past for performance reasons. As materialized aggregates have some distinct characteristics, let’s take a closer look in this article, which again is accompanied by a longer version that goes into more detail than the summary.

Compared to materialized views in general, materialized aggregates are a special case due to key differences in several areas:

  1. Underlying query: Their query involves aggregation (group-by clause in the query) and calculation, not only a selection and projection of the base tuples of some database relation.
  2. Structure: The tuples stored in the materialized aggregate are not identical replicas of some other data in the database, but have a different structure. Typically, some identifying attributes, the dimensions, form the key and other attributes represent the aggregated values. In the example from the beginning, the month dimension would be part of the key and the calculated sum of sales the aggregate value, or fact.
  3. Data distribution: Typically, the table of the materialized aggregate contains fewer tuples than the base table because of the aggregation.

For a base table such as the table of accounting document line items (BSEG), many attributes are possible dimension, which may be applied to an aggregation in any combination. For example, for different analysis questions the line items may be grouped by profit center, period, and account or only by period. Figure 1 displays the resulting hypercube of different materialized aggregates originating from BSEG and these three dimensions.

Figure 1: Example hypercube over accounting document line items

Figure 1: Example hypercube over accounting document line items

SAP HANA makes it feasible to get rid of materialized aggregates without compromising performance, as we show in the detailed calculations below. Comparing different materialization strategies and on-the-fly calculation, it becomes apparent that even with one billion of line items we can dispense of materialization altogether. To stay in the picture of the opening statements: writing down the result of each calculation is not necessary. In the example introduced above in Figure 8, an in-memory database is able to calculate on-the-fly aggregate values by profit center and period in less than half a second. As you increase the number of attributes to be considered as possible dimensions beyond the simple example with three attributes, the trade-off between performance and materialization shifts even more in the favor of on-the-fly calculation. More possible combinations due to more dimensions mean that the materialized aggregate tables will contain more tuples, diminishing the gain of fewer tuples to scan in case of queries.

An interesting situation highlighting the benefit of on-the-fly calculation occurs if you include more and more dimensions in aggregations to remain flexible with regard to possible analysis queries by any reasonable combination of dimensions. If you or your traditional database management system decide to materialize those (which would not be necessary in-memory, as outlined above), you may quickly reach a point where the number of tuples stored overall in different materialized aggregates of the same hypercube exceeds the size of the base table. This scenario wipes out any performance impact of the materialization, because scanning the base table will be faster than looking for the particular materialized aggregate value within the larger number of tuples across different vertices of the hypercube.

In addition to the performance, you should also consider the significant gain in flexibility of on-the-fly calculation. Instead of being limited to pre-defined aggregates, you can freely explore according to the analysis needs, thanks to the performance of SAP HANA.

In that spirit, the series of blog posts on SAP Simple Finance will continue with and overview and subsequent exploration of the business value of SAP Simple Finance.

This blog post continues after the break with an in-detail look at the points that we have summarized so far. First, we introduce the concept of materialized aggregates and hypercubes. We then look at general strategies for materializing aggregates and the role of aggregation in financial accounting.


The Concept of Materialized Aggregates and Hypercubes

A materialized aggregate is a materialized view with aggregation (Müller, Butzmann, Klauck, and Plattner: “An Adaptive Aggregate Maintenance Approach for Mixed Workloads in Columnar In-Memory Databases”; in: Proceedings of ACSC 2014). It is the materialization of a query that aggregates the tuples of its base relations, grouping them by certain attributes and using an aggregation function such as COUNT or SUM. Thus, it provides access to calculated and continuously synchronized values that aggregate individual items according to specific criteria.

The following SQL template describes the general structure of an underlying simple query, where AGGx represents any aggregation function.

SELECT A_1, A_2, …, A_k, AGG1(V_1), AGG2(V_2), …
FROM baseTable
WHERE C_1=’…’ AND …
GROUP BY A_1, A_2, …, A_k

The set of attributes used in grouping will vary depending on the analysis question. In general, only a subset of all columns of the base table will be at all relevant for being part of the grouping statement of any query. These columns form the analysis dimensions. However, in order to not restrict the analysis possibilities when thinking of materialization, the dimensions should be an intentionally broad set. Out of the dimensions, a specific analysis question may require grouping by any subset of dimensions. In case of a separate data warehouse system or a separate materialization mechanism, this system needs to consider all dimensions and therefore store the data grouped at least on the lowest level of granularity.

If we ignore hierarchies within a dimension (for example, different levels of time or customer groups), the possible analysis queries over a specific base relation with k relevant attributes form a hypercube. Each possible view is a vertex of this k-dimensional cube. The view grouping by all relevant attributes A1, …, Ak is the most detailed vertex (lowest granularity of grouping). Each next-lower level of detail consists of all views that leave out one more attribute than the higher level.

An example three-dimensional hypercube over the accounting document line items has been displayed in the Figure 1 above. For reasons of simplicity and to not overcomplicate the explanations, we initially only use period (Period), G/L account (G/L account) and profit center (Profit center) as dimensions. We will later extend the example to include more attributes as dimensions. For our discussion, we assume the following cardinalities in the system:

  • 1 billion line items,
  • 1000 accounts,
  • 100 profit centers, and
  • 100 periods.

To put the number of periods into comparison: 5 years with 12 months per year amount to 60 periods (months). If data were aggregated by week, the same time span of 5 years would already lead to more than 260 periods (weeks). A more fine-granular aggregation on daily basis would lead to at least 1825 periods (days).

In this example, the highest-detail vertex is the view grouping by all dimensions <Period,G/L account,Profit center>. Assuming no correlation between dimensions, it could have values for up to 10 million cells (100 periods × 1000 accounts × 100 profit centers). Let us assume that half of the combinations do not occur in the base relation, so that the top view contains 5 million tuples (the numbers for the other vertices are similarly based on assumptions and displayed on the side of each box in Figure 1). Each tuple could have one or several aggregate values, for example, the number of items and the sum of the debit amounts. On the next lower level of detail, each of the vertices represents a view whose grouping leaves out one attribute. For example, the view <Period,Profit center> on the second level leaves out the account (G/L account) from the grouping. Each lower level of detail consists of all permutations with one attribute less than the level above. The lowest level of detail consists of the single vertex that groups by no attribute and thus sums over all entries.

Materialization Strategies and Performance

It would be possible to materialize all or some of the views represented by vertices in the hypercube. If it is being discussed to materialize some aggregates, the highest-detail vertex would be an obvious candidate. Since each vertex can be calculated from the vertices of higher levels that group by a superset of its attributes (for example, <Profit center> can be calculated from <G/L account,Profit center> by grouping on Profit center), all other vertices could be calculated based on this materialized aggregate. Further vertices could be selected for materialization based on the benefit incurred by having it materialized. This benefit is proportional to the difference in the number of tuples compared to the higher-level vertices, as this difference represents the aggregation effort: a materialized view that contains 2.5 million tuples when a materialized view of one level above contains 5 million brings less benefit compared to a materialized view where only 5000 entries remain. A query on the first materialized view would still have to scan through 2.5 million tuples (that is, half of the higher-level view) to find the answer for a given query on the aggregate, so that the materialization brings almost no performance gain. In contrast, having only 5000 entries to scan for a given question considerably improves the performance and may be worthwhile, depending on the requirements.

We can ignore the effort required for the actual calculation in all of these estimates, because the I/O activities while scanning the column actually dominate the CPU time required for the (simple) calculations. The CPU remains mostly idle while data is continuously loaded from memory into the cache (at a processing speed of 4 MB/ms/core).

Let us consider the query for the sum of sales for profit center 123 in January 2014 under three different scenarios of materialization. The attribute vector for profit center has a width of 10 bit to be able to represent all 1000 profit centers (period: 7 bit for 100 periods).

Scenario

Number of entries to scan

Scanning time (independent scans)

No materialization, direct calculation on BSEG

1 billion

Profit center: 1192 MB / 4MB/ms/core = 298 ms / core

Period: 834 MB / 4MB/ms/core = 209 ms / core

Materialization of highest-detail vertex <Period,G/L account,Profit center>, group by Period & Profit center

5 million

Profit center: 6.0 MB, 1.5 ms

Period: 4.2 MB, 1.0 ms

Materialization of <Period,Profit center>, direct access

5,000

Profit center: 6.1 KB, 1.5 μs

Period: 4.3 KB, 1 μs

As evident from these back-of-the-envelope calculations, materializing the highest-detail vertex is not necessary for performance reasons, but would at least in this example be noticeable for users – although they probably wouldn’t mind the extra waiting time of less than half a second. Already the subsequent reduction achieved from materializing the vertex of the lower level of detail is negligible under most workload patterns. Hence, at most materializing the highest-detail vertex may be reasonable in this simple example if performance expectations require it.

In the example above, the highest-detail vertex stood for a significant reduction in the number of entries. However, this came at the cost of only considering three attributes (out of close to 400 of the table BSEG in SAP ERP Financials) as interesting for grouping. As outlined above, a lot more would be possible dimensions in reality. The question is: at what point does it make sense to materialize at least the highest-detail vertex? As this trades flexibility and database size on the one hand with query performance on the other hand, the answer depends on the preferences at hand. Furthermore, the disadvantage of materialization with regard to throughput has to be taken into account as well. In fact, the impact of materialization on the complexity of data modifications is even more pronounced for materialized aggregates, because identifying all the materialized aggregate values to be updated after modifications in the base tables requires more consideration. Additionally, the same materialized aggregate will need to be updated for different underlying changes, because it aggregates more than one line item. For example, a total balance for a certain account will be synchronized whenever a line item for that account is posted. Thus, contention through locking will happen more frequently, further reducing the throughput of a system with materialization in place. In effect, materializing aggregates trades query performance for modification performance and transactional throughput. Taking into account that the impact on query performance is diminishing thanks to in-memory speed, only the disadvantages of materialized aggregates especially in case of data modifications remain.

In addition, there are certain cut-off points at which it does not make sense to materialize an aggregate irrespective of the individual preferences. To be of any benefit, each materialized aggregate should contain fewer tuples than the base relation. Otherwise, scanning the materialized view for the answer is not significantly faster than accessing the same number of entries directly in the base table. As a lot of attributes may be of potential interest, it is not given that the highest-level vertex contains significantly fewer entries than the base relation, as the grouping will be rather fine-granular. The same condition needs to hold true for all lower-level aggregates – their size in terms of tuples needs to be smaller than any other materialized view from which it could be derived.

In addition to these hard constraints on each individual materialized view, the complete set of materialized aggregates needs to be significantly smaller in terms of the number of tuples than the base table. Otherwise, the same performance situation as described above applies as the database system has to access the corresponding materialized aggregate data: if the hypercube over all vertices contains more tuples than the base table, the system in effect has to consider more tuples in order to access a specific aggregate value than a full scan of the base table would require.

In a scenario of k dimensions, there are 2k vertices in the hypercube. If the materialization strategy is too liberal and there are many dimensions, it is easily possible to end up with more tuples in the set of all materialized views than there are in the base table. Imagine again a base table with 1 billion tuples. Already with only 30 dimensions, there will be more than 1 billion vertices in the hypercube (230 ≈ 1.07 billion). Even if only every thousandth vertex is being materialized, each of the resulting vertices must not contain more than 1,000 entries on average – otherwise, the total number of tuples in materialized views will already be greater than the number of entries in the base table. If, for example, the average number of tuples per vertex was 1,500, the system would have to consider more than 1.6 billion aggregate tuples. It is easy to imagine based on the performance calculations above that this may take longer than just scanning the base table and calculating the aggregate on the fly – not to speak of the associated storage costs. This also holds in view of the fact that each query will only access a specific vertex: the system still has to look for the tuples belonging to that specific combination of dimensions.

A direct access to the tuples of that specific vertex would be comparable to an index and has conceptually no significant impact in an in-memory system. Thanks to the sequential scanning speed in memory, an index has considerably small impact on performance.

Only relying on a possibly fixed set of materialized aggregates for analysis purposes is dangerous due to the ensuing loss of information. If the performance of the underlying (disk-based) database system does in contrast to SAP HANA not allow free exploration of aggregates (calculated on-the-fly) and drill down to the original line items themselves, relevant information will be hidden from the user. Only with flexible instead of fixed aggregates based on in-memory technology is it possible to spot the following situations:

  • The revenue of a company seems to have reassuringly increased over the last quarter. However, when looking deeper at the individual sales themselves, one single close “saved” the total numbers, while, overall, business has been declining.
  • In another company, profit has been stable over the last years on group level as well as for individual products. In fact, however, all profit essentially comes from sales to a single customer. The pre-defined materialized aggregates do not allow fast access to cover all possible scenarios.
  • Trends that are visible on the lowest level get hidden as managers only have access to certain high-level aggregations.

Aggregation in Financial Accounting

So far, we have looked at a simple example to explain the basic concepts. Although the materialized aggregates in this example significantly reduce the number of tuples to scan for queries, materialization is at best borderline-beneficial. Its benefits appear even more doubtful if you take into account that the example dramatically reduces the flexibility by restricting the aggregation to only three dimensions. Based on this, reports could only answer those queries that aggregate by one of the options in the hypercube. In reality, there will be a lot more dimensions of interest to analyses. Hence, we now look at a more complex example.

The following table lists a set of fields from BSEG that are the absolute minimum for any proper analysis source on accounting documents. The last column gives an estimate of the number of unique dimension values per field (the numbers are a rough approximation and meant to give an impression of the order of magnitude – they will of course differ between systems). The product of these cardinalities would be the maximum number of cells in the highest-detail vertex of the hypercube. Even with this still rather small selection of fields, this amounts to 2 octillion cells (that’s a two with 27 zeroes)! Obviously, only a small subset will actually be filled, as for many combinations of dimension values there will be no entries in the base table. If there are 1 billion line items, only every two-quintillion-th cell of the highest-detail view can be filled at all. To reach a substantially smaller highest-detail vertex, compared to the base relation, this factor has to be still larger, meaning a substantial amount of correlation between dimensions.

Dimension

Column in BSEG

Typical column cardinality

(estimates based on customer data)

Company (code)

BUKRS

50

Fiscal Year

GJAHR

5

Fiscal Period (e.g., month)

H_MONAT

16

G/L account

HKONT

1000

Posting key (determines account type, debit/credit, …)

BSCHL

50

Status (open / cleared)

AUGBL=’’

2

Business partner (customer/vendor)

KUNNR / LIFNR

500000

Currency

H_WAERS

100

Profit center

PRCTR

1000

Cost center

KOSTL

10000

Functional area

FKBER

10

Business area

GSBER

1000

The numbers highlight: the more fields there are that may be interesting to include in an analysis, the more likely it is that the highest-level vertex of the hypercube will not be significantly smaller in the number of tuples. The number of entries that can actually be grouped together on this level will be small because there will be few entries with matching values in all dimensions. This means that materializing the highest-detail vertex will not significantly reduce the number of tuples to scan.

When creating a materialized aggregate, it is not possible to foresee all analysis questions of interest in the future. In order to not restrict the flexibility to certain pre-defined exploration paths, it is advisable to opt for a broad set of dimensions to include. As outlined above, the most relevant factor when deciding whether to materialize aggregates is the reduction in the number of tuples. This should be estimated based on sample data to get a feeling for how sparse the highest-detail vertex is. The savings in the number of entries will only linearly influence the performance of aggregate queries. Whether a factor of 10 will be worth the effort and costs of a materialized aggregate depends on the performance without materialization compared to expected response times – which in-memory speed will help to achieve more easily also without materialization.

On a side note: there may be extreme cases where it nevertheless may be reasonable to materialize certain vertices of the hypercube, if the reduction in the number of tuples to scan that is achieved by such a materialization is significant. In any case, this should be handled dynamically by the database itself based on an analysis of queries and response times. It should not have to rely on a fixed materialization strategy set beforehand by the application and, more importantly, keeping the redundant data in sync must not be the task of the application itself. In order to not increase application complexity and to integrate better with the database, the database needs to handle identification and update of materialized aggregates by itself. To this end, the concept of aggregate caching has been proposed for in-memory databases utilizing the main-delta architecture (Müller and Plattner: “Aggregates Caching in Columnar In-Memory Databases”; Proceedings of the 1st IMDM at VLDB 2013).

Deep knowledge of the statistical distribution of the data is necessary for deciding on a materialization strategy. When trying to thus optimize the queries by materializing select aggregates, it has to be kept in mind that slight changes in the statistical distribution of the underlying data may require re-optimization. Contrast this with the fact that a column store exhibits a stable query performance which is less affected by the data characteristics.

Materialized aggregates also restrict the flexibility with regards to applying filters or performing calculations before aggregation. For example, an analysis that only wants to look at sales of more than one million Euros cannot operate on already aggregated data, because the aggregate values have lost the necessary information. Similarly, the loss of information from aggregation means that certain (so-called non-distributive) functions cannot be part of analysis questions, if the materialized aggregate table does not store additional information beforehand. For example, calculating the average sales amount (assuming the average itself is not part of the materialized aggregate table) needs both the sum of sales and the number of sales to be stored in the materialized aggregate table. Other functions such as minimum or median even need more fine-granular additional information (these are called non-algebraic or holistic). These different classes of aggregation functions also have an effect on whether it is easily possible to update materialized aggregates in the event of inserts, updates, or deletes of the line items.

Conclusion

As explained and demonstrated in this and the preceding blog posts, not having to rely on materialization has substantial benefits. On-the-fly calculation offers more flexibility and lowers the total cost of ownership compared to materialized views and materialized aggregates (thanks to higher throughput and lower database footprint).

The redundancy of materialized views and materialized aggregates was necessary in traditional database systems to achieve adequate performance. With the in-memory database SAP HANA, it is indeed feasible to eliminate redundancy and have a highly performant system. We have demonstrated both theoretically and at the example of SAP Simple Finance how getting rid of redundancy is possible and beneficial. SAP Simple Finance means Zero Cost for Redundant Data and unprecedented levels of analysis flexibility, all in one system.

Throughout this series, we thus have so far explained the benefits that arise from the new technical possibilities. Most importantly, these are

  • increased flexibility,
  • higher throughput, and
  • lower database footprint.

The blog posts of this Deep Dive series into SAP Simple Finance that have appeared so far have focused on the technical antecedents and implications. The technical benefits and the thus enabled cost savings are significant. In our next blog posts, we will take a closer look into the business value and new opportunities.

VN:F [1.9.22_1171]
Average User Rating
Rating: 4.5/5 (31 votes cast)
How Simple Finance Removes Redundancy: The Case of Materialized Aggregates, 4.5 out of 5 based on 31 ratings

5715 Views