Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation
Performance Optimization - 7.4 [Merge and join] Attached table
The operation in the backend of multidimensional analysis is essentially a grouping and aggregation operation, and the grouping methods mentioned earlier can all be used. However, when the data amount is very large, it is not easy to achieve instant response.
To solve this problem, pre-aggregation is an easy-to-think-of method, that is, calculate various summary results in advance so that they can be returned directly when requested by the frontend. In other words, this method is to trade space for time, which is equivalent to converting the traversal problem to a search problem, making it possible to achieve instant response in theory.
However, full pre-aggregation is basically not feasible, which can be confirmed by a simple calculation.
To do a full pre-aggregation on 50 dimensions, it would require 250 intermediate result sets, and even with a conservative estimate, a capacity of millions of terabytes would be required, so it is not operable. Even if only 10 dimensions are pre-aggregated (10 dimensions is not a very large number because it will involve the observation and slice dimensions), it still requires at least hundreds of terabytes of storage space, which is not very practical.
All we can do is partial pre-aggregation, that is, pre-aggregate only the combination of some dimensions. When there is a request from the front end, we can search for the pre-aggregation data based on a certain condition first and then aggregate. This can roughly improve performance by dozens of times on average and often meet the requirements.
As for which dimension combinations to pre-aggregate, it generally depends on practical experience. Alternatively, we can either record historical query requests first and then perform statistical analysis, or dynamically generate new and delete infrequently used combinations. As for the algorithm, there is not much to discuss.
If there are multiple pre-aggregated data, which one should be selected to respond to frontend request?
Assuming that the frontend request is for the statistical values of dimensions A and B, then searching the pre-aggregated data that includes dimensions A and B would be OK. If there are multiple pre-aggregated data that meet the criterion, just select the one with the smallest amount of data and then calculate.
These logics are relatively simple.
SPL provides partial pre-aggregation functions for composite table:
A | |
---|---|
1 | =file(“T.ctx”).open() |
2 | =A1.cuboid(file(“1.cube”),D1,…;sum(M1),…) |
3 | =A1.cuboid(file(“2.cube”),D1,…;sum(M1),…) |
… |
Using the cuboid() function can create pre-aggregated data. What we need to do is to specify a file name for the pre-aggregated data, and the remaining parameters are the same as those of grouping and aggregating algorithm.
It is also very simple to use:
A | |
---|---|
1 | =file(“T.ctx”).open() |
2 | =A1.cgroups(D1,…;sum(M1),…;file(“1.cube”), file(“2.cube”)) |
The cgroups() function will automatically search for the most appropriate pre-aggregated data according to the above logic before calculation.
Although the pre-aggregation scheme is simple, its application is limited due to capacity constraints and can only handle the most common scenarios.
For example, it is difficult to fully apply the pre-aggregation scheme in the following situations:
1) Unconventional aggregation: In addition to common aggregation operations such as sum and count, some unconventional aggregations such as COUNT DISTINCT, median and variance, are likely to be omitted and cannot be calculated from other aggregation values. In theory, there are countless kinds of aggregation operations, and pre-aggregation would be impossible.
2) Combination of aggregations: Aggregation operations may be combined. For example, we may care about the monthly average sales, which is obtained by adding up the daily sales by month and then calculating the average. This operation is not a simply sum operation and average operation, but a combination of two aggregation operations at different dimension levels. These operations are also unlikely to be pre-aggregated in advance.
3) Conditions on metrics: Metrics may also be conditional in statistics. For example, we want to know the total sales of orders with transaction amount greater than $100. This information cannot be processed during pre-aggregation, because 100 is a temporary input parameter.
4)Count by time period: Time is a special dimension, which can be enumerated, or sliced in continuous intervals. The start and end points of the query interval may be fine-grained (to a certain day, for example), in which case the fine-grained data must be used for statistics instead of using the higher-level pre-aggregation data directly.
Performance Optimization - 8.2 [Multi-dimensional analysis] Time period pre-aggregation
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL