Discussion on pre-aggregation scheme in multi-dimensional analysis
As we have already calculated at Storage volume of pre-aggregation in multi-dimensional analysis that in order to achieve O(1) computational complexity, we need to at least consider the various dimension combinations used by the interface. Nevertheless, it does not work when the total number of dimensions is slightly larger.
Therefore, we have to give up the expected O(1) complexity, that is, we don’t pre-aggregate every possible dimension combination but part of them only. Specifically, when querying, for the data that have been pre-aggregated, they can be returned directly; while for the dimension combination that has not been pre-aggregated, we can still traverse and aggregate it from the original CUBE, in this case, the computational complexity is either O(1) or O(n).
We can do it in a smarter way: aggregate it from a certain existing intermediate CUBE. For example, if the pre-aggregated data for the dimension combination [A,B,C] is stored, then the querying on dimension combination [A,B] or [B,C] can be reaggregated from this intermediate CUBE instead of aggregating from the original CUBE, as a result, the amount of calculation will be greatly reduced. Sometimes the target query can be aggregated from multiple intermediate CUBEs. For example, the combination [B, C] can be reaggregated from either combination [A, B, C] or [B, C, D], in this case, the intermediate CUBE with the smaller amount of data is preferred.
So, how do we know which combinations should be pre-aggregated in the initial state?
We can generate the combinations dynamically. For the combination that cannot be aggregated from the existing intermediate CUBE when querying, we can only aggregate it from the original CUBE. In this case, we can save the result as a new intermediate CUBE after aggregation. When a new combination emerges, there will be a sense of delay in the first access, but subsequent queries based on this combination or queries that can be aggregated from this combination will be returned faster.
Actually, it does not mean that as long as the combination can be aggregated from an existing intermediate CUBE, it is always the temporary aggregation. The goal of performance optimization in multi-dimensional analysis is the front-end response speed. If the intermediate CUBE is still large, the re-aggregation will also be slower. In this case, these re-aggregated results can be saved acting as new intermediate CUBEs.
In addition, in this process, we can also record the usage frequency of each intermediate CUBE, and delete those intermediate CUBEs with lower usage rate under the limitation of the total space, so as to use the limited space more efficiently.
After adopting the methods mentioned above, although we cannot fully achieve O(1) complexity, we can often improve the computing performance by dozens or even hundreds of times in comparison with full hard traversal, which is enough for most multi-dimensional analysis scenarios.
We also introduced several situations at Functional blind area of pre-aggregation in multi-dimensional analysis, which cannot improve the performance through pre-aggregation. Among such situations, the unconventional aggregation and combined aggregation are still the data amount problem in essence; while the temporarily generated conditional metric and time period counting, it has nothing to do with the data amount. Since we cannot predict the parameters that are entered by users only when they use them, it is impossible to pre-aggregate the data corresponding to all parameters in advance.
Theoretically, the above-mentioned methods can also be used: when encountering a new parameter, calculate and save it. However, unlike dimension combination, the metric parameters are often continuous quantity, and their values and combinations cannot be enumerated, and thus there is little possibility of repeated use.
Pre-aggregation is indeed not very effective for conditional metrics, but for the time period counting, it is somewhat effective. For the latter case, we can pre-aggregate the data according to higher level of time dimension, and consequently, the amount of traversal computation can be reduced during query.
Suppose the data in the original CUBE are stored day by day, then we can pre-aggregate the data by month to make an intermediate CUBE. When there is a need to count a certain time period, we can traverse the data of the whole months spanned by this time period from the intermediate CUBE, and then read the data on the dates at both ends of the time period that do not constitute a whole month, after that, we can obtain the query target. In this way, the amount of calculation for counting long time period can be reduced by ten times or even more.
For instance, we want to query a certain statistical value for the time period from January 22 to September 8, and the pre-aggregation by month has been made in advance. In this case, as long as we first calculate the aggregation value from February through August based on the pre-aggregated data, and then use the original CUBE to calculate the aggregation values from January 22 to 31 and September 1 to 8, the amount of calculation involved will be 7 (Feb. – Aug.) +10 (Jan. 22 - 31) +8 (Sep. 1 – 8) =25; however, if the aggregation is performed based on the original data, the amount of calculation will be 223 (the number of days from Jan. 22 to Sep. 8). As a result, the calculation amount is reduced almost by 10 times.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version