Functional blind area of pre-aggregation in multi-dimensional analysis
Before further discussing how to implement the pre-aggregation of multi-dimensional analysis in a limited space, it is necessary to understand the functional deficiencies of pre-aggregation scheme, that is, we need to make clear what query requirements are unlikely to be achieved through the pre-aggregation of data.
1. Unconventional aggregation
The pre-aggregation scheme is to calculate and store the aggregation value of the metric in advance, and then it is obvious that the aggregation value of a metric that is not thought of in the pre-aggregation stage cannot be directly queried from the pre-aggregated data. For example, if we only stored the total sales and did not store the maximum value, we cannot query the maximum value directly.
SQL provides five most common aggregation operations: count, sum, average, max and min. The aggregation value of these operations is generally taken into consideration in the pre-aggregation stage. However, there are also some unconventional aggregation values like the count unique (count distinct), median, variance, etc., these values are, of course, meaningful in business and are likely to be queried, but they are often omitted. Moreover, these values cannot be calculated simply from other aggregation values. If they are not pre-stored, a temporary traversal calculation has to be performed.
In theory, there are countless kinds of aggregation operations, and it is impossible to think of them all. However, involving too many kinds of aggregation operations makes the volume problem we discussed in the last article even worse.
2. Combining aggregations
Aggregation operations may also be combined. For example, the data grain of original CUBE is stored day by day (i.e., one record per day); however, we may want to know the average monthly sales, which is obtained by summing the daily sales of each month and then average the aggregated values by month. This operation is not a simply summing and averaging operation, but a combination of two aggregation operations at different dimension levels. There are many other similar operations such as the average value obtained by the maximum monthly sales and median regional personnel income. There may even be combinations involving three or more aggregation operations.
Obviously, the aggregated value of such combinations is meaningful in business, but the conventional pre-aggregation scheme works only for a single aggregation operation, and generally does not consider such combination operation, and hence the pre-aggregation will not be performed. Even if such combination is deliberately considered in the pre-aggregation phase, the problem of excessive volume will still occur. If a CUBE with 30 dimensions is combined according to five basic aggregation operations, and only pairwise combination is considered, there will be C(30,2)*5*4=8700 combinations in theory. If these values are all pre-stored, the storage space will be increased by several orders of magnitude.
3. Conditional metric
Metric may also be conditional during the counting. For instance, we want to know the total sales of orders with transaction amount greater than 100 dollars.
Generally, this query cannot be processed in the pre-aggregation stage. We can only pre-aggregate the sales, and cannot pre-aggregate the sales of orders exceeding 100 dollars separately, unless such possible query requirement is thought of and performed in advance. If this query requirement is performed in advance, it is equivalent to creating a new derived metric: if (sales > 100, sales, 0). However, the 100 here is often a parameter and can only be input on an ad hoc basis in the interactive query stage. As a result, it is impossible to pre-aggregate in advance.
Also, there may be multiple associations in the conditional metric, such as counting the sales of orders with selling quantity exceeding 10 pieces. However, the problem and solution caused by multi-association are also the same.
Similarly, there may also be dimension temporarily generated by the metric. For example, to count the average income by age range, the division rules of age range are passed in by parameters (if the rule is predetermined, it can indeed define a dimension in advance).
Obviously, we have to say it again: the query on such conditional metric (or ad hoc dimension) is meaningful in business, and is arguably very common, but the pre-aggregation schemes can't do anything for it.
The conditional metric and the combined aggregation value may arise in a mixed manner, and you can imagine some business-meaningful examples yourself.
4. Counting by time period
Time is a particularly important dimension in multidimensional analysis. General dimensions can only be sliced in an enumerative way (for some specific dimension values when counting), while the time dimension is very special, it can be sliced either in an enumerative way or in a continuous interval way.
We may want to know the statistical value between a certain start date and a certain end date, such as the total sales between May 8th and June 12th. The two time points are also passed in as the parameter during the query, which is of highly arbitrary.
We have to say “obviously” for the third time: such counting is extremely meaningful in business; and we want to say it one more time: the pre-aggregation schemes can't do anything for it.
Pre-aggregating by dimension is either for each day or each month, and such dimension (or its layers) is already determined, while the intervals like May 8th to June 12th cannot be described by any dimension and layer.
Think about it carefully and you will find that the counting by time period can also be understood as a conditional metric. But because it is too common and the implementation scheme in practice is different from the conditional metric, we discuss it here separately.
The counting by time period may also involve multiple association combinations, such as querying the total sales of goods that are sold between May 8th and June 12th and produced between January 9th and February 17th. If only one time dimension interval is considered, the pre-aggregated data can be used to a certain extent to traverse the aggregation based on the intermediate CUBE. However, if the query involves a combination of multiple time dimension intervals, the problem will become very cumbersome. Even if there is pre-aggregated data, it will still face the problem of multi-dimension slicing. To respond quickly, not only does it require redundant storage of the data pre-aggregated by dimension, but it may also need multiple redundant sorting schemes. For the processing methods, we will study them in detail later.
The functional blind area of pre-aggregation scheme in multi-dimensional analysis lies not only in the detailed data we often talk about, but also in a lot of aggregated information that cannot be solved. Using the pre-aggregation to trade space for time can indeed improve performance to a certain extent, but it can only solve a few of the simplest needs, and it also faces the problem of huge storage. It is very unreliable to place hope on the pre-aggregation scheme when pursuing the effect of multidimensional analysis. To do the multidimensional analysis well, hard traversal is the basic. Even if the pre-aggregated data is available, it can play a greater role only with the aid of excellent hard traversal capability.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version