Storage volume of pre-aggregation in multi-dimensional analysis
In general, multi-dimensional analysis works in an interactive way; therefore, it requires an extremely high response speed. Moreover, the amount of data involved in multi-dimensional analysis is often so large (tens or hundreds of millions of data rows, or even larger) that the instant statistics may not keep up with the actions of the interface. In order to ensure performance, some multi-dimensional analysis products adopt the pre-aggregation scheme, that is, the needed statistical results are calculated in advance. In this way, the computational complexity can be decreased from O(n) to O(1), and the result can be returned within constant time (seconds or even milliseconds), which meets the needs of interactive analysis.
We all know that the basic logic of pre-aggregation is to trade space for time, this method will occupy a lot of additional storage space, but many people have no perceptual knowledge on how much space it occupies, let’s calculate it now.
Assuming that one original CUBE has 50 independent dimensions (the so-called independent dimensions refer to dimensions that are not dependent on each other, while for dimensions that are dependent on each other like year/month/day, it can be regarded as one independent dimension). If we pre-aggregate all possible dimension combinations, then how many intermediate CUBEs (more precisely, it should be CUBOID) will there be? It's easy to calculate, that's 250! See Note 1.
Note that this is the number of intermediate CUBEs, not the number of rows of data. Each CUBE will also have a lot of data, even if each CUBE only occupies 1K bytes (obviously it is impossible to be so small), the storage space occupied by 250 CUBEs will be an astronomical figure exceeding 1MT, which means that over one million of hard disks with a capacity of 1T each are needed to hold such amount of data.
Well, we don't actually need to pre-aggregate all dimension combinations, because most people won’t see so many aggregated dimensions at the same time. Let’s calculate by 20 dimensions and only aggregate all combinations of no more than 20 dimensions. In this case, how many intermediate CUBEs will there be? It will be C(50,1)+C(50,2)+…+C(50,20). We only look at the last item C(50,20), which is about 4.7E13. If calculated based on 10,000 rows of data in each intermediate CUBE (in fact, this row number is too conservative. Even so, it’s easy to figure out millions of or even hundreds of millions of rows in the case of 20 dimensions. See Note 2), there will be more than 4.7E17 rows of data. For many repeated values in the dimension information, we suppose that there is already an efficient compression method, allowing us to completely ignore the repeated value. We only consider the statistical value of one metric, a row of data will still occupy 1-4 bytes, even if it only occupies 1 byte (there is no way to compress it anymore), it still requires a capacity of over 470,000T, equivalent to hundreds of thousands of hard disks.
Does the number of 50 independent dimensions or 20-dimension combinations exceed the normal range? No, it doesn’t, those who are familiar with industries such as finance and communication know that these numbers are relatively common and not excessive.
We continue to decrease the task space.
Generally, we use a crosstab to query the results of multi-dimensional analysis. Obviously, it is inconvenient to query if there are too many layers in the crosstab. Therefore, we assume that there are 3 layers on the left and 3 layers on the top, a total of up to 6 layers. If we take 6 dimensions from 50 dimensions, the number of combinations will be C(50,6), about 15.89 million. If we still calculate based on 10,000 rows of data in each intermediate CUBE, the number of rows will be less than 160G. When each row of data has a dozen metric statistical values, its size will not exceed 1K in general, and the total volume is around 100T accordingly. This volume seems to be acceptable, and only 100 hard disks are needed.
Wait a minute! What we discussed above is the case where we only consider the crosstab, but in multi-dimensional analysis, some slicing conditions are often proposed. For example, we may want to look at the crosstab of other dimensions according to a certain month or region or product. Therefore, the pre-aggregated data should not only consider the dimensions of the crosstab itself, but also add the dimensions used for slicing. Six-dimension combinations are only enough for the crosstab, we should add another 4 dimensions to calculate the volume, which means that 4 dimensions are reserved for slicing.
We still use the above estimation method, there will be about C(50,10), i.e., 10 billion intermediate CUBEs, and 100T rows of data, which is calculated based on 10,000 rows of data per intermediate CUBE. Although it seems to be passable, 100T only represents the number of rows of data. If we pre-aggregated over 10 metrics, it would occupy a space of thousands of terabytes, and the number of hard disks would be over a thousand.
The space occupied by pre-aggregation is so huge that it seems to be impractical. Moreover, our calculations are already very conservative, for example, the number of CUBEs only counts the maximum item; the volume of the intermediate CUBE only counts 10,000 rows; the space used by dimensions is not counted; and the aggregation value of one metric only counts one byte. The actual situation is far from ideal like this way, and it is normal that the actual space occupied is several times to dozens of times larger than the estimated value.
If so, is it meaningful to do the pre-aggregation?
Of course. It works when the number of dimensions is relatively small. For example, when there is only a dozen or so dimensions, the number of intermediate CUBEs will be in the order of thousands to tens of thousands, and the space occupied will be in the range within a hundred hard disks. Generally, when the number of independent dimensions exceeds 30, you can no longer expect to be able to pre-aggregate all possible queried dimension combinations, and therefore it is impossible to decrease the computational complexity to O(1). Of course, the specific value should be determined according to the statistical requirements on the dimensions and metrics. You can use the above method to estimate by yourself.
Can it be concluded that it’s meaningless to do the pre-aggregation when there are many dimensions? No, it can’t. Although there is no way to decrease the computational complexity to O(1) in a limited space, it’s still meaningful to decrease the complexity by dozens or hundreds of times. We will discuss this topic later.
Note 1: If the layer-based dimension like year/month/day is taken into account, the number of intermediate CUBEs is not 2n but larger than 2n, which should be (L1+1)*(L2+1)*…*(Ln+1), where n is the number of dimensions; Li is the number of layers of the ith dimension. For example, the number of layers of dimension year/month/day is 3, and the dimension without layers can be regarded as one layer dimension.
Note 2: The number of CUBE rows is theoretically the product of the possible value number of each dimension. Even if each dimension has only two values, the number of CUBE rows of 20 dimensions will be 220=1 million, which is far more than 10,000. When there are five possible values for each dimension, the CUBE with 6 dimensions will have tens of thousands of rows.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version