Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting

 

Performance Optimization - 8.2 [Multi-dimensional analysis] Time period pre-aggregation

Aggregation operation without slicing condition always involves full data. If the data is not pre-aggregated, there is no way to reduce the amount of calculation. However, when there are slicing conditions, it may not be necessary to traverse all the data if the data is organized properly.

Simply creating an index on dimensions will help, but the effect is not very significant. Although using an index can quickly locate records that meet the conditions, if the physical storage location of these data is not continuous, there will still be a lot of waste when reading the data. When the target data is too scattered, using index may not be much better than full traversal. The reason is that in multidimensional analysis operations, the amount of data to be read is often very large even after slicing. The main application scenario of index is often to select a small amount of data.

If the data is stored orderly by a certain dimension, the slicing condition of this dimension can be used to limit the target data to a continuous storage area. In this way, there is no need to traverse all the data, and the amount of data read can be effectively reduced. However, each dimension may have slicing conditions in theory. If the data is sorted by each dimension, it would be equivalent to copying the data several times, which would increase the storage cost.

A compromise method is to store two data sets, that is, store one data set after sorting the dimensions in the order of D1,…,Dn, and store another data set after sorting the dimensions in the order of Dn,…,D1. Although this method will double the amount of data, it is acceptable. In this way, for any dimension D, there will always be a data set that makes D in the first half of its sorted dimension list. If D is not the first dimension, the data after slicing will generally not form a single area, but will also be composed of some relatively large continuous areas. For a dimension, the closer to the front of the sorted dimension list, the higher the physical orderliness of data after slicing.

When calculating, just using the slice condition of one dimension to filter is enough, and the conditions of other dimensions are still calculated in traversal. In multidimensional analysis, using the slice condition on a certain dimension can often reduce the amount of data involved by several times or even tens of times. It will be of little significance to use the slice condition on other dimensions, but very difficult to implement. When there are slice conditions on multiple dimensions, we usually select the dimension of which the proportion of the sliced value range in the total value range is smaller, which often means that the filtered amount of data is smaller.

The cgroups()function implements this selection. If multiple pre-aggregated data are found sorted by different dimensions and there are multiple slicing conditions, the most appropriate pre-aggregated data will be selected. When cuboid() creates pre-aggregated data, the order of grouped dimensions is meaningful, because different pre-aggregated data will be created for different dimension orders.

We can also manually select the appropriate sorted data set using code, and store more sorted data sets.

The redundant sorting method is not only suitable for multidimensional analysis, but can also be used for conventional traversal with filtering conditions. The reason for taking multidimensional analysis as an example is that the relevant features of this method will be more obvious when slicing the dimensions in multi-dimensional analysis, which is more suitable to explain.


Performance Optimization - 8.4 [Multi-dimensional analysis] Boolean dimension sequence
Performance Optimization - Preface