Performance Optimization - 8.2 [Multi-dimensional analysis] Time period pre-aggregation

 

Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation

For statistical analysis on time period, pre-aggregation will work after taking some techniques.

If the data in the original data table is stored by day, we can pre-aggregate the data by month. When there is a need to count on a time period, we can first read the data of the whole month that the time period spans from the pre-aggregated data and aggregate the read data, and then read the data of the dates at both ends of the time period that do not constitute a whole month from the original data table, and finally aggregate them together to obtain the query target. In this way, the amount of calculation for counting on a long time period can be reduced by ten times or more.

For example, we want to query a certain statistical value in the time period from January 22 to September 8, and we have pre-aggregated the data by month in advance. In this case, we can first calculate the aggregate value of data from February to August based on the pre-aggregated data, and then use the original data table to calculate the aggregate value of data from January 22 to January 31 and from September 1 to September 8. In this process, the amount of calculation involved is 7 (February to August) + 10 (January 22 - 31) + 8 (September 1 - 8) = 25. If the aggregation is performed completely based on the original data table, the amount of calculation will be 223 (the number of days from Jan. 22 to Sep. 8). Therefore, the calculation amount is reduced by almost 10 times.

The original data table mentioned here can also be a certain fine-grained pre-aggregated data.

SPL has implemented this method by adding conditional parameters on the cgroups() function:

A
1 =file(“orders.ctx”).open()
2 =A1.cuboid(file(“day.cube”),dt,area;sum(amount))
2 =A1.cuboid(file(“month.cube”),month@y(dt),area;sum(amount))
3 =A1.cgroups(area;sum(amount);dt>=date(2020,1,22)&&dt<=date(2020,9,8); file(“day.cube”),file(“month.cube”))

If it is found that there are time period condition and higher-level pre-aggregated data, SPL will use this method to reduce the amount of calculation. In this example, SPL will read the corresponding data from the pre-aggregated files month.cube and day.cube respectively before aggregation.

The time period pre-aggregation technology is essentially to solve the slicing (dicing) problem.


Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting
Performance Optimization - Preface