Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation

 

Performance Optimization - 4.6 [Traversal technology] Grouping and aggregating

Consider this question: find the top 10 largest amounts from 100 million orders.

A simple idea is to sort the 100 million records by amount in descending order, and then take the amount field of the first 10 records, and it is done.

Writing SQL in the database to solve this problem is exactly this idea.

However, sorting itself is a very slow action. Sorting big data also involves data buffering, which will result in significant performance decrease and is not easy to perform parallel computing.

In fact, we can easily think of a simpler algorithm:

First, prepare a result set with 10 members, and fill them all with 0 (In fact, any small number will be OK), and then traverse the order table. If the current order amount is bigger than the smallest value in the result set, replace the smallest member with this amount. After traversal, the result set is the values we want.

This algorithm only requires traversing the original data table once, without the need to sort (the number of traversals for sorting will be log2N), let alone buffering files. Moreover, parallel processing in segments is also very easy, it is just a matter of calculating the top 10 of each segment, and then calculating the top 10 of the union of the calculated every segment’s top 10, which still doesn’t involve writing data to hard disk concurrently.

To take the top M from N members, the complexity of sort is N*logN, while that of the above algorithm is N*logM. For this example, even for an all-in-memory operation, the computation load of CPU can be decreased by around 8 times.

This idea is not novel. If the question is changed to calculating the maximum value, then almost everyone will think of this method. But when taking the top M members, we will be more accustomed to thinking of sort method first.

This requires us to expand our understanding on aggregation operation.

Usually, aggregation operation we understand is to calculate a set into a single value, such as summing, counting, computing max/min. However, if we expand our thinking and consider the case where the return value is a small set as an aggregation operation, then we can use the idea of aggregation operation to solve related problems.

Let’s look at this example again, we can just consider “taking the top N members” as an aggregation operation, and regard it as the same operation as summation and counting, except that it returns a set rather than a single value.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.total(sum(amount), top(-10,amount), top(-10;amount) )

Like sum(), the top() function in SPL is also considered as an aggregation function and only has a direction from small to large. The -10 in the parameter means to take the last 10, that is, the 10 largest amounts. The top(-10,amount) will return the 10 largest amount values, while top(-10;amount) will return the 10 records that maximize the amount.

Another advantage of regarding top()as an aggregation function is that it can be used in grouping and aggregating where it is still the same as functions such as sum(), count(), max() and min():

A
1 =file(“orders.btx”).cursor@bm(4)
2 =A1.groups(area;top(-10,amount),top(-10;amount))

This code can calculate the top 10 order amounts and corresponding orders in each region, and the parallel computing of multi-cursor can also be used. It should be noted that the values of the last two fields in the calculation result of A2 are sequence (set).

If top() is not regarded as an aggregate function, it will be very difficult to do this kind of operation in grouping and aggregating. In SQL, this kind of operation can barely be described by using window functions, and the operation performance is very poor.

Essentially, aggregation operation is an operation performed on a set, but in practice, it is often not necessary to get all the members of a set ready. Many aggregation operations can be done by using the cumulative method to gradually traverse the members of a set, which makes it suitable for big data operation.

Operations like summing, counting, computing max/min value as well as “taking the top M members” just mentioned, all meet this characteristic.

We refer to this type of aggregate function as iteration function, and its operation process can be abstracted with following characteristics:

1) Provide an initial value as the calculation result;

2) For each new set member encountered, perform a calculation on this member and the last calculation result to obtain a new calculation result;

3) After traversal, the calculation result can be returned.

To calculate iteration function, it only needs to keep a current result value, and the traversed set member can be discarded. Even for the aggregation of grouped subsets in grouping operation, it only needs to keep one current result value for each group, which won’t take up too much memory. The calculation of iteration function only requires traversing the original data table once, without the need to buffer files, and parallel computing can also be performed.

SPL designs a general form for iteration function:

iterate(x,a)

Where, a represents the initial value; x represents the expression, in which ~~ represents the last calculation result, and ~ represents the current set member. The calculated x is used as the new calculation result, i.e. the ~~ of the next calculation. After traversal, this function will return current ~~.

We can use it to define the common aggregation operations such as sum()and count():

sum iterate(~~+~,0)
max iterate(if(~~<~,~,~~),-inf())
min iterate(if(~~>~,~,~~),inf())
count iterate(if(~,~~+1,~~),0)

Although it is bit more troublesome to define top(), it can still be defined. You can try it yourself.

We can now expand aggregation operation to more general cases. Any case that can be described by iterate()and whose calculation result does not occupy much memory can use the cumulative method to achieve higher computing performance. This method can be used for grouping, and it just needs to traverse the original data table once, without the need to buffer files (buffering is still needed in big grouping, but it is caused by big grouping itself and has nothing to do with aggregation calculation). However, the parallel computing of iteration function is a bit different. Iteration function itself can perform parallel computing for segmented data tables to obtain multiple calculation results (one for each segment), but it needs to manually code to do the second round of aggregation (aggregate the calculation results of each segment into one result), and it cannot perform parallel computing directly based on multi-cursor (although the fork syntax of multi-cursor can be used, it still needs to do the second round of aggregating operation manually).

We can also use iterate() to perform some aggregation operations that have not been defined in advance, such as successive multiplication, union calculation.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.total(iterate(~~&area,[]))
3 =A1.groups(product;iterate(~~&area,[]))

Here, the union calculation is an undefined aggregation operation.

SPL stipulates that in the iteration function for calculating records, the field name can be directly used to represent the field of current record, and there is no need to write it as ~.area. A2 will calculate the region to which all orders are sold, and A3 will calculate the region to which each product is sold.

In structured data operations, the common simple aggregation operations are the above-mentioned several operations that have been defined, or operations that can be derived from such operations. For example, continued product can be implemented by taking logarithm, summing and then using exponent. For the union calculation in the above example, it can also be replaced by DISTINCT operation (id() function in SPL). Custom aggregate operations that need to be written with iteration function are not uncommon, but such operations may involve more complex business background, which are not easily illustrated with simple example.

Assuming the order table is ordered by time, now we want to calculate the total tax amount for each product. The initial tax rate is 5%, and when the cumulative tax amount exceeds 10,000, the tax rate for subsequent orders is reduced to 3%, This type of aggregation can hardly be described with conventional functions, but it can be described by iteration function and can also be calculated at a higher performance.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.groups(product;iterate(~~+amount*if(~~>=10000,0.03,0.05),0):tax)

The iteration function below will calculate the maximum number of consecutive orders with an amount exceeding 50 for each region.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.groups(area;iterate([max(~~),if(amount>=50,~~(2)+1,0)],[0,0]):C)
3 =A2.run(C=max(C))

In the next chapter, we will discuss the form of applying iteration function to ordered detailed data, and give more examples with business significance.


Performance Optimization - 4.8 [Traversal technology] Redundant grouping key
Performance Optimization - Preface