Performance Optimization - 5.3 [Ordered traversal] Ordered grouped subsets

 

Performance Optimization - 5.2 [Ordered traversal] DISTINCT and COUNT(DISTINCT)

When a data table is ordered by the grouping key, the grouped subsets can be read out in turn in the form of cursor. Taking advantage of this, we can perform some complex operations.

For example, there is an account transaction table that contains one-year data. Now we want to calculate the number of accounts with the consumption times over m within n consecutive days, where n and m are the parameters entered by users at program interface, and we hope to find the result immediately.

This is a relatively complex operation that is unlikely to be written in a simple aggregation function (nor with an iterative function). In general, the calculation will become easy only when these transaction records are read into memory. That is, we need to take out the transaction records under one account at a time to calculate. Since the transaction data under one account is generally very small, the memory has sufficient space to hold them.

One grouped subset after grouping by account contains the transaction records under one account. However, the number of accounts in such scenarios is likely to be very large, which is a typical big grouping, and it is impossible to store all grouped subsets into memory. When the transaction records of an account are fetched every time, it needs to traverse the whole table if there is no index, which is completely unacceptable. Even if the index is available, too-many fetching times may result in slow computing speed because the original data table is usually sorted by transaction time (refer to the explanation in previous chapters).

If we sort the data table by account in advance (sort the transactions in an account by date), and then use the ordered grouping technology, we can easily take out these grouped subsets to calculate:

A B C
1 =file(“trades.ctx”).open().cursor(id,dt) >m-=1,n-=1
2 for A1;id if (k=(t=A2.(dt)).len()-m)<=0 next
3 =@+if(k.pselect(A2t(#+m)-A2t(#)<=n),1,0)
4 return B3

For the ordered cursor, the for statement will fetch a grouped subset at a time, then judge whether there are m transactions within n days.

In practice, the code can be further optimized to read more accounts each time.

These lines of code can also be concatenated into the group function:

A B
1 =file(“trades.ctx”).open().cursor(id,dt) >m-=1,n-=1
2 =A1.group(id).((k=(t=~.(dt)).len()-m)>0 && k.pselect(t(#+m)-t(#)<=n))
3 return A2.total(count(~))

For each grouped subset taken out by group(), a logical expression will be calculated. If the result is true, it indicates that there are m transactions in n days. A2 will also return a cursor, we just need to traverse the cursor and count the number of true.

If the data table is built with the composite table segmentation method described in the previous section, this operation can also work based on multi-cursor, simply by adding options to the function that generates cursor:

A B
1 =file(“trades.ctx”).open().cursor@m(id,dt;;4) >m-=1,n-=1
2 =A1.group(id).((k=(t=~.(dt)).len()-m)>0 && k.pselect(t(#+m)-t(#)<=n))
3 return A2.total(count(~))

The method to maintain and append the ordered data has been discussed in the previous chapters. As long as we pre-sort data and then use the technique of converting dates to integers as mentioned earlier, it is possible to obtain high performance with immediate response even if the amount of data is very large. This calculation method can be one or two orders of magnitude faster than using cursor on a traditional database (a single SQL statement can hardly implement this logic).

The ordered grouped subset technology is very useful for improving the performance of complex analysis on massive accounts.

With grouped subsets, it is very easy to explain DISTINCT, as we just need to take one record from each group. For example, calculate in which months each account in the transaction table made transactions.

A
1 =file(“trades.ctx”).open().cursor(id,dt)
2 =A1.group(id,month(dt)).(~(1))

The cursor calculated in A2 is the account and months where the transaction occurred. In fact, what it takes out is the first record of each grouped subset. Note that it is a cursor, we also need to use the fetch() to actually calculate and fetch the data. There will be many examples like this below, the only thing we need to do is to get the cursor because the result set may be too large to be taken out completely. After getting the cursor, we can do further calculations or save the result set as a file.

The operation for taking the first record of a grouped subset is relatively common, SPL provides the options:

A
1 =file(“trades.ctx”).open().cursor(id,dt)
2 =A1.group@1(id,month(dt))

The group@1()will do the same thing as above, but it will not generate the grouped subset first, which will consume less memory and can also adapt to the situations where the grouped subset is very large (although in this case, it is usually a small grouping).

In fact, group@1()can be understood as DISTINCT, and its function is basically the same as the id() function. When the data is in order, DISTINCT can be performed efficiently, while when the data is out of order, DISTINCT is as complex as grouping.

For the account transaction table described above, now we want to add a monthly cumulative amount information for each record, i.e., the cumulative transaction amount of the account in a month after this transaction is done, and then filter out the transaction (including date) when the cumulative transaction amount of each account exceeds 100 for the first time every month.

This information can be calculated after reading the grouped subset:

A
1 =file(“trades.ctx”).open().cursor(id,dt,amonut)
2 =A1.group(id,month(dt)).(~.derive(itertate(~~+amount,0):mca))
3 =A2.(~.select@1(mca>=100))

Using an iterative function for detailed data can accomplish such cumulative calculation. The iterative function here still has the aforementioned characteristics: there is an initial calculation result, and the traversed member is used to calculate new result each time. Unlike aggregate functions that only return the final result, the iterative function will return the current calculation result every time when dealing with detailed data. Therefore, the monthly cumulative amount after each transaction can be calculated in the field added by the derive function in A2.

It should be noted that each data item fetched by the cursor calculated in A2 is a table sequence, and filtering should be done on the table sequence in A3 (to take out the first record that meets the cumulative amount criteria) instead of filtering the cursor directly.

However, this algorithm needs to take out the grouped subset. If we change to another scenario where there is an order table sorted by product and date, and we want to calculate the date when the cumulative order amount of each product exceeds 100 per month. Because there may be many transaction records for the same product, the grouped subset may be very large, and the method of taking out the grouped subset in advance and then calculating will not work.

For cumulative calculations on ordered cursor, we can also use the grouping parameter of iterative function to implement:

A
1 =file(“orders2.ctx”).open().cursor(product,dt,amount)
2 =A1.derive(iterate( ~~+amount,0; prudoct,month(dt) ): mca)
3 =A2.select(mca>=100).group@1(product,moth(dt))

The parameter after the semicolon in the iterate()parameter is used to represent the grouping fields. When these fields (or expressions) change, SPL will restart the calculation of iterative function (set the calculation result as the initial value again and continue to iterate). During iterating, it only needs to compare the previous record, without the need to take the whole grouped subset, which consumes less memory space and, it supports large grouped subset. Moreover, since the cumulative amount field is added to the original record, the select() can be directly performed on the cursor in A3.

Finally, we need to use group@1() to perform DISTINCT. At this time, the first record of the grouped subset is retrieved. Although the grouping keys are product and month, the retrieved record is the one before grouping, i.e., the record containing the dt field.


Performance Optimization - 5.4 [Ordered traversal] Program cursor
Performance Optimization - Preface