Performance Optimization - 5.4 [Ordered traversal] Program cursor

 

Performance Optimization - 5.3 [Ordered traversal] Ordered grouped subsets

Let’s continue using the account transaction table as an example. Now we want to find the transaction records where transaction has occurred for n consecutive days in each month, and then count the transaction amount by the day of the week of the occurrence date.

The second task is very simple, which is a common grouping and aggregating operation. However, the first task is a little troublesome. Even if the data table has been sorted by account and date, it still needs to take out the grouped subsets first, and then write a few lines of code to filter out the result. After that, these records will be stored in memory, then how can we proceed to the next step to perform the grouping and aggregating operation?

An easy method to think of is to write the calculated data into a buffer file gradually, and then group and aggregate this file:

A B
1 =file(“trades.ctx”).open().cursor(id,dt,amount)
2 for A1;id =A2.align@a(31,day(dt)).group@o(~==[])
3 =B2.select(~.len()>=n ).conj().conj()
4 =file(“temp.btx”).export@ab(B3,dt,amount)
5 =file(“temp.btx”).cursor@b().groups(day@w(dt);sum(amount))

A2 takes out each grouped subset; B2 aligns the subset to 31 days according to the transaction date, and then splits them into continuously empty and non-empty subsets using the ordered grouping operation; In B3, find the subsets whose time span exceeds n, that is, the records that transactions have occurred or have not occurred for n consecutive days, and then union these records, we can finally obtain the records that transactions have occurred for n consecutive days (transactions not occurred in n consecutive days are empty sets, which will not change the union result). Note that we need to conj twice here, because the result of align@a is a sequence of the sequences.

After computing, write the calculation result into a temporary file. We only need to write two fields, and finally perform the grouping and aggregating operation.

This calculation process will obviously be slow because it involves writing intermediate data to a buffer file, adding an extra write and read operation. In fact, these data can be directly used for grouping and aggregating operation, and there is no need to write them into external storage. However, since the grouping functions can only be based on table sequence or cursor, if we hard-code each batch of data to implement grouping and aggregating, it will be too troublesome.

SPL provides the program cursor, which allows us to implement this mechanism, that is, simulate the data generated in the loop as a cursor.

A B C
1 func =file(“trades.ctx”).open().cursor(id,dt,amount)
2 for A1;id =A2.align@a(31,day(dt)).group@o(~==[])
3 return B2.select(~.len()>=n ).conj().conj()
4 =cursor@c(A1).groups(day@w(dt);sum(amount))

By defining a subprogram, the required records can be calculated and returned in the loop of this subprogram. The cursor@c()will collect the returned values and concatenate them into a cursor. When we fetch data from the cursor (such as groups() here), the cursor() function will execute the subprogram and collect the returned values. Once the collected values are sufficient to meet the number requested in this fetch, the execution of the subprogram will be suspended, and the result of this fetch will be returned, but the subprogram will not be closed. When we need to fetch data next time, the subprogram will continue running until the whole loop is over, and the cursor() function will also return the signal indicating the cursor ends.

This process can concatenate the data continuously calculated in the loop as a cursor, without the need to write the intermediate data to a file, thereby making the computation of such complex processes achieve higher performance. This kind of cursor is called program cursor.

As we discussed earlier, SPL provides the hash big grouping algorithm, but it does not provide a similar sort algorithm. We can use the mechanism of program cursor to implement a rough hash big sort algorithm. For example, sort the order table by order amount:

A B C
1 func =file(“orders.btx”).cursor@b() =100.(file(~))
2 =B1.groupn(int(amount/100)+1;C1) >B1.skip()
3 for C1 return B3.import@b().sort(amount)
4 return cursor@c(A1)

In B2, the orders are split into 100 parts by amount (we assume the order amounts are basically evenly distributed in the range of 0-10,000, and you can adjust the split method according to actual situation. It is necessary to ensure that the split expression and the field values to be sorted are monotonically nondecreasing or nonincreasing, and that the number of records corresponding to each split value is small so that it can be loaded into memory). Then, we just need to return the sorting results of each part in order, the cursor@c() function will collect these returned values and concatenate them into a cursor.


Performance Optimization - 5.5 [Ordered traversal] First-half ordered grouping
Performance Optimization - Preface