Performance Optimization - 5.1 [Ordered traversal] Ordered grouping and aggregating
Performance Optimization - 4.9 [Traversal technology] Column-wise computing
If a data table is ordered by the grouping key, we can use the ordered grouping algorithm.
The process of ordered grouping is very simple. When traversing, we only need to compare the key value between the current record and the last grouped subset. If they are the same, continue to put the record into the grouped subset. If they are different, it means the calculation of this grouped subset has been completed. In this case, create a new grouped subset, and put the record into the new subset. Repeat the process until the end of the traversal, and we will get all grouped subsets.
Similarly, for big data scenarios, it is more common to obtain the aggregate value of each grouped subset, so only one grouped result set needs to be kept. When traversing to a certain record, judge its grouping key value to determine whether to accumulate the current record value to the last grouped record or generate a new grouped record. This algorithm can be applied to all accumulable aggregation operations (describable by iteration function).
Ordered grouping only requires comparing adjacent records, so its complexity is very low and there is no luck issue of hash grouping. Moreover, the records in grouped subset or grouped result set are originally calculated one by one, thus it will not change the calculated grouped subset and grouped record when the new data is traversed. Furthermore, such computing process does not need a large memory space to store calculation results, and is well suited for returning the results in the form of a cursor, so it is naturally suitable for big grouping.
In fact, we introduced the sort grouping of big grouping earlier, its second round uses exactly the ordered grouping algorithm.
For example, for the orders sorted by time, you can use this algorithm to calculate the total order amount of each day:
A | |
---|---|
1 | =file(“orders.btx”).cursor@b() |
2 | =A1.groups@o(dt;sum(amount)) |
In SPL, the groups@o()represents the ordered grouping, but it still returns a table sequence, and using groups() is considered as small grouping.
The groups@o()for small grouping can work on multi-cursor. The data table ordered by the grouping key value can be seen as composed of continuous grouped subsets. Segmentation does not necessarily happen between two grouped subsets (possibly within a certain grouped subset), but groups@o() will finally adjust the calculation result.
SPL does not provide a corresponding groupx@o(), and its ordered grouping algorithm to return a big result uses another function:
A | |
---|---|
1 | =file(“orders.btx”).cursor@b() |
2 | =A1.group(dt).new(dt,~.sum(amount)) |
3 | =A1.group(dt;sum(amount)) |
The group() function on the cursor will consider by default that the cursor is ordered by grouping key values, and it will return a cursor that can take out each grouped subset in turn, and then further operations can be performed. If the aggregation expression is written in the parameters, the cumulative method will be used directly to perform the aggregation, and it will still return a cursor, except that the cursor members will be the grouped records consisting of grouped aggregation values.
Here, A2 and A3 use different methods to perform the same operation and obtain the same result. For the method of A2, since it needs to take out each grouped subset, it requires that the grouped subset cannot be too large, while the method of A3 is to use the calculation process of iterative function to perform aggregation calculation, which also works for big grouped subset. If you only want to get the grouped aggregation value, the code of A3 will be more advantageous.
Both A2 and A3 are cursors, it still needs to use fetch()to take out the calculated data. Moreover, the group() function is a delayed cursor, and the actual calculation is done only when fetching.
For big grouping, the group()cannot directly support multi-cursor. If the segmentation position falls within a certain grouped subset, it will result in incorrect calculation result. Theoretically, we can adopt the strategy of discarding the head and complementing the tail discussed in the text file segmentation section, that is, find the next record with a different grouping key value starting from the segmentation position (except the first segment), and then start traversing this segment. After this segment is traversed, continue to the next segment and fetch a record and continue to traverse until this grouped subset is completely traversed. In this way, a correct calculation result can be ensured when using multi-cursor for segmentation.
However, unlike text files, a certain grouped subset may be very large, which may cause a very large “head” to be discarded in the above-mentioned strategy, resulting in performance decrease due to repeated read. Moreover, when executing the group()function, since it is not clear how the cursor is calculated out (the cursor may come from various sources), the said strategy should already be processed when the data table is segmented, which requires to know by which field the data table is ordered when segmenting.
SPL’s composite table provides another segmentation mechanism. Because SPL needs to specify the data structure in advance when creating a composite table, some information like ordered field is known in advance. Therefore, when segmenting, it can ensure that the segmentation point will definitely fall between the grouped subsets. In contrast, since the text files and bin files do not need to specify the data table structure in advance, this mechanism is not provided for them.
A | |
---|---|
1 | =file(“orders.ctx”).create@p(#dt,…) |
2 | =file(“orders.ctx”).open().cursor@m(dt,amount;;4) |
3 | =A2.group(dt;sum(amount)) |
When creating a composite table, using the @p option will notify SPL not to put the records with the same value in the first field (usually the ordered field) into two segments during segmentation. Currently, the segmentation processing is provided only for the first field, as past practices have shown that it is enough.
Then, you can use multi-cursor to calculate. Note that the multi-cursor syntax for composite table is slightly different from that for text and bin files, and should be written at the parameter after the second semicolon.
Performance Optimization - 5.2 [Ordered traversal] DISTINCT and COUNT(DISTINCT)
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL