Performance Optimization - 5.6 [Ordered traversal] Second-half ordered grouping

 

Performance Optimization - 5.5 [Ordered traversal] First-half ordered grouping

We have handled the “first-half ordered” situation, so is there something we can take advantage of in the “second-half ordered” situation?

Still, we take the account transaction table as example. The data under each account is sorted by date. Now we want to calculate the total transaction amount for all dates. Since the whole table is not ordered by date, the ordered grouping algorithm cannot be used. Yet, because the number of dates in a few years is not large, it is still a small grouping operation, which can be easily written with groups():

A
1 =file(“trades.btx”).cursor@b(dt,amount)
2 =A1.groups(dt;sum(amount))

However, this code does not take advantage of the characteristic that the data under an account is ordered by date, and still uses the conventional hash grouping mechanism.

We use this example to explain how to utilize the “second-half ordered” characteristic.

Assume that we use the ordered grouping mechanism to implement the “second-half ordered” situation. When traversing a record, we only need to compare its grouping key with that of the current last grouped record to determine whether to generate a new group or accumulate this record. As long as the dt within the same account is in order, no errors will occur. Moreover, it can decrease the calculation and comparison of hash values. When traversing to the next account, the dt will be resorted, and the dt of new record may be smaller than that of the current last grouped record. Once this phenomenon occurs, we need to compare this dt with that of the first one of the current grouped records to find an appropriate position, where we can either accumulate this record or insert a record, and then traverse the next transaction record. There is no need to compare the record that has been compared in grouped record table again, because it is also in order. Only when traversing the next account will the comparison with the grouped record table start from the beginning again.

As a result, when the accounts are traversed one by one, this grouped record table will be compared from beginning to end over and over again. For the ordered grouping algorithm, each record is only compared once during traversal, while for this algorithm, it may compare many times before finding the appropriate position. However, as long as the grouped record is already compared, it will not be compared again until the next account is traversed.

Of course, the overall comparison times of this algorithm is higher than that of ordered grouping, but if most of the data is ordered, the occurrence proportion of comparison starting from the beginning (when traversing to the next account) will be relatively lower (within the same account), and the increased comparison times may not be too many, which may be more advantageous than hash grouping.

A key point to note here is that it may involve an insert operation on the grouped record table, and the conventional sequential insertion is a very complex action. To maintain order, it needs to move all the subsequent members back to make room, and the complexity of this operation will offset all the advantages of taking advantages of order. Therefore, here we need to use the linked list to maintain the grouped record table. In this way, only the attributes of the members before and after the insertion point are changed when inserting, and a large number of members are not moved. Once a linked list is used, binary search cannot be performed. However, since the grouped record table is originally compared in order, it has little impact here.

SPL also makes this mechanism an option, which can be used directly:

A
1 =file(“trades.btx”).cursor@b(dt,amount)
2 =A1.groups@h(dt;sum(amount))

When using groups@h(), conventional hash method will not be used, but the linked list mechanism mentioned above will be used instead.

Such second-half ordered grouping also works for big grouping, and the algorithm is similar, except that when the grouped record table in memory becomes large enough, it needs to be written to the buffer file, and then repeat the process, and finally merge these buffer files. Generally, the process is similar to that of big sort grouping, and it is also self-adaptively to big grouping and small grouping.

A
1 =file(“trades0.btx”).cursor@b()
2 =A1.groupx@h(id;sum(amount))

Assuming trade0 is sorted by date, and the transactions on the same date are sorted by account, this code can efficiently calculate the total transaction amount of each account.

However, the main cost of big grouping is the writing and reading of buffer files. Using this algorithm still has advantages, but not as significant as small grouping.

In theory, this algorithm works for any data, and does not necessarily require the data to be “second-half ordered”, because any data can actually be seen as half-ordered data. However, if the ordered degree is too low, leading to the need to frequently compare with the grouped record table from the beginning, the performance will not catch up with hash grouping. Therefore, attention should be given to the ordered degree of data when using this algorithm.

The higher the second-half ordered degree, the more obvious the advantages of this algorithm will be. Once the ordered degree reaches its extreme, it becomes an ordered grouping.

Unlike the method for first-half ordered situation that can be used for sorting, this algorithm has little significance on sorting. The quick sort algorithm commonly used for small sorting has already taken advantage of the ordered degree of original data, while the main cost of big sorting is the writing and reading of buffer files, this algorithm does not reduce the number of buffer files, so it has no obvious advantages.

Since the grouping operation may involve aggregation, the actual buffer file is usually smaller than the original data, and sometimes much smaller. In this case, the time cost of comparison in memory will become higher, and the advantages of this algorithm will become apparent. However, if each grouping key is almost unique in the whole table, and actual aggregation rarely occurs, and there is little difference between buffer file and original table, this algorithm will not have much advantage over hash grouping or sort grouping.


Performance Optimization - 5.7 [Ordered traversal] Serial number grouping and controllable segmenting
Performance Optimization - Preface