Optimization Skills for SPL sorting
This article introduces various optimization skills of SPL sorting, in addition to the ordinary sorting algorithm, other alternative sorting algorithms for different data characteristics in different scenarios are also illustrated, which can decrease the comparisons and IO amount, thus improving the performance.
1. Sorting in memory
We can use the in-memory sorting function such as A.sort() function in SPL when the data can be easily loaded in memory. The default sorting function in SPL is the multi-thread sorting algorithm based on “merge sort”, the optimization idea of which, in other words, is basically implemented by adding threads. And the actual number of threads in use are specified by the [maximum parallel task] in the esProc configuration. The code sample is:
A |
B |
|
1 |
=5000*1000 |
/number of elements |
2 |
=A1\1000 |
/maximum random number |
3 |
=to(A1).(rand(A2)) |
/generate a random sequence |
4 |
=now() |
/current time |
5 |
=A3.sort() |
/sort in ascending order |
6 |
=interval@ms(A4,now()) |
/time spend in sorting |
The test computer’s Corei7 CPU has 4 cores and 8 threads in total. According to the different [maximum parallel task] configurations, the test results are as follows:
Maximum parallel task |
Average time(milliseconds |
1(single thread) |
1800 |
4 |
800 |
8 |
660 |
As shown in the table, multi-core CPU or computer with multiple CPUs can exploit the capability of parallel calculation in each core to the full extent, thus improving the sorting performance significantly.
In the above example, the average duplicates of each value are 1,000, which actually has no impact on the performance as for the A.sort()function. To further improve the performance, we can adopt the grouping method, using A.group@s() function to perform the sorting when there are too many duplicates. This method first uses the hash calculation to group the elements and sort the groups, then union the sorted groups to get the sorting results. The code sample is:
A |
B |
|
1 |
=5000*1000 |
/number of elements |
2 |
=A1\1000 |
/maximum random number |
3 |
=to(A1).(rand(A2)) |
/generate a random sequence |
4 |
=now() |
/current time |
5 |
=A3.group@s() |
/sort the average 1,000 duplicates of each value in ascending order with the grouping method |
6 |
=interval@ms(A4,now()) |
/time spend in sorting |
After sorting the elements with the grouping method, the average time of the operation is 360 ms, which shows that this method is quite suitable for data with a large number of duplicates, and the more the number of duplicates, the better the performance.
2. Sorting on external storage
We need to perform the sorting on external storage when the data are too large to be loaded in memory. And external storage sorting first loads the data in steps, sorts them to temporary files, and then merges all the temporary files to get the final results.
The number of temporary files will affect the efficiency of sorting, i.e., the merging will consume more resources if there are too many temporary files and the efficiency of merging will be degraded with too many merging ways. Therefore, loading a large amount of data at a time can improve the performance of sortx function.
The function provided in SPL for external storage sorting is cs.sortx(). The number of temporary files generated during the sorting are determined by the amount of original data and the data loaded in each time, the latter of which can be specified by the parameter of sortx function. Programmer can specify a reasonable loading data amount according to the size of records and the available memory for the best performance. SPL will also estimate an approximate data amount based on the available memory of the virtual machine if the amount is not specified beforehand.
The unlock of the temporary files will be triggered when the result cursor finishes fetching data or the “close” method is invoked.
The test for external storage sorting requires a large amount of data, but the efficiency of fetching data from database through JDBC is very poor, thus this article will use the more efficient bin file for testing.
The data for testing simulates the data of the “orders” table whose structure is {order_id, order_datetime, customer_id, employee_id, product_id, quantity, price}, which is ordered by order_id and order_datetime, with randomly generated 20 million records.
The test data are also generated in SPL, and the code is:
A |
B |
|
1 |
2018-01-01 00:00:00 |
=file("orders_2018.btx") |
2 |
=to(1000*1000) |
0 |
3 |
for 20 |
=A2.new(B2+~:order_id, elapse@s(A1,order_id):order_datetime, rand(1000000) + 1:customer_id, rand(1000) + 1:employee_id, rand(10000) + 1:product_id, rand(1000) + 1:quantity, rand(100000000)/100:price) |
4 |
=B1.export@ab(B3) |
|
5 |
>B2=B2+A2.len() |
The code for sorting operation is:
A |
B |
|
1 |
=now() |
|
2 |
=file("orders_2018.btx").cursor@b() |
/generate the cursor on the orders bin file to fetch data |
3 |
=A2.sortx(customer_id; 2000000) |
/sort the cursor based on the customer id, and take the second parameter as the loading data amount of each time |
4 |
=file("orders_customer_2018.btx").export@b(A3) |
/export the sorting results to the bin file |
5 |
=interval@s(A1,now()) |
/the operation time, unit: s |
The test results are as follows:
The loading data amount of each time |
Time(s) |
2 million |
73 |
200,000 |
216 |
3. Many-way merging
When calculating big data in SPL, we usually externalize the data into bin files or composite tables for better performance, meanwhile we sort them by the commonly used filtering dimensions for higher filtering performance. In this way, if we need to append new data to the original file and resort all the data, the features of the bin file or composite table can be very helpful. At this point, since the original data are already ordered, we can first sort the new data by dimensions and then merge them with the original data to get the final ordered data. The data volume involved in this method of sorting is far smaller than that of full sorting. And the test in detail is as follows:
In the previous code for generating data on external storage, rewrite A1 as 2017-01-01 00:00:00 and B1 as =file("orders_2017.btx") to generate the simulated orders data file of 2007, “orders_2017.btx”. Then use the code for external storage sorting to sort the data by field custom_id, generating the sorting result file “orders_2017_customer.btx.”
Then we need to sort all the data of 2017 and 2018, that is, perform merging on orders_2017_customer.btx and orders_2018_customer.btx to generate a new ordered file. And the code sample is:
A |
B |
|
1 |
=now() |
|
2 |
=file("orders_2017_customer.btx").cursor@b() |
|
3 |
=file("orders_2018_customer.btx").cursor@b() |
|
4 |
=[A2,A3].mergex(customer_id) |
/merge the two cursors both ordered by custom_id to a new cursor ordered by custom_id |
5 |
=file("orders1.btx").export@b(A4) |
/export the sorted data to the bin file |
6 |
=now() |
=interval@s(A1,A6) |
7 |
=file("orders_2017_customer.btx").cursor@b() |
|
8 |
=file("orders_2018_customer.btx").cursor@b() |
|
9 |
=[A7,A8].conjx().sortx(customer_id; 2000000) |
/concatenate the two cursors vertically and sort them on external storage |
10 |
=file("orders2.btx").export@b(A9) |
|
11 |
=now() |
=interval@s(A6,A11) |
The first six lines of the code adopts the merging method, which takes 30 seconds, while the last five lines stimulates the simple full sorting, which takes 133 seconds.
4. First-half ordered sorting
If the data have to be sorted by many fields and have already ordered by some of the sorting fields, we can load the data grouped by the ordered fields in memory, sort and then export them. This method spares one loading and writing operation compared with the cs.sortx() function, so the performance is substantially better than the latter.
For example, the data of the orders table are generated in order of datetime. And if we want to sort the orders table by fields date and customer, this method can be adopted. The code sample is:
A |
B |
|
1 |
=now() |
|
2 |
=file("orders_2017.btx").cursor@b() |
|
3 |
=A2.run(order_datetime=date(order_datetime)) |
/convert data type from datetime to date |
4 |
=A2.group@qs(order_datetime;customer_id) |
/the data are already ordered by order_datetime, sort the data by fields order_datetime and customer_id |
5 |
=file("orders_2017_date_customer.btx").export@b(A4) |
|
6 |
=interval@s(A1,now()) |
The time used is 38 seconds (the value in A6), and the time will be 95 seconds if the expression in A4 is replaced with the following code:
A |
B |
|
4 |
=A2.sortx(order_datetime,customer_id;2000000) |
/the data are already order by order_datetime, sort the data by fields order_datetime and customer_id |
5. Index sorting
The composite table of SPL provides the functionality of creating index on some columns and some commonly-used columns can be saved in the index, then we don’t need to access the whole original file if all the to-be-accessed columns are in the index, thus sparing a lot of IO operations. What’ more, if the sorting fields are the index fields and all the to-be-assessed fields are in the index, we can take advantage of the orderliness of the index to return ordered cursor using T.icursor@s() function.
The code for creating index is as follows:
A |
B |
|
1 |
2018-01-01 00:00:00 |
=file("orders.ctx") |
2 |
=to(1000*1000) |
0 |
3 |
=B1.create@y(#order_id,#order_datetime,customer_id,employee_id,product_id,quantity, price) |
|
4 |
/the following loop is to create composite table data |
|
5 |
for 20 |
=A2.new(B2+~:order_id, elapse@s(A1,order_id):order_datetime, rand(1000000)+1:customer_id, rand(1000)+1:employee_id, rand(10000)+1:product_id, rand(1000)+ 1:quantity, rand(100000000)/100:price) |
6 |
=A3.append(B5.cursor()) |
|
7 |
>B2=B2+A2.len() |
|
8 |
=A3.index(PriceIndex;price;product_id,order_datetime) |
/create index based on the price field and keep the values of the product and date fields |
The code for creating ordered cursor using index is as follows:
A |
B |
|
1 |
=now() |
=file("orders.ctx").create() |
2 |
=B1.icursor@s(order_datetime,product_id,price;true,PriceIndex) |
/return the ordered cursor using index |
3 |
for A2,10000 |
/traverse the cursor data |
4 |
=now() |
=interval@ms(A1,A4) |
5 |
=B1.cursor(order_datetime,product_id,price) |
|
6 |
=A5.sortx(price) |
/sort on external storage |
7 |
for A6,10000 |
/traverse the cursor data |
8 |
=now() |
=interval@ms(A4,A8) |
As shown in the test, the time of ordered cursor is 4 seconds (the value in B4), while sorting and traversing on external storage takes 54 seconds (the value in B8).
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version