Performance Optimization Skills: Position
Abstract
One of the features of SPL is that the data are ordered, which can improve the performance greatly if the positions are appropriately used. Now let’s start with a typical scenario in order to master the skills of using position gradually.
Performing binary search on sorted data can achieve better performance, but some algorithms may require the original order of data, making the sorting of data seem unnecessary. For example:
There are three fields in PerformanceRanking.txt: empID (id number of the employees), dep (the name of departments), and amount (the sales amount) respectively. This file includes the sales ranking of each employee in each department for this season, which has been sorted in reverse order by sales amount. Based on the ID number of a specified employee, we want to calculate how much more sales he should increase in order to raise his ranking. And there is no need to increase the sales amount if he already ranks first.
This algorithm requires subtracting the sales amount of the specified employee from the sales amount of the employee ranked one place higher, which is equivalent to doing a relative position calculation on the original data. Since the original order of the data is bound to be used, it seems that we shouldn’t sort them again, otherwise it will be difficult for one order of data to convert to the other. And other algorithms may also demand the original order of data. In short, the above logic may lead to the following script:
A |
B |
|
1 |
=file("PerformanceRanking.txt").import@t() |
/load the data |
2 |
=A1.pselect(empID:10) |
/query the sequence number of records by empID without using position |
3 |
=A1.calc(A2,if(#>1,amount[-1]-amount,0)) |
/do relative position calculation on the original data |
The above script does not sort the data, so binary search cannot be used, which leads to relatively bad performance.
In fact, on the basis of the reserved original order, we can use the position to sort the data, thus improving the performance. The script is:
A |
B |
|
5 |
=oPos=A1.psort(empID) |
/the position of record in the original data after sorting |
6 |
=index=A1(oPos) |
/use the position to generate sorted data |
7 |
=oPos(index.pselect@b(empID: 10)) |
/perform binary search and obtain the sequence number |
8 |
=A1.calc(A7,if(#>1,amount[-1]-amount,0)) |
/perform relative position calculation on the original data |
A5: The psort function only gets the position of the record in the original data after sorting rather than sorting the original data actually.
A6: Use oPos to generate the sorted data. Please note that the original data are not changed at all, and oPos can be used as an intermedia between the sorted data “index” and the original data for interchange.
A7: Perform binary search on the sorted data, and convert them back to the corresponding sequence number of the record in the original data.
To compare the performance difference between the two algorithms with or without position used, we can randomly take the id number of employees as parameter, simulate massive access with loops, and execute both algorithms respectively. The script is:
A |
B |
|
10 |
=100000.(A1(rand(A1.len())+1).empID) |
/generate 10,000 empID |
11 |
=now() |
|
12 |
for A10 |
=A1.pselect(empID:A12) |
13 |
=A1.calc(B12,if(#>1,amount[-1]-amount,0)) |
|
14 |
=interval@ms(A11,now()) |
/without position used, time: 13,552 milliseconds |
15 |
||
16 |
=now() |
|
17 |
for A10 |
=oPos(index.pselect@b(empID: A17)) |
18 |
=A1.calc(B17,if(#>1,amount[-1]-amount,0)) |
|
19 |
=interval@ms(A16,now()) |
/with position used, time: 165 milliseconds |
As shown above, using position can improve the performance tens of times. What’ more, the data volume in the example is relatively small; the performance difference will be even greater as the volume of data increases because the time complexity of traversal search is linear while that of binary search is logarithmic.
Quick alignment
The align function can align the data according to a sequence, for example, input the condition: =pOrderList= [10250,10247,10248,10249,10251], and align the order details according to the list. To get the subtotal amount of each order, the code is as follows:
A |
|
1 |
=connect("demo").query@x("select orderID,productID,price,quantity from orderDetail") |
2 |
=A1.align@a(pOrderList,orderID).new(orderID,~.sum(price*quantity):subtotal) |
But the above script does not use the position, thus the performance is not very prominent. To improve it, we can first sort the sequence (creating index table manually), and align them using binary search, then restore them back to the original order. The code is:
A |
|
1 |
=connect("demo").query@x("select orderID,productID,price,quantity from orderDetail") |
2 |
=oPos= pOrderList.psort() |
3 |
=index= pOrderList (oPos) |
4 |
=A1.align@ab(index,orderID).new(orderID,~.sum(price*quantity):total) |
5 |
=A4.inv(oPos) |
A2-A3: Create the index table manually.
A4: Align the orders details table with the orders list to get the subtotal amount. Binary search, i.e., @b option, can be used to perform the alignment because the index table is ordered.
A5: Adjust A4 by the original position, keeping the same order with pOrderList. The inv function is able to adjust members by the specified position, and here it adjusts members according to the original order, which equals to restoring them to the original position.
We can simulate massive access to test the two algorithms with or without position used, and the significant performance improvement can be clearly seen after the position is used:
A |
B |
|
8 |
=now() |
|
9 |
for A9 |
=A1.align@a(A12,orderID).new(orderID,~.sum(price*quantity):total) |
10 |
=interval@ms(A11,now()) |
/without position used, time: 43,456 milliseconds |
11 |
||
12 |
=now() |
|
13 |
for A9 |
=oPos=A16.psort() |
14 |
=index=A16(oPos) |
|
15 |
=A1.align@ab(index,orderID).new(orderID,~.sum(price*quantity):total) |
|
16 |
=B18.inv(oPos) |
|
17 |
=interval@ms(A15,now()) |
/with position used, time: 7,313 milliseconds |
18 |
=now() |
|
19 |
for A9 |
=A1.align@a(A12,orderID).new(orderID,~.sum(price*quantity):total) |
20 |
=interval@ms(A11,now()) |
/without position used, time: 43,456 milliseconds |
Batch search on ordered data
Sometimes we need to perform batch search on ordered data, for example: pOrderList=[10877,10588,10611,11037,10685]. To calculate the total shipping charges of the orders that satisfy the above list, the code is as follows:
A |
B |
|
1 |
=connect("demo").query@x("select orderID,customerID,orderDate,shippingCharge from order order by orderID") |
|
2 |
=pOrderList=[10877,10588,10611,11037,10685] |
/the parameter of the list |
3 |
=A1.select(pOrderList.pos(OrderID)).sum(shippingCharge) |
/without position used, id number of the order |
Descriptions: The pos and select functions are executed together to perform batch search, in which the pos function returns the position of a certain value in the sequence, or returns “null” if the value does not exist in the sequence; the select function is used to query and returns the current record when the condition is neither “null” nor “false”.
But the above code does not use the position, leading to less effective performance.
It should be noted that the records of the order are ordered, so we can perform binary search to get the positions of the eligible orders, and then use the positions to retrieve the records and calculate them. The specific code is as follows:
A |
B |
|
5 |
=A1.(orderID).pos@b(pOrderList) |
|
6 |
=A1(A5).sum(shippingCharge) |
/with position used, id number of the order |
A1.(orderID) is able to get the “orderID” column, and pos@b can quickly get the positions of members with binary search for ordered data.
We can simulate massive access to test the two algorithms with or without position used, and the significant performance improvement can be clearly seen after the position is used:
A |
B |
|
8 |
=A1.(OrderID) |
/prepare for the performance test |
9 |
=100000.(A1.(OrderID).sort(rand()).to(rand(100))) |
/generate 100,000 lists randomly |
10 |
||
11 |
=now() |
|
12 |
for A9 |
=A1.select(A12.pos(OrderID)).sum(shippingCharge) |
13 |
=interval@ms(A11,now()) |
/without position used, 85,166 milliseconds |
14 |
||
15 |
=now() |
|
16 |
for A9 |
=A1.(OrderID).pos@b(A16) |
17 |
=A1(B16).sum(shippingCharge) |
|
18 |
=interval@ms(A15,now()) |
/with position used, 3,484 milliseconds |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version