Performance optimization skill: Ordered Locate Association to speed up the filtering on joined primary-sub tables
I. Problem introduction & Solution
The ordered MERGE algorithm has proved a big success in improving the association performance in Performance Optimization Skill: Ordered MERGE. Yet we always want faster performance, so this time we try to load as few records as possible.
Usually, the association of a primary table and its sub table is followed by other more calculations such as filtering. The ordered MERGE algorithm first retrieves the primary table and the sub table respectively and then performs association on them. When many records in one of the two associated tables are filtered out, the other associated table are left with many records that cannot be associated, which is not known before doing the MERGE, so they will still be loaded, thus wasting a lot of time.
Instead, we can locate the records that can be associated according to the primary key of the filtered table and then perform association to skip a lot of read actions. And the SPL composite table stores data in order by the key, which allows us to locate the target records with key values effectively.
We’ll do tests to compare this ordered locate association technique with the ordinary ordered MERGE algorithm to see whether this approach can improve the performance, and the optimization logic will be further explained during the processes.
II. Test environment
The serve for testing has two Intel2670 CPUs, 2.6G memory, 16 core in total, 128G memory and an SSD hard disk.
A total of 200G data is generated according to TPCH standard. The primary table is orders and the sub table is lineitem. The records of two tables are respectively sorted by O_ORDERKEY and L_ORDERKEY in ascending order. Because of the relatively large data volume and the long execution time in single thread, we will do the teats in 4-thread parallel.
III. Filtering on the primary table
The filtering condition is O_TOTALPRICE>price. Parameter price is used to test the performance with different data sizes after the primary table is filtered.
1. Ordered MERGE
SPL script:
A |
|
1 |
=now() |
2 |
=file("/home/ctx/orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_TOTALPRICE>price;4) |
3 |
=file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;;A2) |
4 |
=joinx(A3:detail,L_ORDERKEY;A2:orders,O_ORDERKEY) |
5 |
=A4.groups(year(orders.O_ORDERDATE):year;sum(detail.L_EXTENDEDPRICE*(1-detail.L_DISCOUNT)):revenue) |
6 |
=interval@s(A1,now()) |
2. Ordered locate association
SPL script:
A |
|
1 |
=now() |
2 |
=file("/home/ctx/orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_TOTALPRICE>price;4) |
3 |
=file("/home/ctx/lineitem.ctx").open().news(A2,L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE) |
4 |
=A3.groups(year(O_ORDERDATE):year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
5 |
=interval@s(A1,now()) |
Instead of using the cursor function to generate a sub table cursor and then performing association with the joinx function, A3 uses the news function to directly generate a cursor of the joined primary-sub tables based on the primary table cursor using the ordered locate association algorithm, so joinx is not needed any more.
3. Test results & explanations
With the primary table (orders) containing 3 billion records, the test results are as follows:
Number of records of filtered primary table |
2 billion |
1.4 billion |
0.9 billion |
0.17 billion |
Ordered MERGE (sec) |
131 |
115 |
94 |
57 |
Ordered target MERGE (sec) |
105 |
87 |
69 |
35 |
Both SPL scripts create a primary table cursor in the same way. With the ordered MERGE algorithm, a sub table cursor is created by retrieving all the three required fields for each record and then performing ordered MERGE association on the two cursors using joinx function. However, the ordered locate association algorithm locates records in the sub table by matching its join key values to the key values in the filtered primary table. The composite table stores data by block and its index headers record the maximum and the minimum key values of each block. If a join key value in the primary table is greater than the maximum value of the current block in the sub table, SPL will skip this block and move onto the next block to match until a block containing this key value is found. The matching process only compares the join key values, and the other two fields will be read only when the key values are associated.
The test results show that the ordered locate association technique can considerably improve the performance no matter what data size the filtered primary table has.
IV Filtering on the sub table
The filtering condition is L_QUANTITY >quantity. Parameter quantity is used to test the performance with different data sizes after the sub table is filtered.
1. Ordered MERGE
SPL script:
|
A |
1 |
=now() |
2 |
=file("/home/ctx/orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;;4) |
3 |
=file("/home/ctx/lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_QUANTITY>quantity;A2) |
4 |
=joinx(A3:detail,L_ORDERKEY;A2:orders,O_ORDERKEY) |
5 |
=A4.groups(year(orders.O_ORDERDATE):year; sum(detail.L_EXTENDEDPRICE*(1-detail.L_DISCOUNT)):revenue) |
6 |
=interval@s(A1,now()) |
2. Ordered locate association
SPL script:
|
A |
1 |
=now() |
2 |
=file("/home/ctx/lineitem.ctx").open().cursor@m(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_QUANTITY>quantity;4) |
3 |
=file("/home/ctx/orders.ctx").open().new(A2,O_ORDERKEY,O_ORDERDATE,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):v) |
4 |
=A3.groups(year(O_ORDERDATE):year; sum(v):revenue) |
5 |
=interval@s(A1,now()) |
The new function is used in A3 to create a cursor on the primary table and the sum function to generate a field value for the primary table by grouping the sub table according to the key values of primary table and then calculating the sum for each group.
3. Test results & explanations
With the sub table (lineitem) containing 12 billion records, the test results are as follows:
Number of records of filtered sub table |
8.4 billion |
7.2 billion |
3.6 billion |
1.2 billion |
0.5 billion |
Ordered MERGE (sec) |
127 |
115 |
84 |
57 |
36 |
Ordered target MERGE (sec) |
92 |
87 |
72 |
47 |
31 |
These two SPL scripts are basically the same as those of filtering on the primary table, but the difference is that a filtered sub table is joined with the primary table using the new function. The optimization method is also the same as the previous one: the filtered sub table only associates with the matched records in the primary table rather than all of them by traversing the whole table, which decreases the number of records to be read.
According to the test results, as the remaining records in the filtered sub table become fewer and fewer, the advantage of the algorithm in decreasing data reading amount grows smaller as well, resulting in less effective performance.
V Summary
In a nutshell, the ordered locate association algorithm can make the performance of primary-sub table join query with a filtering condition even better.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL