Performance optimization skill: Attached Table
I. Problem introduction & Solution
Though we already have ordered MERGE scheme to boost the performance of join between a primary table and its sub table in Performance Optimization Skill: Ordered MERGE, we never intend to stop our work in finding a faster one. This time we will use esProc attached table to speed up the primary-sub tables association. In esProc, a composite table can save multiple tables, such as a primary table and its sub table, in one file. We create a composite table file from the primary table and then attach the sub table to the primary table. The sub table is thus an attached table that associates with the primary table through a field which is also the dimension field of the primary table.
When the composite table stores the association field in the attached table, it stores the association key in the primary table once and doesn’t store in the sub table physically. Such storage structure can reduce the disk reading actions when retrieving the attached table. The sub table is attached to the primary table through the association field, which is equivalent to pre-association, sparing the comparisons for association compared to ordered MERGE association algorithm and thus making the operation faster.
In the following part we’ll test how faster an association using the attached table is than that using the ordered MERGE scheme.
II. Test environment
The server for testing has two Intel2670 CPUs, 2.6G frequency, 16 core in total, 64G memory and an SSD hard disk.
A total of 200G data is generated according to TPCH standard. The primary table is orders and the sub table is lineitem. The records of two tables are sorted respectively by O_ORDERKEY and L_ORDERKEY in ascending order.
III. Generating an attached table
First we generate an attached table file from the previously-mentioned composite table files orders.ctx and lineitem.ctx. The SPL script is:
A |
|
1 |
=file(path+"orders.ctx").open().cursor() |
2 |
=file(path+"lineitem.ctx").open().cursor(L_ORDERKEY:O_ORDERKEY, L_LINENUMBER, L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE) |
3 |
=file(path+"orders_lineitem.ctx").create(#O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_SHIPPRIORITY) |
4 |
=A3.attach(lineitem, #L_LINENUMBER, L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE) |
5 |
=A3.append@i(A1) |
6 |
=A4.append@i(A2) |
The expression “primary table.attach(sub table,…) is used to attach the sub table to the primary table.
IV. Tests
The following is the SPL test script using the ordered MERGE algorithm, in which A1 defines the number of parallel tasks:
A |
|
1 |
1 |
2 |
=now() |
3 |
=file(path+"orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;;A1) |
4 |
=file(path+"lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;;A3) |
5 |
=joinx(A4:detail,L_ORDERKEY;A3:orders,O_ORDERKEY) |
6 |
=A5.groups(year(orders.O_ORDERDATE):l_year; sum(detail.L_EXTENDEDPRICE*(1-detail.L_DISCOUNT)):revenue) |
7 |
=interval@s(A2,now()) |
The SPL test script using the attached table is:
A |
|
1 |
1 |
2 |
=now() |
3 |
=file(path+"orders_lineitem.ctx").open() |
4 |
=A3.attach(lineitem) |
5 |
=A4.cursor@m(L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;;A1) |
6 |
=A5.groups(year(O_ORDERDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
7 |
=interval@s(A2,now()) |
Test results (Unit: sec):
Number of parallel task |
1 |
2 |
4 |
8 |
16 |
Attached table |
427 |
218 |
116 |
62 |
46 |
Ordered MERGE |
675 |
361 |
171 |
92 |
64 |
V. Summary
According to the test results, the execution time of attached table test is just over 60% of that of ordered MERGE in performing primary-sub table association in various parallel, which proves that attached table approach can indeed improve the performance significantly.
We also made more tests and found that, since the ratio of the record number in the primary table to that in its sub table is 1:N, the bigger the value of N is or when the primary key of the primary table consists of more than one field, the less it takes in reading disk and comparing keys, thus the more the performance is boosted.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL