"I. Problem introduction & Solution Though we already have ordered MERGE scheme to boost th .."

pjf RaqForum 22 No.
1 Reply • 894 View • 4 Years ago

Performance optimization skill: Attached Table

I. Problem introduction & Solution

Though we already have ordered MERGE scheme to boost the performance of join between a primary table and its sub table in Performance Optimization Skill: Ordered MERGE, we never intend to stop our work in finding a faster one. This time we will use esProc attached table to speed up the primary-sub tables association. In esProc, a composite table can save multiple tables, such as a primary table and its sub table, in one file. We create a composite table file from the primary table and then attach the sub table to the primary table. The sub table is thus an attached table that associates with the primary table through a field which is also the dimension field of the primary table.

When the composite table stores the association field in the attached table, it stores the association key in the primary table once and doesn’t store in the sub table physically. Such storage structure can reduce the disk reading actions when retrieving the attached table. The sub table is attached to the primary table through the association field, which is equivalent to pre-association, sparing the comparisons for association compared to ordered MERGE association algorithm and thus making the operation faster.

In the following part we’ll test how faster an association using the attached table is than that using the ordered MERGE scheme.

II. Test environment

The server for testing has two Intel2670 CPUs, 2.6G frequency, 16 core in total, 64G memory and an SSD hard disk.

A total of 200G data is generated according to TPCH standard. The primary table is orders and the sub table is lineitem. The records of two tables are sorted respectively by O_ORDERKEY and L_ORDERKEY in ascending order.

III. Generating an attached table

First we generate an attached table file from the previously-mentioned composite table files orders.ctx and lineitem.ctx. The SPL script is:

	A
1	=file(path+"orders.ctx").open().cursor()
2	=file(path+"lineitem.ctx").open().cursor(L_ORDERKEY:O_ORDERKEY, L_LINENUMBER, L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE)
3	=file(path+"orders_lineitem.ctx").create(#O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_SHIPPRIORITY)
4	=A3.attach(lineitem, #L_LINENUMBER, L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE)
5	=A3.append@i(A1)
6	=A4.append@i(A2)

The expression “primary table.attach(sub table,…) is used to attach the sub table to the primary table.

IV. Tests

The following is the SPL test script using the ordered MERGE algorithm, in which A1 defines the number of parallel tasks:

	A
1	1
2	=now()
3	=file(path+"orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;;A1)
4	=file(path+"lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;;A3)
5	=joinx(A4:detail,L_ORDERKEY;A3:orders,O_ORDERKEY)
6	=A5.groups(year(orders.O_ORDERDATE):l_year; sum(detail.L_EXTENDEDPRICE*(1-detail.L_DISCOUNT)):revenue)
7	=interval@s(A2,now())

The SPL test script using the attached table is:

	A
1	1
2	=now()
3	=file(path+"orders_lineitem.ctx").open()
4	=A3.attach(lineitem)
5	=A4.cursor@m(L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;;A1)
6	=A5.groups(year(O_ORDERDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)
7	=interval@s(A2,now())

Test results (Unit: sec):

Number of parallel task	1	2	4	8	16
Attached table	427	218	116	62	46
Ordered MERGE	675	361	171	92	64

V. Summary

According to the test results, the execution time of attached table test is just over 60% of that of ordered MERGE in performing primary-sub table association in various parallel, which proves that attached table approach can indeed improve the performance significantly.

We also made more tests and found that, since the ratio of the record number in the primary table to that in its sub table is 1:N, the bigger the value of N is or when the primary key of the primary table consists of more than one field, the less it takes in reading disk and comparing keys, thus the more the performance is boosted.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 894 View • 4 Years ago