Performance optimization case course: TPCH-Q21
select * from (
select
s_name,
count(*) as numwait
from
supplier,lineitem l1,orders,nation
where
s_suppkey = l1.l_suppkey
and o_orderkey = l1.l_orderkey
and o_orderstatus = 'F'
and l1.l_receiptdate > l1.l_commitdate
and exists (
select
*
from
lineitem l2
where
l2.l_orderkey = l1.l_orderkey
and l2.l_suppkey <> l1.l_suppkey
)
and not exists (
select
*
from
lineitem l3
where
l3.l_orderkey = l1.l_orderkey
and l3.l_suppkey <> l1.l_suppkey
and l3.l_receiptdate > l3.l_commitdate
)
and s_nationkey = n_nationkey
and n_name = 'CHINA'
group by
s_name
order by
numwait desc,
s_name
) where rownum<=100;
The main query is a grouping & aggregation operation on the result set of joining between the primary-sub table and the foreign key table. The filtering condition is relatively complex and contains two exists subqueries.
1. Data storage
Both exists sub-queries perform computation on lineitem records under the same L_ORDERKEY.
Store the orders table and lineitem table in order by the primary key ORDERKEY. When associating the two tables, perform order-based merge by ORDERKEY, and perform order-based grouping but without aggregation on the result set of association and keep the grouped subsets. The grouped subsets are small sets composed of lineitem records with the same L_ORDERKEY value. It will be simpler to calculate the two exists conditions mentioned above based on these small sets.
Continue to use orders.ctx and lineitem.ctx from Q3 query and supplier.ctx and nation.btx from Q2 query. Copy these tables to the main directory of this query.
2. General method
Calculation code:
A |
|
1 |
=now() |
2 |
>name="CHINA" |
3 |
=file("nation.btx").import@b().select@1(N_NAME==name).N_NATIONKEY |
4 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME;S_NATIONKEY==A3).fetch().keys@im(S_SUPPKEY) |
5 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY;O_ORDERSTATUS=="F") |
6 |
=(lineitemFile=file("lineitem.ctx").open()).news@r(A5,L_ORDERKEY:ORDERKEY;A4.find(L_SUPPKEY)) |
7 |
=lineitemFile.news(A6:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_RECEIPTDATE,L_COMMITDATE) |
8 |
=A7.group(L_ORDERKEY) |
9 |
=A8.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.L_SUPPKEY) && !t.pselect(L_SUPPKEY!=s) && ~.pselect(L_SUPPKEY!=s),t,null) ) |
10 |
=A9.switch@i(L_SUPPKEY,A4) |
11 |
=A10.groups@u(L_SUPPKEY.S_NAME:s_name;count(1):numwait) |
12 |
=A11.top(100;-numwait,s_name) |
13 |
=interval@ms(A1,now()) |
A6 associates the sub-table lineitem with the primary table orders, filters the association result and selects the ORDERKEY values satisfying the L_SUPPKEY condition.
When multiple records of lineitem correspond to one record of orders, news@r will not copy the record of orders.
Since opening a file takes time, the opened file here is assigned to the variable lineitemFile to avoid reopening it in A7.
A7 associates lineitem with A6 to select the desired fields. When multiple records of lineitem correspond to one record of orders, news will copy the record of orders.
A8 performs order-based grouping on A7. The grouped subsets are small sets composed of lineitem records with the same L_ORDERKEY value.
A9 performs calculations on these small sets, which is equivalent to implementing two exists condition judgments in SQL.
The exists subquery in SQL is equivalent to requiring that all L_SUPPKEY values in a small set cannot be the same, while the not exists subquery requires that all L_SUPPKEY values in a subset after filtering the small set by the condition L_RECEIPTDATE > L_COMMITDATE must be the same.
To determine if all L_SUPPKEY values in a set are the same, just determine whether the L_SUPPKEY value of set member is the same as that of a specified member. The s in A9 is the specified member.
The rest of the code associates with other foreign key tables and performs regular grouping operation.
Test result:
Test items |
Execution time (seconds) |
General method |
21 |
3. Data conversion
For this query, we need to use two optimization methods mentioned in previous articles: dimension table primary key sequence-numberization and date-integer conversion. Copy supplier_2.ctx from Q2 and lineitem_3.ctx from Q12 to the main directory of this query.
Based on orders.ctx in Q3, convert the enumeration string field O_ORDERSTATUS to numbers.
Code for data conversion:
A |
|
1 |
=file("orders.ctx").open() |
2 |
=A1.cursor(O_ORDERSTATUS) |
3 |
=A2.id(O_ORDERSTATUS).sort() |
4 |
=file("o_orderstatus.btx").export@b(A3) |
5 |
=A1.cursor().run(O_ORDERSTATUS=A3.pos@b(O_ORDERSTATUS)) |
6 |
=file("orders_21_3.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
7 |
>A6.append(A5) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
>name="CHINA" |
|
3 |
=file("o_orderstatus.btx").import@b().(_1).(~=="F") |
|
4 |
=file("nation_2.btx").import@b().(if(N_NAME==name,N_NATIONKEY,null)) |
|
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(null) |
6 |
=A5.cursor@m(S_SUPPKEY,S_NAME;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NAME) |
|
7 |
=file("orders_21_3.ctx").open().cursor@m(O_ORDERKEY;A3(O_ORDERSTATUS)) |
|
8 |
=(lineitemFile=file("lineitem_3.ctx").open()).news@r(A7,L_ORDERKEY:ORDERKEY;B5(L_SUPPKEY)) |
|
9 |
=lineitemFile.news(A8:ORDERKEY,L_ORDERKEY,L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE) |
|
10 |
=A9.group(L_ORDERKEY) |
|
11 |
=A10.conj(if((t=~.select(L_RECEIPTDATE>L_COMMITDATE)) && (s=t.#2) && !t.pselect(#2!=s) && ~.pselect(#2!=s),t.select(B5(L_SUPPKEY)),null) ) |
|
12 |
=A11.groups@u(L_SUPPKEY:s_name;count(1):numwait).run@m(s_name=B5(s_name)) |
|
13 |
=A12.top(100;-numwait,s_name) |
|
14 |
=interval@ms(A1,now()) |
A3, A4 and B5 are aligned sequences.
Test result:
Test items |
Execution time (seconds) |
General method |
21 |
Data conversion |
16 |
4. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
>name="CHINA" |
|
3 |
=file("o_orderstatus.btx").import@b().(_1).(~=="F") |
|
4 |
=file("nation_2.btx").import@b().(if(N_NAME==name,N_NATIONKEY,null)) |
|
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(null) |
6 |
=A5.cursor@m(S_SUPPKEY,S_NAME;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NAME) |
|
7 |
=file("orders_21_3.ctx").open().cursor@mv(O_ORDERKEY;A3(O_ORDERSTATUS)) |
|
8 |
=(lineitemFile=file("lineitem_3.ctx").open()).news@r(A7,L_ORDERKEY:ORDERKEY,min(L_SUPPKEY):SK,count(1):cnt; B5(L_SUPPKEY) && L_RECEIPTDATE>L_COMMITDATE) |
|
9 |
=lineitemFile.news@r(A8:ORDERKEY,L_ORDERKEY,SK,cnt,count(L_RECEIPTDATE>L_COMMITDATE && SK!=L_SUPPKEY):c1,count(SK!=L_SUPPKEY):c2).select@v(c1==0 && c2!=0) |
|
10 |
=A9.groups@u(SK;sum(cnt):numwait).new(B5(SK):s_name,numwait) |
|
11 |
=A10.top(100;-numwait,s_name) |
|
12 |
=interval@ms(A1,now()) |
Since column-wise computing cannot use temporary variables, A8 defines the
computed columns SK and cnt.
A9 compares the L_SUPPKEY values of the records in the grouped subset with SK and directly summarize the computed columns c1 and c2. When multiple records of lineitem correspond to one record of orders, news@r will directly aggregate the lineitem records without copying the record of orders. Then, utilize the select function to filter the conditions of c1 and c2.
Test result:
Test items |
Execution time (seconds) |
General method |
21 |
Data conversion |
16 |
Column-wise computing |
8 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version