Performance optimization case course: TPCH-Q4
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
and exists (
select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
This SQL statement has an exists-subquery, which is used to find out records in the sub table lineitem that satisfy the condition l_commitdate < l_receiptdate. The exists clause can be rewritten as join. The rewritten code is:
select
o_orderpriority,
count(*) as order_count
from (
select
o_orderkey,o_orderpriority
from orders join (
select l_orderkey
from lineitem
where l_commitdate < l_receiptdate
group by l_orderkey
) on l_orderkey = o_orderkey
where
o_orderdate >= date '1995-10-01'
and o_orderdate < date '1995-10-01' + interval '3' month
)
group by
o_orderpriority
order by
o_orderpriority;
It can be seen that the rewritten code is to filter, group, and aggregate the sub-table first, and then perform a one-to-one join with primary table.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey. Since the sub-table lineitem remains ordered to orderkey after being grouped by orderkey and aggregate, we can merge it with the primary table orders in order.
In this way, we can directly use orders.ctx and lineitem.ctx from Q3.
Copy these tables to the main directory of this query.
2. Group and aggregate before joining
Calculation idea: group the sub-table lineitem in order first, and then merge it with the primary table orders in order.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-10-01 |
3 |
3 |
=elapse@m(A2,B2) |
|
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
|
5 |
=file("lineitem.ctx").open().cursor(L_ORDERKEY;L_COMMITDATE<L_RECEIPTDATE;A4) |
|
6 |
=A5.group(L_ORDERKEY) |
|
7 |
=A4.joinx@im(O_ORDERKEY,A6:L_ORDERKEY) |
|
8 |
=A7.groups(O_ORDERPRIORITY;count(1):order_count) |
|
9 |
=interval@ms(A1,now()) |
A4, A5: Use the pre-cursor filtering mechanism.
A6: Group lineitem in order by the primary key orderkey. The result remains ordered to orderkey.
A7: Merge the grouped lineitem with the primary table orders in order by primary key.
Test result:
Test items |
Execution time (seconds) |
Aggregate before merging |
13 |
3. Direct aggregation when joining
Use the news function mentioned in the previous article. When joining the primary table and the sub table, the sub table can be directly aggregated, which eliminates the process of copying primary table fields and other calculations.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-10-01 |
3 |
3 |
=elapse@m(A2,B2) |
|
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
|
5 |
=file("lineitem.ctx").open().news@r(A4,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
|
6 |
=A5.groups(O_ORDERPRIORITY;count(1):order_count) |
|
7 |
=interval@ms(A1,now()) |
In A5, adding an open @r to the news function means that the join is based on the primary table orders. If @r is not added, the join is based on latter table by default. When the join is based on primary table, the sub table needs to be aggregated before joining.
Test result:
Test items |
Execution time (seconds) |
Aggregate before merging |
13 |
Direct aggregation when joining |
7 |
4. Data conversion
Utilize the conversion methods mentioned in previous articles: convert enumeration string field to numbers, convert date to integers.
Convert orders:
A |
|
1 |
=file("orders.ctx").open().cursor(O_ORDERPRIORITY) |
2 |
=A1.id(O_ORDERPRIORITY).sort() |
3 |
=file("o_orderpriority.btx").export@b(A2) |
4 |
=file("orders.ctx").open().cursor() |
5 |
=A4.run(O_ORDERPRIORITY=A2.pos@b(O_ORDERPRIORITY),O_ORDERDATE=days@o(O_ORDERDATE)) |
6 |
=file("orders_4.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
7 |
>A6.append(A5) |
Convert lineitem:
A |
|
1 |
=file("lineitem.ctx").open().cursor() |
2 |
=A1.run(L_COMMITDATE=days@o(L_COMMITDATE),L_RECEIPTDATE=days@o(L_RECEIPTDATE)) |
3 |
=file("lineitem_4.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE, L_COMMENT) |
4 |
>A3.append(A2) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-10-01 |
3 |
3 |
>o_orderpriority=file("o_orderpriority.btx").import@b().(_1) |
|
4 |
=days@o(elapse@m(A2,B2)) |
=days@o(A2) |
5 |
=file("orders_4.ctx").open().cursor@m(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=B4 && O_ORDERDATE<A4) |
|
6 |
=file("lineitem_4.ctx").open().news@r(A5,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
|
7 |
=A6.groups(O_ORDERPRIORITY;count(1):order_count) |
|
8 |
=A7.run(O_ORDERPRIORITY=o_orderpriority(O_ORDERPRIORITY)) |
|
9 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
Aggregate before merging |
13 |
Direct aggregation when joining |
7 |
Data conversion |
6 |
5. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-10-01 |
3 |
3 |
>o_orderpriority=file("o_orderpriority.btx").import@b().(_1) |
|
4 |
=days@o(elapse@m(A2,B2)) |
=days@o(A2) |
5 |
=file("orders_4.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERPRIORITY;O_ORDERDATE>=B4 && O_ORDERDATE<A4) |
|
6 |
=file("lineitem_4.ctx").open().news@r(A5,O_ORDERPRIORITY;L_COMMITDATE<L_RECEIPTDATE) |
|
7 |
=A6.groups(O_ORDERPRIORITY;count(1):order_count) |
|
8 |
=A7.new(o_orderpriority(O_ORDERPRIORITY):O_ORDERPRIORITY,order_count) |
|
9 |
=interval@ms(A1,now()) |
A10: To convert the integerized O_ORDERPRIORITY back to string, the common method is to use the run function to reset the value of O_ORDERPRIORITY. However, since the data type of column cannot be changed in column-wise computing, the new function is used here.
Test result:
Test items |
Execution time (seconds) |
Aggregate before merging |
13 |
Direct aggregation when joining |
7 |
Data conversion |
6 |
Column-wise computing |
2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version