Performance optimization case course: TPCH-Q12
select
l_shipmode,
sum(case
when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1
else 0 end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1
else 0 end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('TRUCK', 'MAIL')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1995-01-01'
and l_receiptdate < date '1995-01-01' + interval '1' year
group by
l_shipmode
order by
l_shipmode;
This is a grouping & aggregation operation on the filtered result set of association between primary table orders and sub table lineitem.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining.
Continue to use orders.ctx and lineitem.ctx from Q3.
Copy these tables to the main directory of this query.
2. General method
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2, 1) |
4 |
=["MAIL", "TRUCK"] |
5 |
=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A2 && L_RECEIPTDATE<A3 && A4.contain(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE) |
6 |
=file("orders.ctx").open().new@r(A5,L_SHIPMODE,O_ORDERPRIORITY) |
7 |
=A6.run(O_ORDERPRIORITY=if(O_ORDERPRIORITY=="1-URGENT" || O_ORDERPRIORITY=="2-HIGH",1,0)) |
8 |
=A7.groups(L_SHIPMODE;sum(O_ORDERPRIORITY):high_line_count, sum(1-O_ORDERPRIORITY):low_line_count) |
9 |
=interval@ms(A1,now()) |
Here we use the optimization methods mentioned in previous articles, such as pre-cursor filtering, multi-thread parallel computing.
The conditions for high_line_count and low_line_count are opposite.If summation operation is not involved, the latter is 0 when the former is 1, and the latter is 1 when the former is 0. Therefore,A7 and A8 use O_ORDERPRIORITY and 1-O_ORDERPRIORITY to calculate these two values beforesummation.
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
3. Data conversion
Utilize the conversion methods mentioned in previous articles: convert date to integer, convert enumeration string field to number.
We can directly use orders_4.ctx converted in Q4.
Copy these tables to the main directory of this query.
For lineitem table, we can convert the enumeration string field l_shipmod to number based on Q3.
Code for data conversion:
A |
|
1 |
=file("lineitem.ctx").open() |
2 |
=A1.cursor(L_SHIPMODE) |
3 |
=A2.id(L_SHIPMODE).sort() |
4 |
=file("l_shipmode.btx").export@b(A3) |
5 |
=A1.cursor() |
6 |
=A5.run(L_SHIPMODE=A3.pos@b(L_SHIPMODE),L_RECEIPTDATE=days@o(L_RECEIPTDATE),L_COMMITDATE=days@o(L_COMMITDATE),L_SHIPDATE=days@o(L_SHIPDATE)) |
7 |
=file("lineitem_3.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
8 |
>A7.append(A6) |
Calculation code:
A |
|
1 |
=now() |
2 |
>l_shipmode=file("l_shipmode.btx").import@b().(_1) |
3 |
>o_orderpriority=file("o_orderpriority.btx").import@b().(_1) |
4 |
1995-1-1 |
5 |
=days@o(elapse@y(A4, 1)) |
6 |
=days@o(A4) |
7 |
=l_shipmode.(["MAIL", "TRUCK"].contain(~)) |
8 |
=o_orderpriority.(["1-URGENT","2-HIGH"].contain(~)) |
9 |
=file("lineitem_3.ctx").open().cursor@m(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A6 && L_RECEIPTDATE<A5 && A7(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE) |
10 |
=file("orders_4.ctx").open().new@r(A9,L_SHIPMODE,O_ORDERPRIORITY) |
11 |
=A10.run(O_ORDERPRIORITY=A8(O_ORDERPRIORITY)) |
12 |
=A11.groups(L_SHIPMODE;count(O_ORDERPRIORITY):high_line_count, count(!O_ORDERPRIORITY):low_line_count) |
13 |
=A12.run(L_SHIPMODE=l_shipmode(L_SHIPMODE)) |
14 |
=interval@ms(A1,now()) |
A7 and A8 are aligned sequence.
In A11, A8 (O_ORDERPRIORITY) returns true or false, corresponding to the original 1 and 0. The result of count in A12 is the same as that of the original sum, which eliminates if judgement and improves performance.
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
Data conversion |
4 |
4. Column-wise computing
A |
|
1 |
=now() |
2 |
>l_shipmode=file("l_shipmode.btx").import@b().(_1) |
3 |
>o_orderpriority=file("o_orderpriority.btx").import@b().(_1) |
4 |
1995-1-1 |
5 |
=days@o(elapse@y(A4, 1)) |
6 |
=days@o(A4) |
7 |
=l_shipmode.(["MAIL", "TRUCK"].contain(~)) |
8 |
=o_orderpriority.(["1-URGENT","2-HIGH"].contain(~)) |
9 |
=file("lineitem_3.ctx").open().cursor@mv(L_ORDERKEY,L_SHIPMODE;L_RECEIPTDATE>=A6 && L_RECEIPTDATE<A5 && A7(L_SHIPMODE) && L_COMMITDATE<L_RECEIPTDATE && L_SHIPDATE<L_COMMITDATE) |
10 |
=file("orders_4.ctx").open().new@r(A9,L_SHIPMODE,O_ORDERPRIORITY) |
11 |
=A10.derive@o(A8(O_ORDERPRIORITY):flag) |
12 |
=A11.groups(L_SHIPMODE;count(flag):high_line_count,count(!flag):low_line_count) |
13 |
=A12.new(l_shipmode(L_SHIPMODE):L_SHIPMODE,high_line_count,low_line_count) |
14 |
=interval@ms(A1,now()) |
Since column-wise computing cannot assign value to a field, a new field flag is defined in A11.
Test result:
Test items |
Execution time (seconds) |
General method |
6 |
Data conversion |
4 |
Column-wise computing |
2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version