Performance optimization case course: TPCH-Q19
select
sum(l_extendedprice * (1 - l_discount)) as revenue
from
lineitem,part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#32'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 7 and l_quantity <= 7 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 18 and l_quantity <= 18 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#45'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 22 and l_quantity <= 22 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
This is a filtering and aggregate query based on two-table association. It has simple structure but complicated filtering condition.
1. Data storage
There is no special requirement for tables lineitem and part, store them in order by primary key.
Continue to use lineitem.ctx from Q3 and part.ctx from Q2. Copy them to the main directory of this query.
2. General method
In the conditional expression, many items are only related to the smaller dimension table part, and the complexity to calculate these items is relatively high (as it involves IN operation). If we first compute these items on part table, there will be much less compute-intensive than performing computations on result set of the two-table association. This is because the association result set, which is as big as the lineitem table, is much larger.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
>brand1="Brand#32" |
>quantity1=7 |
3 |
>brand2="Brand#23" |
>quantity2=18 |
4 |
>brand3="Brand#45" |
>quantity3=22 |
5 |
=["SM CASE", "SM BOX", "SM PACK", "SM PKG"] |
|
6 |
=["MED BAG", "MED BOX", "MED PKG", "MED PACK"] |
|
7 |
=["LG CASE", "LG BOX", "LG PACK", "LG PKG"] |
|
8 |
=[A5,A6,A7].conj().sort() |
=["AIR","AIR REG"] |
9 |
=file("part.ctx").open().cursor@m(P_PARTKEY,P_BRAND,P_SIZE,P_CONTAINER,0:FLAG;P_SIZE>=1 && P_SIZE<=15 && [brand1,brand2,brand3].contain(P_BRAND) && A8.contain(P_CONTAINER)) |
|
10 |
=A9.run(FLAG=if(P_BRAND==brand1 && A5.contain(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A6.contain(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A7.contain(P_CONTAINER) && P_SIZE<=15:3; 0)).select(FLAG>0).fetch().derive@o().keys@im(P_PARTKEY) |
|
11 |
=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:A10,L_SHIPINSTRUCT=="DELIVER IN PERSON" && B8.contain(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10) |
|
12 |
=A11.groups(;sum(if(case(L_PARTKEY.FLAG,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10;L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0))) |
|
13 |
=interval@ms(A1,now()) |
A9 combines the subsequent conditions and write them into the pre-cursor filtering condition of part table to reduce the amount of data read.
A10 adds a FLAG field to part table for computing the complex conditional items related to part table. When a record meets item 1, item 2 and item 3 in the SQL OR condition, assign corresponding values 1, 2, 3 to FLAG; and assign 0 to the field when neither item is met. Then we filter records according to whether or not FLAG value is greater than 0.
A12 selects the quantity value according to FLAG value to preform filtering on L_QUANTITY. The number of calculationsforthe FLAG-related complex formula is the same as the number of rows in part table.
Test result:
Test items |
Execution time (seconds) |
General method |
5 |
3. Data conversion
For this query, we need to use the dimension table primary key sequence-numberization method mentioned in previous articles. The lineitem_4.ctx in Q8 can be used directly. Copy it to the main directory of this query.
In addition, we also need to use the method of converting enumeration string to numbers. Since the P_BRAND and P_CONTAINER in part table have been converted in Q16 and Q17 respectively, copy them to the main directory of this query.
And, we also need to convert the enumeration string fields L_SHIPMODE and L_SHIPINSTRUCT based on lineitem_4.
Code for data conversion:
A |
|
1 |
=file("lineitem_4.ctx").open().cursor(L_SHIPINSTRUCT,L_SHIPMODE) |
2 |
=A1.id(L_SHIPINSTRUCT,L_SHIPMODE) |
3 |
=A2.(~.sort()) |
4 |
=file("l_shipinstruct.btx").export@b(A3(1)) |
5 |
=file("lineitem_4.ctx").open().cursor() |
6 |
=A5.run(L_SHIPINSTRUCT=A3(1).pos@b(L_SHIPINSTRUCT),L_SHIPMODE=A3(2).pos@b(L_SHIPMODE)) |
7 |
=file("lineitem_19_3.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
8 |
>A7.append(A6) |
Calculation code after data conversion:
A |
B |
|
1 |
=now() |
|
2 |
=file("p_brand.btx").import@b().(_1) |
|
3 |
=file("p_container.btx").import@b().(_1) |
|
4 |
=file("l_shipinstruct.btx").import@b().(_1) |
|
5 |
=file("l_shipmode.btx").import@b().(_1) |
|
6 |
>brand1=A2.pos@b("Brand#32") |
>quantity1=7 |
7 |
>brand2=A2.pos@b("Brand#23") |
>quantity2=18 |
8 |
>brand3=A2.pos@b("Brand#45") |
>quantity3=22 |
9 |
=A3.(["SM CASE", "SM BOX", "SM PACK", "SM PKG"].contain(~)) |
|
10 |
=A3.(["MED BAG", "MED BOX", "MED PKG", "MED PACK"].contain(~)) |
|
11 |
=A3.(["LG CASE", "LG BOX", "LG PACK", "LG PKG"].contain(~)) |
|
12 |
=A9.(~||A10(#)||A11(#)) |
|
13 |
=A5.(["AIR","AIR REG"].contain(~)) |
=A4.pos@b("DELIVER IN PERSON") |
14 |
=file("part_17_3.ctx").open() |
=A14.cursor@m().skip().(null) |
15 |
=A14.cursor@m(P_PARTKEY,P_SIZE,P_BRAND,P_CONTAINER;P_SIZE>=1 && P_SIZE<=15 && [brand1,brand2,brand3].contain(P_BRAND) && A12(P_CONTAINER)) |
|
16 |
=A15.fetch().(B14(P_PARTKEY)=if(P_BRAND==brand1 && A9(P_CONTAINER) && P_SIZE>=1 && P_SIZE<=5:1, P_BRAND==brand2 && A10(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A11(P_CONTAINER) && P_SIZE<=15:3; null)) |
|
17 |
=file("lineitem_19_3.ctx").open().cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;L_PARTKEY:B14:#,L_SHIPINSTRUCT==B13 && A13(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10) |
|
18 |
=A17.groups(;sum(if(case(L_PARTKEY,1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10,3:L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0))) |
|
19 |
=interval@ms(A1,now()) |
A9, A10, A11, A13, and B14 are all alignment sequence.
Test result:
Test items |
Execution time (seconds) |
General method |
5 |
Data conversion |
4 |
4. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=file("p_brand.btx").import@b().(_1) |
|
3 |
=file("p_container.btx").import@b().(_1) |
|
4 |
=file("l_shipinstruct.btx").import@b().(_1) |
|
5 |
=file("l_shipmode.btx").import@b().(_1) |
|
6 |
>brand1=A2.pos@b("Brand#32") |
>quantity1=7 |
7 |
>brand2=A2.pos@b("Brand#23") |
>quantity2=18 |
8 |
>brand3=A2.pos@b("Brand#45") |
>quantity3=22 |
9 |
=A3.(["SM CASE", "SM BOX", "SM PACK", "SM PKG"].contain(~)) |
|
10 |
=A3.(["MED BAG", "MED BOX", "MED PKG", "MED PACK"].contain(~)) |
|
11 |
=A3.(["LG CASE", "LG BOX", "LG PACK", "LG PKG"].contain(~)) |
|
12 |
=A9.(~||A10(#)||A11(#)) |
|
13 |
=A5.(["AIR","AIR REG"].contain(~)) |
=A4.pos@b("DELIVER IN PERSON") |
14 |
=file("part_17_3.ctx").open() |
=A14.cursor@m().skip().(null) |
15 |
=A14.cursor@m(P_PARTKEY,P_SIZE,P_BRAND,P_CONTAINER;P_SIZE>=1 && P_SIZE<=15 && [brand1,brand2,brand3].contain(P_BRAND) && A12(P_CONTAINER)) |
|
16 |
=A15.fetch().(B14(P_PARTKEY)=if(P_BRAND==brand1 && A9(P_CONTAINER) && P_SIZE<=5:1, P_BRAND==brand2 && A10(P_CONTAINER) && P_SIZE<=10:2, P_BRAND==brand3 && A11(P_CONTAINER):3; null)) |
|
17 |
=file("lineitem_19_3.ctx").open().cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;B14(L_PARTKEY),L_SHIPINSTRUCT==B13 && A13(L_SHIPMODE) && L_QUANTITY>= quantity1 && L_QUANTITY<=quantity3+10) |
|
18 |
=A17.groups(;sum(if(case(B14(L_PARTKEY),1:L_QUANTITY<=quantity1+10,2:L_QUANTITY>=quantity2 && L_QUANTITY<=quantity2+10,3:L_QUANTITY>=quantity3),L_EXTENDEDPRICE*(1-L_DISCOUNT),0))) |
|
19 |
=interval@ms(A1,now()) |
In column-wise computing, since we cannot use switch to assign values to fields, A17 doesn’t use L_PARTKEY:B14:# but B14(L_PARTKEY) instead, which can only implement filtering but not assignment. A18needstouseB14 (L_PARTKEY) to calculate again.
Test result:
Test items |
Execution time (seconds) |
General method |
5 |
Data conversion |
4 |
Column-wise computing |
2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version