Performance optimization case course: TPCH-Q8
select
o_year,
sum(case
when nation = 'CHINA' then volume
else 0
end) / sum(volume) as mkt_share
from
(
Select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'ASIA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'STANDARD POLISHED BRASS'
) all_nations
group by
o_year
order by
o_year;
This is a grouping and aggregation operation on the filtered result set of multi-table association, which involves a primary-sub table association between orders table and lineitem table.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.
So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.
For tables part and supplier, we can directly use part.ctx and supplier.ctx from Q2.
For tables region and nation, we can directly use region.btx and nation.btx from Q2.
Copy these tables to the main directory of this query.
2. General method
A |
B |
|
1 |
=now() |
|
2 |
1995-01-01 |
1996-12-31 |
3 |
>nation="CHINA" |
>name="ASIA" |
4 |
>type="STANDARD POLISHED BRASS" |
|
5 |
=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY |
|
6 |
=file("nation.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY).keys@i(N_NATIONKEY) |
|
7 |
=A6.select@1(N_NAME==nation).N_NATIONKEY |
|
8 |
=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY) |
|
9 |
=file("supplier.ctx").open().cursor(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY:A6).fetch().keys@im(S_SUPPKEY) |
|
10 |
=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY) |
|
11 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY) |
|
12 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY)) |
|
13 |
=file("lineitem.ctx").open().news(A12,L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY),L_SUPPKEY:A9) |
|
14 |
=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT)) |
|
15 |
=A14.groups(year(O_ORDERDATE):o_year;sum(if(L_SUPPKEY.S_NATIONKEY.N_NAME==nation,L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2) |
|
16 |
=A15.new(o_year,s1/s2:mkt_share) |
|
17 |
=interval@ms(A1,now()) |
Here we use the techniques mentioned in the previous articles: filter at cursor creation, order-based primary-sub table association, and convert association field to foreign key table pointers.
However, L_SUPPKEY.S_NATIONKEY.N_NAME in A15 is not the filtering of dimension table that appears in JOIN or WHERE clause as mentioned in the previous article, but appears in the conditional sum expression. The implementation method here is that the supplier table is joined with the nation table in A9, the lineitem table is joined with the supplier table in A13, and the foreign key field, as a condition, appears in sum(if) in A15.
Test items |
Execution time (seconds) |
General method |
18 |
3. The filtering of dimension table in calculation formula
In the previous section, L_SUPPKEY.S_NATIONKEY.N_NAME in A15 is associated with many tables, and the essence of sum(case) is to performconditionalsummationon volume based on whether l_suppkey meets the criteria.
For this calculation requirement, if we filter the supplier first and then use the obtained s_suppkey set as the condition for l_suppkey to do a conditional summation, it can avoid supplier from participating in subsequent association calculation.
Code in SQL:
select
o_year,
sum(case
when l_suppkey in(
select s_suppkey
from supplier,nation
where
s_nationkey=n_nationkey
and n_name='CHINA'
)
then volume
else 0
end) / sum(volume) as mkt_share
from
(
Select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
l_suppkey
from
part,
lineitem,
orders,
customer,
nation n1,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'STANDARD POLISHED BRASS'
) all_nations
group by
o_year
order by
o_year;
After rewriting in SQL, the condition for sum(case) becomes an IN clause, where the subquery is a statement with JOIN and WHERE clauses. In SPL, we can further optimize by filtering dimension table. So, the supplier and its dimension table nation(n2) will not appear in the subsequent association calculation.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-01-01 |
1996-12-31 |
3 |
>nation="CHINA" |
>name="ASIA" |
4 |
>type="STANDARD POLISHED BRASS" |
|
5 |
=file("region.btx").import@b().select@1(R_NAME==name).R_REGIONKEY |
|
6 |
=file("nation.btx").import@b() |
|
7 |
=A6.select@1(N_NAME==nation).N_NATIONKEY |
|
8 |
=A6.select(N_REGIONKEY==A5).derive@o().keys(N_NATIONKEY) |
|
9 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY;S_NATIONKEY==A7).fetch().keys@im(S_SUPPKEY) |
|
10 |
=file("part.ctx").open().cursor@m(P_PARTKEY;P_TYPE==type).fetch().keys@im(P_PARTKEY) |
|
11 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;A8.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY) |
|
12 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && A11.find(O_CUSTKEY)) |
|
13 |
=file("lineitem.ctx").open().news(A12,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;A10.find(L_PARTKEY)) |
|
14 |
=A13.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT)) |
|
15 |
=A14.groups(year(O_ORDERDATE):o_year;sum(if(A9.find(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2) |
|
16 |
=A15.new(o_year,s1/s2:mkt_share) |
|
17 |
=interval@ms(A1,now()) |
A15 adopts the calculation method mentioned above and uses L_SUPPKEY to search for the primary key S_SUPPKEY of table sequence in A9 and take it as the condition for summation. This way, the supplier in A9 doesn’t need to be joined with nation, and the lineitem in A13 doesn’t need to be associated with supplier.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
The filtering of dimension table in calculation formula |
16 |
4. Data conversion
Utilize the conversion methods mentioned in previous articles: convert primary key of dimension table to sequence number, convert date to integer.
For tables region, nation, supplier and part, we can directly use region_2.btx, nation_2.btx, supplier_2.ctx and part_2.ctx converted in Q2.
To convert the enumeration string field type in the part table to number, we can use p_type.btx from Q2.
For the customer table, we can use customer_3.ctx from Q5.
For the orders table, we can use orders_5.ctx from Q3.
Copy these tables to the main directory of this query.
Code for data conversion:
A |
|
1 |
=file("part.ctx").open().import(P_PARTKEY).keys@im(P_PARTKEY) |
2 |
=file("lineitem.ctx").open().cursor() |
3 |
=A2.run(L_PARTKEY=A1.pfind(L_PARTKEY)) |
4 |
=file("lineitem_4.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
5 |
>A4.append(A3) |
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=days@o(date("1995-01-01")) |
=days@o(date("1996-12-31")) |
3 |
>nationName="CHINA" |
>name="ASIA" |
4 |
=p_type=file("p_type.btx").import@b().(_1) |
|
5 |
=nation= file("nation_2.btx").import@b() |
|
6 |
>type=p_type.pos@b("STANDARD POLISHED BRASS") |
|
7 |
=file("region_2.btx").import@b().pselect(R_NAME==name) |
|
8 |
=nation.pselect(N_NAME==nationName) |
=nation.(N_REGIONKEY==A7) |
9 |
=file("supplier_2.ctx").open() |
=A9.cursor@m().skip().(false) |
10 |
=A9.cursor@m(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true) |
|
11 |
=file("part_2.ctx").open() |
=A11.cursor@m().skip().(false) |
12 |
=A11.cursor@m(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true) |
|
13 |
=file("customer_3.ctx").open() |
=A13.cursor@m().skip().(false) |
14 |
=A13.cursor@m(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true) |
|
15 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE <=B2 && B13(O_CUSTKEY)) |
|
16 |
=file("lineitem_4.ctx").open().news(A15,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE;B11(L_PARTKEY)) |
|
17 |
=A16.run(L_EXTENDEDPRICE*=(1-L_DISCOUNT)) |
|
18 |
=A17.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),L_EXTENDEDPRICE,0)):s1,sum(L_EXTENDEDPRICE):s2) |
|
19 |
=A18.new(o_year,s1/s2:mkt_share) |
|
20 |
=interval@ms(A1,now()) |
B8, B9, B11 and B13 are aligned sequence.
A18 uses the aligned sequence B9 to determine if L_SUPPKEY meets the sum condition, which achieves better performance than the primary key search mentioned previously.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
The filtering of dimension table in calculation formula |
16 |
Data conversion |
10 |
5. Column-wise computing
A |
B |
|
1 |
=now() |
|
2 |
=days@o(date("1995-01-01")) |
=days@o(date("1996-12-31")) |
3 |
>nationName="CHINA" |
>name="ASIA" |
4 |
=p_type=file("p_type.btx").import@b().(_1) |
|
5 |
=nation= file("nation_2.btx").import@b() |
|
6 |
>type=p_type.pos@b("STANDARD POLISHED BRASS") |
|
7 |
=file("region_2.btx").import@b().pselect(R_NAME==name) |
|
8 |
=nation.pselect(N_NAME==nationName) |
=nation.(N_REGIONKEY==A7) |
9 |
=file("supplier_2.ctx").open() |
=A9.cursor@m().skip().(false) |
10 |
=A9.cursor@mv(S_SUPPKEY;S_NATIONKEY==A8).fetch().(B9(S_SUPPKEY)=true) |
|
11 |
=file("part_2.ctx").open() |
=A11.cursor@m().skip().(false) |
12 |
=A11.cursor@mv(P_PARTKEY;P_TYPE==type).fetch().(B11(P_PARTKEY)=true) |
|
13 |
=file("customer_3.ctx").open() |
=A13.cursor@m().skip().(false) |
14 |
=A13.cursor@mv(C_CUSTKEY;B8(C_NATIONKEY)).fetch().(B13(C_CUSTKEY)=true) |
|
15 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE;O_ORDERDATE>=A2 && O_ORDERDATE<=B2 && B13(O_CUSTKEY)) |
|
=file("lineitem_4.ctx").open().cursor@v(L_ORDERKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT;B11(L_PARTKEY);A15) |
||
16 |
=A16.joinx@im(L_ORDERKEY,A15:O_ORDERKEY,O_ORDERDATE) |
|
17 |
=A17.derive@o(L_EXTENDEDPRICE*(1-L_DISCOUNT):dp) |
|
18 |
=A18.groups(year(O_ORDERDATE):o_year;sum(if(B9(L_SUPPKEY),dp,0)):s1,sum(dp):s2) |
|
19 |
=A19.new(o_year,s1/s2:mkt_share) |
|
20 |
=interval@ms(A1,now()) |
|
21 |
=now() |
We do not use news() function to perform the primary-sub table association between orders and lineitem. Instead, we use joinx@im to perform an order-based merge join after cursors are synchronously segmented. Of course, the news() function is applicable, but joinx() function is slightly faster in our testing.
A16 uses derive@o to add a new column dp, so there is no need to compute dp again in A18. For column-wise computing, try to avoid assignment operation. Therefore, instead of using run to assign value to the original column, A16 uses derive to define a new column.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
The filtering of dimension table in calculation formula |
16 |
Data conversion |
10 |
Column-wise computing |
4 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version