Performance Optimization Exercises Using TPC-H – Q20
Ⅰ SQL code and analysis
Below is the SQL query statement:
select
s_name,s_address
from
supplier,nation
where
s_suppkey in (
select
ps_suppkey
from
partsupp
where
ps_partkey in (
select
p_partkey
from
part
where
p_name like 'bisque%'
)
and ps_availqty > (
select
0.5 * sum(l_quantity)
from
lineitem
where
l_partkey = ps_partkey
and l_suppkey = ps_suppkey
and l_shipdate >= date '1995-01-01'
and l_shipdate < date '1995-01-01' + interval '1' year
)
)
and s_nationkey = n_nationkey
and n_name = 'CHINA'
order by
s_name;
The main query is simple, but the nested filtering condition is complex.
Ⅱ SPL solution
The query appears complex, but it becomes clear when divided into multiple steps:
1. Filter nation table according to the condition;
2. Filter part table according to the condition;
3. Take result of step 1 as the foreign key table to match and filter supplier table;
4. Take result of steps 2 and 3 as foreign key tables to match and filter partsupp table;
5. Take result of step 4 as the foreign key table and associate it with lineitem table, select PS_AVAILQTY field, and perform grouping and aggregation by L_PARTKEY field and L_SUPPKEY field. As mentioned previously, a subquery containing an association with the primary table according to the equivalence condition can be always rewritten as a statement of grouping operation and association with the primary table. Based on the grouping & aggregation result set, we then select the eligible L_SUPPKEY values.
6. Take result of step 5 as the foreign key table to match and filter supplier table.
The whole process is generating intermediate foreign key tables and use them to match and filter other tables.
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2,1) |
4 |
>partname="bisque" |
5 |
>nationname="CHINA" |
6 |
=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY |
7 |
=file("part.ctx").open().cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY) |
8 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY) |
9 |
=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
10 |
=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3) |
11 |
=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
12 |
=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
13 |
=A12.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY) |
14 |
=A8.join@im(S_SUPPKEY,A13:~) |
15 |
=A14.new(S_NAME,S_ADDRESS).sort@0(S_NAME) |
16 |
=interval@ms(A1,now()) |
A14 uses join@im, where @i option deletes non-matching records and @m option enables using order-based merge to speed up computation when both S_SUPPKEY and A13 are ordered.
Ⅲ Further optimization
1. Optimization method
In this example, we will use the date-integer conversion optimization method explained in Q1, where linetime table’s L_SHIPDATE has been converted, and dimension table primary key numberization optimization method explained in Q2 – part table’s P_PARTKEY field, supplier table’s S_SUPPKEY field, partsupp table’s PS_PARTKEY field and PS_SUPPKEY field, and lineitem table’s L_PARTKEY field and L_SUPPKEY field have all been converted in the previous examples.
2. Code for data conversion
Copy nation_11.btx, part_19.ctx, supplier_16.ctx, partsupp_16.ctx and lineitem_19.ctx, and rename them nation_20.btx, part_20.ctx, supplier_20.ctx, partsupp_20.ctx and lineitem_20.ctx respectively.
3. Code after data conversion
First, we need to preload the dimension tables. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_20.btx").import@b()) |
2 |
>env(part, file("part_20.ctx").open().import()) |
3 |
>env(supplier, file("supplier_20.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=days@o(elapse@y(A2,1)) |
4 |
=days@o(A2) |
5 |
>partname="bisque" |
6 |
>nationname="CHINA" |
7 |
=nation.select@1(N_NAME==nationname).N_NATIONKEY |
8 |
=part.@m(pos@h(P_NAME,partname)) |
9 |
=supplier.@m(S_NATIONKEY==A7) |
10 |
=file("partsupp_20.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A8(PS_PARTKEY) && A9(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
11 |
=file("lineitem_20.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
12 |
=A11.join@i(L_PARTKEY:L_SUPPKEY,A10:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
13 |
=A12.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
14 |
=A13.select(PS_AVAILQTY*2>quantity).run(A9(L_SUPPKEY)=null) |
15 |
=supplier(A9.pselect@a(~==null)) |
16 |
=A15.new(S_NAME,S_ADDRESS).sort@0(S_NAME) |
17 |
=interval@ms(A1,now()) |
A14 first select eligible records and then sets values of A9’s members corresponding to each record’s L_SUPPKEY as null. A15 first gets row numbers of all A9’s members having null values and then selects corresponding records from supplier table according to these row numbers.
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=elapse@y(A2,1) |
4 |
>partname="bisque" |
5 |
>nationname="CHINA" |
6 |
=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY |
7 |
=file("part.ctx").open().cursor@mv(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY) |
8 |
=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY) |
9 |
=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
10 |
=file("lineitem.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3) |
11 |
=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
12 |
=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
13 |
=A12.select@mv(PS_AVAILQTY*2>quantity).id(L_SUPPKEY) |
14 |
=A8.join@im(S_SUPPKEY,A13:~) |
15 |
=A14.new@m(S_NAME,S_ADDRESS).sort@o(S_NAME) |
16 |
=interval@ms(A1,now()) |
2. Optimized data
First, we need to preload the dimension tables. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_20.btx").import@bv()) |
2 |
>env(part, file("part_20.ctx").open().import@v()) |
3 |
>env(supplier, file("supplier_20.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1995-1-1 |
3 |
=days@o(elapse@y(A2,1)) |
4 |
=days@o(A2) |
5 |
>partname="bisque" |
6 |
>nationname="CHINA" |
7 |
=nation.select@1(N_NAME==nationname).N_NATIONKEY |
8 |
=part.(pos@h(P_NAME,partname)) |
9 |
=supplier.(S_NATIONKEY==A7) |
10 |
=file("partsupp_20.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A8(PS_PARTKEY) && A9(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
11 |
=file("lineitem_20.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3) |
12 |
=A11.join@i(L_PARTKEY:L_SUPPKEY,A10:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY) |
13 |
=A12.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity) |
14 |
=A13.select@mv(PS_AVAILQTY*2>quantity).run@m(A9(L_SUPPKEY)=null) |
15 |
=supplier(A9.pselect@a(~==null)) |
16 |
=A15.new(S_NAME,S_ADDRESS).sort@0(S_NAME) |
17 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
10.6 |
5.9 |
After optimization |
8.8 |
3.9 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL