Performance Optimization Exercises Using TPC-H – Q17
Ⅰ SQL code and analysis
Below is the SQL query statement:
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,part
where
p_partkey = l_partkey
and p_brand = 'Brand#33'
and p_container = 'LG DRUM'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
);
Such an association with the primary table according to the equivalence condition in the subquery can be always converted to a JOIN operation, and the SQL statement can be rewritten as:
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,part,
(select
l_partkey lp,
0.2*avg(l_quantity) lq
from
lineitem
group by
l_partkey
) lpq
where
p_partkey = l_partkey
and p_brand = 'Brand#33'
and p_container = 'LG DRUM'
and l_partkey = lp
and l_quantity < lq;
The subquery is a grouping & aggregation operation and the main query is an aggregate operation on the result set of multi-table association.
Ⅱ SPL solution
A |
|
1 |
=now() |
2 |
>brand="Brand#33" |
3 |
>container="LG DRUM" |
4 |
=file("part.ctx").open().cursor@m(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY) |
5 |
=file("lineitem.ctx").open() |
6 |
=A5.cursor@m(L_PARTKEY,L_QUANTITY;A4.find(L_PARTKEY)) |
7 |
=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(avg/=5).keys@im(L_PARTKEY) |
8 |
=A5.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A7) |
9 |
=A8.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0 |
10 |
=interval@ms(A1,now()) |
Ⅲ Further optimization
1. Optimization method
In this example, we will use the string-integer conversion method explained in Q1 – part table’s P_BRAND field is already converted in Q16 and its P_CONTAINER field is waiting to be converted, and the dimension table primary key numberization method explained in Q2 – part table’s P_PARTKEY field is already converted in Q16 and lineitem table’s L_PARTKEY field has been converted in the previous example.
2. Code for data conversion
2.1 Conversion on lineitem table
Copy lineitem_15.ctx and rename it lineitem_17.ctx.
2.2 Conversion on part table
A |
|
1 |
=file("part_16.ctx").open().cursor().fetch() |
2 |
=A1.id(P_CONTAINER).sort() |
3 |
=file("p_container.txt").export(A2) |
4 |
=A1.run(P_CONTAINER=A2.pos@b(P_ CONTAINER)) |
5 |
=file("part_17.ctx").create(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT) |
6 |
>A5.append(A4.cursor()) |
3. Code after data conversion
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(p_brand,file("p_brand.txt").import@si()) |
2 |
>env(p_container,file("p_container.txt").import@si()) |
3 |
>env(part, file("part_17.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>brand=p_brand.pos@b("Brand#33") |
3 |
>container=p_container.pos@b("LG DRUM") |
4 |
=part.@m(P_BRAND==brand && P_CONTAINER==container) |
5 |
=file("lineitem_17.ctx").open() |
6 |
=A5.cursor@m(L_PARTKEY,L_QUANTITY;A4(L_PARTKEY)) |
7 |
=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg).run(A4(L_PARTKEY)=avg*0.2) |
8 |
=A5.cursor@m(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A4:#) |
9 |
=A8.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0 |
10 |
=interval@ms(A1,now()) |
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
>brand="Brand#33" |
3 |
>container="LG DRUM" |
4 |
=file("part.ctx").open().cursor@mv(P_PARTKEY;P_BRAND==brand && P_CONTAINER==container).fetch().keys@im(P_PARTKEY) |
5 |
=file("lineitem.ctx").open() |
6 |
=A5.cursor@mv(L_PARTKEY,L_QUANTITY;A4.find(L_PARTKEY)) |
7 |
=A6.groups@u(L_PARTKEY;avg(L_QUANTITY)*0.2:avg).keys@im(L_PARTKEY) |
8 |
=A5.cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A7) |
9 |
=A8.total(sum(if(L_QUANTITY<L_PARTKEY.avg,L_EXTENDEDPRICE,0)))/7.0 |
10 |
=interval@ms(A1,now()) |
2. Optimized data
The dimension tables and the value list of string fields need to be preloaded. Below is preloading code:
A |
|
1 |
>env(p_brand,file("p_brand.txt").import@si()) |
2 |
>env(p_container,file("p_container.txt").import@si()) |
3 |
>env(part, file("part_17.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load the small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>brand=p_brand.pos@b("Brand#33") |
3 |
>container=p_container.pos@b("LG DRUM") |
4 |
=part.@m(P_BRAND==brand && P_CONTAINER==container) |
5 |
=file("lineitem_17.ctx").open() |
6 |
=A5.cursor@mv(L_PARTKEY,L_QUANTITY;A4(L_PARTKEY)) |
7 |
=A6.groups@u(L_PARTKEY;avg(L_QUANTITY):avg) |
8 |
clear A4 |
9 |
=part.(null) |
10 |
=A7.run@m(A9(L_PARTKEY)=avg*0.2) |
11 |
=A5.cursor@mv(L_PARTKEY,L_QUANTITY,L_EXTENDEDPRICE;L_PARTKEY:A9:#) |
12 |
=A11.total(sum(if(L_QUANTITY<L_PARTKEY,L_EXTENDEDPRICE,0)))/7.0 |
13 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
15.4 |
7.7 |
After optimization |
14.8 |
6.8 |
In this example, there are only over 600,000 rows of data in the filtered lineitem table. The data size is too small to give full play to the numberization method, and the optimization shows little effect. If we modify the filtering condition in A4 by changing && to ||, there are over 38 million rows left in the filtered lineitem table. In this case, the test shows a significant optimization effect.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL