Performance Optimization Exercises Using TPC-H – Q9
Ⅰ SQL code and analysis
Below is the SQL query statement:
select
nation,
o_year,
sum(amount) as sum_profit
from
(
select
n_name as nation,
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%chocolate%'
) profit
group by
nation,
o_year
order by
nation,
o_year desc;
This is a grouping & aggregation operation on the filtered result set of multi-table association, which involves a primary-sub table association between orders table and lineitem table and an association with the foreign key table having a two-field primary key.
Ⅱ SPL solution
A |
|
1 |
=now() |
2 |
>name="chocolate" |
3 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
4 |
=file("part.ctx").open().cursor@m(P_PARTKEY;pos(P_NAME, name)).fetch().keys@im(P_PARTKEY) |
5 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY).fetch().keys@im(S_SUPPKEY) |
6 |
=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A4.find(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
7 |
=file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A4.find(L_PARTKEY)) |
8 |
=file("orders.ctx").open().new@r(A7,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE) |
9 |
=A8.join@i(L_PARTKEY:L_SUPPKEY,A6,PS_SUPPLYCOST).switch(L_SUPPKEY,A5) |
10 |
=A9.groups(L_SUPPKEY.S_NATIONKEY:nationName,year(O_ORDERDATE):o_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit) |
11 |
=A10.run(nationName=A3.find(nationName).N_NAME).sort(nationName,-o_year) |
12 |
=interval@ms(A1,now()) |
There is no filtering on orders table. The filtering action only happens on lineitem table. The number of data rows after filtering is much less than that in orders table, so we handle the primary-sub table association differently from the previous example. First, A7 creates a lineitem table cursor. Then A8 filters the primary table orders according to the cursor. This way, the number of data rows retrieved is considerably decreased and efficiency is increased.
A9 uses join@i method to stitch the foreign key table’s PS_SUPPLYCOSTA field to A8’s cursor. As the foreign key table has a two-field primary key, the method of matching and conversion at cursor creation becomes inapplicable.
Ⅲ Further optimization
1. Optimization method
In this example, we will use the date-integer conversion optimization method explained in Q1 – orders table’s O_ORDERDATE field is already converted in Q3, and the dimension table primary key numberization method explained in Q2 – nation table’s N_NATIONKEY field, part table’s P_PARTKEY, supplier table’s S_SUPPKEY and S_NATIONKEY fields, partsupp table’s PS_PARTKEY field and PS_SUPPKEY field, and lineitem table’s l_PARTKEY field and L_SUPPKEY field are already converted in the previous examples.
2. Code for data conversion
Copy nation_8.btx, part_8.ctx, supplier_8.ctx, partsupp_2.ctx, orders_8.ctx and lineitem_8.ctx, and rename them nation_9.btx, part_9.ctx, supplier_9.ctx, partsupp_9.ctx, orders_9.ctx and lineitem_9.ctx respectively.
3. Code after data conversion
First, we need to preload dimension tables. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_9.btx").import@b()) |
2 |
>env(part, file("part_9.ctx").open().import()) |
3 |
>env(supplier, file("supplier_9.ctx").open().import()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>name="chocolate" |
3 |
=part.@m(pos(P_NAME, name)) |
4 |
=supplier.@m(S_NATIONKEY) |
5 |
=file("partsupp_9.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A3(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
6 |
=file("lineitem_9.ctx").open().cursor@m(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A3(L_PARTKEY)) |
7 |
=file("orders_9.ctx").open().new@r(A6,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE) |
8 |
=A7.select(L_PARTKEY=A5.find(L_PARTKEY,L_SUPPKEY).PS_SUPPLYCOST).switch(L_SUPPKEY,A4:#) |
9 |
=nation.len()+1 |
10 |
=A8.groups(year(O_ORDERDATE)*A9+L_SUPPKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-L_PARTKEY*L_QUANTITY):profit) |
11 |
=A10.new(nation(gk%A9).N_NAME:nationName,gk\A9:o_year,profit).sort(nationName,-o_year) |
12 |
=interval@ms(A1,now()) |
A8 searches A5 for target records according to L_PARTKEY and L_SUPPKEY, and assigns L_PARTKEY as PS_SUPPLYCOST values – during which the current PS_SUPPLYCOST values is null if no corresponding A5 record is found and select() function will filter away the current record. This method has equal performance with the previously explained join@i method.
A10 uses the grouping key technique explained in Q1 to convert the grouping operation by two fields into one by a single field gk.
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
>name="chocolate" |
3 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
4 |
=file("part.ctx").open().cursor@mv(P_PARTKEY;pos(P_NAME, name)).fetch().keys@im(P_PARTKEY) |
5 |
=file("supplier.ctx").open().cursor@mv(S_SUPPKEY,S_NATIONKEY).fetch().keys@im(S_SUPPKEY) |
6 |
=file("partsupp.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A4.find(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
7 |
=file("lineitem.ctx").open().cursor@mv(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A4.find(L_PARTKEY)) |
8 |
=file("orders.ctx").open().new@r(A7,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE) |
9 |
=A8.join@i(L_PARTKEY:L_SUPPKEY,A6,PS_SUPPLYCOST;L_SUPPKEY,A5,S_NATIONKEY) |
10 |
=A9.groups(S_NATIONKEY:nationName,year(O_ORDERDATE):o_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit) |
11 |
=A10.new(A3.find(nationName).N_NAME:nationName,o_year,profit).sort(nationName,-o_year) |
12 |
=interval@ms(A1,now()) |
2. Optimized data
First, we need to preload dimension table. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_9.btx").import@bv()) |
2 |
>env(part, file("part_9.ctx").open().import@v()) |
3 |
>env(supplier, file("supplier_9.ctx").open().import@v()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
>name="chocolate" |
3 |
=part.@m(pos(P_NAME, name)) |
4 |
=file("partsupp_9.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;A3(PS_PARTKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY) |
5 |
=file("lineitem_9.ctx").open().cursor@mv(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT;A3(L_PARTKEY)) |
6 |
=file("orders_9.ctx").open().new@r(A5,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,O_ORDERDATE) |
7 |
=A6.join@i(L_PARTKEY:L_SUPPKEY,A4,PS_SUPPLYCOST;L_SUPPKEY,supplier:#,S_NATIONKEY) |
8 |
=nation.len()+1 |
9 |
=A7.groups(year(O_ORDERDATE)*A8+S_NATIONKEY:gk;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY):profit) |
10 |
=A9.new(nation(gk%A8).N_NAME:nationName,gk\A8:o_year,profit).sort(nationName,-o_year) |
11 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
33.5 |
11.3 |
After optimization |
23.8 |
8.5 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL