Performance Optimization Exercises Using TPC-H – Q10
Ⅰ SQL code and analysis
Below is the SQL query statement:
select * from (
select
c_custkey,c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,n_name,c_address,c_phone,c_comment
from
customer,orders,lineitem,nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1993-05-01'
and o_orderdate < date '1993-05-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
) where rownum <=20;
This is a grouping & aggregation operation on the filtered result set of multi-table association (which involves a primary-sub table association between orders table and lineitem table). The target is to sort records by grouped aggregates and get the first 20 ones.
Ⅱ SPL solution
The working principle is similar to that in Q3. In the SQL statement, there are 7 fields in group by clause. In SPL, we just need to perform grouping by the first field.
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R") |
6 |
=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
7 |
=A6.top(-20;revenue) |
8 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
9 |
=file("customer.ctx").open() |
10 |
=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT) |
11 |
=A10.switch(C_NATIONKEY,A8:N_NATIONKEY) |
12 |
=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,C_NATIONKEY.N_NAME,C_ADDRESS,C_PHONE,C_COMMENT) |
13 |
=A12.sort@z(revenue) |
14 |
=interval@ms(A1,now()) |
First, we perform grouping & aggregation on the result set of associating orders table and lineitem table; then based on the grouping & aggregation result set we perform the externally correlated foreign key table association. Since the target is to get 20 records, we can only perform the second association on the 20 records. There is no need to handle it before the grouping operation because computation will become more intensive that way.
After A7’s computation and as customer table is ordered by c_custkey field, we use A7.joinx@q to match records of customer table in order and get eligible ones quickly. Then we handle the other joins without the need to traverse customer table and with less data retrieved.
Ⅲ Further optimization
1. Optimization method
In this example, we will use the date-integer conversion optimization method explained in Q1 – orders table’s O_ORDERDATE field is already converted in Q3, the string-integer conversion method explained in Q1, where lineitem table’s L_RETRUNFLAG is already converted, and the dimension table primary key numberization method explained in Q2 – nation table’s N_NATIONKEY field, customer table’s C_CUSTKEY field and C_NATIONKEY field, orders table’s O_CUSTKEY field are already converted in the previous examples.
2. Code for data conversion
Copy nation_9.btx, customer_8.ctx, orders_9.ctx and lineitem_9.ctx, and rename them respectively as nation_10.btx, customer_10.ctx, orders_10.ctx andlineitem_10.ctx.
3. Code after data conversion
First, we need to preload dimension tables. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_10.btx").import@b()) |
2 |
>env(customer, file("customer_10.ctx").open().import()) |
3 |
>env(l_returnflag,file("l_returnflag.txt").import@si()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=l_returnflag.pos@b("R") |
6 |
=file("orders_10.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3) |
7 |
=file("lineitem_10.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG==A5) |
8 |
=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
9 |
=A8.top(-20;revenue) |
10 |
=A9.new((c=customer(c_custkey)).C_CUSTKEY,c.C_NAME,revenue,c.C_ACCTBAL,nation(c.C_NATIONKEY).N_NAME,c.C_ADDRESS,c.C_PHONE,c.C_COMMENT) |
11 |
=A10.sort@z(revenue) |
12 |
=interval@ms(A1,now()) |
Ⅳ Using enterprise edition’s column-wise computation
1. Original data
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R") |
6 |
=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
7 |
=A6.top(-20;revenue) |
8 |
=file("nation.btx").import@bv().keys@i(N_NATIONKEY) |
9 |
=file("customer.ctx").open() |
10 |
=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT) |
11 |
=A10.join(C_NATIONKEY,A8,N_NAME) |
12 |
=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,N_NAME,C_ADDRESS,C_PHONE,C_COMMENT) |
13 |
=A12.sort@z(revenue) |
14 |
=interval@ms(A1,now()) |
2. Optimized data
First, we need to preload dimension tables. Below is preloading code:
A |
|
1 |
>env(nation, file("nation_10.btx").import@bv()) |
2 |
>env(customer, file("customer_10.ctx").open().import@v()) |
3 |
>env(l_returnflag,file("l_returnflag.txt").import@si()) |
Before performing the query, we need to first run the preloading code to load small dimension tables into memory.
Computing code:
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=l_returnflag.pos@b("R") |
6 |
=file("orders_10.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3) |
7 |
=file("lineitem_10.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG==A5) |
8 |
=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
9 |
=A8.top(-20;revenue) |
10 |
=A9.new(customer(c_custkey):c,revenue).new(c.C_CUSTKEY,c.C_NAME,revenue,c.C_ACCTBAL,nation(c.C_NATIONKEY).N_NAME,c.C_ADDRESS,c.C_PHONE,c.C_COMMENT) |
11 |
=A10.sort@z(revenue) |
12 |
=interval@ms(A1,now()) |
Ⅴ Test result
Unit: Second
Regular |
Column-wise |
|
Before optimization |
13.3 |
7.4 |
After optimization |
12.2 |
6.2 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL