Performance optimization case course: TPCH-Q10
select * from (
select
c_custkey,c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,n_name,c_address,c_phone,c_comment
from
customer,orders,lineitem,nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1993-05-01'
and o_orderdate < date '1993-05-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
) where rownum <=20;
This is a grouping & aggregation operation on the filtered result set of multi-table association which involves a primary-sub table association between orders table and lineitem table. The target is to sort records by grouped aggregates and get the first 20 ones.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.
So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.
For the nation table, we can directly use nation.btx from Q2.
Copy these tables to the main directory of this query.
2. General method
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=elapse@m(A2,3) |
4 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
5 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT;C_NATIONKEY:A4).fetch().keys@i(C_CUSTKEY) |
6 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3,O_CUSTKEY:A5) |
7 |
=file("lineitem.ctx").open().news@r(A6,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R") |
8 |
=A7.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
9 |
=A8.top(-20;revenue) |
10 |
=A9.new(c_custkey.C_CUSTKEY,c_custkey.C_NAME,revenue,c_custkey.C_ACCTBAL,c_custkey.C_NATIONKEY.N_NAME,c_custkey.C_ADDRESS,c_custkey.C_PHONE,c_custkey.C_COMMENT) |
11 |
=A10.sort@z(revenue) |
12 |
=interval@ms(A1,now()) |
Here we use the optimization methods mentioned in the previous articles such as pre-cursor filtering, order-based primary-sub table merge, and foreign key fieldattributization.
Test result:
Test items |
Execution time (seconds) |
General method |
17 |
3. Big dimension table search
We found that this operation can be changed into two steps. The first step is to associate orders with lineitem and perform grouping and aggregation operation on custkey. The second step is to associate the result set with the dimension table customer. In SQL, it can be roughly rewritten as:
select c_custkey,c_name,revenue,c_acctbal,n_name,c_address,c_phone,c_comment
from (
select o_custkey, revenue
from (
select o_custkey,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
orders,lineitem
where
l_orderkey = o_orderkey
and o_orderdate >= date '1993-05-01'
and o_orderdate < date '1993-05-01' + interval '3' month
and l_returnflag = 'R'
group by o_custkey
order by revenue desc
) orders_lineitem
where rownum <=20
) orders_lineitem20,customer,nation
where
c_custkey = o_custkey
and c_nationkey = n_nationkey
order by revenue desc;
Because the target is to get 20 records, and the subquery orders_lineitem20 is also to return 20 records, we just need to perform subsequent association between the 20 records and the customer table. In contrast, the original method needs to associate every record of the subquery orders_lineitem with customer table.
This is a scenario where the small fact table orders_lineitem20 joins with the big dimension table customer. We can use the big dimension table search method, that is, convert the join operation into a batch search on customer’s primary key. Since the customer table is ordered by the primary key c_custkey, the search calculation can be completed quickly.
Calculation code:
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=elapse@m(A2,3) |
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE<A3) |
5 |
=file("lineitem.ctx").open().news@r(A4,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;L_RETURNFLAG=="R") |
6 |
=A5.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
7 |
=A6.top(-20;revenue) |
8 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
9 |
=file("customer.ctx").open() |
10 |
=A7.joinx@q(c_custkey,A9:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT) |
11 |
=A10.switch(C_NATIONKEY,A8:N_NATIONKEY) |
12 |
=A11.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,C_NATIONKEY.N_NAME,C_ADDRESS,C_PHONE,C_COMMENT) |
13 |
=A12.sort@z(revenue) |
14 |
=interval@ms(A1,now()) |
After A7’s computation, we use joinx@q to match records of customer table in order and get eligible ones quickly. This eliminates theneedto load customer into memory or join the orders table containing a large amount of data with the customer table, thus reducing the amount of data retrieval and calculation.
Test result:
Test items |
Execution time (seconds) |
General method |
17 |
Big dimension table search |
7 |
4. Data conversion
Utilize the conversion methods mentioned in previous articles: convert enumeration string field to number, convert primary key of dimension table to sequence number, convert date to integer.
For the nation table, we can directly use nation_2.btx converted in Q2.
For the customer table, we can use customer_3.ctx from Q5.
For the orders table, we can use orders_5.ctx from Q3.
Copy these tables to the main directory of this query.
Although Q1 has converted L_RETURNFLAG in lineitem to number, lineitem.ctx does not have primary key, resulting in a failure to meet order-based merge requirement. Therefore, it needs to convert enumeration field to number based on the lineitem.ctx in Q3.
Code for data conversion:
A |
|
1 |
=file("lineitem.ctx").open() |
2 |
=A1.cursor(L_RETURNFLAG).id(L_RETURNFLAG).sort() |
3 |
=file("l_returnflag.btx").export@b(A2) |
4 |
=A1.cursor().run(L_RETURNFLAG =A2.pos@b(L_RETURNFLAG)) |
5 |
=file("lineitem4.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,L_SUPPKEY,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT) |
6 |
>A5.append(A4) |
Calculation code:
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=file("nation_2.btx").import@b() |
6 |
=file("l_returnflag.btx").import@b().(_1=="R") |
7 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3) |
8 |
=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG)) |
9 |
=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
10 |
=A9.top(-20;revenue) |
11 |
=file("customer.ctx").open() |
12 |
=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT) |
13 |
=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT) |
14 |
=A13.sort@z(revenue) |
15 |
=interval@ms(A1,now()) |
A3 and A4 are to convert date to number, and A6 is aligned sequence.
The nation table in A13 is to sequence-numberize the foreign key field.
Test result:
Test items |
Execution time (seconds) |
General method |
17 |
Big dimension table search |
7 |
Data conversion |
5 |
5. Column-wise computing
A |
|
1 |
=now() |
2 |
1993-5-1 |
3 |
=days@o(elapse@m(A2,3)) |
4 |
=days@o(A2) |
5 |
=file("nation_2.btx").import@b() |
6 |
=file("l_returnflag.btx").import@b().(_1=="R") |
7 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A4 && O_ORDERDATE<A3) |
8 |
=file("lineitem4.ctx").open().news@r(A7,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):dp,O_CUSTKEY;A6(L_RETURNFLAG)) |
9 |
=A8.groups@u(O_CUSTKEY:c_custkey;sum(dp):revenue) |
10 |
=A9.top(-20;revenue) |
11 |
=file("customer.ctx").open() |
12 |
=A10.joinx@q(c_custkey,A11:C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT) |
13 |
=A12.new(c_custkey:C_CUSTKEY,C_NAME,revenue,C_ACCTBAL,A5(C_NATIONKEY).N_NAME,C_ADDRESS,C_PHONE,C_COMMENT) |
14 |
=A13.sort@z(revenue) |
15 |
=interval@ms(A1,now()) |
Test result:
Test items |
Execution time (seconds) |
General method |
17 |
Big dimension table search |
7 |
Data conversion |
5 |
Column-wise computing |
3 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version