Performance optimization case course: TPCH-Q7
select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
extract(year from l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = 'CHINA' and n2.n_name = 'RUSSIA')
or (n1.n_name = 'RUSSIA' and n2.n_name = 'CHINA')
)
and l_shipdate between date '1995-01-01' and date '1996-12-31'
) shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;
This is the grouping and aggregation operation after multiple tables are associated and filtered, which includes the association between the primary table orders and the sub table lineitem.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.
So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx from Q3.
For nation and supplier, we can directly use nation.btx and supplier.ctx from Q2.
Copy these tables to the main directory of this query.
2. General method
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
1996-12-31 |
3 |
>name1="CHINA" |
>name2="RUSSIA" |
4 |
=file("nation.btx").import@b().keys@i(N_NATIONKEY) |
|
5 |
>n1=A4.select@1(N_NAME==name1).N_NATIONKEY |
|
6 |
>n2=A4.select@1(N_NAME==name2).N_NATIONKEY |
|
7 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;S_NATIONKEY==n1 || S_NATIONKEY==n2).fetch().keys@im(S_SUPPKEY) |
|
8 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;C_NATIONKEY==n1 || C_NATIONKEY==n2).fetch().keys@im(C_CUSTKEY) |
|
9 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_CUSTKEY:A8) |
|
10 |
=file("lineitem.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE <=B2,L_SUPPKEY:A7) |
|
11 |
=A10.select(O_CUSTKEY.C_NATIONKEY!=L_SUPPKEY.S_NATIONKEY) |
|
12 |
=A11.groups(L_SUPPKEY.S_NATIONKEY:supp_nation,O_CUSTKEY.C_NATIONKEY:cust_nation,year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)): volume) |
|
13 |
=A12.run(supp_nation=A4.find(supp_nation).N_NAME,cust_nation=A4.find(cust_nation).N_NAME) |
|
14 |
=interval@ms(A1,now()) |
Here we use the techniques mentioned in the previous articles: filter at cursor creation, order-based primary-sub table association, and convert join field to foreign key table pointers. Similar to Q5, A12 here also performs grouping on integer key values instead of strings, and then uses strings again in A13.
Test result:
Test items |
Execution time (seconds) |
General method |
13 |
3. Data conversion
Utilize the conversion methods mentioned in previous articles: convert primary keys of dimension tables to sequence numbers, convert date to integers.
For region, nation and supplier, we can directly use region_2.btx, nation_2.btx and supplier_2.ctx converted in Q2.
For customer, we can use customer_3.ctx from Q5.
For orders and lineitem, we can use orders_5.ctx and lineitem_5.ctx from Q3.
Copy these tables to the main directory of this query.
Calculation code:
A |
B |
||
1 |
=now() |
||
2 |
=days@o(date("1995-01-01")) |
=days@o(date("1996-12-31")) |
|
3 |
[CHINA,RUSSIA] |
||
4 |
=file("nation_2.btx").import@b().(A3.pos(N_NAME)) |
||
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(null) |
|
6 |
=A5.cursor@m(S_SUPPKEY,S_NATIONKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NATIONKEY) |
||
7 |
=file("customer_3.ctx").open() |
=A7.cursor@m().skip().(null) |
|
8 |
=A7.cursor@m(C_CUSTKEY,C_NATIONKEY;A4(C_NATIONKEY)).fetch().(B7(C_CUSTKEY)=C_NATIONKEY) |
||
9 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_CUSTKEY:B7:#) |
||
10 |
=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE <=B2,L_SUPPKEY:B5:#) |
||
11 |
=A10.select(O_CUSTKEY!=L_SUPPKEY).run(O_CUSTKEY=A4(O_CUSTKEY),L_SUPPKEY=A4(L_SUPPKEY)) |
||
12 |
=A11.groups@u(year(L_SHIPDATE)*9+L_SUPPKEY*3+O_CUSTKEY:gk; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)): volume) |
||
13 |
=A12.new(A3(gk%9\3):supp_nation,A3(gk%3):cust_nation,gk\9:l_year,volume).sort(supp_nation,cust_nation,l_year) |
||
14 |
=interval@ms(A1,now()) |
||
A4, B5 and B7 are aligned sequence.
A11 replaces values of O_CUSTKEY and L_SUPPKEY with corresponding values of A5’s aligned sequence, i.e., 1 or 2, corresponding to CHINA and RUSSIA respectively. A12 uses the grouping key technique explained in Q1 to convert the grouping operation by three fields into one by a single filed.
After integerizing the date in A12, the year function can still be used to calculate the year to which the date belongs.
Test result:
Test items |
Execution time (seconds) |
General method |
13 |
Data conversion |
9 |
4. Column-wise computing
A |
B |
|
1 |
=now() |
|
2 |
=days@o(date("1995-01-01")) |
=days@o(date("1996-12-31")) |
3 |
[CHINA,RUSSIA] |
|
4 |
=file("nation_2.btx").import@b().(A3.pos(N_NAME)) |
|
5 |
=file("supplier_2.ctx").open() |
=A5.cursor@m().skip().(null) |
6 |
=A5.cursor@m(S_SUPPKEY,S_NATIONKEY;A4(S_NATIONKEY)).fetch().(B5(S_SUPPKEY)=S_NATIONKEY) |
|
7 |
=file("customer_3.ctx").open() |
=A7.cursor@m().skip().(null) |
8 |
=A7.cursor@m(C_CUSTKEY,C_NATIONKEY;A4(C_NATIONKEY)).fetch().(B7(C_CUSTKEY)=C_NATIONKEY) |
|
9 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;B7(O_CUSTKEY)) |
|
10 |
=file("lineitem_5.ctx").open().news(A9,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE,O_CUSTKEY;L_SHIPDATE>=A2 && L_SHIPDATE<=B2 && B5(L_SUPPKEY)) |
|
11 |
=A10.derive@o(A4(B5(L_SUPPKEY)):s,A4(B7(O_CUSTKEY)):c).select@v(s!=c) |
|
12 |
=A11.groups@u(year(L_SHIPDATE)*9+c*3+s:gk; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):volume) |
|
13 |
=A12.new(A3(gk%3):supp_nation,A3(gk%9\3):cust_nation,gk\9:l_year,volume).sort(supp_nation,cust_nation,l_year) |
|
14 |
=interval@ms(A1,now()) |
For column-wise computing, try to avoid using switch() function. Therefore, instead of using O_CUSTKEY:B7:# that performs association and filtering simultaneously, A9 uses B7(O_CUSTKEY) to filter data only.
A11 first uses O_CUSTKEY to find the corresponding NATIONKEY from customer in B7, and then takes the value of A4. To achieve this, the value of aligned sequence calculated in B7 will be NATIONKEY when the condition is satisfied, or null when the condition is not satisfied.
The association between lineitem and supplier is handled similarly to that of orders and customer.
Test result:
Test items |
Execution time (seconds) |
General method |
13 |
Data conversion |
9 |
Column-wise computing |
4 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version