Performance optimization case course: TPCH-Q5
select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= date '1995-01-01'
and o_orderdate < date '1995-01-01' + interval '1' year
group by
n_name
order by
revenue desc;
This is a grouping and aggregation operation after multiple tables are associated and filtered, which includes the association between the primary table orders and the sub table lineitem.
1. Data storage
Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.
So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx in Q3.
For tables region, nation and supplier, we can directly use regin.btx, nation.btx and supplier.ctx in Q2.
Copy these tables to the main directory of this query.
2. General method
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
>name="ASIA" |
3 |
=elapse@y(A2,1) |
|
4 |
=file("region.btx").import@b().select(R_NAME==name) |
|
5 |
=file("nation.btx").import@b().switch@i(N_REGIONKEY,A4:R_REGIONKEY).keys@i(N_NATIONKEY) |
|
6 |
=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;A5.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY) |
|
7 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;A5.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY) |
|
8 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE < A3,O_CUSTKEY:A7) |
|
9 |
=file("lineitem.ctx").open().news(A8,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A6) |
|
10 |
=A9.select(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY) |
|
11 |
=A10.groups@u(L_SUPPKEY.S_NATIONKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
|
12 |
=A11.sort@z(revenue).run(N_NAME=A5.find(N_NAME).N_NAME) |
Here we use the techniques mentioned in the previous articles: filter at cursor creation, order-based primary-sub table association, and convert join field to foreign key table pointers.
Unlike Q3, the grouping field isn’t the ordered L_ORDERKEY and group@s() becomes unsuitable. Here we use groups@u, the result is no longer ordered by the grouping field, and the performance is better compared to not adding @u.
A11 performs grouping on S_NATIONKEY instead of N_NAME because integer comparison is faster than string comparison, and replace the field with N_NAME after the grouping operation is done.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
3. Data conversion
Utilize the conversion methods mentioned in previous articles: convert primary keys of dimension tables to sequence numbers, convert enumeration string field to numbers, convert date to integers.
For region, nation and supplier, we can directly use region_2.btx, nation_2.btx and supplier_2.ctx converted in Q2.
For orders, we can use orders_5.ctx in Q3.
Copy these tables to the main directory of this query.
Convert customer:
A |
|
1 |
=file("customer.ctx").open().import() |
2 |
=file("nation.btx").import@b() |
3 |
=A1.run(C_CUSTKEY=#,C_NATIONKEY=A2.pselect(N_NATIONKEY==C_NATIONKEY)) |
4 |
=file("customer_3.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
5 |
>A4.append(A3.cursor()) |
Here we use the customer.ctx in Q3 and the nation.btx in Q2.
Convert lineitem:
A |
|
1 |
=file("supplier.ctx").open().import().keys@i(S_SUPPKEY) |
2 |
=file("lineitem.ctx").open().cursor() |
3 |
=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY)) |
4 |
=file("lineitem_3.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
5 |
>A4.append(A3) |
Here we use the supplier.ctx in Q2 and the lineitem.ctx in Q3.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
>name="ASIA" |
3 |
=days@o(elapse@y(A2,1)) |
=days@o(A2) |
4 |
=file("region_2.btx").import@b(R_NAME).(R_NAME==name) |
|
5 |
=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY) |
|
6 |
=A5.(A4(N_REGIONKEY)) |
|
7 |
=file("supplier_2.ctx").open() |
=A7.cursor@m().skip().(null) |
8 |
=A7.cursor@m(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=S_NATIONKEY) |
|
9 |
=file("customer_3.ctx").open() |
=A9.cursor@m().skip().(null) |
10 |
=A9.cursor@m(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=C_NATIONKEY) |
|
11 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3,O_CUSTKEY:B9:#) |
|
12 |
=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:B7:#) |
|
13 |
=A12.select(O_CUSTKEY ==L_SUPPKEY) |
|
14 |
=A13.groups@u(L_SUPPKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
|
15 |
=A14.sort@z(revenue).run(N_NAME=A5(N_NAME).N_NAME) |
|
16 |
=interval@ms(A1,now()) |
B7 and B9 are aligned sequence and their members are true or false.
A11 and A12 assign the O_CUSTKEY of orders to C_NATIONKEY and assign the L_SUPPKEY of lineitem to S_NATIONKEY. The O_CUSTKEY==L_SUPPKEY in A13 is actually the comparison between C_NATIONKEY and S_NATIONKEY.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
Data conversion |
10 |
4. Column-wise computing
A |
B |
|
1 |
=now() |
|
2 |
1995-1-1 |
>name="ASIA" |
3 |
=days@o(elapse@y(A2,1)) |
=days@o(A2) |
4 |
=file("region_2.btx").import@b(R_NAME).(R_NAME==name) |
|
5 |
=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY) |
|
6 |
=A5.(A4(N_REGIONKEY)) |
|
7 |
=file("supplier_2.ctx").open() |
=A7.cursor@m().skip().(null) |
8 |
=A7.cursor@mv(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=~) |
|
9 |
=file("customer_3.ctx").open() |
=A9.cursor@m().skip().(null) |
10 |
=A9.cursor@mv(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=~) |
|
11 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3 && B9(O_CUSTKEY)) |
|
12 |
=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;B7(L_SUPPKEY)) |
|
13 |
=A12.derive@o(B7(L_SUPPKEY).S_NATIONKEY,B9(O_CUSTKEY).C_NATIONKEY).select@v(C_NATIONKEY==S_NATIONKEY) |
|
14 |
=A13.groups@u(S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
|
15 |
=A14.sort@z(revenue).new(A5(S_NATIONKEY).N_NAME,revenue) |
|
16 |
=interval@ms(A1,now()) |
For column-wise computing, try to avoid using switch() function. Therefore, instead of using O_CUSTKEY:B9:# that performs association and filtering simultaneously, A11 uses B9(O_CUSTKEY) to filter data only. Because O_CUSTKEY is already sequence-numberized here, the customer record can be directly retrieved by row number in B9 without the need for association, which better manifests the advantages of column-wise computing.
A13 first uses O_CUSTKEY to find the corresponding customer record in B9, and then obtains the value of S_NATIONKEY. To achieve this, the aligned sequence calculated in B9 is no longer true or false, but the customer record when the condition is satisfied, or null when the condition is not satisfied.
The association between lineitem and supplier is handled similarly to that of customer.
In A13, when performing select filtering on the columnar cursor, adding @v option ensures the filtered cursor is still a columnar cursor.
Test result:
Test items |
Execution time (seconds) |
General method |
18 |
Data conversion |
10 |
Column-wise computing |
4 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version