Performance optimization case course: TPCH-Q3
select * from (
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
) where rownum<=10;
This is a typical operation of associating tables that are in primary-sub relationship, and then grouping and aggregation.
1. Data storage
This query is a traversal calculation. The big tables orders and lineitem are stored as columnar composite table. The relatively big table customer is also stored as composite table.
The orders and lineitem are in primary-sub relationship and need to be sorted by the association primary key orderkey. Subsequent computations involve grouping larger result sets, and the grouping key happens to be orderkey.
Conversion code:
A |
|
1 |
=file("customer.tbl").cursor(; , "|").new(_1:C_CUSTKEY, _2:C_NAME, _3:C_ADDRESS, _4:C_NATIONKEY, _5:C_PHONE, _6:C_ACCTBAL, _7:C_MKTSEGMENT, _8:C_COMMENT).sortx(C_CUSTKEY) |
2 |
=file("customer.ctx").create@y(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
3 |
>A2.append(A1) |
4 |
=file("lineitem.tbl").cursor(;,"|").new(_1:L_ORDERKEY,_4:L_LINENUMBER,_2:L_PARTKEY,_3:L_SUPPKEY,_5:L_QUANTITY,_6:L_EXTENDEDPRICE,_7:L_DISCOUNT,_8:L_TAX,_9:L_RETURNFLAG,_10:L_LINESTATUS,_11:L_SHIPDATE,_12:L_COMMITDATE,_13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT,_15:L_SHIPMODE,_16:L_COMMENT).sortx(L_ORDERKEY,L_LINENUMBER;5000000) |
5 |
=file("lineitem.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
6 |
>A5.append(A4) |
=file("orders.tbl").cursor(; ,"|").new(_1:O_ORDERKEY,_2:O_CUSTKEY,_3:O_ORDERSTATUS,_4:O_TOTALPRICE,_5:O_ORDERDATE,_6:O_ORDERPRIORITY,_7:O_CLERK, _8:O_SHIPPRIORITY,_9:O_COMMENT).sortx(O_ORDERKEY;5000000) |
|
=file("orders.ctx").create@y(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
|
>A8.append(A7) |
The sub table lineitem must be stored in segments according to the primary table orders. Therefore, A5 uses @p to create the composite table in order to ensure that parallel segments are not misaligned during association.
2. Ordinary grouping
Use the optimization algorithms mentioned in previous articles: pre-cursor filtering, ordered merge of tables with a primary-sub relationship, association and location, topN, multi-thread parallel computing.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-03-15 |
BUILDING |
3 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;C_MKTSEGMENT==B2).fetch().keys@im(C_CUSTKEY) |
|
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;O_ORDERDATE<A2 && A3.find(O_CUSTKEY)) |
|
5 |
=file("lineitem.ctx").open().news(A4,O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>A2) |
|
6 |
=A5.group(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;~.sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
|
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
|
8 |
=A7.run(O_ORDERDATE=date@o(O_ORDERDATE)) |
|
9 |
=interval@ms(A1,now()) |
A3, A4 and A5 use the filtering technique when creating a cursor.
A5 uses association and location.
Since the join result in A5 is ordered by the first grouping field ORDERKEY, and the other two grouping fields are uniquely determined by ORDERKEY, A6 can do an ordered grouping.
Test result:
Test items |
Execution time (seconds) |
Ordinary grouping |
9 |
3. Redundant grouping key
Among the three grouping fields, the last two fields O_ORDERDATE and O_SHIPPRIORITY can be uniquely determined by the first field ORDERKEY, so they do not need to participate in the grouping calculation and can be calculated using the redundant grouping key method.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-03-15 |
BUILDING |
3 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;C_MKTSEGMENT==B2).fetch().keys@im(C_CUSTKEY) |
|
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;O_ORDERDATE<A2 && A3.find(O_CUSTKEY)) |
|
5 |
=file("lineitem.ctx").open().news(A4,O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY,L_EXTENDEDPRICE,L_DISCOUNT;L_SHIPDATE>A2) |
|
6 |
=A5.group(O_ORDERKEY;O_ORDERDATE,O_SHIPPRIORITY,~.sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue) |
|
7 |
=A6.total(top(10;-revenue,O_ORDERDATE)) |
|
8 |
=A7.run(O_ORDERDATE=date@o(O_ORDERDATE)) |
|
9 |
=interval@ms(A1,now()) |
The two grouping fields written originally in A6 are now written in aggregation section. When calculating, directly take the field value of the first record in each group, which avoids having them participate in the grouping calculation.
Test result:
Test items |
Execution time (seconds) |
Ordinary grouping |
9 |
Redundant grouping key |
8 |
4. Direct aggregation when joining
When joining the primary table and the sub table, the sub table can be directly aggregated, which eliminates the process of copying primary table fields and other calculations.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
1995-03-15 |
BUILDING |
3 |
=file("customer.ctx").open().cursor@m(C_CUSTKEY;C_MKTSEGMENT==B2).fetch().keys@im(C_CUSTKEY) |
|
4 |
=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;O_ORDERDATE<A2 && A3.find(O_CUSTKEY)) |
|
5 |
=file("lineitem.ctx").open().news@r(A4,O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue;L_SHIPDATE>A2) |
|
6 |
=A5.total(top(10;-revenue,O_ORDERDATE)) |
|
7 |
=A6.run(O_ORDERDATE=date@o(O_ORDERDATE)) |
|
8 |
=interval@ms(A1,now()) |
In A5, adding a suffix @r to the news function means that the join is based on the primary table orders. If @r is not added, the join is based on latter table by default. When the join is based on primary table, the sub table needs to be aggregated before joining.
Test result:
Test items |
Execution time (seconds) |
Ordinary grouping |
9 |
Redundant grouping key |
8 |
Direct aggregation when joining |
7 |
5. Data conversion
Utilize various data conversion methods mentioned in previous articles: convert enumeration string field to numbers, convert date to integers, convert primary keys of dimension tables to sequence numbers.
Conversion code:
Convert customer:
A |
|
1 |
=file("customer.ctx").open().import() |
2 |
=A1.id(C_MKTSEGMENT).sort() |
3 |
=file("c_mktsegment.btx").export@b(A2) |
4 |
=A1.run(C_CUSTKEY=#, C_MKTSEGMENT=A2.pos@b(C_MKTSEGMENT)) |
5 |
=file("customer_5.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
6 |
>A5.append(A4.cursor()) |
A2-A4: Convert enumeration string field C_MKTSEGMENT to numbers.
A4: Set the primary key C_CUSTKEY as row number, set C_MKTSEGMENT as its sequence number in value list.
Convert orders:
A |
|
1 |
=file("customer.ctx").open().import(C_CUSTKEY).keys@im(C_CUSTKEY) |
2 |
=file("orders.ctx").open().cursor() |
3 |
=A2.run(O_CUSTKEY=A1.pfind(O_CUSTKEY),O_ORDERDATE=days@o(O_ORDERDATE)) |
4 |
=file("orders_5.ctx").create(#O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE, O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT) |
5 |
>A4.append(A3) |
A1: Read the C_CUSTKEY column in customer as the primary key and create an index.
A3: Search for row number of the records whose primary key value is O_CUSTKEY in A1, and assigns it to O_CUSTKEY. Convert O_ORDERDATE to small integers.
Convert lineitem:
A |
|
1 |
=file("lineitem.ctx").open().cursor() |
2 |
=A1.run(L_SHIPDATE=days@o(L_SHIPDATE)) |
3 |
=file("lineitem_5.ctx").create@py(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
4 |
>A3.append(A2) |
The sub table lineitem must be stored in segments according to the primary table orders. Therefore, A3 uses @p to create the composite table in order to ensure that parallel segments are not misaligned during association.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=days@o(date("1995-03-15")) |
|
3 |
=file("c_mktsegment.btx").import@b().(_1=="BUILDING") |
|
4 |
=file("customer_5.ctx").open() |
=A4.cursor@m().skip().(false) |
5 |
=A4.cursor@m(C_CUSTKEY;A3(C_MKTSEGMENT)).fetch() |
|
6 |
=A5.(B4(C_CUSTKEY)=true) |
|
7 |
=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;B4(O_CUSTKEY) && O_ORDERDATE<A2) |
|
8 |
=file("lineitem_5.ctx").open().news@r(A7,O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue;L_SHIPDATE>A2) |
|
9 |
=A8.total(top(10;-revenue,O_ORDERDATE)) |
|
10 |
=A9.run(O_ORDERDATE=date@o(O_ORDERDATE)) |
|
11 |
=interval@ms(A1,now()) |
Having converted the enumeration string field to numbers and the primary key of dimension table to sequence numbers, aligned sequence can be used in A3 and B4.
Test result:
Test items |
Execution time (seconds) |
Ordinary grouping |
9 |
Redundant grouping key |
8 |
Direct aggregation when joining |
7 |
Data conversion |
6 |
6. Column-wise computing
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
=days@o(date("1995-03-15")) |
|
3 |
=file("c_mktsegment.btx").import@b().(_1=="BUILDING") |
|
4 |
=file("customer_5.ctx").open() |
=A4.cursor@m().skip().(false) |
5 |
=A4.cursor@mv(C_CUSTKEY;A3(C_MKTSEGMENT)).fetch() |
|
6 |
=A5.(B4(C_CUSTKEY)=true) |
|
7 |
=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_ORDERDATE,O_SHIPPRIORITY;B4(O_CUSTKEY) && O_ORDERDATE<A2) |
|
8 |
=file("lineitem_5.ctx").open().news@r(A7,O_ORDERKEY,sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue,O_ORDERDATE,O_SHIPPRIORITY;L_SHIPDATE>A2) |
|
9 |
=A8.total(top(10;-revenue,O_ORDERDATE)) |
|
10 |
=A9.new(O_ORDERKEY,date@o(O_ORDERDATE):O_ORDERDATE,O_SHIPPRIORITY,revenue) |
|
11 |
=interval@ms(A1,now()) |
A10: To convert the integer date back to date type data, the common method is to use the run function to reset the value of O_ORDERDATE. However, since the data type cannot be changed in column-wise computing, the new function is used here.
Test result:
Test items |
Execution time (seconds) |
Ordinary grouping |
9 |
Redundant grouping key |
8 |
Direct aggregation when joining |
7 |
Data conversion |
6 |
Column-wise computing |
3 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version