Performance optimization case course: TPCH-Q13
select
c_count,
count(*) as custdist
from (
select
c_custkey,
count(o_orderkey) c_count
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%accounts%'
group by
c_custkey
) c_orders
group by
c_count
order by
custdist desc,
c_count desc;
The query can be simply regarded as two rounds of regular grouping operations on the orders table. The first round groups by custkey and count the orders each customer has placed, and the second round groups by the number of orders and counts customers corresponding to each number.
1. Data storage
There is no special requirement for tables orders and customer, they are stored in order by primary key.
We can continue to use orders.ctx and customer.ctx from Q3.
Copy these tables to the main directory of this query.
2. General method
We noticed that there is a left join (i.e., a left outer join)in the original SQL statement to specifically include customers who have not placed any orders (the number of orders is 0). However, when grouping and aggregating by O_CUSTKEY based on orders, these data will be omitted and need to be included after grouping, that is to say, there is a need to add a record to count customers with an order count of 0.
A |
|
1 |
=now() |
2 |
>filter="*special*accounts*" |
3 |
=file("orders.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter)) |
4 |
=A3.groups@u(O_CUSTKEY;count(1):c_count) |
5 |
=A4.len() |
6 |
=A4.groups@um(c_count;count(1):custdist) |
7 |
=file("customer.ctx").open().cursor@m().skip() |
8 |
=A6.insert(0,0,A7-A5) |
9 |
=A6.sort@z(custdist,c_count) |
10 |
=interval@ms(A1,now()) |
The code utilizes optimization methods mentioned in previous articles, such as pre-cursor filtering, multi-thread parallel processing.
A4 performs the first round of grouping operation, and A6 performs the second round. A7 calculates the number of all customers. A8 gets the number of customers who do not place any orders by subtracting A5’s number of customers who have placed orders from A7’s total number, and inserts the corresponding records to A6 for sorting in the next step.
A4 uses groups@u to perform the grouping operation. @u option enables not to sort by the grouping field O_CUSTKEY after grouping.
Test result:
Test items |
Execution time (seconds) |
General method |
22 |
3. Sequence-number-based grouping
When using the groups function for grouping in the previous section, hash computation is required. If the grouping field values are sequence number, we can use @n option to locate records directly to avoid hash computation.
Here we use the dimension table primary key sequence-numberization method introduced in previous articles - the C_CUSTKEY in customer and the O_CUSTKEY in orders have been converted in the previous examples.
We can directly use orders_5.ctx converted in Q3 and customer_3.ctx converted in Q5.
Copy these tables to the main directory of this query.
Calculation code:
A |
|
1 |
=now() |
2 |
>filter="*special*accounts*" |
3 |
=file("customer_3.ctx").open().cursor().skip() |
4 |
=file("orders_5.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter)) |
5 |
=A4.groups@n0(O_CUSTKEY;count(1):c_count;A3) |
6 |
=A5.len() |
7 |
=A5.groups@um(c_count;count(1):custdist) |
8 |
=A7.insert(0,0,A3-A6) |
9 |
=A7.sort@z(custdist,c_count) |
10 |
=interval@ms(A1,now()) |
The groups@n0 in A5 means locating the corresponding group directly by O_CUSTKEY value; @0 means removing empty groups.
Test result:
Test items |
Execution time (seconds) |
General method |
22 |
Sequence-number-based grouping |
17 |
4. Sequence-based computing after sequence-numberization
The result set of sequence-number-based grouping in the previous section is a table sequence. However, when performing the grouping operation, thedata need tobeaggregated intothe table sequence, which uses a relatively large memory space. To solve this, we can perform grouping directly on a sequence after sequence-numberization, which can reduce memory usage and increase performance.
Calculation code:
A |
B |
|
1 |
=now() |
|
2 |
>filter="*special*accounts*" |
|
3 |
=file("customer_3.ctx").open().cursor().skip() |
|
4 |
=file("orders_5.ctx").open().cursor@m(O_CUSTKEY;!like(O_COMMENT,filter)) |
|
5 |
fork A4 |
=A3.(0) |
6 |
=A5.run(B5(O_CUSTKEY)+=1).skip() |
|
7 |
return B5 |
|
8 |
=transpose(A5).(~.sum()).groups@m(~:c_count;count(1):custdist) |
|
9 |
=A8.sort@z(custdist,c_count) |
|
10 |
=interval@ms(A1,now()) |
A3 gets the number of records in customer table (denote the number as n for the convenience of description).
A5 computes A4’s cursor through segment-based multithreaded processing. The number of threads m depends on A4’s cursor@m. Here m is the default number set by SPL for multi-cursor.
B5 defines a sequence having n members for each thread and assigns 0 to each member. B6 loops over the segment of orders cursor corresponding to the current thread and adds 1 to value of the (O_CUSTKEY)th member in the sequence. After the multithreaded processing finishes execution, we will get m sequences of length n in A5. Each sequence represents the number of orders placed by O_CUSTKEY in a segment.
A8 performs row-to-column transposition on A5. The result is n sequences of length m, each of which corresponds to a customer. Add up the m members of every row to get the total number of orders placed by each customer, and use groups() function to perform the second round of grouping to get customers corresponding to each number of orders placed.
In this process, calculations are all performed onsequence, which uses less memory space and improves performance comparedto calculations ontable sequence.
Test result:
Test items |
Execution time (seconds) |
General method |
22 |
Sequence-number-based grouping |
17 |
Sequence-based computing after sequence-numberization |
11 |
5. Column-wise computing
A |
|
1 |
=now() |
2 |
>filter="*special*accounts*" |
3 |
=file("orders_5.ctx").open().cursor@mv(O_CUSTKEY;!like(O_COMMENT,filter)) |
4 |
=A3.groups@uz(O_CUSTKEY;count(1):c_count) |
5 |
=A4.len() |
6 |
=A4.groups@um(c_count;count(1):custdist).o() |
7 |
=file("customer_3.ctx").open().cursor().skip() |
8 |
=A6.insert(0,0,A7-A5) |
9 |
=A6.sort@z(custdist,c_count) |
10 |
=interval@ms(A1,now()) |
In A4, groups() function works with @z option to let all threads share the result set during the multithreaded grouping. @z option is used when the grouping result set is rather large (usually the number of groups reaches over 1 million).
Because data cannot be inserted into a column-wise table sequence, o() function in A6 converts a column-wise table sequence to a regular one so that A8 can call the insert method.
As the column-wise computation hasn’t offered groups@n method yet, the sequence-number-based grouping method cannot be used.
Test result:
Test items |
Execution time (seconds) |
General method |
22 |
Sequence-number-based grouping |
17 |
Sequence-based computing after sequence-numberization |
11 |
Column-wise computing |
5 |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version