Performance optimization skill: Use Multi-purpose traverse to speed up multiple grouping
As we all know, the bottleneck of big data computing performance often lies in external storage (i.e., hard disk) IO because the access performance of external storage is one or two orders of magnitude slower than that of memory. Therefore, it is sometimes more important to reduce the amount of hard disk access than to reduce the amount of CPU calculation when optimizing performance. If the algorithm with less hard disk access can be used, better performance will be obtained for the same task even with the same or slightly more CPU computation.
Grouping and aggregation require traversing the dataset. The same dataset may be grouped by different dimensions, so it has to be traversed many times in principle and multiple hard disk accesses will be involved for big data. However, if we can calculate the grouping results of multiple dimensions in one traversing, it will reduce a lot of hard disk access.
Unfortunately, SQL does not support such an operation (returning multiple grouping results in one traversing), it can only traverse many times or depends on whether the database engine can be optimized. While SPL supports the syntax of multi-purpose traverse, which can calculate multiple grouping results at one traversal, thus improving performance.
Let's do some test to see if Oracle can optimize the calculation of multiple traversals and the effect of multi-purpose traverse on performance in SPL.
1. Data preparation and environment
The SPL script generates a data file with two columns. The first column ID is a random integer of less than 2 billion, and the second column amount is a random real number of no more than 10 million. There are 8 billion rows of data records, and the size of the generated original text file is 169 G. With the data importing tool provided by the database, the file data is imported into Oracle's data table topn, and the SPL composite table file topn.ctx is also generated with the file data.
The test is performed on an Intel server, 2 Intel 3014 CPUs, 1.7G frequency, 12 cores in total, 64G memory. The database table data and SPL composite table file are both stored on the same SSD hard disk.
The data volume is deliberately made larger than the memory to ensure that the operating system can not cache all the data into memory and will definitely read the hard disk in actual calculation.
2. Oracle Testing
The test is divided into three situations: single grouping with single calculation amount, single grouping with double calculation amount, and double grouping with double calculation amount.
a) Single grouping with single calculation amount
select /*+ parallel(12) */ mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100)
b) Single grouping with double calculation amount
select /*+ parallel(12) */
mod(id,100)+floor(id/20000000) Aid, max(amount) Amax, min(amount) Amin from topn group by mod(id,100)+floor(id/20000000);
The calculation expressions are doubled, which equals to twice the amount of calculation.
c) Double grouping with double calculation amount
select /*+ parallel(12) */ * from
(select mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100) ) A
join
(select floor(id/20000000) Bid,min(amount) Bmin from topn group by floor(id/20000000) ) B
on A.Aid=B.Bid;
The calculation amount here is basically the same as b), but there are two groups. So we will observe if the database will perform traversing twice. The last JOIN operation involves only 100 rows of data with negligible time.
3. SPL Testing
Let's do the test in Oracle again in SPL.
a) Single grouping with single calculation amount
Write SPL script to execute the test:
A |
|
1 |
=now() |
2 |
=file("/home/topn/topn.ctx").open().cursor@m(id,amount) |
3 |
=A2.groups@u(id%100:Aid;max(amount):Amax) |
4 |
=interval@s(A1,now()) |
A2 defines a multi-cursor to execute in parallel, and the number of parallel is determined by the cursorParallelNum in the configuration file raqsoftConfig.xml.
b) Single grouping with double calculation amount
Write SPL script to execute the test:
A |
|
1 |
=now() |
2 |
=file("/home/topn/topn.ctx").open().cursor@m(id,amount) |
3 |
=A2.groups@u(id%100+id\20000000:Aid;max(amount):Amax,min(amount):Amin) |
4 |
=interval@s(A1,now()) |
c) Double grouping with double calculation amount
Write SPL script to execute the test:
A |
B |
|
1 |
=now() |
|
2 |
=file("/home/topn/topn.ctx").open().cursor@m(id,amount) |
|
3 |
cursor A2 |
=A3.groups@u(id%100:Aid;max(amount):Amax) |
4 |
cursor |
=A4.groups@u(id\20000000:Bid;max(amount):Bmax) |
5 |
=A3.join@i(Aid,A4:Bid,Bid,Bmax) |
|
6 |
=interval@s(A1,now()) |
The exclusive multi-purpose traversal syntax of SPL is used here. A cursor is defined in A2, and two sets of calculations for this cursor are defined in A3/B3 and A4/B4, which means that the two results will be calculated simultaneously in one cursor traversing.
4. Analysis and conclusions
The time for the three situations is as follows:
Test results (time unit: seconds)
single grouping with single calculation amount |
single grouping with double calculation amount |
double grouping with double calculation amount |
|
Oracle |
458 |
692 |
878 |
SPL |
336 |
350 |
376 |
As seen in the test result of Oracle, double grouping with double calculation amount is nearly 200 seconds slower than single grouping with double calculation amount, which cannot be ignored anymore. Since the calculation amount is almost the same for both, the extra time is estimated to be spent in one more traversing. This means that the database will not automatically optimize traversal and the data table will be traversed twice for double grouping. As a result, it will take almost twice as long to do one more grouping.
However, SPL adopts the mechanism of multi-purpose traversing, and the calculation time of the three tests is very little in difference. It will not traverse once more to do one more grouping and does not slow down much with only some added control logic for multi-purpose traversing.
Please note that the “amount” field type of Oracle is set as “decimal” when preparing data, so the calculation speed is slower; whereas “double” type is used in SPL composite table, which makes the performance much faster. But this test is not a comparison of the computation performance of Oracle and SPL, and these differences do not affect the above conclusions.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL