Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal
Performance Optimization - 4.1 [Traversal technology] Cursor filtering
We know that the data reading time accounts for a large proportion in the traversal operation of external storage data table. However, reading is unavoidable, so we hope to get as many things done as possible during one time reading, that is, try to enable the data read out in the traversal process to be used for as many purposes as possible.
For example, we want to group and aggregate an order table to calculate the sales amount by product, then find out the maximum order amount of each region. Since the two tasks use different grouping keys, they cannot be written in one grouping operation. A simple method is:
A | |
---|---|
1 | =file(“orders.ctx”).open() |
2 | =A1.cursor(product,amount).groups(product;sum(amount)) |
3 | =A1.cursor(area,amount).groups(area;max(amount)) |
This method will traverse the data table twice, and the amount field is read repeatedly (we assume the data table adopts columnar storage. If it adopts row-based storage, more content will be will read repeatedly).
Actually, we can use some techniques to calculate both grouped results in one traversal.
A | |
---|---|
1 | =file(“orders.ctx”).open() |
2 | =A1.cursor(area,product,amount).groups(area,product;sum(amount):samount,max(amount):mamount) |
3 | =A2.groups(product;sum(samount)) |
4 | =A2.groups(area;max(mamount)) |
In this way, CPU’s computation load will increase, as will the memory usage since it needs to calculate and keep a more detailed grouped result set. Nevertheless, a better computing performance can usually be obtained because the amount of traversal is much less. But, if we want to do different grouping and aggregating on more fields, the code will be much more cumbersome; If we want to perform complex grouping operations to the cursor, such as filtering first and then grouping, it is almost impossible to code with such techniques.
SPL provides multipurpose traversal technology to solve this kind of problem. The above operation can be written as:
A | |
---|---|
1 | =file(“orders.ctx”).open() |
2 | =A1.cursor(product,area,amount) |
3 | =channel(A2).groups(area;max(amount)) |
4 | =A2.groups(product;sum(amount)) |
5 | =A3.result() |
A3 uses the channel() function to define a channel that synchronizes with the cursor A2, on which an operation (also a grouping) is attached. When A4 traverses the cursor to perform grouping operation, the read data will be simultaneously sent to the channel A3 to perform the attached operation (another grouping). After traversal, the corresponding calculation result will be kept in the channel and can be taken out in A5.
In comparison with the previous code that traverses twice, CPU’s computation load in this code is the same, both codes only do two small grouping operations. However, the amount of reading the hard disk in this code is much less, because the amount field is only read once.
A cursor can define multiple synchronous channels and attach multiple sets of operations at the same time. Moreover, this kind of operation can be freely written, not limiting to grouping operation, and can also be written in multiple steps. For example, A3 can be written as:
=channel(A2).select(amount>=50).groups(area;max(amount))
This code can count the orders with an amount over 50.
This mechanism works for all cursors, not limiting to composite table.
SPL also provides a statement-pattern channel syntax, which makes code look neater:
A | B | |
---|---|---|
1 | =file(“orders.ctx”).open() | |
2 | =A1.cursor(product,area,amount) | |
3 | cursor A2 | =A3.groups(area;max(amount)) |
4 | cursor | =A4.groups(product;sum(amount)) |
5 | cursor | =A5.select(amount>=50).total(count(1)) |
6 |
After the cursor is created, use the cursor statement to create a channel for it, and attach operations on the channel. Multiple channels can be created. If the cursor parameters are not written in the subsequent statements, it indicates the same cursor will be used. Once all cursor statements (code blocks) are written, SPL will consider all channels have been defined completely, and will start traversing the cursor and calculate the operation result of each channel and store them in the cell where the cursor statement is located.
Here, B3 and B4 respectively define corresponding operation target, and B5 adds the number of orders with a calculated amount of more than 50. These calculation results will be put into A3, A4 and A5 respectively (note that they are not B3, B4, B5).
The idea of multi-purpose traversal can also be applied to data split. For example, there is a large text file, we want to pick out the compliant data (that meet the given condition) from this file for further analysis. To solve this problem, using the select()function of cursor is OK. Furthermore, we may want to know what data is non-compliant data (that doesn’t meet the given condition) in order to prevent this situation from happening again. However, the one-time filter principle cannot separate the records that meet the conditions from those that don’t meet the condition at the same time. In this case, the multi-purpose traversal technique can be used.
A | |
---|---|
1 | =file(“data.txt”).cursor@t() |
2 | … |
3 | =channel(A1).select(!(${A2})).fetch() |
4 | =A1.select(${A2}) |
5 | >file(“result.btx”).export@b(A4) |
6 | =A3.result() |
Fill in the filter condition in A2; the channel in A3 will filter out the records that do not meet the condition and fetch them; the cursor in A4 will filter out the records that meet the condition; A5 will traverse the cursor to write the records that meet the condition to a new file; A6 will take the channel results. Here we assume there are only a few records that do not meet the condition, which can be hold in memory.
However, this method still needs to calculate the condition twice (the records that meet the condition and the records that do not meet the condition need to be calculated separately). SPL provides a method directly in select() function, which can take out the records that do not meet the condition at the same time. However, such records can only be written to another file in the format of bin file.
A | |
---|---|
1 | =file(“data.txt”).cursor@t() |
2 | … |
3 | =A1.select(${A2};file(“error.btx”)) |
4 | >file(“result.btx”).export@b(A3) |
Similarly, there is also a possibility to split a big data table into multiple groups, such as splitting the order records into multiple files by region for distribution purpose. This kind of problem can also be solved with channels. However, the number of channels is predetermined in code, so we should know in advance how many parts to divide into, rather than generating new channel temporarily during the grouping process.
To solve this problem, SPL attaches a function that can split and write-to-files when performing sequence number grouping on the cursor. For example, the following code divides the orders into 12 months (taking months as example is because that they are easy to be sequence-numberized. For other cases, you can do sequence-numberization yourself).
A | |
---|---|
1 | =file(“orders.txt”).cursor@t() |
2 | =12.(file(“order”/~/“.btx”)) |
3 | =A1.groupn(month(dt);A2) |
4 | =A1.skip() |
The groupn()function in A3 is a delayed cursor, which merely records the action, and will be actually calculated during cursor traversal. We should prepare a corresponding number of file objects (A2) in advance. Similar to select(), groupn() can only write data to bin files.
Performance Optimization - 4.3 [Traversal technology] Parallel traversal
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL