Iterative functions and custom aggregation operations
The conventional aggregation operations we have discussed, such as SUM/COUNT and unconventional aggregation operations, such as maxp/top, are pre designed aggregation functions. But what if we want to implement an operation that has not been defined before? For example, if we want to perform a multiplication operation, which is obviously an aggregation, can we only write it ourselves using a loop? Can we combine it with existing syntax and functions? (Off topic: Multiplication can be done using exp(SUM(ln(x))), but it’s a bit tricky, and it can’t handle situations where members are negative.).
To design such a syntax scheme, we first need to observe how these common aggregation functions are calculated.
SUM: First set an initial value of 0, then traverse each member of the set, adding member values to the initial value each time until the members are completely traversed.
COUNT: Set the initial value to 0, traverse the set members, and add 1 to the initial value every time a non null member is encountered until the traversal is complete.
AVERAGE: This cannot be calculated while traversing, but AVERAGE=SUM/COUNT is considered an derived function and does not need to be considered.
MAX: Set the initial value to infinitesimal, traverse the set members, and replace the initial value with each member value larger than the initial value until the traversal is complete.
MIN: Just like MAX, the initial value and comparison direction are opposite.
We found that the implementation schemes for these basic aggregation operations all have the same process: first setting an initial value, then traversing the set members to calculate a new initial value for the current member and initial value, and then entering the next cycle until traversing the entire set, the initial value becomes the aggregation value we need.
So, SPL designed an aggregation function to implement iterative computation:
A.iterate( x, r )
Traverse set A with r as the initial value, calculate the expression x with the current initial value and current member each time, and replace the current initial value with the result obtained until the traversal is completed, returning the final value of x.
There is a problem here, where an identifier or symbol is needed in x to represent the current initial value and current member. The current member can be represented by the ~ symbol, and the current initial value needs to be represented by another symbol, which is ~~ symbol in SPL.
The aggregation operations mentioned earlier can be represented by this iterative function:
A.sum() = A.iterate( ~~+~, 0 )
A.count() = A.iterate( ~+if(~~,1,0),0 )
A.max() = A.iterate( if(~>~~,~,~~), -inf )
A.min() = A.iterate( if( ~<~~,~,~~), inf )
Multiplication is also very simple:
A.iterate( ~~*~, 1 )
The unconventional aggregation we mentioned earlier can also be represented, for example, the maxp that returns a single value can be written as
A.maxp(F) = A.iterate( if (~==null || ~.F>>~~.F,~,~~), null ) The F here is the field name
But the situation of returning a set is a little more complicated:
A.maxp@a(F) = A.iterate( if(~==null || ~.F>~~.F,~,if(~.F==~~.F,~~|~,~~)), null ) | represents the union of sets
A.top( n ) = A.iterate( merge(~~,~).to(n), [inf]*n ) the merge function indicates the orderly merge operation, to(n) selects the first n members of the set
A.distinct() = A.iterate( if(~~.contain(~),~,~~|~), [] ) the contain function judges whether a member is included in the set
However, not all aggregation operations are easy to describe using iterate, as the definition of aggregation operations is too broad, such as the median, which is not suitable for describing with iterate. In addition, aggregation operations like getting the first/last do not require traversal and there is no need to describe them using iterate.
Using iterate function can help us write some new aggregation operations, and it is itself a traversal method. Aggregation operations that can be written using iterate function can all be traversed and calculated simultaneously. Once traversed, the aggregation value is obtained, and the target set only needs to be traversed once.
We know that computers cannot directly calculate on the external storage. When the amount of data is large and cannot be fully loaded into memory, the calculation method of iterate function can implement the aggregation of large amounts of data with only a small amount of memory (which can accommodate aggregation values). This is very meaningful for implementing aggregation operations after grouping.
The overall data size of grouped subsets is the same as the original set. Performing aggregation operations based on the split subset means traversing twice, which is not a big problem for pure in-memory operations. However, if the data volume is too large to fit in memory, external buffer swapping may occur, which is very inefficient. As an aggregation operation, iterate can of course be used after grouping. When performing aggregation operations that can be described by iterate after grouping, there is no need to traverse the original data twice. It can aggregate while grouping, and it only requires a small amount of memory (which can save the result set) to complete.
This is probably one of the reasons why SQL did not provide grouping subsets. In the era when SQL was invented, memory size was very small, and the vast majority of raw data could not be loaded into memory. The computational efficiency of generating grouping subsets first and then aggregating was intolerable. However, grouping and then aggregating directly, and the aggregation operations can be described using iterate, in most cases, it can be computed in one traversal without all data read into memory, unless the result set is also too large to fit in memory. Later on, the machine memory size increased, but this rule of SQL has been consistently used as a standard.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version