Iterative Aggregation Syntax
There are ready-to-use functions for performing the usual aggregate operations like SUM and COUNT and the unusual aggregate operations like MAXP and TOP. But how can we handle a provisional query, say, calculating the product of multiplying numbers? Can we compose it using the existing syntax and functions? We can use exp(SUM(ln(x))) to the aggregate calculation, but it’s a makeshift and can’t handle the occasions where the numbers are negative.
Before we try to design syntax for this calculation, let’s look at how do the aggregate queries are executed?
SUM:Set an initial value, which begins from 0, and traverse members of a set while adding the current member value to the initial value until all members are added.
COUNT:Set an initial value, which begins from 0, and traverse members of a set while adding 1 to the initial value if the current member is non-null until all members are scanned.
AVERAGE:This calculation can’t be done directly with a traversal. It is a derived function because AVERAGE=SUM/COUNT.
MAX:Set an initial value, which begins from infinitesimal, and traverse members of a set while replacing the initial value with the current member value, if the latter is greater than the former, until the whole set is scanned.
MIN:Same process as that of executing MAX, except the setting of initial value, which is infinity, and that the current member value will replace the initial value if it is less than the latter.
We find that these basic aggregate operations are executed in the same manner – set an initial value and perform a calculation over each member value and the current initial value to get a new initial value. Repeat the calculation until all members are traversed. The last initial value is the aggregate value we want.
Thus we can design an aggregate function for performing iterative calculations:
A.iterate(x,a)
The function traverses set A with parameter a as the initial value while calculating expression x with the current initial value and the current member value and replacing the current initial value with the result, and returns the last result of expression x when the traversal is done.
We need identifiers or signs to represent the current initial value and the current member. The latter, as we explained in previous articles, can be represented by the sign “~”. For the former, we use the sign “~~”to represent it.
Thus we can express the above aggregate operations using the iterative function:
A.SUM() = A.iterate(~~+~
, 0 )
A.COUNT()= A.iterate(~+if(~~,1,0),0 )
A.MAX()= A.iterate( if(~>~~,~,~~), -inf )
A.MIN()= A.iterate( if(~<~~,~,~~), inf )
To get a product of multiplying numbers is also easy:
A.iterate( ~~*~
, 1 )
It can be used to express the unconventional aggregate operations. For example, the MAXP that returns a single can be written as this:
A.maxp(F) = A.iterate(if (~==null || ~.F>>~~.F,~,~~), null ) Here parameter F is the field name.
It’s a little complicated if the MAXP returns a set:
A.maxp(F) = A.iterate( if(~==null || ~.F>~~.F,~,if(~.F==~~.F,~~|~,~~
)), null) The sign “|” represents the set union
A.top(n) = A.iterate( merge(~~,~
).to(n), [inf]*n)The merge function performs a merge sort operation; to(n) function selects the first n members in the set
A.distinct() = A.iterate( if(~~.contain(~),~,~~|~
), [] ) The contain function checks whether a specified value is included in a given set or not
As there are a large variety of aggregate operations, not all of them are easy to be described with the iterate function. One example is the query of getting the median. Besides, there’s no need to perform a traversal for the FIRST and LAST calculations and thus it’s unnecessary to use the iterate function.
Not only the aggregation syntax enables different descriptions of the aggregate operations, but it itself is a traversal algorithm. An aggregate query that can be expressed in aggregation syntax can be executed while members are traversed. The desired aggregate value is got once the traversal is over. Only one traversal over the target set is needed.
The computer can’t directly process data stored in an external storage device. But the iterative aggregation algorithm makes it possible to deal with a massive amount of data that can’t fit into the memory with a relative small memory space (which must be able to accommodate the aggregate value). It means a lot to the performing of aggregate calculations after a group operation.
The size of a grouped set with a number of grouping subsets is the same as that of the original set. This means two traversals are needed if the aggregate calculation is performed over the grouping subsets. It’s not an issue for a pure in-memory computation. If there is a huge amount of data that can’t fit into the memory, data needs to be retrieved from an external storage device, leading to low efficiency. The iterative operation, one of the aggregate operations, enables a simultaneous execution of the grouping action and the aggregation when used after a group operation to perform a calculation that can be expressed by it, making it unnecessary to traverse the original data set twice and requiring relatively small memory space (into which the result set can fit).
Possibly this helps explain why SQL didn’t define the grouping subsets. Back the age when SQL was invented, computer memory was small. On most occasions, it couldn’t accommodate the raw data. The execution would be unbearably slow if the grouping subsets are first generated before the aggregation is performed. An instant aggregation, which was always one that can be phrased with iteration syntax, following a group operation improved the situation, though retrieval from the external store could still happen (when the result set can’t fit into the memory).
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL