Unconventional Aggregate Operations
Standard SQL has a set of five aggregate functions – SUM/COUNT/AVG/MIN/MAX, which take a set as the parameter and return a single value. According to this feature, an aggregate operation can be defined as one that turns a set into a single value, which is a process of aggregation that makes multiple values into one. That is how the aggregate operations get their name.
Obviously we can use an aggregate function whenever there is a set. So the above functions can be applied to a data table (a set of records).
The result of a group operation is a set of grouping subsets, to each of which an aggregate function can be applied. A SQL group operation is always followed by an aggregate operation that is performed over each subset. Actually an aggregate operation performed over a whole set is one following a group operation that can be considered to divide the set into one group (a complete partition of a set). With this understanding, we believe that an aggregate operation always follows a group operation (but not vice versa, which we’ve explained in one of the previous articles). Or we can say it is certain that an aggregate operation (which is one conforming to the above definition) can be used after a group operation. The ideas will enable much wider application scenarios of the grouping and aggregation duo.
In addition to the previous five aggregate operations, some database products also provide aggregate functions for calculating aggregate values including variance and standard deviation. All of them are the usual type of aggregate calculations. Now we’ll look at the unusual type of aggregate operations that have significance in business practices.
1.Aggregate operations that return a record
The above unusual aggregate operations are performed over numeric values. With structured data, the operations are applied to a certain field (or an expression) while the numeric field values or the numeric results of the expression are calculated and return a numeric value. At times what we care about isn’t the numeric return value but the related information.
Here’s an example. Suppose you want to find the IP address of a certain user’s first login. Not the login time. Following is the code written in standard SQL:
SELECT ip_address FROM LogTable WHERE user=? AND logintime=
(SELECT MIN(logintime) FROM LogTable WHERE user=?)
Get the time of the user’s first login with the subquery, and then the IP address at the time. The query will need to traverse the data set twice.
ORACLE provides a KEEP function to eliminate the need of a subquery:
SELECT MIN(ip_address) KEEP(DENSE_RANK FIRST ORDER BY logintime) FROM LogTable WHERE user=?
Besides the IP address, we may take interest in another field, like the browser the user uses and whether or not the user uses a mobile device. To put it simply, it is the record containing the minimum value that we want. As SQL lacks discreteness, it’s still very complicated to express the query by applying the KEEP function to every field or by using a subquery and traversing the data set twice.
Now we can use an aggregate function that directly returns the record containing the maximum or minimum value to express the query. The query is simple and needs only one traversal:
=LogTable.select(user=?).minp(logintime)
As mentioned previously, we can use this type of aggregation functions with a group function, like finding the log records of the first logins for all users:
=LogTable.group(user).(~.minp(logintime))
Similarly, we can use the maxp function to get a record containing the maximum value.
Usually the log records are ordered by time in ascending order. This allows us to directly get the first record without the need of calculating the minimum value through comparisons.
=LogTable.select(user=?).first() // The first function returns the first member of a set
Use the aggregate function with the group function:
=LogTable.group(user).(~.first())
Of course in real-life we can get a member of a set without using a special function. The first function is used only to stress that the action of getting a member can be regarded as an aggregate operation.
It’s a common operation to get a certain member from a set. We can create an option to work with the group function to do it:
=LogTable.group@1(user)
As an unordered-set-based language, SQL can’t ensure a certain order for the returned records. To express the query, we need to create sequence numbers for records and then perform a filtering operation.
2.Aggregate calculations that return a set
Now we make a little change to the previous computing task: Find the names of the youngest ones among a group of people.
A user’s multiple logins won’t be at the same time, but among a group of people, some may have same ages, and there may be more than one person who is the youngest. The minp function should return a set.
Now let’s rethink the definition of aggregate operations we give in the beginning and find that it’s unnecessary to stipulate that a single value must be returned. An operation that takes a set as the input parameter is an aggregate operation, no matter what type of value it returns. According to this concept, the minp function and maxp function that return a set can be considered aggregate operations.
A more typical aggregate calculation that returns a set is finding the top N records.
SQL doesn’t treat the action of finding the top N records as an aggregate operation, but makes it a modifier applied to the returned result set. In practice, we need to get a result set containing all records and then retrieve the first N to return. A sorting action is always precedes the action of finding top N, which is time-consuming if the original data set is massive. Practical optimization of the database engine is the only choice, but not always the way out. Treating the action as a modifier to a result set makes it impossible to be applied to a grouping subset. Moreover, optimization becomes harder as queries get more and more complicated. Actually it’s not necessary to sort the whole set to get the top N records.
The query is a piece of cake if we think finding the top N as an aggregate action:
=a=LogTable.select(user=?).top(logingtime,-2), a(2)-a(1) // The difference between a given user’s last two login times
=LogTable.groups(user;(a=~.top(logintime,-2),a(2)-a(1))) // The difference between each user’s last two login times
Database optimization isn’t needed and the computation is highly performant even applying the topN function to the grouping subset.
The topN function may return a record, that is, to get a record where the value of a given field (or an expression) is at the Nth position counting from the beginning. Like minp or maxp, a function needs to be invented to do this.
A data set may be already ordered, like the log file in the previous section. Now that the log table is ordered by time in ascending order, there is no need to make comparisons using topN function.
=a=LogTable.select(user=?).last(2),a(2)-a(1) // The last(n) function returns the last n records
=LogTable.groups(user;(a=~.last(2),a(2)-a(1)))
The goal of the aggregate function can be phrased by directly getting members from a set.
These are two typical scenarios of the unconventional aggregate operations. Both are difficult when handled in SQL. According to our definition of the aggregate operations, there are more scenarios of unconventional aggregations, and even the two scenarios have many variants like finding a member whose rank is in the middle and getting a set consisting of distinct values of a certain field. A deeper understanding about the aggregate operations and their relation with group operation can help widen the range of application scenarios, which has big significance in expressing and implementing queries.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL