Order-based Group Operations (I)
Founded on the mathematical concept of unordered sets, SQL doesn’t emphasize the order of members when grouping a data set. Examples include equi-grouping and non-equi grouping, where the grouping rule is directly related to the member values. But when SQL application scenarios are extended to cover a variety of operations on ordered sets, the order signifies. In the real-world business practices, there are a lot of scenarios requiring order-based group operations.
- Sequence-number-based grouping
One scenario is to evenly divide the students in a class into three groups (Suppose the number of students is dividable by 3). According to the definition of group operations, this is one of the instances. It is difficult to do this in SQL because the grouping rule has nothing to do with the member values.
Yet it is easy to handle it using the hash sign“#” (which represents a sequence number) introduced in dealing with the order-based traversal syntax.
A.group((#-1)*3\A.len() ) // Group the data set into three parts – the first 1/3, the second 1/3, and the last 1/3 – by sequence numbers
Or A.group((#-1)%3 ) // Get one in every three members according to sequence numbers to form a set of grouping subsets
SQL expresses the computation in a roundabout way. First it needs to create sequence numbers with a subquery and then groups the data set with similar ways.
The instance illustrates the role sequence numbers play, but doesn’t show the deciding effect the order can have. A change of the order of members will still get a workable result.
Here’re two more examples.
When comes to the processing of textual log files, it is possible that the information unit is three lines instead one, which means each event covers three lines of text. This isn’t rare in practical scenarios. With this type of log files, we need to split a file into subsets by grouping data every three lines before performing specific analysis and processing over every group. In this case, the order of members in the to-be-grouped set (lines in a log file) is crucial to performing the group operation.
In an effort to ensure the same average rank, a school wants to divide students into two classes according to their results of the entrance examination. The way is this: for every four students, put the first and the last students into one class and the second and third students into the other class, and so on. The result would be 1,4,5,8,… and 2,3,6,7,…. We can handle this with sequence numbers:
A.sort@z(score).group(#%4<2)
Here the grouping value isn’t a number but a boolean value. Students are placed into two groups according to “true” and “false”. A student is put into class X if their rank makes the value of the expression true, and class Y if the value is false. In essence, this is an equi-grouping where the grouping value can be of any generic data type.
Here the order of members in the to-be-grouped set determines the grouping result.
By the way, this is an instance that cares about the grouping subsets rather than the aggregate values.
The sequence-number-based grouping, in most cases, is to get the grouping rule through the sequence numbers and then perform an equi-grouping. But for the following two scenarios, it’s not easy to perform an equi-grouping.
2. Grouping by the value change
There are a set of birth records ordered by birth date. Now we want to find out how many groups where the number of baby girls or baby boys who were born continuously is greater than 5.
The algorithm we naturally think of is to group the records, count the numbers of records in every group and find the groups over which the COUNT result is greater than 5. The last two steps are simple. The real issue lies in the group operation.
It’s incorrect to group the records directly by gender. The order needs to be taken into consideration. Scan the records in order while starting a new group as the gender changes. Obviously this type of grouping can’t be handled directly with the equi-grouping.
An order-based grouping method can be used to handle this – form a new group if the value of the grouping field is changed.
A.group@o(gender).count(~.len()>5) // @o option enables the creation of a new group whenever the value of the grouping field is changed
It’s complicated to do this in SQL. We need to create tags and a variable to generate sequence numbers for the groups. The following is the SQL way:
SELECT COUNT(*) FROM
(SELECT ChangeNumber FROM
(SELECT SUM(ChangeFlag) OVER (ORDER BY birthday) ChangeNumber FROM
(SELECT CASE WHEN gender=LAG(gender) OVER (ORDER BY birthday) THEN 0 ELSE 1 END ChangeFlag FROM A))
GROUP ChangeNumber HAVING COUNT(*)>5)
The code is difficult to understand, and uses the birthday field to generate a certain order. This is unnecessary for the ordered original data if we use the order-based grouping syntax.
Also, we may encounter the issue in the text analysis. It is likely that the event log of every user covers a number of lines and the number is indefinite. Usually each line is headed by a user ID, according to which we can group the lines. A change of the ID means a new group starts.
3. Grouping by the change of the condition
Here’s a familiar computation: find the largest number of the consecutive rising days for a given stock.
One solution is to perform a traversal. But an order-based grouping is the only choice in a SQL system (Strictly speaking, this is so far the simplest and most feasible option).
Sort the closing prices of the specified stock by date, put the continuously rising dates into the same group, and find the group with the most members and count them. To say more specifically, we put a date with a rising price into the current group and one with a declining price into a new group.
To phrase the algorithm in SQL, we still need to create tags and a variable to generate sequence numbers for the groups.
SELECT MAX(ContinuousDays) FROM
(SELECT COUNT(*) ContinuousDays FROM
(SELECT SUM(RisingFlag) OVER (ORDER BY TradingDate) NoRisingDays FROM
(SELECT TradingDate,
CASE WHEN ClosingPrice>LAG(ClosingPrice) OVER (ORDER BY TradingDate THEN 0 ELSE 1 END) RisingFlag
FROM A))
GROUP BY NoRisingDays)
Yet with a special order-based grouping method and the order-based traversal syntax, the computation becomes easy:
A.sort(TradingDate).group@i(ClosingPrice<ClosingPrice[-1]).max(~.len()) //@i option enables the creation of a new group if the specified condition is true
This line of code, unlike the SQL code that is a multi-nested statement, is written in a step-wise fashion to express the same algorithm and thus is easy to write and understand.
Similar computations also take place in text analysis. In a log file with each event covering an indefinite number of lines, sometimes there is a tag at the beginning of each event. In the process of scanning, a new group will be created for each different tag by matching the current line with the specified string, making sure all lines of one event are placed in the same group.
Theoretically, the computations in both the above two scenarios can be transformed into an equi-grouping operation (It is the SQL way of handling them, which is a proof of SQL’s complete computing abilities). The process is rather complicated, and generally, we won’t adopt the way of performing an equi-grouping.
So far all the proposed methods of handling group operations work on the assumptions that the to-be-grouped set is already in place and that its members can be directly accessed anytime. For computations over a huge amount of data that can’t be entirely read into the memory, such assumptions will lead to frequent data retrieval from an external storage device, causing extremely poor performance. To address the issue, we need a different computing system able to perform grouping and aggregation while reading data in in the stream style. In real-world analysis, big log files are common and the above methods still apply in the analysis of them, which we’ll discuss in later articles.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL