Comparison of SQL & SPL: Order-based Grouping
Sometimes the order of members in an ordered data set is important, such as the scenario where each record is compared with its next neighbor and put them into the same group if they have same value for a specific field or if they satisfy the same condition(s). Examples include finding the maximum frequency of consecutive winning of the champion title and the largest number of days when temperature rises continuously in Beijing. They involve the order-based grouping. This essay explains the solutions and basic principles of SQL and SPL, two commonly used programming languages, for handling the order-based grouping scenarios, and offers the simpler and more efficient way through SQL and SPL sample programs. Looking Comparison of SQL & SPL: Order-based Grouping for details.
【Example 1】Count the largest number of days when SSE Composite Index closing prices rise consecutively in the year 2020 (when the index in the first transaction date rises). Below is part of the source table:
DATE |
CLOSE |
OPEN |
VOLUME |
AMOUNT |
2020/01/02 |
3085.1976 |
3066.3357 |
292470208 |
3.27197122606E11 |
2020/01/03 |
3083.7858 |
3089.022 |
261496667 |
2.89991708382E11 |
2020/01/06 |
3083.4083 |
3070.9088 |
312575842 |
3.31182549906E11 |
2020/01/07 |
3104.8015 |
3085.4882 |
276583111 |
2.88159227657E11 |
2020/01/08 |
3066.8925 |
3094.2389 |
297872553 |
3.06517394459E11 |
… |
… |
… |
… |
… |
SQL solution:
It is not a breeze to accomplish. The basic logic is this – divide records ordered by date into multiple groups, during which those having continuously rising indexes are put into one group. That is, if the closing price in the current date rises, put the corresponding record to the group that contains the previous record; and if it decreases, put the record into a new group. Then get the largest number of members among the groups, which is our target.
The grouping task is special because order of records plays a role here. SQL, which supports the equi-grouping only, needs to convert the order-based grouping into an ordinary equi-grouping for further handling. The process is this:
(1) Sort records by date and get the previous closing price for each record using a window function;
(2) Compare the current closing price with the previous one, make the result as 0 if the price rises and as 1 if the price falls;
(3) Cumulatively sum the marks before the current record, the cumulative sum is like 0,0,1,1,1,1,2,2,3,3,3…, which are the expected group numbers;
(4) Now we can perform an ordinary equi-grouping operation in SQL.
SQL queries are as follows:
SELECT
MAX(ContinuousDays) MAX_DAYS
FROM (
SELECT COUNT(*) ContinuousDays
FROM (
SELECT SUM(RisingFlag) OVER (ORDER BY "DATE") NoRisingDays
FROM (
SELECT
"DATE", CASE WHEN CLOSE>
LAG(CLOSE) OVER (ORDER BY "DATE")
THEN 0 ELSE 1 END RisingFlag
FROM SSEC
WHERE EXTRACT (YEAR FROM "DATE")=2020
)
)
GROUP BY NoRisingDays
)
SPL solution:
SPL A.group() function uses @i option to create a new group whenever the condition changes.
A |
|
1 |
=T("SSEC.csv") |
2 |
=A1.select(year(DATE)==2020).sort(DATE) |
3 |
=A2.group@i(CLOSE<CLOSE[-1]) |
4 |
=A3.max(~.len()) |
A1: Import SSEC table.
A2: Get records of the year 2020 and sort them by date in ascending order.
A3: Create a new group when the current closing price is lower than the previous one.
A4: Get the largest number of consecutive rising dates.
SPL is outstandingly concise while SQL is rather roundabout. There are two reasons for this. SQL does not support the order-based grouping while SPL does; SQL cannot retain the post-grouping subsets and thus perform further operations on them while SPL can.
【Example 2】According to the following Olympic medal table, get the country that holds the champion title for the longest time in terms of the total number of medals and the related medal information. Below is part of the source table:
Game |
Nation |
Gold |
Silver |
Copper |
30 |
USA |
46 |
29 |
29 |
30 |
China |
38 |
27 |
23 |
30 |
UK |
29 |
17 |
19 |
30 |
Russia |
24 |
26 |
32 |
30 |
Korea |
13 |
8 |
7 |
… |
… |
… |
… |
… |
SQL solution:
It is the post-groping subsets and the order-based grouping that need to be retained and achieved in this case. But SQL does not support both, and it is complicated to accomplish them. Marks are needed to index the group before equi-grouping can be used. We get the champion for each Olympic game according to the medal counts, group records by nation in their original order, get the nation having the most number of records, select the first game the champion title was obtained and the how many games the title was held, and finally, get the corresponding records according to the selected information from the source table. SQL queries are as follows:
with cte1 as
(select GAME, max(NATION) keep (dense_rank first order by 1000000*GOLD+1000*SILVER+COPPER desc) NATION
from OLYMPIC
group by GAME),
cte2 as
(select min(NATION) NATION, min(GAME) GAMESTART, count(*) GAMECOUNT
from
(select GAME, NATION, row_number() over(ORDER by GAME) rowno, ROW_NUMBER()OVER(PARTITION BY NATION ORDER BY GAME) groupno
from cte1)
group by NATION, groupno-rowno),
cte3 as
(select min(NATION) keep (dense_rank first order by GAMECOUNT desc) NATION, min(GAMESTART) keep (dense_rank first order by GAMECOUNT desc) GAMESTART, max(GAMECOUNT) GAMECOUNT
from cte2
select t1.GAME,t1.NATION,t1.GOLD,t1.SILVER,t1.COPPER
from OLYMPIC t1
right join cte3
on t1.nation=cte3.nation and t1.game>=cte3.GAMESTART and t1.game<(cte3.GAMESTART+cte3.GAMECOUNT)
The SQL code is hard to understand. As the code is written in Oracle, the analytic function KEEP is used to get the maximum or minimum value from a subset. If any of the other databases is used, an additional query is needed to get the record containing the maximum or minimum value.
SPL solution:
SPL A.group() function works with @o function to create a new group whenever the field value is changed.
A |
|
1 |
=T("Olympic.txt") |
2 |
=A1.sort@z(GAME,1000000*GOLD+1000*SILVER+COPPER) |
3 |
=A2.group@o1(GAME) |
4 |
=A3.group@o(NATION) |
5 |
=A4.maxp(~.len()) |
A1: Import the Olympic table.
A2: Sort A1’s records by game and total number of medals in descending order.
A3: Get the champion in each game; since the records are already ordered, the first record is what we need.
A4: Create a new group when a new nation appears.
A5: Get the group containing the most members, which is the nation that holds the champion for the longest time.
The SPL script is much simpler than the SQL script because SPL’s grouping operation can keep the post-grouping subsets and perform the order-based grouping operation on the subsets. The logic for solving the problem is clear and SPL just expresses the logic naturally.
In the cases where the sequence numbers (a member should be given to which group) of subsets can be obtained directly or indirectly, we can group the records according to the sequence numbers.
【Example 3】Divide the following employee table evenly into three groups (put those with a remainder into a new group) and calculate the average salary in each group. Below is part of the source table:
ID |
NAME |
BIRTHDAY |
ENTRYDATE |
DEPT |
SALARY |
1 |
Rebecca |
1974/11/20 |
2005/03/11 |
R&D |
7000 |
2 |
Ashley |
1980/07/19 |
2008/03/16 |
Finance |
11000 |
3 |
Rachel |
1970/12/17 |
2010/12/01 |
Sales |
9000 |
4 |
Emily |
1985/03/07 |
2006/08/15 |
HR |
7000 |
5 |
Ashley |
1975/05/13 |
2004/07/30 |
R&D |
16000 |
… |
… |
… |
… |
… |
… |
SQL solution:
SQL’s solution is this. Group employee records by the order of hire dates. The first one-third of the records is put into a group, the second one-third is put into the second group, and the last one-third is put into the third group. This is one example of data grouping by sequence numbers, which puts members having same sequence numbers into same group. As SQL does not support the grouping by sequence number, we can first obtain the sequence numbers, perform equi-grouping by them, and then sort groups by them. Below are SQL queries:
with cte1 as(
select count(*) COUNT
from EMPLOYEE
)
select
GROUP_NO, avg(SALARY) AVG_SALARY
from (
select
TRUNC((rn-1)*3/(select COUNT from cte1))+1 GROUP_NO, SALARY
from (
select
SALARY, ROW_NUMBER()OVER(ORDER BY ENTRYDATE) rn
from EMPLOYEE
)
)
group by GROUP_NO
order by GROUP_NO
SPL solution:
SPL A.group()function uses @n option to group records by sequence numbers, putting those having same sequence numbers into same group (records whose sequence numbers are N are put into the Nth group; N starts from 1).
A |
|
1 |
=T("Employee.csv").sort(ENTRYDATE) |
2 |
=A1.group@n((#-1)*3\A1.len()+ 1) |
3 |
=A2.new(#:GROUP_NO,~.avg(SALARY):AVG_SALARY) |
A1: Import Employee table and sort it by entry date.
A2: Get the group number according to the row numbers of sorted records and then perform grouping by group numbers.
A3: Calculate the average salary in each group.
Summary
SQL is not fit to handle order-based grouping operations. The cause is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets. Window functions, however, are patches rather than the solution to address the SQL problem. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the order-based grouping operations.
When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.
The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL