SPL: grouping and sorting aligned by sequence number
Sometimes we group and sort the data in order of sequence number, grouping members with the same sequence number into the same group. For example, calculate the total sales of each month last year in order from January to December, calculate the numbers of accessors to a website according to the order from Monday to Sunday, and so on.
This kind of grouping operation, which is aligned to a specified criterion, is collectively referred to as alignment grouping. Grouping by sequence alignment is a special case of alignment grouping where the base set is a sequence of integers starting from 1. Alignment groups may have empty groups or members that are not assigned to any of the groups.
1. Sorting by sequence number
Sort the data in the specified sequence order, and retain up to one matching member per group. This is suitable for situations where we want to query or use data in a specified order.
[e.g. 1] Query the sales records in order from Monday to Sunday according to the daily sales table. Some of the data are as follows:
Week |
Day |
Amount |
5 |
Sunday |
1101.2 |
5 |
Saturday |
538.6 |
5 |
Friday |
2142.4 |
5 |
Thursday |
1456.0 |
5 |
Wednesday |
48.0 |
5 |
Tuesday |
1376.0 |
5 |
Monday |
676.0 |
4 |
Sunday |
448.0 |
4 |
Saturday |
4031.0 |
4 |
Friday |
364.8 |
… |
… |
… |
The A.align(n,y) function in SPL is used to align the groups which are directly divided into n groups (from 1 to n), and calculate straight the group number of each member by grouping expression y. By default, each group retains up to one matching member.
The SPL script looks like this:
A |
|
1 |
=T("DailySales.csv") |
2 |
["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] |
3 |
=A1.group(Week; ~.align(7,A2.pos(Day)):WeekSales) |
4 |
=A3.conj(WeekSales) |
5 |
=A4.select(~) |
A1: query the daily sales table.
A2: define the sequence from Monday to Sunday.
A3: when grouping the sales data by week, use the A.align(n,y) function to align sales records of each week in the order defined by A2. It is important to note here that aligned groups may have empty groups. For example, when no sales were recorded on Friday and Saturday of the second week, there are still seven groups per week:
Week |
[2,Monday,1194.0] |
[2,Tuesday,1622.4] |
[2,Wednesday,319.2] |
[2,Thursday,802.0] |
(null) |
(null) |
[2,Sunday,2123.2] |
A4: concatenate the sorted sales records of each week.
A5: select non-empty records from the result set.
The execution results of A5 are as follows:
Week |
Day |
Amount |
1 |
Monday |
3063.0 |
1 |
Tuesday |
3868.6 |
1 |
Wednesday |
2713.5 |
1 |
Thursday |
1005.9 |
1 |
Friday |
1675.0 |
1 |
Saturday |
400.0 |
1 |
Sunday |
2018.2 |
2 |
Monday |
1194.0 |
2 |
Tuesday |
1622.4 |
2 |
Wednesday |
319.2 |
… |
… |
… |
2. Retaining all matching members of each group
Group the data in a specified sequence order, each retain all matching members. This applies to situations where we care about the information of members in each group, or where we need to continue calculating with these member records.
[e.g. 2] According to the sales table, list the total sales of each month in 2014 in order. Some data of the sales table are as follows:
ID |
CustomerID |
OrderDate |
Amount |
10248 |
VINET |
2013/07/04 |
2440 |
10249 |
TOMSP |
2013/07/05 |
1863.4 |
10250 |
HANAR |
2013/07/08 |
1813.0 |
10251 |
VICTE |
2013/07/08 |
670.8 |
10252 |
SUPRD |
2013/07/09 |
3730.0 |
… |
… |
… |
… |
option @a of the A.align(n,y) function in SPL is used to keep all matching members of each group while aligning the groups.
The SPL script looks like this:
A |
|
1 |
=T("Sales.csv") |
2 |
=A1.select(year(ORDERDATE)==2014) |
3 |
=A2.align@a(12,month(ORDERDATE)) |
4 |
=A3.new(#:Month,~.sum(AMOUNT):AMOUNT) |
A1: query the sales table.
A2: select the records of 2014 from the sales table.
A3: use the A.align@a(n,y) function to sort the months of the order table into 12 groups in order from 1 to 12, with the option @a retaining all matching members of each group.
A4: calculate the total sales of each month.
The execution results of A4 are as follows:
Month |
Amount |
1 |
66692.8 |
2 |
52207.2 |
3 |
39979.9 |
4 |
60699.39 |
… |
… |
3. Overlapped grouping by sequence number
Sometimes the sequence number of groups calculated by each record is more than one, and we want the record to be repeatedly assigned to multiple groups according to the sequence numbers.
[e.g. 3] According to the posting record table, group the posts by label, and count the frequency of each label. Some of the data in the posting record table are as follows:
ID |
Title |
Author |
Label |
1 |
Easy analysis of Excel |
2 |
Excel,ETL,Import,Export |
2 |
Early commute: Easy to pivot excel |
3 |
Excel,Pivot,Python |
3 |
Initial experience of SPL |
1 |
Basics,Introduction |
4 |
Talking about set and reference |
4 |
Set,Reference,Dispersed,SQL |
5 |
Early commute: Better weapon than Python |
4 |
Python,Contrast,Install |
… |
… |
… |
… |
option @r of the A.align(n,y) function in SPL is used to create overlapped groups by the sequence numbers when aligning groups.
The SPL script looks like this:
A |
|
1 |
=T("PostRecord.txt") |
2 |
=A1.conj(Label.split(",")).id() |
3 |
=A1.align@ar(A2.len(),A2.pos(Label.split(","))) |
4 |
=A3.new(A2(#):Label,~.count():Count).sort@z(Count) |
A1: query the posting record table.
A2: separate the labels by comma and concatenate them into a sequence to get all the labels with no overlap values.
A3: use the A.align@r(n,y) function to group the posts by their labels’ position in all labels with the option @r.
A4: calculate the number of posts per label,and sort them in descending order.
The execution results of A4 are as follows:
Label |
Count |
SPL |
7 |
Excel |
6 |
Basics |
5 |
… |
… |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL