"Grouping often requires the result set to appear in the order of a criterion set, and this kind .."

pjf RaqForum 22 No.
1 Reply • 478 View • 3 Years ago

SPL: grouping and sorting aligned by a specified criterion

Grouping often requires the result set to appear in the order of a criterion set, and this kind of alignment groups are quite common in everyday statistics. For example, according to the order of Beijing, Shanghai, Guangzhou, and Shenzhen, calculate the total sales of a company in these cities; according to the specified department order, query the average salary of each department, and so on.

This kind of grouping is called alignment grouping. Aligned groups may have empty groups or members that are not assigned to any of the groups.

1. Sorting by the specified criterion set

The data is sorted in the order of the fields specified in the criterion table, and every group will retain up to one matching member per group. It is applicable to situations where we want to query or use the data in a specified order.

[e.g. 1] According to the GDP table of Chinese cities in a given year, query the GDP and population of these first-tier cities in the order of Beijing, Shanghai, Guangzhou and Shenzhen. Some of the data are as follows:

ID	CITY	GDP	POPULATION
1	Shanghai	32679	2418
2	Beijing	30320	2171
3	Shenzhen	24691	1253
4	Guangzhou	23000	1450
5	Chongqing	20363	3372
…	…	…	…

The A.align() function in SPL is used to align the groups and retain up to one matching member per group by default.

The SPL script looks like this:

	A
1	=T("CityGDP.txt")
2	["Beijing","Shanghai","Guangzhou","Shenzhen"]
3	=A1.align(A2,CITY)
4	=A3.new(CITY,GDP,POPULATION)

A1: query the city GDP table.

A2: define the sequence of cities.

A3: use the A.align() function to sort the city GDP table by a specified sequence of cities, and retain up to one matching member per group.

A4: create a result table with cities, GDP, and population as fields.

The execution results of A4 are as follows:

CITY	GDP	POPULATION
Beijing	30320	2171
Shanghai	32679	2418
Guangzhou	23000	1450
Shenzhen	24691	1253

2. Retaining all matching members in each group

The data is grouped in the order of the fields specified in the criterion table, and each group retain all matching members. This is applicable to situations where we care about the information of members in each group, or where we need to continue calculating with these member records.

[e.g. 2] According to the interrelated employee table and department table, the number of each department is calculated in the order of the departments in the department table. The relationship between the employee table and the department table is as follows:

The option @a of A.align() function in SPL is used to keep all matching members in each group while aligning the groups.

The SPL script looks like this:

	A
1	=connect("db")
2	=A1.query("select * from EMPLOYEE")
3	=A1.query("select * from DEPARTMENT")
4	=A2.align@a(A3:DEPT, DEPT)
5	=A3.new(DEPT, A4(#).count():COUNT)

A1: connect to the database.

A2: query the employee table.

A3: query the department table.

A4: use the A.align@a() function to align employees by department, with the option @a returning all matching members of each group.

A5: create a result sequence based on the department table and calculate the number of each group, which is the number of people in each department, based on the results of the A4 alignment grouping.

The execution results of A5 are as follows:

DEPT	COUNT
Administration	4
Finance	24
HR	19
…	…

The alignment grouping may have empty groups, which means that none of the member is assigned to the subgroup.

3. Placing the mismatching records in a new group

The data is grouped in the order of the fields specified in the criterion table, and create a new group for mismatching records. This is applicable to situations where we care not only about the information of matching member, but also about other mismatching records.

[e.g. 4] According to the employee table, calculate the average wages of [California, Texas, New York, Florida], with other states as a “Other” group. Some data of the employee table are as follows:

ID	NAME	STATE	DEPT	SALARY
1	Rebecca	California	R&D	7000
2	Ashley	New York	Finance	11000
3	Rachel	New Mexico	Sales	9000
4	Emily	Texas	HR	7000
5	Ashley	Texas	R&D	16000
…	…	…	…	…

The option @n of the A.align() function in SPL is used to put the mismatching records into a new group while aligning the group.

The SPL script looks like this:

	A
1	=T("Employee.csv")
2	[California,Texas,New York,Florida]
3	=A1.align@an(A2,STATE)
4	=A3.new(if (#>A2.len(),"Other",STATE):STATE,~.avg(SALARY):AvgSalary)

A1: query the employee table.

A2: create a sequence of region.

A3: use the A.align@an() function to group employee tables by region, with the option @a returning all matching members of each group, and the option @n saving the mismatching members in the new group.

A4: calculate the average salary of each group, and name the mismatching group (the last group) as “Other”.

The execution results of A4 are as follows: