3.1 Equivalence grouping
2.5 Loop to generate and execute
1. Grouping & single aggregation on single column - Calculate the number of employees in each department
2. Group & COUNT DISTINCT - Calculate how many states the employees in each department are from
3. Grouping & multiple aggregations on single column - Calculate the number of employees, the average salary and the maximum age of each department
4. Grouping & single aggregation on multiple columns - Calculate the average salary of employees in each department and each state
5. Grouping & multiple aggregations on multiple columns - Calculate the average salary, maximum salary and average age of employees in each department and each state
6. Grouping & no aggregation - Find employees with the same birthday (excluding the year)
7. Grouping & complex aggregation - Calculate the salary difference between the oldest and youngest employees in each department
SPL
A | B | |
---|---|---|
1 | =file(“EMPLOYEE.csv”).import@tc() | |
2 | =A1.groups(DEPT;count(1):emp_no) | /Grouping & single aggregation on single column |
3 | =A1.groups(DEPT;icount(STATE)) | /Group & COUNT DISTINCT |
4 | =A1.groups(DEPT;count(1):emp_no,avg(SALARY):avg_salary,max(age(BIRTHDAY)):max_age) | /Grouping & multiple aggregations on single column |
5 | =A1.groups(DEPT,STATE;avg(SALARY):avg_salary) | /Grouping & single aggregation on multiple columns |
6 | =A1.groups(DEPT,STATE;avg(SALARY):avg_salary,max(SALARY):max_salary,avg(age(BIRTHDAY)):avg_age) | /Grouping & multiple aggregations on multiple columns |
7 | =A1.group(month(BIRTHDAY),day(BIRTHDAY)).select(~.len()>1).conj() | /Grouping & no aggregation |
8 | =A1.group(DEPT;(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY), ma.SALARY-mi.SALARY):SALARY_DIF) | /Grouping & complex aggregation |
SQL
1. Grouping & single aggregation on single column
SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES
FROM EMPLOYEE
GROUP BY DEPT;
2. Group & COUNT DISTINCT
SELECT DEPT, COUNT(DISTINCT STATE) AS NUM_STATES
FROM EMPLOYEE
GROUP BY DEPT;
3. Grouping & multiple aggregations on single column
SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES, AVG(SALARY) AS AVG_SALARY,
MAX(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS MAX_AGE
FROM EMPLOYEE
GROUP BY DEPT;
4. Grouping & single aggregation on multiple columns
SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
GROUP BY DEPT, STATE;
5. Grouping & multiple aggregations on multiple columns
SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY, MAX(SALARY) AS MAX_SALARY,
AVG(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS AVG_AGE
FROM EMPLOYEE
GROUP BY DEPT, STATE;
6. Grouping & no aggregation
SQL doesnt have ‘Grouping & no aggregation’, so it has to adopt an indirect method.
7. Grouping & complex aggregation
WITH MAX_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
SELECT DEPT,BIRTHDAY,SALARY,
ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY ASC) AS RN
FROM EMPLOYEE)
WHERE RN = 1),
MIN_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
SELECT DEPT,BIRTHDAY,SALARY,
ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY DESC) AS RN
FROM EMPLOYEE)
WHERE RN = 1)
SELECT MAX_AGE_EMPLOYEE.DEPT,
MAX_AGE_EMPLOYEE. SALARY - MIN_AGE_EMPLOYEE.SALARY AS SALARY_DIFF
FROM MAX_AGE_EMPLOYEE
JOIN MIN_AGE_EMPLOYEE ON MAX_AGE_EMPLOYEE.DEPT = MIN_AGE_EMPLOYEE.DEPT;
Python
emp = pd.read_csv('../EMPLOYEE.csv')
# Grouping & single aggregation on single column
dept_counts = emp['DEPT'].value_counts()
#Group & COUNT DISTINCT
dept_state_counts = emp.groupby('DEPT')['STATE'].nunique()
#Grouping & multiple aggregations on single column
dept_stats = emp.groupby('DEPT').agg(
count=('DEPT', 'count'),
avg_salary=('SALARY', 'mean'),
max_age=('BIRTHDAY', lambda x:
(pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').max()))
#Grouping & single aggregation on multiple columns
dept_state_salary = emp.groupby(['DEPT', 'STATE'])['SALARY'].mean()
#Grouping & multiple aggregations on multiple columns
dept_state_stats = emp.groupby(['DEPT', 'STATE']).agg(
avg_salary=('SALARY', 'mean'),
max_salary=('SALARY', 'max'),
avg_age=('BIRTHDAY', lambda x:
(pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').mean()))
birthday_md=pd.to_datetime(emp['BIRTHDAY']).dt.strftime('%m-%d').sort_values()
#Grouping&noaggregation
emp_same_birthday=emp.loc[birthday_md.index]
[emp.groupby(birthday_md).BIRTHDAY.transform('count')>1]
#Grouping&complexaggregation
def salary_diff(g):
max_age = g['BIRTHDAY'].idxmin()
min_age = g['BIRTHDAY'].idxmax()
diff = g.loc[max_age]['SALARY']-g.loc[min_age]['SALARY']
return diff
emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])
salary_diff=emp.groupby('DEPT').apply(salary_diff)
3.2 Non-equivalence grouping
Example codes for comparing SPL, SQL, and Python
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL