SPL: selecting data by segment

 

Selecting data by segment refers to finding the data in a group according to the specified interval. For example, return to assessments (excellent, good, medium, poor) according to the test results; return to the levels of assets (poor, well-off, middle-class, rich, etc.) according to the annual income of the family.

 

1. The segment number of a member in sequence

We first define an interval sequence for the segment. For example, we define the age sequence of [0,18,45,60], which represents minors, youth, middle-ages, and old-ages respectively. Next by calculating the sequence number of the value of the age field in the sequence of the age range, we can get what age range the person is in.

 

The A.pseg() function is provided in SPL, which returns the segment number of a member in sequenceand the default interval is left-open and right-closed. For example, the 17-year-old belongs to the age range of [0,18), so the sequence number returned by the A.pseg()function is 1; whereas the 18-year-old belongs to the age range of [18,45), then the sequence number returned by the A.pseg()function is 2. The A.pseg() function provides the @r option to use left-open and right-closed intervals when calculating the segment numbers. Take the same case for an example, 18-year-old is assigned to the interval of (0,18] by A.pseg() function,so the sequence number returned is 1.

 

In the case of employees, some of the data are as follows:

ID

NAME

ENTRYDATE

STATE

DEPT

SALARY

1

Rebecca

2005/03/11

California

R&D

7000

2

Ashley

2008/03/16

New York

Finance

11000

3

Rachel

2010/12/01

New Mexico

Sales

9000

4

Emily

2006/08/15

Texas

HR

7000

5

Ashley

2004/07/30

Texas

R&D

16000

 

[e.g. 1] Divide the employees into groups of less than 8,000, form 8,000 to 12,000 and more than 12,000 according to their salaries, and count the number of employees in each group.

The SPL script looks like this:

A

1

=T("Employee.csv")

2

[0,8000,12000]

3

=A1.align@a(A2.len(),A2.pseg(SALARY))    

4

=A3.new(A2(#):SALARY,~.count():COUNT)

A1: import employee table.

A2: define the intervals of salaries.

A3: use the A.pseg() function to get the segment numbers of the salaries, and then group them according to the numbers

A4: calculate the number of people in each group.

 

[e.g. 2] Divide the employees into three segments in terms of their working years, 1 to 10 years, 10 to 20 years and 20 years respectively, and count the average salary of each group.

The SPL script looks like this:

A

1

=T("Employee.csv")

2

[0,10,20]

3

=A1.align@a(A2.len(),A2.pseg@r(interval@y(ENTRYDATE,now())))

4

=A3.new(A2(#):EntryYears,~.avg(SALARY):AvgSalary)  

A1: import employee table.

A2: define the intervals of working years.

A3: use the A.pseg() function to get the segment numbers of the working years, where the @r option represents the left-open and right-closed interval. And then group them by the numbers.

A4: calculate the average salary of each group.

 

2. Returning the corresponding member of a sequence based on the segment number

Sometimes we need to get the corresponding member in the set by calculating the sequence number of the field value (or expression) in the segment.

The A.segp() function is provided in SPL, which returns the corresponding member in the sequence based on the segment number.

 

[e.g. 3] According to the score table, count the number of excellent, pass and fail students in the English subject. Some of the data are as follows:

Class

StudentID

Subject

Score

1

1

English

95

1

1

Math

90

1

1

PE

80

1

2

English

75

1

2

Math

84

 

The SPL script looks like this:

A

1

=T("Scores.csv").select(SUBJECT:"English")

2

=create(Assessment,Score).record(["Fail",0,"Pass",60,"Excellent",90])

3

=A1.derive(A2.segp(Score,SCORE).Assessment:Assessment)

4

=A3.groups(Assessment;count(1):Count)

A1: import the score table and select the data of the English subject.

A2: create a check table of score and assessment:

Assessment

Score

Fail

0

Pass

60

Excellent

90

A3: use the A.segp() function to obtain the corresponding assessment according to the segment numbers of the score in the check table.

A4: group and calculate the number by assessment.

Employee.csv

Scores.csv