SPL: TopN and TopN in group

 

TopN query is to look up the first Ns / last Ns from the data. TopN query can not only directly get a value, but also inquire the detailed information of the record where TopN is located, and sometimes it also looks for the row number of TopN. Additionally, TopN query may also be used in groups to query the first Ns / last Ns within the group.

 

1. TopN

We can divide taking the first Ns / last Ns into three types of requirements to describe them in detail, respectively taking their values, taking their row numbers and taking their records.

In the case of the NASDAQ, some of the data are as follows:

Date

Open

Close

Amount

2019/01/02

6506.910156

6665.939941

2261800000

2019/01/03

6584.77002

6463.5

2607290000

2019/01/04

6567.140137

6738.859863

2579550000

2019/01/07

6757.529785

6823.470215

2507550000

2019/01/08

6893.439941

6897.0

2380290000

 

1.1 Taking the first Ns / last Ns values

Check out the top three volume values of the NASDAQ Index in 2019.

The SPL script looks like this:

A

1

=T("IXIC.txt")

2

=A1.select(year(Date)==2019)

3

=A2.top(-3,Amount)

A1: import NASDAQ index data

A2: select the data of 2019

A3: employ the A.top(n,x) function to get the three highest volume values. If n is a positive number, take the first Ns; if N is a negative number, take the last Ns. In particular cases, if n is ±1, return to the single value,which is similar to taking a maximum/minimum value.

 

We can also look up the four lowest volume values of the NASDAQ in 2019:

A

3

=A2.top(4,Amount)

A3: employ the A.top(n,x) function to get the four lowest volume values

 

1.2 Taking the row numbers of first Ns / last Ns

In an order-related set, we can operate inter-row calculations by taking the row numbers of the first N / last N members.

 

[e.g. 2] Inquire the increasing rate in trading volumes relative to the previous days of the top three highest closing prices of NASDAQ in 2019.

The SPL script looks like this:

A

1

=T("IXIC.txt")

2

=A1.select(year(Date)==2019).sort(Date)

3

=A2.ptop(-3,Close)

4

=A3.(A2(~).Amount/A2(~-1).Amount-1)

A1: import NASDAQ data.

A2: select the data of 2019 and sort them by date.

A3: employ the A.top(n,x) function to take the row numbers of the top three closing prices.

A4: based on the selected row number, calculate the the increasing rate by comparing the trading volumes of current day with those of the previous day.

 

1.3 Taking the records of first Ns / last Ns

Sometimes, we don’t care about what the specific values of first Ns / last Ns are, we care more about the records where the values are. For example, search the names of the top three students in a math final exam, the top five sales customers’ names in 2020, and so on.

 

[e.g. 3] Check the trading volumes of the lowest closing prices of NASDAQ index in 2019.

The SPL script looks like this:

A

1

=T("IXIC.txt")

2

=A1.select(year(Date)==2019)

3

=A2.top(5;Close)

4

=A3.new(Date,Amount)

A1: import NASDAQ data.

A2: select data of 2019

A3: employ the A.top(n,x) function to take records of the five trading days with the lowest closing prices.

A4: take the dates and volumes from the records of the five days.

 

2. TopN in group

The query of first Ns / last Ns in group is a very common requirement. For example, find out what the top two math scores are in each class, what the top three single sales customers are for each month, and so on. In this section, we’ll break down how to solve the problems of using TopN in group.

2.1 performing TopN after grouping

We can regard the TopN query as a kind of aggregation operation. First, the data is grouped according to certain conditions, and then the TopN query is performed on the result sets of each group. Let’s talk about it in terms of values and records respectively.

Take the student score table as an example. Some of the data are as follows:

Class

StudentID

Subject

Score

1

1

English

95

1

1

Math

90

1

1

PE

80

1

2

English

75

1

2

Math

84

 

[e.g. 4] Check the scores of the top two students in each class.

The SPL script looks like this:

A

1

=T("Score.txt")

2

=A1.select(Subject:"Math")

3

=A2.group(Class;~.top(-2,Score):TOP2)

4

=A3.new(Class,TOP2(1):First,TOP2(2):Second)

A1: import score table.

A2: select the math grades.

A3: group by class, and use the A.top() function to count the top two math scores in each class.

A4: create the result table. The first column is the class, the second column is the first student, and the third column is the second student.

 

[e.g. 5] Check the information about the top three students of each subject in every class.

The SPL script looks like this:

A

1

=T("Score.txt")

2

=A1.group(Class,Subject;~.top(-3;Score):TOP3)

3

=A2.conj(TOP3)

A1: import score table.

A2: divide them into groups by class and subject and take the records of the top three scores in each group.

A3: concatenate the records of the top two students of each subject in all classes.

 

2.2 TopN operation by accumulation

To perform TopN operation using accumulation does not produce the result set of a grouping, which is often used when the amount of data is big. Let’s nevertheless talk about it in terms of values and records.

Take the sales table for example. Some of the data are as follows:

OrderID

Customer

OrderDate

SellerId

Amount

81182311

VINET

2013/07/04

5

2440.0

98807954

TOMSP

2013/07/05

6

1863.4

65550721

HANAR

2013/07/08

4

1813.0

37311312

VICTE

2013/07/08

3

670.8

80138612

SUPRD

2013/07/09

4

3730.0

 

[e.g. 6] Check the top two sales amounts of each month in 2014.

The SPL script looks like this:

A

1

=file("Sales.txt").cursor@t().select(year(OrderDate)==2014)

2

=A1.groups(month(OrderDate):Month;top(-2,Amount):TOP2)

3

=A2.news(TOP2;Month,~:Amount)

A1: generate cursors for the sales table and select the data of 2014.

A2: group them by month and get the top two sales amounts of each month.

A3: create the results table, the first column is the month and the second one is the amount.

 

[e.g. 7] calculate the sales records of the top three sales in each month of 2014.

The SPL script looks like this:

A

1

=file("Sales.txt").cursor@t().select(year(OrderDate)==2014)

2

=A1.groups(month(OrderDate);top(-3;Amount):TOP3)

3

=A2.conj(TOP3)

A1: generate cursors for the sales table and select the data of 2014.

A2: group them by month and get the records of the top three sales amounts per month.

A3: concatenate the records of the top three sales amounts of each month.

IXIC.txt

Score.txt

Sales.txt