Comparison of SQL & SPL: Set-oriented Operations

 

【Abstract】

Set-oriented operations include a set of basic operations, including concatenation, intersection, union, difference, XOR, and inclusion. As structured data often exists in the form of sets, both SQL and SPL, the two commonly used programming languages, provide a wealth of set-oriented methods and functions to handle the data. This essay explores solutions and basic principles of the two languages for handling set-oriented operations through examples and corresponding sample programs and tries to introduce the more convenient and more efficient way for you.

 

1. Concatenation

Concatenation of set A and set B is a set including all members of set A and all members of set B. It contains duplicate members.

Example 1A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to calculate the total sales amount of each product in the year 2014. Below are parts of the two source tables:

SALES_ONLINE

ID

CUSTOMERID

ORDERDATE

PRODUCTID

AMOUNT

1

HANAR

2014/01/01

17

3868.6

2

ERNSH

2014/01/03

42

1005.9

3

LINOD

2014/01/06

64

400.0

4

OTTIK

2014/01/07

10

1194.0

5

OCEAN

2014/01/09

41

319.2

SALES_STORE

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

1

EASTC

2014/01/01

1

27

3063.0

2

ERNSH

2014/01/02

8

70

2713.5

3

MAGAA

2014/01/03

2

74

1675.0

4

SAVEA

2014/01/07

7

45

2018.2

5

FOLIG

2014/01/08

8

30

1622.4

The two tables have different structures. SALES_STORE has SELLERID field for storing IDs of salespeople. But both have ID, CUSTOMERID, ORDERDATE, PRODUCTID and AMOUNT fields.

 

SQL solution:

   select

      PRODUCTID, sum(AMOUNT) AMOUNT

   from

      (select PRODUCTID, AMOUNT

      from SALES_ONLINE

      where extract (year from ORDERDATE)=2014

      union all

      select PRODUCTID, AMOUNT

      from SALES_STORE

      where extract (year from ORDERDATE)=2014)

   group by PRODUCTID

   order by PRODUCTID

 

   SQL UNION operation is used to combine two or more sub result sets of same structure. They need to have same number of columns and each column should have the same data type. UNION ALL combines all records, including the duplicates.

Here we use ORCALE to do the SQL calculations. Since ORACLE does not have YEAR function, we use extract (year from date) to get the year.

 

SPL solution:

   SPL uses the vertical line “|” to calculate concatenation of sets. A|B represents the concatenation of set A and set B.


A

1

=T("SalesOnline.csv").select(year(ORDERDATE)==2014)

2

=T("SalesStore.csv").select(year(ORDERDATE)==2014)

3

=A1|A2

4

=A3.groups(PRODUCTID; sum(AMOUNT):AMOUNT)

A1: Import SalesOnline table from the source file and select sales records of 2014.

A2: Import SalesStore table from the source file and select sales records of 2014.

A3: Use the sign “|” to calculate concatenation of the two sets.

A4: Group A3’s result set by product and calculate the total sales amount.

   

The SPL sequence supports members of different structures, so it does not require that the involved sets have same structure when concatenate them. It can directly access their common fields, like PRODUCTID and AMOUNT in this case, as it accesses an ordinary data table. It’s superbly convenient.

SPL supports retrieving a data table from the database, we can change A1 in the above script as follows:


A

1

=connect("db").query("select * from SALES_STORE where extract (year from ORDERDATE)=2014")

 

Example 2Based on the following scores table, find the top 4 math scores, top 3 English scores and top 2 PE scores. Below is part of the source table:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

 

SQL solution:

   SQL lacks a special method of calculating concatenation of two or more sets. It uses UNION ALL operator to do this. Oracle, for instance, defines an open table, uses ROW_NUMBER() OVER to calculate the rankings of all subjects ordered by scores in descending order, and then get top N for each subject:

   with cte1 as

      (select

         CLASS, STUDENTID, SUBJECT, SCORE,

         ROW_NUMBER() OVER(PARTITION BY SUBJECT

         ORDER BY SCORE DESC) grouprank

      from SCORES)

   select CLASS, STUDENTID, SUBJECT, SCORE 

   from cte1

   where grouprank <= 4 and SUBJECT='Math'

   UNION ALL

   select CLASS, STUDENTID, SUBJECT, SCORE 

   from cte1

   where grouprank <= 3 and SUBJECT='English'

   UNION ALL

   select CLASS, STUDENTID, SUBJECT, SCORE 

   from cte1

   where grouprank <= 2 and SUBJECT='PE'

 

SPL solution:

   SPL uses A.conj() function to calculate the concatenation of member sets when A is a set of sets.


A

1

=T("Scores.csv")

2

[Math,English,PE]

3

[4,3,2]

4

=A3.conj(A1.select(SUBJECT==A2(A3.#)).top(-~;SCORE))

A1: Import student scores table.

A2: Define a set of subjects.

A3: Define a set of ranks corresponding to the set of subjects.

A4: Get top N scores for each specified subject, and use A.conj() function to concatenate multiple sets.

 

As there isn’t a special method of calculating concatenation of sets, we use UNION ALL to combine two sets each time. The code becomes more and more complicated when the number of sets involved increases. SPL has the special function A.conj() to do the job. It can calculate concatenation of any number of sets.

 

2. Intersection

The intersection of set A and set B is a set including all members belonging to both A and B.

 

Example 3A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the customers who purchase through both sales channels in the year 2014. Below is part of the source table:

SALES_ONLINE

ID

CUSTOMERID

ORDERDATE

PRODUCTID

AMOUNT

1

HANAR

2014/01/01

17

3868.6

2

ERNSH

2014/01/03

42

1005.9

3

LINOD

2014/01/06

64

400.0

4

OTTIK

2014/01/07

10

1194.0

5

OCEAN

2014/01/09

41

319.2

SALES_STORE

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

1

EASTC

2014/01/01

1

27

3063.0

2

ERNSH

2014/01/02

8

70

2713.5

3

MAGAA

2014/01/03

2

74

1675.0

4

SAVEA

2014/01/07

7

45

2018.2

5

FOLIG

2014/01/08

8

30

1622.4

 

 

SQL solution:

   SQL INTERSECT operator is used to calculate the intersection of two or more sub result sets. The sub result sets must have same data structure and same number of columns, and their columns should contain data of same types. Below is SQL query:

   select

      DISTINCT CUSTOMERID

   from

      SALES_ONLINE

   where

      extract (year from ORDERDATE)=2014

   INTERSECT

   select

      DISTINCT CUSTOMERID

   from

      SALES_STORE

   where

      extract (year from ORDERDATE)=2014

 

Early databases probably do not support INTERSECT operator. So we can first calculate concatenation (UNION ALL) and then perform a filtering to get the target through grouping operation and count operation:

   select

      CUSTOMERID, COUNT(*)

   from

      (select DISTINCT CUSTOMERID

      from SALES_ONLINE

      where extract (year from ORDERDATE)=2014

   UNION ALL

   select DISTINCT CUSTOMERID

   from SALES_STORE

   where extract (year from ORDERDATE)=2014)

   group by

      CUSTOMERID

   HAVING

      COUNT(*)>1

   order by

      CUSTOMERID

 

The additional layer of grouping and aggregation makes the SQL query extremely complicated.

 SPL solution:

   SPL uses the sign “^” to calculate intersection of sets. A^B represents the intersection of set A and set B.


A

1

=T("SalesOnline.csv").select(year(ORDERDATE)==2014)

2

=T("SalesStore.csv").select(year(ORDERDATE)==2014)

3

=A1.id(CUSTOMERID)^A2.id(CUSTOMERID)

A1: Import SalesOnline table from the source file and select sales records of 2014.

A2: Import SalesStore table from the source file and select sales records of 2014.

A3: A.id() function gets the set of unique customer IDs. The sign “^” is used to calculate intersection of the two sets, which are customers who purchase products in both in online and at stores.

 

Example 4Based on the following sales data table, find top 10 customers in terms of monthly sales amount in the year 2014. Below is part of the source table:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

 

SQL solution:

   with cte1 as

      (select

         extract (month from ORDERDATE) ORDERMONTH,

         CUSTOMERID, SUM(AMOUNT) AMOUNT

      from SALES

      where extract (year from ORDERDATE)=2014

      group by extract (month from ORDERDATE),CUSTOMERID

      order by ORDERMONTH ASC, AMOUNT DESC),

   cte2 as

      (select

         ORDERMONTH,CUSTOMERID,AMOUNT,

         ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank

      from cte1)

   select

      CUSTOMERID,count(*)

   from cte2

   where grouprank<=10

   group by CUSTOMERID

   having count(*)=12

 

SQL has not a method of calculating intersection of sets. The language uses INTERSECT operator to do the job. It is unsuitable to get top 10 customers in each month and then use INTERSECT operator to get the combination. Here we group records and perform a count operation. If a customer ranking in monthly top 10 for 12 times, it enters in the top 10 in each month.

 

SPL solution:

   SPL uses A.isect() function to calculate intersection of all member sets if A is a set of sets.


A

1

=T("Sales.csv").select(year(ORDERDATE)==2014)

2

=A1.group(month(ORDERDATE))

3

=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT))

4

=A3.(~.top(-10;AMOUNT).(CUSTOMERID))

5

=A4.isect()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group A1’s table by month.

A3: Group records in each month by customer.

A4: Get customers whose monthly sales amounts rank in top 10 in each month.

A5: A.isect() function calculates intersection of all selected customers in all months.

 

   Though we employ a trick to calculate the intersection, the SQL query is still complex. One reason is that SQL does not have a special method to calculate intersection of sets, the other is that SQL cannot retain the post-grouping subsets for a further grouping operation or other computations. The grouping operation will be explained separately later.

 

3. Union

The union of set A and set B is a set including all members of A and members of B, except for the duplicates.

 

Example 5A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the products whose total online sales amounts in the year 2014 is above 10,000 or that are purchased over 5 times at stores. Below are parts of the source tables:

SALES_ONLINE

ID

CUSTOMERID

ORDERDATE

PRODUCTID

AMOUNT

1

HANAR

2014/01/01

17

3868.6

2

ERNSH

2014/01/03

42

1005.9

3

LINOD

2014/01/06

64

400.0

4

OTTIK

2014/01/07

10

1194.0

5

OCEAN

2014/01/09

41

319.2

SALES_STORE

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

1

EASTC

2014/01/01

1

27

3063.0

2

ERNSH

2014/01/02

8

70

2713.5

3

MAGAA

2014/01/03

2

74

1675.0

4

SAVEA

2014/01/07

7

45

2018.2

5

FOLIG

2014/01/08

8

30

1622.4

 

SQL solution:

    select

      PRODUCTID

   from

      (select PRODUCTID, sum(AMOUNT)

      from SALES_ONLINE

      where extract (year from ORDERDATE)=2014

      group by PRODUCTID

      having sum(AMOUNT)>10000)

   union

   select

      PRODUCTID

   from

      (select PRODUCTID, count(*)

      from SALES_STORE

      where  extract (year from ORDERDATE)=2014

      group by PRODUCTID

      having count(*)>5)

 

   As we have mentioned, SQL UNION operator is used to combine results sets of two or more SELECT statements. When UNION is used without ALL, the operator removes duplicate records while combining records.

 

SPL solution:

   SPL offers the and sign “&” to calculate the union of sets. A&B represents the union of set A and set B.


A

1

=T("SalesOnline.csv").select(year(ORDERDATE)==2014)

2

=T("SalesStore.csv").select(year(ORDERDATE)==2014)

3

=A1.groups(PRODUCTID; sum(AMOUNT):AMOUNT).select(AMOUNT>10000)

4

=A2.groups(PRODUCTID; count(~):COUNT).select(COUNT>5)

5

=A3.(PRODUCTID)&A4.(PRODUCTID)

A1: Import SalesOnline table from the source file and select records of the year 2014.

A2: Import SalesStore table from the source file and select records of the year 2014.

A3: Group A1’s records by product ID, calculate the total sales amount for each product, and select records where the totals are above 10,000.

A4: Group A2’s records by product ID, count purchase frequencies for each product, and select records where the frequency count is greater than 5.

A5: Use “&” to calculate the union of records purchased both through online and at stores.

 

Example 6Based on the following sales table, find the products whose yearly purchase frequency ranks in top 10. Below is part of the source table:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

 

SQL solution:

   with cte1 as

      (select

         extract (year from ORDERDATE) ORDERYEAR,

         PRODUCTID, COUNT(*) ORDERCOUNT

      from SALES

      group by extract (year from ORDERDATE),PRODUCTID

      order by ORDERYEAR ASC, ORDERCOUNT DESC),

   cte2 as

      (select

         ORDERYEAR,PRODUCTID,ORDERCOUNT,

         ROW_NUMBER()OVER(PARTITION BY ORDERYEAR ORDER BY ORDERCOUNT DESC) grouprank

      from cte1)

   select

      DISTINCT PRODUCTID

   from cte2

   where grouprank<=10

 

SQL does not offer a special method of calculating union of sets, so we group data by year, calculate the total purchase frequency of each product per year, sort the result records, get rankings through row numbers after sorting, then select IDs of products that rank in top 10 in each year in terms of purchase frequency, and finally, use DISTINCT operator to remove the duplicates to get the union.

 

SPL solution:

   SPL uses A.union() function to calculate union of all member sets when A is a set of sets.


A

1

=T("Sales.csv")

2

=A1.group(year(ORDERDATE))

3

=A2.(~.groups(PRODUCTID;count(~):COUNT))

4

=A3.(~.top(-5;COUNT).(PRODUCTID))

5

=A4.union()

A1: Import Sales table from the source file.

A2: Group A1’s table by year.

A3: Group records of each year by product and calculate their purchase frequency.

A4: Get IDs of products whose yearly purchase frequencies rank in top 10.

A5: A.union() function calculates union of desired records of products in all years.

 

4. Difference

The difference of set A and set B is a set including all members of A that do not belong to set B.

 

Example 7A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find customers whose total purchase amount at stores is above 1000 but who have not any online purchase records. Below is part of the source table:

SALES_ONLINE

ID

CUSTOMERID

ORDERDATE

PRODUCTID

AMOUNT

1

HANAR

2014/01/01

17

3868.6

2

ERNSH

2014/01/03

42

1005.9

3

LINOD

2014/01/06

64

400.0

4

OTTIK

2014/01/07

10

1194.0

5

OCEAN

2014/01/09

41

319.2

SALES_STORE

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

1

EASTC

2014/01/01

1

27

3063.0

2

ERNSH

2014/01/02

8

70

2713.5

3

MAGAA

2014/01/03

2

74

1675.0

4

SAVEA

2014/01/07

7

45

2018.2

5

FOLIG

2014/01/08

8

30

1622.4

 

 

SQL solution:

    select

      CUSTOMERID

   from

      (select

         CUSTOMERID,count(*)

      from

         SALES_STORE

      where

         extract (year from ORDERDATE)=2014

      group by

         CUSTOMERID

      having

         count(*)>3)

   MINUS

   select

      DISTINCT CUSTOMERID

   from

      SALES_ONLINE

   where

      extract (year from ORDERDATE)=2014

 

   Take Oracle SQL as an example. The MINUS operator (some databases use EXCEPT) is used to calculate difference. MINUS requires that the involved sub result sets must have same data structures and same number of columns, and that corresponding columns should have same or similar data types.

 

SPL solution:

   SPL uses the backslash “\” to calculate difference of sets. A\B represents the difference of set A and set B.


A

1

=T("SalesOnline.csv").select(year(ORDERDATE)==2014)

2

=T("SalesStore.csv").select(year(ORDERDATE)==2014)

3

=A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3)

4

=A3.id(CUSTOMERID)\A1.id(CUSTOMERID)

A1: Import SalesOnline table from the source file and select records of the year 2014.

A2: Import SalesStore table from the source file and select records of the year 2014.

A3: Group A2’s records by customer IDs, calculate the total purchase frequency for each customer, and select records where the total frequency is greater than 3.

A4: Use “\” to calculate difference of customers who purchase products online and those who buy things at stores.

 

Example 8Based on the following sales table, find customers whose total sales amounts rank in top 10 only in January in the year 2014. Below is part of the source table:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

 

SQL solution:

    with cte1 as

      (select

         extract (month from ORDERDATE) ORDERMONTH,

         CUSTOMERID, SUM(AMOUNT) AMOUNT

      from SALES

      where extract (year from ORDERDATE)=2014

      group by extract (month from ORDERDATE),CUSTOMERID

      order by ORDERMONTH ASC, AMOUNT DESC),

   cte2 as

      (select

         ORDERMONTH,CUSTOMERID,AMOUNT,

         ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank

      from cte1)

   select CUSTOMERID

   from cte2

   where grouprank<=10 and ORDERMONTH=1

   MINUS

   select CUSTOMERID

   from cte2

   where grouprank<=10 and ORDERMONTH<>1

 

SQL does not have a particular method for calculating difference of sets, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers of January, and use MINUS operator to calculate difference January’s customers and the set of customers in other months.

 

SPL solution:

   SPL offers A.diff() function to calculate difference of the first member set and all the other member sets when A is a set of sets.


A

1

=T("Sales.csv").select(year(ORDERDATE)==2014)

2

=A1.group(month(ORDERDATE))

3

=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT))

4

=A3.(~.top(-10;AMOUNT).(CUSTOMERID))

5

=A4.diff()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group A1’s records by month.

A3: Group records of each month by customer IDs and calculate total sales of each customer.

A4: Get customers in each month whose total sales amounts rank in top 10.

A5: Use A.diff() function to calculate difference between customers of January and those in the other months.

 

5. XOR

XOR of set A and set B is a set made up members that belong to A or B but not belong to both.

 

Example 9Students’ scores in two semesters are stored in two tables respectively. The task is to find students in class 1 whose total scores rank in top 10 only once in both semesters. Below is part of the source table:

SCORES1_SEMESTER1:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

           1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

SCORES2_SEMESTER2:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

97

           1

1

Math

64

1

1

PE

97

1

2

English

56

1

2

Math

82

 

SQL solution:

SQL does not define an operator for performing XOR. There are two methods if we choose to use set operators to do this:

  1. (A UNION B) EXCEPT (A INTERSECT B);

  2. (A EXCEPT B) UNION (B EXCEPT A);

Both methods are not convenient because they use multiple views and thus drag performance down. In the following query, we use FULL OUTER JOIN to counter the disadvantage:

    with cte1 as

      (select STUDENTID,TOTALSCORE

   from

      (select STUDENTID, sum(SCORE) TOTALSCORE

      from SCORES1

      group by STUDENTID

      order by TOTALSCORE DESC)

      where rownum <= 10),

   cte2 as

      (select STUDENTID,TOTALSCORE

      from

         (select STUDENTID, sum(SCORE) TOTALSCORE

         from SCORES2

         group by STUDENTID

         order by TOTALSCORE DESC)

      where rownum <= 10)

   select

      COALESCE(cte1.STUDENTID, cte2.STUDENTID) STUDENTID,

      cte1.TOTALSCORE, cte2.TOTALSCORE

   from cte1

   FULL OUTER JOIN cte2

   ON cte1.STUDENTID=cte2.STUDENTID

   where cte1.TOTALSCORE IS NULL

      OR cte2.TOTALSCORE IS NULL

 

With Oracle SQL, we can use LEFT JOIN UNION RIGHT JOIND to implement FULL JOIN if the involved database is one that does not support FULL JOIN, such as MySQL (Detail query is skipped here).

 

SPL solution:

   SPL uses the percent sign “%” to calculate XOR. A%B represents XOR of set A and set B.


A

1

=T("Scores1.csv")

2

=T("Scores2.csv")

3

=A1.groups(STUDENTID; sum(SCORE):SCORE)

4

=A2.groups(STUDENTID; sum(SCORE):SCORE)

5

=A3.top(-10;SCORE).(STUDENTID)

6

=A4.top(-10;SCORE).(STUDENTID)

7

=A5%A6

A1: Import Scores1 table of from source file.

A2: Import Scores2 table of from source file.

A3: Group A1’s table by student ID and sum scores of each student.

A4: Group A2’s table by student ID and sum scores of each student.

A5: Get IDs of students whose total scores rank in top 10 in semester 1.

A6: Get IDs of students whose total scores rank in top 10 in semester 2.

A7: Use “%” to calculate XOR of student total scores in semester 1 and those in semester 2.

   The SQL query is enormously complicated because there isn’t a particular operator in SQL to calculate XOR. SPL, however, is convenient by offering the percent sign “%” to do it.

 

Example 10Based on the following table, find whether customer RATTC ranked in top 3 in the year 2014 in terms of sales amount in a single month. Below is part of the source data:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

 

SQL solution:

    with cte1 as

      (select

         extract (month from ORDERDATE) ORDERMONTH,

         CUSTOMERID, SUM(AMOUNT) AMOUNT

      from SALES

      where extract (year from ORDERDATE)=2014

      group by extract (month from ORDERDATE),CUSTOMERID

      order by ORDERMONTH ASC, AMOUNT DESC),

   cte2 as

      (select

         ORDERMONTH,CUSTOMERID,AMOUNT,

         ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank

      from cte1)

   select count(*) CUSTOMER_EXISTS

   from cte2

   where grouprank<=3 and CUSTOMERID='RATTC'

 

SQL does not have a specific method for getting XOR, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers entering top 3 in each month, and count records of customer "RATTC". If the count result is 1, it means true; and if it is 0, it means false.

 

SPL solution:

   SPL offers A.cor() function to calculate XOR of member sets when A is a set of sets.


A

1

=T("Sales.csv").select(year(ORDERDATE)==2014)

2

=A1.group(month(ORDERDATE))

3

=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT))

4

=A3.new(~.top(-3; AMOUNT):TOP3)

5

=A4.(TOP3.(CUSTOMERID).pos("RATTC")>0)

6

=A5.cor()

A1: Import Sales table from the source file and select records of the year 2014.

A2: Group Sales table by month.

A3: Group records of each month by customer and calculate total sales amount of each customer in each group.

A4: Get customers whose sales amounts rank in top 3 in each month.

A5: Find if there is customer "RATTC" in each group.

A6: Use A.cor() function to find XOR, that is, whether customer "RATTC" is included in each month. The customer exists if the result is true, and it does not exist if the result is false.

 

6. Belong to & include

Suppose there are two sets A and B. If all members of set A are members of set B, we call that B includes A. The “belong to” relationship is between a member and a set. When there is a member x in set A, we call that x belongs to A.

Example 11Based on the following employee table, calculate the average salary in each department in the states of California, New York, Texas and Washington. Below is part of the source table:

ID

NAME

SURNAME

STATE

DEPT

SALARY

1

Rebecca

Moore

California

R&D

7000

2

Ashley

Wilson

New York

Finance

11000

3

Rachel

Johnson

New Mexico

Sales

9000

4

Emily

Smith

Texas

HR

7000

5

Ashley

Smith

Texas

R&D

16000

 

SQL solution:

   SQL IN operator is used to define a “belong to” relationship in WHERE sub statement. The SQL query is as follows:

    select

      DEPT, avg(SALARY) AVGSALARY

   from

      EMPLOYEE

   where

      STATE in ('California','New York','Texas','Washington')

   group by

      DEPT

 

SPL solution:

   SPL supplies A.contain(x) function to check whether member x belongs to set A.


A

1

=T("Employee.csv")

2

[California,New York,Texas,Washington]

3

=A1.select(A2.contain(STATE))

4

=A3.groups(DEPT; avg(SALARY):SALARY)

A1: Import Employee table from the source file.

A2: Define a constant set of states.

A3: Select records from A1’s table where the states belong to A2’s set.

A4: Group the selected records in A3 by department and calculate the average salary in each department.

 

Example 12Based on COURSE table and SELECT_COURSE table, find students who select both Matlab and Modern wireless communication system. Below is part of the s source table:

COURSE: 

ID

NAME

TEACHERID

1

Environmental   protection and sustainable development

5

2

Mental   health of College Students

1

3

Matlab

8

4

Electromechanical   basic practice

7

5

Introduction to modern life science

3

SELECT_COURSE: 

ID

STUDENT_NAME

COURSE

1

Rebecca   Moore

2,7

2

Ashley   Wilson

1,8

3

Rachel   Johnson

2,7,10

4

Emily   Smith

1,10

5

Ashley   Smith

5,6

The task can be described in another way. It checks whether the COURSE field value in SELECT_COURSE table contains the set [3,6], which are IDs of Matlab and Modern wireless communication system.

 

SQL solution:

   The SQL field does not support set data type, so we cannot use the set include relationship to get this task done. With Oracle database here, we use REGEXP_SUBSTR function to split each COURSE value string according to a specific regular expression, left join SELECT_COURSE table and COURSE table to get records selecting both courses, group these records by IDs, get groups containing at least two records, that is, those selecting both courses, and then locate corresponding records from SELECT_COURSE table according to the selected IDs. Below is SQL query:

 

    with cte1 as

      (SELECT ID,REGEXP_SUBSTR(t1.COURSE ,'[^,]+',1,l) COURSE

       FROM SELECT_COURSE t1,

         (SELECT LEVEL l

         FROM DUAL

         CONNECT BY LEVEL<=10) t2

         WHERE l<=LENGTH(t1.COURSE) - LENGTH(REPLACE(COURSE,','))+1)

   select *

   from SELECT_COURSE t5

   inner join (

      select ID, count(*)

      from (

         select t3.ID, COURSE

         from cte1 t3

         inner join (

            select ID

            from COURSE

            where NAME='Matlab' or

            NAME='Modern wireless communication system'

            ) t4

         on t3.COURSE=t4.ID

      )

      group by ID

      having count(*)>=2

   ) t6

   on t5.ID=t6.ID

 

SPL solution:

   SPL uses A.pos(B) function locate the position of a member of set B in set A and returns null if the member does not exist in set A.


A

1

=T("Course.txt")

2

=T("SelectCourse.txt")

3

=A1.select(NAME=="Matlab" || NAME=="Modern wireless communication system").(ID)

4

=A2.run(COURSE=COURSE.split@cp())

5

=A4.select(COURSE.pos(A3)!=null)

A1: Import Course table from the source file.

A2: Import SelectCourse table from the source file.

A3: Get the set of IDs of the target two courses.

A4: Split each Course value by comma and parse the numbers into a set.

A5: Use A.pos() function to locate IDs of the target courses in COURSE value of  SELECT_COURSE table, and a record that does not make it return null is a desired one.

 

   As SQL lacks support of set type field values, it is a little complicated to handle this case. The language does snot provide a method of checking a set include relationship, so it handles the job using the filtering join. SPL, however, supports set type field values and offers a rich library of functions to locate members of a set, which facilitates the handling of set include relationship judgment cases.

 

Example 13Based on the following weather data in a certain area, find the dates when west wind occurs and when north wind visits in the previous dates. Below is part of the source table:

WEATHER_DATE

RAIN_FALL

WIND_GUST_DIR

WIND_GUST_SPEED

RAIN_TODAY

RAIN_TOMORROW

2008/12/01

0.6

W

44

No

No

2008/12/02

0.0

WNW

44

No

No

2008/12/03

0.0

WSW

46

No

No

2008/12/04

0.0

NE

24

No

No

2008/12/05

1.0

W

41

No

No

 

SQL solution:

   The task is simple. It aims to find an ordered subset [N,N,W] in WIND_GUST_DIR set. SQL has a weak support for order-based calculations due to its unordered-set-based theoretic foundation (which is explained in Comparison of SQL & SPL: Order-based calculation). When the SQL you are using does not support window functions, you can only do this through table joins. Below is SQL query:

 

   select

      curr.WEATHER_DATE, RAIN_FALL,curr.WIND_GUST_DIR,

      WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW

   from

      weather curr

   inner join

      (select

         pre1.WEATHER_DATE,pre1.WIND_GUST_DIR

      from weather pre1

      left join

      (select

         WEATHER_DATE,WIND_GUST_DIR

      from weather) pre2

      on

         pre1.WEATHER_DATE=pre2.WEATHER_DATE+1

      where

         pre1.WIND_GUST_DIR='N' and pre2.WIND_GUST_DIR='N') yest

   on

      curr.WEATHER_DATE=yest.WEATHER_DATE+1

   where

      curr.WIND_GUST_DIR='W'

   order by WEATHER_DATE

 

The SQL query is roundabout. Each inter-row access requires a self-join. It is inefficient. SQL introduced window functions in the year 2003 and brought in the concept of order. That has made the order-based calculations slightly easier:

 

   select

      WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,

      WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW

   from

      (select

         WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,

         WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW,

         LAG(WIND_GUST_DIR,1)

         OVER (

            ORDER BY WEATHER_DATE ASC

         ) YESTERDAY_DIR,

         LAG(WIND_GUST_DIR,2)

         OVER (

            ORDER BY WEATHER_DATE ASC

         ) BEFORE_YESTERDAY_DIR

      from WEATHER)

   where

      WIND_GUST_DIR='W' and YESTERDAY_DIR='N'

      and BEFORE_YESTERDAY_DIR='N'

   order by WEATHER_DATE

 

 

SPL solution:

  SPL supports accessing a record previous to the current one or after it for order-based calculations.


A

1

=T("weather.csv")

2

=A1.select(WIND_GUST_DIR[-2]=="N" &&WIND_GUST_DIR[-1]=="N" && WIND_GUST_DIR=="W")

A1: Import weather data from the source file.

A2: Select records where WIND_GUST_DIR values are west and the previous two values are north.

   

Summary

SQL gives a good support for concatenation, intersection, union, and difference of two sets but a not good one for XOR. The language does not support theses operations on more sets, and it resorts to roundabout ways to get do jobs. SPL provides special functions for each type of set-oriented operations. This makes SPL code concise, efficient, and easy to understand.

SQL does not support set type field values. When a field value is separated by a certain identifier, SQL cannot perform set include operation on it. The language is awkward and produces complicated code in handling the “belong to” relationship on an ordered subset even with the window function. SPL designs a lot of overall location functions to deal with the set “belong to” relationship. It also supplies a complete set of supports for order-based calculations, which makes it easy to handle inter-row operations.

   When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of set-oriented functions, which combines the advantages of both Java and SQL. With SPL, a set-oriented operation becomes simple and easy.

 


SalesOnline.csv

SalesStore.csv

Sales.csv

Scores.csv

Scores1.csv

Scores2.csv

Employee.csv

Course.txt

SelectCourse.txt

weather.csv