Comparison of SQL & SPL: Join Operations (Ⅱ)

 

The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find a faster and more efficient way for you. Looking Comparison of SQL & SPL: Join Operations (Ⅱ) for details.

 

In this essay, we focus on join operations with many-to-one relationship.

The many-to-one relationship exists between tables where any number of records in the first table correspond to one record in the other table. The two tables usually have a foreign key association. A certain field in table A are associated with the primary key of table B. The field in table A that links to table B’s primary key is called the foreign key pointing to table B. Table B is called table A’s foreign key table.

Suppose there are employee table and department table. The department ID field in employee table points to department table’s ID field. One employee corresponds to one department, but one department could have multiple employees. The relationship of employee table and department table is many-to-one.

The one-to-many relationship and the many-to-one relationship are different, but SQL does not distinguish them. It provides no methods to handle them separately.

The method of handling a join operation with many-to-one relationship is completely different from that of dealing with one with one-to-many relationship. We convert foreign key values into corresponding records in the other table, or attach the target field value into the table at the “many” end, to generate a combined table on which queries can be performed. SQL does not offer a particular method to handle joins with the many-to-one relationship, it uses the ordinary JOIN to do the work. SPL provides A.switch()function and other functions to specifically optimize and speed up foreign-key-type joins (many-to-one association).

 

Example 1Get names of all employees and the department they work in based on EMPLOYEE table and DEPARTMENT table. Below is part of the source data:

EMPLOYEE:

ID

NAME

BIRTHDAY

DEPARTMENTID

SALARY

1

Rebecca

1974/11/20

6

7000

2

Ashley

1980/07/19

2

11000

3

Rachel

1970/12/17

7

9000

4

Emily

1985/03/07

3

7000

5

Ashley

1975/05/13

6

16000

DEPARTMENT:

ID

NAME

MANAGER

1

Administration

18

2

Finance

2

3

HR

4

4

Marketing

6

5

Production

7

 

SQL solution:

SQL uses JOIN statement to join two tables with the many-to-one relationship according to the condition that a foreign key field value is equivalent to the corresponding primary key value in the foreign key table. Here we use the left join to search for all employees. Below are SQL statements:

   

   SELECT

      E.NAME, D.NAME DEPT_NAME

   FROM EMPLOYEE E

   LEFT JOIN

   DEPARTMENT D

   ON E.DEPARTMENTID=D.ID

 

SPL solution:

SPL A.switch() function converts values of DEPARTMENTID values into corresponding records in DEPARTMENT table.


A

1

=T("Employee.txt")

2

=T("Department.txt")

3

=A1.switch(DEPARTMENTID, A2:ID)

4

=A3.new(NAME, DEPARTMENTID.NAME:DEPT_NAME)

A1: Import Employee table.

A2: Import Department table.

A3: A.switch() function objectifies foreign key values, by replacing DEPARTMENTID values with corresponding DEPARTMENT records.

A4: Return employee names and department names. The latter is obtained from NAME field of DEPARTMENT records.

 

The case is simple. Both SQL and SPL can handle it well. Someone may ask why we objectify the foreign key instead of referencing the foreign key field during the join? Hope you can understand the reason through the following more complicated example.

 

Example 2Find the American employees whose managers are Chinese based on EMPLOYEE table and DEPARTMENT table. Below is part of the source data:

EMPLOYEE:

ID

NAME

BIRTHDAY

DEPARTMENTID

SALARY

1

Rebecca

1974/11/20

6

7000

2

Ashley

1980/07/19

2

11000

3

Rachel

1970/12/17

7

9000

4

Emily

1985/03/07

3

7000

5

Ashley

1975/05/13

6

16000

DEPARTMENT:

ID

NAME

MANAGER

1

Administration

18

2

Finance

2

3

HR

4

4

Marketing

6

5

Production

7

 

SQL solution:

Just adding the desired foreign key field to EMPLOYEE table is not right. We need two joins instead. The first is the join between DEPARTMENT table and EMPLOYEE table through ID field for getting nationalities of managers. The second join is between EMPLOYEE table and DEPARTMENT table through DEPARTMENTID field for getting the employees’ departments. Below are SQL statements:

 

   SELECT *

   FROM EMPLOYEE E2

   LEFT JOIN

      (SELECT D1.ID,D1.MANAGER,E1.NATION MANAGER_NATION

      FROM DEPARTMENT D1

      LEFT JOIN EMPLOYEE E1

      ON D1.MANAGER=E1.ID

   ) D2

   ON E2.DEPARTMENTID=D2.ID

   WHERE D2.MANAGER_NATION='Chinese' AND E2.NATION='American'

   

This block of SQL code is rather complicated. Each join requires a layer of nested query and you need to know the target fields in the foreign key table in advance.

 

SPL solution:

SPL handles this task also through A.switch() function by objectifying foreign key values.


A

1

=T("Employee.txt").keys(ID)

2

=T("Department.txt").keys(ID)

3

=A2.switch(MANAGER, A1)

4

=A1.switch(DEPARTMENTID, A2)

5

=A4.select(NATION=="American" && DEPARTMENTID.MANAGER.NATION=="Chinese")

A1: Import Employee table and set ID as the primary key.

A2: Import Department table and set ID as the primary key.

A3: Use A.switch() function to objectify foreign key field by converting Department table’s MANAGER field into corresponding Employee records.

A4: Use A.switch() function to objectify foreign key field by converting Employee table’s DEPARTMETNID field into corresponding Department records.

A5: Get records of American employees whose managers are Chinese.

 

The SPL script is not so complicated as SQL statements. SPL just performs one more foreign key objectification according to the natural logic. You do not need to know which fields you want in advance for a foreign key field join because SPL allows you to get them directly from the record object. To get managers who are Chinese , SPL uses the expression DEPARTMENTID.MANAGER.NATION, which is natural.

 

Example 3Find the number of students who select “Matlab” course based on COURSE table and SELECT_COURSE table. Below is part of the source data:

COURSE:

ID

NAME

TEACHERID

1

Environmental protection and sustainable development

5

2

Mental health of College Students

1

3

Matlab

8

4

Electromechanical basic practice

7

5

Introduction to modern life science

3

SELECT_COURSE:

ID

COURSEID

STUDENTID

1

6

59

2

6

43

3

5

52

4

5

44

5

5

37

 

SQL solution:

We can use inner join to delete non-matching records (where the selected courses are not Matlab). Below are SQL statements:

 

   SELECT

      COUNT(*) COUNT

   FROM SELECT_COURSE SC

   INNER JOIN

      COURSE C

   ON SC.COURSEID=C.ID

   WHERE NAME='Matlab'

 

SPL solution:

A.switch() function works with @i option to delete non-matching records during the join operation.


A

1

=T("Course.csv")

2

=T("SelectCourse.csv")

3

=A1.select(NAME:"Matlab")

4

=A2.switch@i(COURSEID, A3:ID).count()

A1: Import Course table.

A2: Import SelectCourse table.

A3: Select records containing Matlab from Course table.

A4: A.switch() works with @i option to remove non-matching record at the join and then count students who select Matlab course.

 

Example 4Get sales information of new customers in the year 2014, that is, records of SALES table where CUSTOMERIDs are not included in CUSTOMER table, based on SALES table and CUSTOMER table. Below is part of the source data:

SALES:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10248

VINET

2013/7/4

5

59

2440

10249

TOMSP

2013/7/5

6

38

1863.4

10250

HANAR

2013/7/8

4

65

1813

10251

VICTE

2013/7/8

3

66

670.8

10252

SUPRD

2013/7/9

4

46

3730

CUSTOMER:

ID

NAME

CITY

POSTCODE

TEL

ALFKI

Sanchuan Industrial Co., Ltd

Tianjin

343567

(030) 30074321

ANATR

Southeast industries

Tianjin

234575

(030) 35554729

ANTON

Tanson trade

Shijiazhuang

985060

(0321) 5553932

AROUT

Guoding Co., Ltd

Shenzhen

890879

(0571) 45557788

BERGS

Tongheng machinery

Nanjing

798089

(0921) 9123465

 

SQL solution:

SQL can use NOT IN or NOT EXISTS to retain only the non-matching records, which are SALES records whose CUSTOMERIDs are not included in CUSTOMER table at the join. Below are SQL statements:

 

   SELECT *

   FROM SALES S

   WHERE

      EXTRACT (YEAR FROM ORDERDATE)=2014

      AND

      CUSTOMERID NOT IN

         (SELECT DISTINCT ID

         FROM CUSTOMER)

 

    Or:

 

   SELECT *

   FROM SALES S

   WHERE

      EXTRACT (YEAR FROM ORDERDATE)=2014

      AND

      NOT EXISTS

         (SELECT *

         FROM CUSTOMER C

         WHERE S.CUSTOMERID=C.ID)

 

SPL solution:

A.switch() function works with @d option to keep only the non-matching records.


A

1

=T("Sales.csv")

2

=T("Customer.txt")

3

=A1.select(year(ORDERDATE)==2014)

4

=A3.switch@d(CUSTOMERID, A2:ID)

A1: Import Sales table.

A2: Import Customer table.

A3: Select records of 2014 from Sales table.

A4: A.switch() works with @d option to keep non-matching records only, which are the sales records of new customers in the year 2014.

 

Example 5Find the superior organization for each section based on organization structure table. Below is part of the source data:

ID

ORG_NAME

PARENT_ID

1

Head Office

0

2

Beijing Branch Office

1

3

Shanghai Branch Office

1

4

Chengdu Branch Office

1

5

Beijing R&D Center

2

SQL solution:

In the table, PARENT_ID points to a value in the same table. This is a case of self-join. We query the table twice, which is equivalent to treating the source data as two tables, and then perform the join. Below are SQL statements:

 

   SELECT

      ORG1.ID,ORG1.ORG_NAME,ORG2.ORG_NAME PARENT_NAME

   FROM

      ORGANIZATION ORG1

   LEFT JOIN

   ORGANIZATION ORG2

   ON ORG1.PARENT_ID=ORG2.ID

   ORDER BY ID

 

SPL solution:

SPL uses A.switch() function to objectify foreign key values, where the foreign key table is the table itself.


A

1

=T("Organization.txt")

2

=A1.switch(PARENT_ID, A1:ID)

3

=A2.new(ID, ORG_NAME, PARENT_ID.ORG_NAME:PARENT_NAME)

A1: Import Organization table.

A2: A.switch() function objectify foreign key values by replacing PARENT_ID values with corresponding parent organization records.

A3: Return names of all organizations and their parent organizations.

 

Example 6Get customers and their order amounts in the year 2015. Below are part of data in ORDERS table, ORDER_DETAIL table and CUSTOMER table:

ORDERS:

ID

CUSTOMERID

EMPLOYEEID

ORDER_DATE

ARRIVAL_DATE

10248

VINET

5

2012/07/04

2012/08/01

10249

TOMSP

6

2012/07/05

2012/08/16

10250

HANAR

4

2012/07/08

2012/08/05

10251

VICTE

3

2012/07/08

2012/08/05

10252

SUPRD

4

2012/07/09

2012/08/06

ORDER_DETAIL:

ID

ORDER_NUMBER

PRODUCTID

PRICE

COUNT

DISCOUNT

10814

1

48

102.0

8

0.15

10814

2

48

102.0

8

0.15

10814

3

48

306.0

24

0.15

10814

4

48

102.0

8

0.15

10814

5

48

204.0

16

0.15

CUSTOMER:

ID

NAME

CITY

POSTCODE

TEL

ALFKI

Sanchuan Industrial Co., Ltd

Tianjin

343567

(030) 30074321

ANATR

Southeast industries

Tianjin

234575

(030) 35554729

ANTON

Tanson trade

Shijiazhuang

985060

(0321) 5553932

AROUT

Guoding Co., Ltd

Shenzhen

890879

(0571) 45557788

BERGS

Tongheng machinery

Nanjing

798089

(0921) 9123465

 

SQL solution:

This task involves both one-to-many relationship (between ORDERS table and ORDER_DETAIL table) and many-to-one relationship (between ORDERS table and CUSTOEMR table). SQL does not define the two relationship separately, but it is necessary to differentiate them. First, we handle foreign key table (the many-to-one relationship) by attaching the foreign key values or desired field values to the table at the “many” end, and then we deal with the primary-and-sub tables (the one-to-many relationship). Below are SQL statements:

 

   SELECT

      CUSTOMER_NAME, SUM(AMOUNT) AMOUNT

   FROM (

      SELECT

         Orders1.ID, CUSTOMER_NAME, Detail.PRICE*Detail.COUNT AMOUNT

      FROM (

         SELECT

            Orders.ID,Customer.NAME CUSTOMER_NAME

         FROM ORDERS Orders

         LEFT JOIN

         CUSTOMER Customer

         ON Orders.CUSTOMERID=Customer.ID

         WHERE EXTRACT (YEAR FROM ORDER_DATE)=2015

      ) Orders1

      INNER JOIN

      ORDER_DETAIL Detail

      ON Orders1.ID=Detail.ID

   )

   GROUP BY CUSTOMER_NAME

   ORDER BY CUSTOMER_NAME

 

SPL solution:

SPL performs foreign key objectification to replace ORDERS table’s CUSTOMERID field values with corresponding CUSTOMER records and then we have only the primary table and its sub table, which are joined using join() function.


A

1

=T("Orders.txt")

2

=T("Customer.txt")

3

=A1.select(year(ORDER_DATE)==2015).switch(CUSTOMERID, A2:ID)

4

=T("OrderDetail.txt").group(ID)

5

=join(A3:Orders, ID;A4:Detail, ID)

6

=A5.groups(Orders.CUSTOMERID.NAME; Detail.sum(PRICE*COUNT):AMOUNT)

A1: Import Orders table.

A2: Import Customers table.

A3: Select Orders records of the year 2015, and use A.switch() function to objectify the foreign key CUSOTMERID by converting its values to corresponding CUSTOMER records.

A4: Import OrderDetail table and group it by ID.

A5: The join() function joins Orders table and OrderDetail table by ID fields.

A6: Group A5’s result set and sum sales amounts for each customer.

 

Example 7Get order information of the year 2014 (order IDs, product names and total amounts) where the product name contains “water” and order amount is greater than 200, and that do not pay in installment and get 5-star evaluation. Below is part of the source data and the relationships between tables:

ORDERS:

ID

CUSTOMERID

EMPLOYEEID

ORDER_DATE

ARRIVAL_DATE

10248

VINET

5

2012/07/04

2012/08/01

10249

TOMSP

6

2012/07/05

2012/08/16

10250

HANAR

4

2012/07/08

2012/08/05

10251

VICTE

3

2012/07/08

2012/08/05

10252

SUPRD

4

2012/07/09

2012/08/06

ORDER_DETAIL:

ID

ORDER_NUMBER

PRODUCTID

PRICE

COUNT

DISCOUNT

10814

1

48

102.0

8

0.15

10814

2

48

102.0

8

0.15

10814

3

48

306.0

24

0.15

10814

4

48

102.0

8

0.15

10814

5

48

204.0

16

0.15

ORDER_PAYMENT:

ID

PAY_DATE

AMOUNT

CHANNEL

INSTALMENTS

10814

2014/01/05

816.0

3

0

10848

2014/01/23

800.25

2

1

10848

2014/01/23

800.25

0

0

10848

2014/01/23

800.25

3

1

10966

2014/03/20

572.0

2

1

EVALUATION:

ID

SCORE

DATE

COMMENT

10248

4

2012/07/12


10249

1

2012/07/06


10250

4

2012/07/10


10251

2

2012/07/11


10252

3

2012/07/16


PRODUCT:

ID

NAME

SUPPLIERID

CATEGORY

1

Apple Juice

2

1

2

Milk

1

1

3

Tomato sauce

1

2

4

Salt

2

2

5

Sesame oil

2

2

 

SQL solution:

This task involves one-to-many relationship, many-to-one relationship and one-to-one relationship. It is wrong to join them all with the JOIN operation because that will result in many-to-many relationship. The right way is to handle the many-to-one relationship (foreign key table) first by attaching foreign key values or desired field values to the table at the “many” end, and we have one-to-one relationship and one-to-many relationship only. Then we group the sub table by the primary table’s primary key (order ID), which makes the key the sub table’s actual primary key. Finally, we join the four tables through order ID. Below are SQL statements:

 

   SELECT

      Orders.ID,Detail1.NAME, Detail1.AMOUNT

   FROM (

      SELECT ID

      FROM ORDERS

      WHERE

         EXTRACT (YEAR FROM Orders.ORDER_DATE)=2014

   ) Orders

   INNER JOIN (

      SELECT ID,NAME, SUM(AMOUNT) AMOUNT

      FROM (

         SELECT

            Detail.ID,Product.NAME,Detail.PRICE*Detail.COUNT AMOUNT

         FROM ORDER_DETAIL Detail

         INNER JOIN

         PRODUCT Product

         ON Detail.PRODUCTID=Product.ID

         WHERE NAME LIKE '%water%'

      )

      GROUP BY ID,NAME

   ) Detail1

   ON Orders.ID=Detail1.ID

   INNER JOIN(

      SELECT

         DISTINCT ID

      FROM ORDER_PAYMENT

      WHERE INSTALMENTS=0

   ) Payment

   ON Orders.ID = Payment.ID

   INNER JOIN(

      SELECT ID

      FROM EVALUATION

      WHERE SCORE=5

   ) Evaluation

   ON Orders.ID = Evaluation.ID

 

The SQL statements are difficult to write, hard to understand and maintain. More importantly, it is inconvenient to check whether the statements are correctly written since there are too many joins and nested queries.

 

SPL solution:


A

1

=T("Orders.txt").select(year(ORDER_DATE)==2014)

2

=T("Product.txt").select(like(NAME, "*water*"))

3

=T("OrderDetail.txt").switch@i(PRODUCTID, A2:ID)

4

=A3.group(ID).select(sum(PRICE*COUNT)>200)

5

=T("OrderPayment.txt").select(INSTALMENTS==0).group(ID)

6

=T("Evaluation.txt").select(SCORE==5)

7

=join(A1:Orders,ID;A4:Detail,ID;A5:Payment,ID;A6:Evaluation,ID)

8

=A7.new(Orders.ID,Detail.PRODUCTID.NAME,Detail.sum(PRICE*COUNT):AMOUNT)

A1: Import Orders table and select records of the year 2014.

A2: Import Product table and select records that contain water.

A3: Import OrderDetail table and objectify the foreign key PRODUCTID by replacing its values with corresponding records in Product table.

A4: Group OrderDetail table by ID field and select records where the amount is above 200.

A5: Import OrderPayment table and select records that do not have installment information.

A6: Import Evaluation table and select records containing 5-star evaluations.

A7: The join() function joins Orders table, OrderDetail table, OrderPayment table, and Evaluation table according to ID fields.

A8: Return the eligible order IDs, product names and order amounts.

 

The SPL script has two more lines of code. The import, select, and group operations on each table are separately performed, and only one line (A7) is for the join. The logic is natural and clear.

 

As seen from the above example, many-to-one relationship and one-to-many relationship is rather different. It is important for you to distinguish them even though SQL treat them in the same way. SPL, however, treats them in separate ways by supplying different functions for handling them. Now let’s take a look at the SPL solution to multi-table joins:

(1)   With the many-to-one relationship (foreign key table), attach foreign key values or desired field values to the table at the “many” end.

(2)   With the one -to-many relationship (primary and sub tables), group the sub table by the primary table’s primary key and the key becomes the sub table’s actual primary key.

(3)   Join multiple tables through the primary keys (or the actual primary keys).

Employee.txt

Course.txt

SelectCourse.csv

Sales.csv

Customer.txt

Organization.txt

Orders.txt

Customer.txt

OrderDetail.txt

Product.txt

OrderPayment.txt

Evaluation.txt