"The join operation is used to combine records of two or more tables. This essay explains the han .."

pjf RaqForum 22 No.
1 Reply • 571 View • 3 Years ago

Comparison of SQL & SPL: Join Operations (Ⅱ)

The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find a faster and more efficient way for you. Looking Comparison of SQL & SPL: Join Operations (Ⅱ) for details.

In this essay, we focus on join operations with many-to-one relationship.

The many-to-one relationship exists between tables where any number of records in the first table correspond to one record in the other table. The two tables usually have a foreign key association. A certain field in table A are associated with the primary key of table B. The field in table A that links to table B’s primary key is called the foreign key pointing to table B. Table B is called table A’s foreign key table.

Suppose there are employee table and department table. The department ID field in employee table points to department table’s ID field. One employee corresponds to one department, but one department could have multiple employees. The relationship of employee table and department table is many-to-one.

The one-to-many relationship and the many-to-one relationship are different, but SQL does not distinguish them. It provides no methods to handle them separately.

The method of handling a join operation with many-to-one relationship is completely different from that of dealing with one with one-to-many relationship. We convert foreign key values into corresponding records in the other table, or attach the target field value into the table at the “many” end, to generate a combined table on which queries can be performed. SQL does not offer a particular method to handle joins with the many-to-one relationship, it uses the ordinary JOIN to do the work. SPL provides A.switch()function and other functions to specifically optimize and speed up foreign-key-type joins (many-to-one association).

【Example 1】Get names of all employees and the department they work in based on EMPLOYEE table and DEPARTMENT table. Below is part of the source data:

EMPLOYEE:

ID	NAME	BIRTHDAY	DEPARTMENTID	SALARY
1	Rebecca	1974/11/20	6	7000
2	Ashley	1980/07/19	2	11000
3	Rachel	1970/12/17	7	9000
4	Emily	1985/03/07	3	7000
5	Ashley	1975/05/13	6	16000
…	…	…	…	…

DEPARTMENT:

ID	NAME	MANAGER
1	Administration	18
2	Finance	2
3	HR	4
4	Marketing	6
5	Production	7
…	…	…

SQL solution:

SQL uses JOIN statement to join two tables with the many-to-one relationship according to the condition that a foreign key field value is equivalent to the corresponding primary key value in the foreign key table. Here we use the left join to search for all employees. Below are SQL statements:

SELECT

E.NAME, D.NAME DEPT_NAME

FROM EMPLOYEE E

LEFT JOIN

DEPARTMENT D

ON E.DEPARTMENTID=D.ID

SPL solution:

SPL A.switch() function converts values of DEPARTMENTID values into corresponding records in DEPARTMENT table.

	A
1	=T("Employee.txt")
2	=T("Department.txt")
3	=A1.switch(DEPARTMENTID, A2:ID)
4	=A3.new(NAME, DEPARTMENTID.NAME:DEPT_NAME)

A1: Import Employee table.

A2: Import Department table.

A3: A.switch() function objectifies foreign key values, by replacing DEPARTMENTID values with corresponding DEPARTMENT records.

A4: Return employee names and department names. The latter is obtained from NAME field of DEPARTMENT records.

The case is simple. Both SQL and SPL can handle it well. Someone may ask why we objectify the foreign key instead of referencing the foreign key field during the join? Hope you can understand the reason through the following more complicated example.

【Example 2】Find the American employees whose managers are Chinese based on EMPLOYEE table and DEPARTMENT table. Below is part of the source data:

EMPLOYEE:

ID	NAME	BIRTHDAY	DEPARTMENTID	SALARY
1	Rebecca	1974/11/20	6	7000
2	Ashley	1980/07/19	2	11000
3	Rachel	1970/12/17	7	9000
4	Emily	1985/03/07	3	7000
5	Ashley	1975/05/13	6	16000
…	…	…	…	…

DEPARTMENT:

ID	NAME	MANAGER
1	Administration	18
2	Finance	2
3	HR	4
4	Marketing	6
5	Production	7
…	…	…

SQL solution:

Just adding the desired foreign key field to EMPLOYEE table is not right. We need two joins instead. The first is the join between DEPARTMENT table and EMPLOYEE table through ID field for getting nationalities of managers. The second join is between EMPLOYEE table and DEPARTMENT table through DEPARTMENTID field for getting the employees’ departments. Below are SQL statements:

SELECT *

FROM EMPLOYEE E2

LEFT JOIN

(SELECT D1.ID,D1.MANAGER,E1.NATION MANAGER_NATION

FROM DEPARTMENT D1

LEFT JOIN EMPLOYEE E1

ON D1.MANAGER=E1.ID

) D2

ON E2.DEPARTMENTID=D2.ID

WHERE D2.MANAGER_NATION='Chinese' AND E2.NATION='American'

This block of SQL code is rather complicated. Each join requires a layer of nested query and you need to know the target fields in the foreign key table in advance.

SPL solution:

SPL handles this task also through A.switch() function by objectifying foreign key values.

	A
1	=T("Employee.txt").keys(ID)
2	=T("Department.txt").keys(ID)
3	=A2.switch(MANAGER, A1)
4	=A1.switch(DEPARTMENTID, A2)
5	=A4.select(NATION=="American" && DEPARTMENTID.MANAGER.NATION=="Chinese")

A1: Import Employee table and set ID as the primary key.

A2: Import Department table and set ID as the primary key.

A3: Use A.switch() function to objectify foreign key field by converting Department table’s MANAGER field into corresponding Employee records.

A4: Use A.switch() function to objectify foreign key field by converting Employee table’s DEPARTMETNID field into corresponding Department records.

A5: Get records of American employees whose managers are Chinese.

The SPL script is not so complicated as SQL statements. SPL just performs one more foreign key objectification according to the natural logic. You do not need to know which fields you want in advance for a foreign key field join because SPL allows you to get them directly from the record object. To get managers who are Chinese , SPL uses the expression DEPARTMENTID.MANAGER.NATION, which is natural.

【Example 3】Find the number of students who select “Matlab” course based on COURSE table and SELECT_COURSE table. Below is part of the source data:

COURSE:

ID	NAME	TEACHERID
1	Environmental protection and sustainable development	5
2	Mental health of College Students	1
3	Matlab	8
4	Electromechanical basic practice	7
5	Introduction to modern life science	3
…	…	…

SELECT_COURSE:

ID	COURSEID	STUDENTID
1	6	59
2	6	43
3	5	52
4	5	44
5	5	37
…	…	…

SQL solution:

We can use inner join to delete non-matching records (where the selected courses are not Matlab). Below are SQL statements:

SELECT

COUNT(*) COUNT

FROM SELECT_COURSE SC

INNER JOIN

COURSE C

ON SC.COURSEID=C.ID

WHERE NAME='Matlab'

SPL solution:

A.switch() function works with @i option to delete non-matching records during the join operation.

	A
1	=T("Course.csv")
2	=T("SelectCourse.csv")
3	=A1.select(NAME:"Matlab")
4	=A2.switch@i(COURSEID, A3:ID).count()

A1: Import Course table.

A2: Import SelectCourse table.

A3: Select records containing Matlab from Course table.

A4: A.switch() works with @i option to remove non-matching record at the join and then count students who select Matlab course.

【Example 4】Get sales information of new customers in the year 2014, that is, records of SALES table where CUSTOMERIDs are not included in CUSTOMER table, based on SALES table and CUSTOMER table. Below is part of the source data:

SALES:

ID	CUSTOMERID	ORDERDATE	SELLERID	PRODUCTID	AMOUNT
10248	VINET	2013/7/4	5	59	2440
10249	TOMSP	2013/7/5	6	38	1863.4
10250	HANAR	2013/7/8	4	65	1813
10251	VICTE	2013/7/8	3	66	670.8
10252	SUPRD	2013/7/9	4	46	3730
…	…	…	…	…	…

CUSTOMER:

ID	NAME	CITY	POSTCODE	TEL
ALFKI	Sanchuan Industrial Co., Ltd	Tianjin	343567	(030) 30074321
ANATR	Southeast industries	Tianjin	234575	(030) 35554729
ANTON	Tanson trade	Shijiazhuang	985060	(0321) 5553932
AROUT	Guoding Co., Ltd	Shenzhen	890879	(0571) 45557788
BERGS	Tongheng machinery	Nanjing	798089	(0921) 9123465
…	…	…	…	…

SQL solution:

SQL can use NOT IN or NOT EXISTS to retain only the non-matching records, which are SALES records whose CUSTOMERIDs are not included in CUSTOMER table at the join. Below are SQL statements:

SELECT *

FROM SALES S

WHERE

EXTRACT (YEAR FROM ORDERDATE)=2014

AND

CUSTOMERID NOT IN

(SELECT DISTINCT ID

FROM CUSTOMER)

Or:

SELECT *

FROM SALES S

WHERE

EXTRACT (YEAR FROM ORDERDATE)=2014

AND

NOT EXISTS

(SELECT *

FROM CUSTOMER C

WHERE S.CUSTOMERID=C.ID)

SPL solution:

A.switch() function works with @d option to keep only the non-matching records.

	A
1	=T("Sales.csv")
2	=T("Customer.txt")
3	=A1.select(year(ORDERDATE)==2014)
4	=A3.switch@d(CUSTOMERID, A2:ID)

A1: Import Sales table.

A2: Import Customer table.

A3: Select records of 2014 from Sales table.

A4: A.switch() works with @d option to keep non-matching records only, which are the sales records of new customers in the year 2014.

【Example 5】Find the superior organization for each section based on organization structure table. Below is part of the source data:

ID	ORG_NAME	PARENT_ID
1	Head Office	0
2	Beijing Branch Office	1
3	Shanghai Branch Office	1
4	Chengdu Branch Office	1
5	Beijing R&D Center	2
…	…	…

SQL solution:

In the table, PARENT_ID points to a value in the same table. This is a case of self-join. We query the table twice, which is equivalent to treating the source data as two tables, and then perform the join. Below are SQL statements:

SELECT

ORG1.ID,ORG1.ORG_NAME,ORG2.ORG_NAME PARENT_NAME

FROM

ORGANIZATION ORG1

LEFT JOIN

ORGANIZATION ORG2

ON ORG1.PARENT_ID=ORG2.ID

ORDER BY ID

SPL solution:

SPL uses A.switch() function to objectify foreign key values, where the foreign key table is the table itself.

	A
1	=T("Organization.txt")
2	=A1.switch(PARENT_ID, A1:ID)
3	=A2.new(ID, ORG_NAME, PARENT_ID.ORG_NAME:PARENT_NAME)

A1: Import Organization table.

A2: A.switch() function objectify foreign key values by replacing PARENT_ID values with corresponding parent organization records.

A3: Return names of all organizations and their parent organizations.

【Example 6】Get customers and their order amounts in the year 2015. Below are part of data in ORDERS table, ORDER_DETAIL table and CUSTOMER table:

ORDERS:

ID	CUSTOMERID	EMPLOYEEID	ORDER_DATE	ARRIVAL_DATE
10248	VINET	5	2012/07/04	2012/08/01
10249	TOMSP	6	2012/07/05	2012/08/16
10250	HANAR	4	2012/07/08	2012/08/05
10251	VICTE	3	2012/07/08	2012/08/05
10252	SUPRD	4	2012/07/09	2012/08/06
…	…	…	…	…

ORDER_DETAIL:

ID	ORDER_NUMBER	PRODUCTID	PRICE	COUNT	DISCOUNT
10814	1	48	102.0	8	0.15
10814	2	48	102.0	8	0.15
10814	3	48	306.0	24	0.15
10814	4	48	102.0	8	0.15
10814	5	48	204.0	16	0.15
…	…	…	…	…	…

CUSTOMER:

ID	NAME	CITY	POSTCODE	TEL
ALFKI	Sanchuan Industrial Co., Ltd	Tianjin	343567	(030) 30074321
ANATR	Southeast industries	Tianjin	234575	(030) 35554729
ANTON	Tanson trade	Shijiazhuang	985060	(0321) 5553932
AROUT	Guoding Co., Ltd	Shenzhen	890879	(0571) 45557788
BERGS	Tongheng machinery	Nanjing	798089	(0921) 9123465
…	…	…	…	…

SQL solution:

This task involves both one-to-many relationship (between ORDERS table and ORDER_DETAIL table) and many-to-one relationship (between ORDERS table and CUSTOEMR table). SQL does not define the two relationship separately, but it is necessary to differentiate them. First, we handle foreign key table (the many-to-one relationship) by attaching the foreign key values or desired field values to the table at the “many” end, and then we deal with the primary-and-sub tables (the one-to-many relationship). Below are SQL statements:

SELECT

CUSTOMER_NAME, SUM(AMOUNT) AMOUNT

FROM (

SELECT

Orders1.ID, CUSTOMER_NAME, Detail.PRICE*Detail.COUNT AMOUNT

FROM (

SELECT

Orders.ID,Customer.NAME CUSTOMER_NAME

FROM ORDERS Orders

LEFT JOIN

CUSTOMER Customer

ON Orders.CUSTOMERID=Customer.ID

WHERE EXTRACT (YEAR FROM ORDER_DATE)=2015

) Orders1

INNER JOIN

ORDER_DETAIL Detail

ON Orders1.ID=Detail.ID

)

GROUP BY CUSTOMER_NAME

ORDER BY CUSTOMER_NAME

SPL solution:

SPL performs foreign key objectification to replace ORDERS table’s CUSTOMERID field values with corresponding CUSTOMER records and then we have only the primary table and its sub table, which are joined using join() function.

	A
1	=T("Orders.txt")
2	=T("Customer.txt")
3	=A1.select(year(ORDER_DATE)==2015).switch(CUSTOMERID, A2:ID)
4	=T("OrderDetail.txt").group(ID)
5	=join(A3:Orders, ID;A4:Detail, ID)
6	=A5.groups(Orders.CUSTOMERID.NAME; Detail.sum(PRICE*COUNT):AMOUNT)

A1: Import Orders table.

A2: Import Customers table.

A3: Select Orders records of the year 2015, and use A.switch() function to objectify the foreign key CUSOTMERID by converting its values to corresponding CUSTOMER records.

A4: Import OrderDetail table and group it by ID.

A5: The join() function joins Orders table and OrderDetail table by ID fields.

A6: Group A5’s result set and sum sales amounts for each customer.

【Example 7】Get order information of the year 2014 (order IDs, product names and total amounts) where the product name contains “water” and order amount is greater than 200, and that do not pay in installment and get 5-star evaluation. Below is part of the source data and the relationships between tables:

ORDERS:

ID	CUSTOMERID	EMPLOYEEID	ORDER_DATE	ARRIVAL_DATE
10248	VINET	5	2012/07/04	2012/08/01
10249	TOMSP	6	2012/07/05	2012/08/16
10250	HANAR	4	2012/07/08	2012/08/05
10251	VICTE	3	2012/07/08	2012/08/05
10252	SUPRD	4	2012/07/09	2012/08/06
…	…	…	…	…

ORDER_DETAIL:

ID	ORDER_NUMBER	PRODUCTID	PRICE	COUNT	DISCOUNT
10814	1	48	102.0	8	0.15
10814	2	48	102.0	8	0.15
10814	3	48	306.0	24	0.15
10814	4	48	102.0	8	0.15
10814	5	48	204.0	16	0.15
…	…	…	…	…	…

ORDER_PAYMENT:

ID	PAY_DATE	AMOUNT	CHANNEL	INSTALMENTS
10814	2014/01/05	816.0	3	0
10848	2014/01/23	800.25	2	1
10848	2014/01/23	800.25	0	0
10848	2014/01/23	800.25	3	1
10966	2014/03/20	572.0	2	1
…	…	…	…	…

EVALUATION:

ID	SCORE	DATE	COMMENT
10248	4	2012/07/12
10249	1	2012/07/06
10250	4	2012/07/10
10251	2	2012/07/11
10252	3	2012/07/16
…	…	…	…

PRODUCT:

ID	NAME	SUPPLIERID	CATEGORY
1	Apple Juice	2	1
2	Milk	1	1
3	Tomato sauce	1	2
4	Salt	2	2
5	Sesame oil	2	2
…	…	…	…

SQL solution:

This task involves one-to-many relationship, many-to-one relationship and one-to-one relationship. It is wrong to join them all with the JOIN operation because that will result in many-to-many relationship. The right way is to handle the many-to-one relationship (foreign key table) first by attaching foreign key values or desired field values to the table at the “many” end, and we have one-to-one relationship and one-to-many relationship only. Then we group the sub table by the primary table’s primary key (order ID), which makes the key the sub table’s actual primary key. Finally, we join the four tables through order ID. Below are SQL statements:

SELECT

Orders.ID,Detail1.NAME, Detail1.AMOUNT

FROM (

SELECT ID

FROM ORDERS

WHERE

EXTRACT (YEAR FROM Orders.ORDER_DATE)=2014

) Orders

INNER JOIN (

SELECT ID,NAME, SUM(AMOUNT) AMOUNT

FROM (

SELECT

Detail.ID,Product.NAME,Detail.PRICE*Detail.COUNT AMOUNT

FROM ORDER_DETAIL Detail

INNER JOIN

PRODUCT Product

ON Detail.PRODUCTID=Product.ID

WHERE NAME LIKE '%water%'

)

GROUP BY ID,NAME

) Detail1

ON Orders.ID=Detail1.ID

INNER JOIN(

SELECT

DISTINCT ID

FROM ORDER_PAYMENT

WHERE INSTALMENTS=0

) Payment

ON Orders.ID = Payment.ID

INNER JOIN(

SELECT ID

FROM EVALUATION

WHERE SCORE=5

) Evaluation

ON Orders.ID = Evaluation.ID

The SQL statements are difficult to write, hard to understand and maintain. More importantly, it is inconvenient to check whether the statements are correctly written since there are too many joins and nested queries.

SPL solution:

	A
1	=T("Orders.txt").select(year(ORDER_DATE)==2014)
2	=T("Product.txt").select(like(NAME, "water"))
3	=T("OrderDetail.txt").switch@i(PRODUCTID, A2:ID)
4	=A3.group(ID).select(sum(PRICE*COUNT)>200)
5	=T("OrderPayment.txt").select(INSTALMENTS==0).group(ID)
6	=T("Evaluation.txt").select(SCORE==5)
7	=join(A1:Orders,ID;A4:Detail,ID;A5:Payment,ID;A6:Evaluation,ID)
8	=A7.new(Orders.ID,Detail.PRODUCTID.NAME,Detail.sum(PRICE*COUNT):AMOUNT)

A1: Import Orders table and select records of the year 2014.

A2: Import Product table and select records that contain water.

A3: Import OrderDetail table and objectify the foreign key PRODUCTID by replacing its values with corresponding records in Product table.

A4: Group OrderDetail table by ID field and select records where the amount is above 200.

A5: Import OrderPayment table and select records that do not have installment information.

A6: Import Evaluation table and select records containing 5-star evaluations.

A7: The join() function joins Orders table, OrderDetail table, OrderPayment table, and Evaluation table according to ID fields.

A8: Return the eligible order IDs, product names and order amounts.

The SPL script has two more lines of code. The import, select, and group operations on each table are separately performed, and only one line (A7) is for the join. The logic is natural and clear.

As seen from the above example, many-to-one relationship and one-to-many relationship is rather different. It is important for you to distinguish them even though SQL treat them in the same way. SPL, however, treats them in separate ways by supplying different functions for handling them. Now let’s take a look at the SPL solution to multi-table joins:

(1) With the many-to-one relationship (foreign key table), attach foreign key values or desired field values to the table at the “many” end.

(2) With the one -to-many relationship (primary and sub tables), group the sub table by the primary table’s primary key and the key becomes the sub table’s actual primary key.

(3) Join multiple tables through the primary keys (or the actual primary keys).