Comparison of SQL & SPL: Join Operations (Ⅲ)

 

Abstract

The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you.

 

In this essay, we focus on the handling of table joins with many-to-many relationship.

The many-to-many relationship exists between tables where one record in table A corresponds to any number of records in table B and vice versa.

We use cross join to handle this type of relationship. One record of table A will join up with each record of table B to generate new records, so the number of rows in the returned result set is the product of numbers of rows in the two tables. SQL CROSS JOIN is used to do the cross join. SPL has xjoin() function to do the job. In real-world situations, the many-to-many relationship is rare. Now let’s look at how to handle such a scenario through matrix multiplication.

 

Example 1Calculate the product of two matrices. The two matrix tables are stored respectively in MATRIXA and MATRIXB:

MATRIXA:

ROW

COL

VALUE

1

1

1

1

2

2

1

3

3

2

1

4

2

2

5

2

3

6

MATRIXB:

ROW

COL

VALUE

1

1

1

1

2

4

2

1

2

2

2

5

3

1

3

3

2

6

The mathematical formula for solving this task is as follows:

                                             

 

SQL solution:

   The matrix multiplication involves many-to-many relationship, which is handled in SQL through the cross join. First, we cross product the two matrix tables to select records where COL value in table A is equivalent to ROW value in table B. Second, sum values with same COL and ROW values. Below are SQL statements:

 

   SELECT

      "ROW",COL,SUM(VALUE1*VALUE2) VALUE

   FROM (

      SELECT

         A."ROW",A.VALUE VALUE1,B.COL,B.VALUE VALUE2

      FROM MATRIXA A

      CROSS JOIN

      MATRIXB B

      WHERE A.COL=B."ROW"

   )

   GROUP BY "ROW",COL

   ORDER BY "ROW",COL

 

SPL solution:

   SPL offers xjoin() function to perform cross join.


A

1

=T("MatrixA.csv")

2

=T("MatrixB.csv")

3

=xjoin(A1:A; A2:B, A.COL==A2.ROW)

4

=A3.groups(A.ROW, B.COL; sum(A.VALUE * B.VALUE):VALUE)

A1: Import MatrixA table.

A2: Import MatrixB table.

A3: The xjoin() function performs cross join, as well as the conditional filtering at the same time.

A4: Group rows by the condition that COL value is equivalent to ROW value and sum products of values in each group.

 

   A non-equi-join is one where the joining condition is not equivalence comparison. Examples of such joins include finding whether an age is included in a specific age group and whether the income belongs to a certain income level.

 

Example 2Find which age group a person falls in based on COMMUNITY table and AGE_GROUP. Below is part of the source data:

COMMUNITY:

ID

NAME

AGE

1

David

28

2

Daniel

15

3

Andrew

65

4

Rudy


AGE_GROUP:

GROUP_NAME

START

END

Children

0

15

Youth

16

40

Middle

41

60

Old

61

100

 

 

SQL solution:

   There are no age record for some community residents, so we use LEFT JOIN to handle this task:

 

    SELECT

      ID,NAME,AGE,GROUP_NAME

   FROM COMMUNITY C

   LEFT JOIN

      AGE_GROUP A

   ON A."START"<=C.AGE AND A."END">=C.AGE

   ORDER BY ID

 

SPL solution:

   SPL offers xjoin() function to perform cross join. The function works with @1 option to enable a left join.


A

1

=T("Community.txt")

2

=T("AgeGroup.txt")

3

=xjoin@1(A1:C; A2:A,A2.START<=C.AGE &&A2.END>=C.AGE)

4

=A3.new(C.ID, C.NAME, C.AGE,A.GROUP_NAME)

A1: Import Community table.

A2: Import AgeGroup table.

A3: The xjoin@1() function left joins the two tables according to the first table and get records corresponding to each group. When a record does not match any age group, just keep it and set its corresponding age group as null.

A4: Return ages groups for all community residents.

 

Summary

SQL defines the join operation simply as performing Cartesian product on two sets (tables) and then filtering by a specific condition. The definition covers not only the commonly seen equi-joins, but the equi-joins with many-to-many relationship and even non-equi-joins. An experienced programmer knows that most joins in real-world situations are equi-joins. The over-simplified SQL definition does not reflect the features of equi-joins, and thus the language cannot make the most use of those features to facilitate code writing and task implementation. When a query is too complicated (such as when the number of involved tables are many and when there are nested queries), it is hard to write and optimize code with SQL.

SPL makes use of those features to distinguish equi-joins and non-equi-joins and to offer separate solutions for equi-joins with three different types of associative relationships. The language has join()function to handle one-to-one relationship (homo-dimension tables) and one-to-many relationship (primary and sub tables) and provides A.switch()function to deal with many-to-one relationship (foreign key table).

 Besides, SQL’s use of temporary table and nested query makes it hard to write and maintain a SQL query. SPL, however, can compose concise code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which combines advantages of both Java and SQL. A join operation will thus become simple and easy with SPL.

MatrixA.csv

MatrixB.csv

AgeGroup.txt

Community.txt