The Anatomy of JOIN Operations
JOIN operations are used in SQL to associate tables. The JOINs, both according to the programmers and from the angle of database implementation, are thought of the most difficult operation.
Yet the definition of JOINs in SQL is simple. A JOIN is the Cartesian product between two sets (tables) according to a certain filtering condition. The syntax is A JOIN B ON …. In principle, the result set of performing a Cartesian product should be composed of 2-tuples containing members from both sets. As members of a SQL set are always the records having fields, and as SQL doesn’t support using a generic data type to describe a 2-tuple whose members are records, the language simply joins the fields of records from both tables to form a new set of records to return, and thus the name JOIN. The name has nothing to do with multiplication (the Cartesian product). But whether a result set consists of 2-tuples or records generated by combining fields from both tables won’t affect our analysis of JOIN operations.
The definition of SQL JOINs doesn’t stipulate the format of the filtering condition. Theoretically any operation can be considered a JOIN if the target result set is the subset of the Cartesian product of the two original sets. Suppose there are set A={1,2,3} and set B={2,3,4}. Calculate A JOIN B ON A<B and the result set is {(1,2),(1,3),{1,4),(2,3),{2,4),(3,4)}.
But according to experience, most of the JOINs in real-world practices are equi-joins, where the filtering condition is one or more equivalence relations (multiple equivalence relations are joined by AND). The syntax is A JOIN B ON A.ai=B.bi AND …, in which ai and bi are respectively fields of table A and table B. The above example, where the filtering condition is ON A<B, is a non-equi-join. The non-equi-joins are relatively rare. Actually on many occasions, a non-equi-join can be handled by being converted into an equi-join. So the analysis focuses on the equi-joins.
Depending on different handling of null values, we also have LEFT JOIN and FULL JOIN. JOINs generally have one-to-one relationship, one-to-many relationship and many-to-many relationship. These are common terms and included by almost all SQL textbooks, so we won’t go into them.
We divide equi-joins into three types:
1. Foreign-key-oriented joins
Two tables are associated by matching a certain field or certain fields in table A with the primary key of table B (By association, we mean the join condition is that values of the corresponding fields are equivalent). Table A is the fact table, and table B is the dimension table. The field(s) in table A matching table B’s primary key is the foreign key of table A pointing to table B; table B is table A’s foreign key table. A table and its foreign key table are a many-to-one relationship, supporting JOIN and LEFT JOIN. The FULL JOIN is extremely rare.
A typical example is the transaction records of accounts and the basic information of the accounts.
2. Joins between homo-dimension tables
Two tables are associated by matching table A’s primary key and table B’s primary key. The matching is a one-to-one relationship, and each table is the other’s homo-dimension table. The joins between homo-dimension tables support JOIN, LEFT JOIN and FULL JOIN. But for most of the data structure designs, FULL JOIN is rare.
A typical example is the employee table and the sellers table.
3. Joins between the parent and child tables
Two tables are associated by matching table A’s primary key and one or several of table B’s primary key fields. Table A is the parent table, and table B is the child table. It is a one-to-many relationship from table A to table B. The joins between them only involve JOIN and LEFT JOIN.
A typical example is the orders table and the orders detail table.
In this context, the primary key is a logical one, including one or more fields whose values are unique. In a table, it is likely that more than one field or set of fields has unique values (but this is rare). All of them can be used as the primary key. The primary key isn’t necessarily the one set on the physical table.
SQL doesn’t differentiate the concept of foreign key table and the parent and child tables. In the context of SQL, a many-to-one relationship and a one-to-many relationship are essentially the same thing except for the direction in which tables are associated. Indeed, an orders table can be considered the foreign key table of the orders detail table. The purpose of differentiating them is to use different methods to simplify their syntax and optimize their performance.
The three types of joins cover most of the equi-join scenarios. Almost all equi-joins that have business significance fall into the three types. To define equi-joins with the three types won’t decrease the range of its application scenarios.
All the three join types involve the primary key. And there isn’t a many-to-many relationship. Doesn’t this relationship have any real-world significance?
No. The relationship almost hasn’t any meaning in real-world business practice.
A many-to-many correspondence occurs when the related fields for JOINing two tables don’t contain any primary key. In that case, there exists a larger table that correlates the two tables with each other by using them as dimension tables. When a student table and a subject table are JOINed, there will be a score table that uses the two tables as dimension tables. It makes no sense in business practice to purely JOIN such two tables.
So it is almost certain that a SQL statement is wrong or the data is bad if a many-to-many relationship happens. The relationship is useful for checking mistakes in JOIN operations.
But we don’t deny the existence of an exception by using “almost”. On very rare occasions, the many-to-many relationship has business significance. One example is performing matrix multiplication in SQL, where an equi-join with many-to-many relationship occurs. Try to code the computation by yourselves.
It is indeed a simple way to define JOIN operations as the Cartesian product plus a filtering. A simple definition covers an extensive variety of joins, including the equi-joins having many-to-many relationship as well as the non-equi-joins. But it can’t fully reflect the computing characteristics of the most common equi-joins, depriving the programming language of opportunities to better code a query for achieving a computing goal according to the characteristics, and having a lot of difficulties in expressing and optimizing a complex query (a join involving several or more tables or with a nested query). By making good use of the characteristics, we can design simple syntax to make computations more performant. We’ll discuss it in detail in subsequent articles.
In summary, it’s more sensible to define the rare JOIN scenarios as a special type of operations than to include them in a universal definition.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL