Improving JOIN Performance: Order-based Merge Algorithm
In this article, we look at how to optimize and speed up homo-dimension table JOINs and parent-and-child table JOINs. They share the same optimization method.
Suppose the sizes (that is, the numbers of records) of two to-be-joined tables are N and M respectively, then the computational complexity (the number of comparisons between associated fields) is about SUM(Ni*Mi). Ni and Mi are the numbers of records in the two tables respectively whose hash values are i, and they satisfy expressions N=SUM(Ni) and M=SUM(Mi). The complexity, in most cases, is significantly smaller than that of a full traversal, which is N*M (Sometimes even K times smaller; K is the maximum hash value).
If both tables are ordered by the associated fields, we can use the merge-join algorithm to handle a JOIN. The complexity is N+M. When N and M are large (usually they are much larger than K), the complexity is far smaller than SUM(Ni*Mi). There are already a lot of discussions about the merge-join algorithm, so we won’t go into details.
This isn’t the right algorithm to handle foreign-key-oriented joins, because there might be multiple foreign keys in a fact table that participate in the join process but the fact table can’t be ordered by all of them at the same time.
But it is a suitable algorithm for handling joins between homo-dimension tables and between parent and child tables.
Besides the primary key in one table, the joining field in another table is always the primary key for joins between homo-dimension tables and a part of the primary key for joins between parent and child tables. So we can pre-sort the to-be-associated tables by their primary keys. Though the sorting is time and energy-consuming, it is once and for all. With the ordered tables, we can handle JOINs in later computations using the merge-join algorithm, which can considerably increase efficiency.
The importance of order-based merge algorithm lies in the handling of large amounts of data. A pair of parent and child tables, say, an orders table and an order detail table, are ever-increasing fact tables, which, in the course of time, often become huge.
Relational databases stick to hash partitioning technique to JOIN two huge tables that can’t be accommodated in the memory. According to the hash values of the associated field, each table is divided into a number of partitions, each of which can be wholly loaded into the memory, to enable the use of the in-memory hash partitioning algorithm. This involves transferring data between disk and memory, during which data is written out partition by partition and then loaded into the memory. The disk read is slow, but the disk write is slower. The write and read actions drag down performance a lot. If a pair of partitions is still too large to be completely loaded into the memory, we need to perform a second hash partitioning operation, which will exacerbate the performance issue. The hash partitioning algorithm handles each partition by reading it into the memory. In order to reduce the number of partitions, each partition is made as big as possible according to the memory size. As a result, the available memory space shrinks almost to zero. This, with concurrent computing, can seriously affect the performance of other concurrent tasks.
There isn’t such performance issue with the order-based merge algorithm. Each of the two tables being joined needs one traversal only. Both the CPU’s computational workload and the disk IO activities will significantly decrease. The execution of order-based merge algorithm involves very small memory usage by keeping a small number of buffer records for each table. This almost won’t affect the memory demand of other concurrent tasks.
The Cartesian-product-based SQL JOINs don’t have key-oriented join types. Without the definition of primary-key-oriented joins, we can only turn to practical optimizations but can’t devise an order-based algorithm. Some database products are designed to check if the to-be-joined tables are physically ordered by their primary keys, and use the merge-join algorithm if the result is true. The problem is unordered-set-based relational databases won’t proactively ensure that data is physically ordered; rather, many of their operations will damage the conditions for performing a merge-join algorithm. An index makes the data logically ordered, but the traversal of physically unordered data will still be slow.
The condition of using order-based merge algorithm is that data tables being joined are already ordered by the primary keys. Data tables with joining fields as primary keys often receive appended data. In principle, a table needs to be sorted after each appending. The sorting of a huge data table is time and resource-consuming, but isn’t necessarily difficult. Combining the appended data with the existing data is also an order-based merging. Different from the regular big data sorting that needs to write data to buffer and then read it from the buffer, sorting the newly-appended data and then merging it into the historical data is equivalent to writing all data again, whose complexity is linear. Some practical optimization plans even make it unnecessary to write all data every time, thus further enhancing the maintenance efficiency.
Another merit of the order-based merge algorithm is that it is easy to be paralleled.
Contemporary computers are equipped with multiple CPUs as well as SSDs that support concurrency well, offering solid foundation for performing multi-process (or multithreaded) parallel processing to boost performance strongly. The conventional hash portioning algorithm, however, is difficult to be paralleled. With the multi-process hash partitioning, there will be multiple processes that write data into a partition at the same time, resulting in resource conflict; and processing a partition will use up nearly all available memory, hindering other concurrent tasks.
The order-based merging will divide data into a number of partitions. It’s relatively simple to partition one table. But it’s a must that data be always aligned when two tables to be associated are partitioned. Otherwise mismatch of the data in the two tables will happen and the final result will be wrong. Pre-ordering data thus can ensure high-performant real-time alignment partitioning.
To achieve the real-time alignment partitioning between two to-be-joined tables, we can first partition the parent table, or the bigger one of the two homo-dimension tables (About how to partition the table evenly and how to handle appended data, we’ll discuss them in subsequent articles), get the primary key values of the first records in these partitions to match records of the child table or another homo-dimension table using the binary search to locate the partitioning points in the second table (whether the binary search can be used or not depends on the storage formats, which will be discussed later).
Since the primary key values are ordered, the key values in each partition of the first table are thus fall in a continuous interval, excluding records whose key values are not within the interval and ensuring the inclusion of all records whose key values are in it. Conversely, that is also the case in the second table. Data won’t be mismatched. It is also because of the ordered key values that we can quickly locate the partitioning points with the efficient binary search. Orderliness is thus the guarantee of proper and efficient partitioning, making it easy to perform parallel processing.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL