Order-based Merge Join (JOIN Simplification and Acceleration Series 8)
Now, we look at how to optimize and speed up homo-dimension table JOINs and primary-and-sub table JOINs. They share the same optimization method.
As we mentioned, hash join algorithm’s computational complexity degree (the number of comparisons between associated fields) is about SUM(ni*mi), which is much smaller than n*m, the degree of complexity for performing a full traversal. Yet, you need to choose the right hash function to get your luck.
If both tables are ordered by the associative fields, we can use the merge-join algorithm to handle a JOIN. The complexity degree is n+m. When both n and m are large (usually they are much larger than the value range of a hash function), the complexity degree is far smaller than that of hash join algorithm. There are already a lot of discussions about the merge-join algorithm, so we won’t go into details.
Yet, this isn’t the right algorithm to handle foreign-key-based joins because there might be multiple foreign keys in a fact table that participate in the join process and the fact table can’t be ordered by all of them at the same time.
But it is a suitable algorithm for handling joins between homo-dimension tables and between primary and sub tables.
Besides the primary key in one table, the joining field in the other table is always the primary key for the homo-dimension table joins or a part of the primary key for primary and sub table joins. We can pre-sort the to-be-associated tables by their primary keys. Though the sorting is time-consuming, it is once and for all. With the ordered tables, we can always handle JOINs in later computations using the merge-join algorithm, which can considerably increase efficiency.
Another time an algorithm can be designed because we are able to make use of the important computing feature, that is, joining keys are primary tables, or the primary key and a part of the primary key.
The importance of order-based merge algorithm lies in the handling of large amounts of data. A pair of primary and sub tables, say, an orders table and an order detail table, are ever-increasing fact tables, which, in the course of time, often become huge and exceeds the memory capacity.
The hash partition method targeting at external storage data processing generates too much buffered data and has high IO load resulted from two rounds of reads and one write on the disk. There isn’t such performance issue with the order-based merge algorithm. Each of the two tables being joined needs one traversal only. Both the CPU’s computational workload and the disk IO activities will significantly decrease. The execution of order-based merge algorithm involves very small memory usage by keeping a small number of buffer records for each table. This almost won’t affect the memory demand of other concurrent tasks. The hash partition method requires relatively large memory for storing more retrieved data to reduce heaps.
The Cartesian-product-based SQL JOIN doesn’t define join types. Without the definition of primary-key-based joins, we can only turn to engineering optimizations but can’t devise a more efficient order-based algorithm. Some database products are designed to check if the to-be-joined tables are physically ordered by their primary keys, and use the merge-join algorithm if the result is true. The problem is unordered-set-based relational databases won’t proactively ensure that data is physically ordered; rather, many of their operations will damage the conditions for performing a merge-join algorithm. An index makes the data logically ordered, but the traversal of physically unordered data is still slow.
The condition of using order-based merge algorithm is that data tables being joined are already ordered by the primary keys. Usually, more data is consecutively appended to the ordered data tables. In principle, sorting is needed after each append. Yet sorting a huge data table is time-consuming. So, isn’t necessarily difficult to do the appending? In fact, combining the appended data with the existing data is also an order-based merging. Different from the regular big data sorting that needs to write data to buffer and then read it from the buffer, sorting the newly-appended data and then merging it with the ordered historical data is equivalent to writing all data again, whose complexity degree is linear. Some engineering optimization plans even make it unnecessary to write all data every time, further enhancing the maintenance efficiency. Find related documentation in https://c.scudata.com.
Another merit of the order-based merge algorithm is that it facilitates data segmentation for parallel processing.
Contemporary computers are all equipped with multicore CPUs, and SSDs support concurrency excellently, offering solid foundation for performing multithreaded parallel processing to boost performance strongly. The conventional hash partition algorithm, however, is difficult to be paralleled. To perform the algorithm with parallel threads, the multiple threads will write data into a data heap at the same time, resulting in resource conflict; and the subsequent join between heaps of the two tables will occupy a large memory space, making it difficult to increase the degree of parallelism.
The order-based merging divides data into a number of segments to implement parallel processing. It’s relatively simple to divide one table. But, to divide two tables to be associated, it’s a must that data be always aligned. Otherwise mismatch of data in the two tables will occur and the final result will be wrong. Pre-ordering data thus can ensure high-performance real-time alignment division. In order to achieve the real-time alignment division between the two tables to be joined, we can first partition the primary table (or the bigger one of the two homo-dimension tables) evenly, get the primary key value in the first record of each segment to match records of the sub table (or the other homo-dimension table) using the binary search to locate the segmentation point in the second table (which is also ordered). This way the two tables will be segmented in alignment.
Since the primary key values are ordered, the key values in each segment of the primary table belong to one continuous interval, excluding records whose key values are not within the interval and ensuring the inclusion of all records whose key values are in it. Conversely, that is also the case in the sub table. Data won’t be mismatched. It is also because of the ordered key values that we can quickly locate the segmentation points in the sub table with the efficient binary search. Orderliness is thus the guarantee of proper and efficient partitioning, making it easy to perform parallel processing.
Another feature of the primary-key-based join between the primary and sub tables is that one sub table has only one primary table with which it will be joined according to the primary key (the same as the join between homo-dimension tables, but it is easy to explain the mechanism behind it using the primary and sub tables). It is impossible for a table to have multiple primary tables that don’t have any relationship between them (though the primary table may have its own primary table). In this case, we can employ the integration storage mechanism to store each record in the sub table as a field value in the corresponding record in the primary table. In this way, better performance is achieved when big data is involved as data to be stored is reduced (as the joining keys just need to be stored once) and pre-join is done without any key value matching actions.
All the join query optimization strategies mentioned previously have already been implemented in esProc SPL and found their effective applications in real-world scenarios. Now SPL is open sourced. You can download it and find more documentation in https://c.scudata.com.
JOIN queries are indeed the most complex structured data computations. In this essay, we try our best to explore and dissect them and propose our solutions but still there are aspects we haven’t cover.
More related documentation can be obtained in https://c.scudata.com.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version