SPL Order-based Merge Join
It is no rare thing that we are handling an association between very large tables where performance becomes an issue. SPL specifically offers an order-based merge algorithm for primary-key-based associations and associations based on the part of a composite primary key.
The order-based merge algorithm requires that every table to be associated be sorted by their primary key in advance. The sorting is costly but once-for-all, always letting us use the algorithm to perform the same type of associations later and thus increasing performance considerably.
Learn how to achieve order-based storage in SPL Ordered Storage.
Order-based merge join
Suppose we have two tables a and b. Both are stored according to the order of id. We are trying to perform an inner join on them by their primary keys. The procedure of getting this done using the order-based merge algorithm is as follows:
fig1 Inner join using the order-based merge join
According to the figure above, step1 reads a record from each of the table into memory and checks whether their ids are equal. The first retrieval shows that table a’s id (1049) is less than table b’s id (1050). Step2 discards the first record of table a and reads the next record – this time the two records, with the same id 1050, match – and makes the two records the fields of a new record in the memory result set. Step3 retrieves records from both tables respectively and checks whether they are equal – again they are equal with the same id 1052 – and uses them to generate a new record for the memory result table.
Continue to read records from table a and table b respectively, compare their ids, discards the record if its id is smaller, and read the next record from the same table for another comparison. The process repeats until the record with the same id is retrieved and use the records to generate a new record in the memory result table.
In this way, records are repeatedly retrieved, compared, and used for generating new records. The loop goes on and on until all records in one of the tables are retrieved. The loop runs like this:
fig2 Inner join loop using order-based merge
SPL encapsulates the loop as a delayed cursor. It controls the loop’s execution and termination to return the association result to the invoker.
Similar algorithms can be used to achieve full join and left join by tuning the comparison method. The application of the algorithm can be extended to the n-table association. We can also handle associations where the joining fields are table a’s primary key and part of table b’s primary key using algorithms of similar principle. For them, only fetch b.r is needed after the loop is finished and returns the result set.
The order-based merge algorithm traverses the two tables to be joined only once. It does not need buffer tables even for large data tables. When sizes of the two tables are N and M respectively, the association’s degree of complexity is O(N+M). For associations between unordered tables, we need to compare pairs of records in sequence if no optimization method is used. In those cases, the degree of complexity is O(N*M), much higher than O(N+M). Databases usually use hash partitioning algorithm to speed up the computation, but it is still much more complex than the order-based merge algorithm. What’s more, for external computations of large-scale data sets, the hash partitioning algorithm involves reading and writing buffer files and there could be an unlucky hash function. The order-based merge algorithm can avoid these issues and boost performance greatly.
SPL uses joinx function to achieve an association between tables using the order-based merge algorithm:
A |
|
1 |
=file("a.btx").cursor@b() |
2 |
=file("b.btx").cursor@b() |
3 |
=joinx(A1,id;A2,id) |
4 |
=joinx@1(A1,id;A2,id) |
5 |
=joinx@f(A1,id;A2,id) |
The joinx function, by default, performs an inner join, and works with @1 option and @f option respectively to enable a left join and a full join.
The joinx function assumes the parameter cursors are ordered. Executing the function on unordered cursors will get wrong result.
Parallel processing by synced segmentation
We cannot directly achieve segment-based parallel processing with the order-based merge algorithm that reads records one by one from two or more cursors for comparison because the division, which is performed as evenly as possible according to the number of records, cannot ensure synced distribution of joining field values in all corresponding segments of the two tables. For example, we are trying to divide table a and table b respectively into four segments. Below is the division result:
fig3 Direct division
We can see that the record with id 87235 in the second segment of table a corresponds to the third segment of table b. That’s out of sync.
SPL handles the problem by making use of the ordered primary key. Below is the SPL method of dividing the two tables synchronously:
fig4 Synchronous segmentation-based parallel processing
In fig4, step1 divides table a into four parts directly because the table is ordered by id and it is easy and fast to get the id value of the first record in each part. Step2 generates four interval conditions using these id values for dividing table b. As table b is also ordered by its primary key, records whose primary key values are within each of the interval are stored continuously, and can be quickly located and used to generate a cursor. Step3 performs the parallel processing. In each thread table a is divided directly and table b is divided according to the interval conditions, ensuring that the primary key values of tables on both sides match synchronously, and cursors of corresponding segments are joined to generate new cursor for subsequent computations.
Take the record whose id is 87235 as an example. It falls within part2 of table a. That is to say, it is located in the part defined by id2 and id3. For table b, all its records whose id values are between id2 and id3 settle in part2, so its record where id is 87235 is definitely within part2. This makes correspondence synchronous.
SPL offers the function that uses this method to generate synchronous multicursors that are convenient for association:
A |
B |
|
1 |
=file("a.ctx").open() |
=file("a.ctx").open() |
2 |
=A1.cursor@m(;;4) |
=B1.cursor(;;A3) |
3 |
=joinx(A2,id;B2,id) |
… |
A2 generates a multicursor for table a. B2 generates a multicursor for table b based on A2’s way of segmentation. Since a composite table defines the primary key (the field name preceded by #) at creation, there’s no need to write key field names explicitly and the search is performed by automatically matching the primary keys (names of key fields in the two tables can be different). A3’s joinx function returns a multicursor, too.
SPL also allows using the synchronous segmentation mechanism on ordered bin files and text files. Just don’t forget to write the joining fields explicitly. As the principle is the same, no examples are offered here.
Make note that primary key values are not duplicate. It is impossible that records with same primary key values are put into two segments. But there are cases where table a’s primary key relates to part of table b’s composite key – the joining field in table b is not the whole key, and duplicate values could exist. A direct division thus may put records with same joining field values in two segments. To avoid the error, the parallel segmentation should take table a as the base table and divides table b according to a’s multicusor, not vice versa.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version