More foreign-key joins (JOIN Simplification and Acceleration Series 7)
Let’s continue our discussion of foreign-key-based JOINs using the example in the previous article.
If a fact table is too big to be held by the memory, the foreign key address-alization strategy will fail because the external storage cannot store the pre-calculated addresses.
Generally, a dimension table referred by a foreign key is small while an ever-increasing fact table is much larger. By loading the smaller dimension table into the memory, we can create temporary foreign key pointers.
A | |
---|---|
1 | =file(“customer.btx”).import@b() |
2 | >A1.keys@i(id) |
3 | =file(“orders.btx”).cursor@b() |
4 | >A3.switch(cid,A1) |
5 | =A3.groups(cid.city;sum(amount)) |
The first two steps are the same as those for an in-memory computation. In step 4, switch from values to addresses is performed while data is streaming in, but the switch result cannot be retained for reuse. Hash value calculations and comparisons are still needed for the next JOIN. The performance would thus be poorer than that of the in-memory strategy. Compared with the hash join algorithm in terms of the computational amount, this one does not need to calculate hash values of the dimension table’s primary key. The algorithm gains an advantage especially when the hash index on the dimension table is repeatedly used, but its ability in increasing the overall performance is limited because of the generally small size of a dimension table. Yet it is as capable of handling all foreign keys at a time and as easy to be paralleled as an in-memory algorithm, thus producing much better real-world performance than the hash join algorithm.
From this algorithm, we have its variation – foreign key numberization.
By converting of dimension table’s primary key values into natural numbers beginning from 1, we can identify records in a dimension table by their numbers without the need of calculating and comparing hash values, obtaining high performance the in-memory foreign key address-alization strategy can give.
A | |
---|---|
1 | =file(“customer.btx”).import@b() |
2 | =file(“orders.btx”).cursor@b() |
3 | >A2.switch(cid,A1:#) |
4 | =A2.groups(cid.city;sum(amount)) |
With a dimension table where the primary key values are numberized, the creation of hash index (step 2) is unnecessary.
The foreign key numberization is, in essence, is a way to achieve address-alization on the external storage. The strategy converts foreign key values in the fact table into numbers, which is similar to the in-memory address-alization process, and the pre-calculation result can be reused. Note that the foreign key values in the fact table need a good sort-out whenever a major change happens to the dimension table, otherwise the correspondence could be messed up. But in real-life practices the re-matching is rare because a dimension table is relatively stable and most of the changes to it are appending and modification, instead of deletions. More documentation about engineering optimization in details handling can be found in https://c.scudata.com.
SQL is based on the concept of unordered sets. This means SQL-based databases cannot take advantage of the foreign key numberization technique to create a shortcut mechanism for locating records in an unordered set. Even if the foreign key is deliberately numberized, they cannot recognize the change but just go into the calculation and comparison of hash values. They can only employ the hash-based search.
But what if a dimension table is too large to be held by the memory?
In the above algorithm, accesses to the fact table are continuous but accesses to the dimension table are not. When we talked about hard disk characteristics, we mentioned that hard disks respond slowly to random accesses, so the algorithm is not suitable for handling a dimension table stored in the external storage.
If, from the very beginning, we store a dimension table that is already sorted by the primary key on the external storage, we can optimize performance according to the characteristics that the associative key in the dimension table is the primary key.
If the fact table is small enough to fit into the memory, the matching to records in a dimension table according to the foreign key will become an external (batch) search action. If an index is already created on the dimension table’s primary key, the search becomes fast and performance is increased, by avoiding large dimension table traversal. The algorithm can be used to parse multiple foreign keys. SQL, however, does not distinguish the dimension table and the fact table. The optimized hash join will not perform hash partition buffering when a large table and a small table is being joined. The database will generally read the smaller table into the memory and traverse the larger table. The large dimension table traversal results in much poorer performance than the external storage search algorithm does.
If the fact table is also large, we can adopt the one-sided partition algorithm to divide the fact table only. As the dimension table is already ordered by the associative key (the primary key), we can conveniently divide it into segments, get scope values of each segment (the maximum and minimum values in each segment of primary key values), and split the fact table into heaps according to the scope values. Then we can join each heap with the corresponding segment of the dimension table. During the process we just divide the fact table into heaps physically and buffer them. There is no need to do the same actions on the dimension table. And by directly segmenting the fact table rather than using the hash function, the unsuitable use of hash function and thus a second hash partition action can be avoided, leading to manageable performance. The database’s hash partition algorithm, however, performs double partition by dividing both large tables respectively into heaps physically and buffering them. This may result in a second partition if we use an unsuitable hash function and get much poorer and uncontrollable performance.
We can also turn to the cluster computing to handle a large dimension table.
Since the dimension table cannot fit into the memory of a single machine, we can put more machines in place to receive data in the table. Segment the dimension table by the primary key values and store the segments onto memories of multiple machines to form a cluster dimension table, on which we can apply the above algorithm catering to an in-memory dimension table to parse multiple foreign keys at a time and make parallel processing easy to perform. The foreign key numberization technique can be applied to a cluster dimension table, too. With this algorithm, the fact table will not be transmitted, only a small amount of network transmission is generated, and no nodes are need to buffer data locally. Under the SQL system, we do not know which is the dimension table, and hash splitting method performs Shuffle action on both tables, which produces a lot more network transmission.
As a lot of details are involved in the algorithm and due to limited space, here we just skip them. Just visit https://c.scudata.com to find more related essays if you are interested.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version