Performance Optimization - 6.9 [Foreign key association] One side partitioning

 

Performance Optimization - 6.8 [Foreign key association] Big dimension table search

Let’s finally deal with the case where both the dimension table and fact table are large, and usually the fact table is larger. In this case, it is very difficult to calculate at high speed anyway, but we still try to find a way to make the calculation as fast as possible.

Does it work to read the fact table with a cursor first and then execute the dimension table search algorithm in batches as discussed in the previous section?

No, because the fact table is very large, too many batches of searches will always find all dimension table records, or even more than once, which will lead to serious frequent reading of small data. Just like what we talked about earlier that the total time consumed in index sorting is not necessarily shorter than that of big sorting algorithm, the performance of this method is unlikely to surpass the algorithm that sorts both the fact table and dimension table before merging and joining.

For the scenario where both tables are large, the database generally adopts the hashing & partitioning algorithm, that is, calculate the hash value of the association keys in two tables respectively, and then put the data whose hash value is within a certain range into a partition to form the buffer data on external storage, ensuring each partition is small enough to be loaded into memory, and finally execute the in-memory join algorithm for each pair of partitions (two tables) one by one. This algorithm will split and buffer both large tables, which can also be called two-side partitioning algorithm. When we are unlucky with the hash function, a second round of hashing may be required since a certain partition may be too large.

If a dimension table is stored orderly by primary key and can be read in segments after adopting an appropriate storage scheme, then we can assume that the dimension table has been logically partitioned (we can simply regard each segment as a partition, and it is easy to calculate an appropriate number of segments when we know the total size of data table). At this time, we only need to partition and buffer the fact table according to the segment partitioned by the primary key value of the dimension table where its foreign key value falls into, that is to say, we just need to partition the fact table. After that, we can gradually associate each partition (of fact table) with the corresponding segment of dimension table, because the partitioning action has ensured the fact table records of one partition will only be related to the dimension table records in the corresponding segment.

The logical process of the algorithm is as follows:

A B C D
1 =file(“customer.btx”)
2 =10.(A1.cursor@b(id;~:10).fetch(1).cid)
3 =file(“orders.btx”).cursor@b(c_id,amount)
4 =10.(file(“temp”/~))
5 =A3.groupn(pseg(A2,c_id);A4)
6 func for 10 =A1.curor@b(cid,area,discount;B6:10).fetch()
7 =A4(B6).cursor@b().join(c_id,C6:cid,area,discount)
8 for C7,1000 return C8
9 =cursor@c(A6).groups(area;amount*discount)

The overall process is similar to the two-side partitioning algorithm, except that it does not need to partition the dimension table, and it only needs to partition the fact table, so this algorithm can be called one-side partitioning. Moreover, this algorithm can divide the dimension table into equal segments, and there won’t be situation where we have to do a second round of hashing and partitioning due to bad luck with hash function. Although there may be a situation where a partition of fact table is too large, we can use the algorithm for large fact table and small dimension table, avoiding a second round of partitioning. One-side partitioning algorithm will generate much less buffer data than two-side partitioning algorithm, so the performance will be better.

The process is still quite complex, and SPL also encapsulates the algorithm:

A
1 =file(“customer.btx”)
2 =file(“orders.btx”).cursor@b(c_id,amount)
3 =A2.joinx@u(c_id,A1:cid,area,discount)
4 =A3.groups(area;amount*discount)

For the joinx() function without @q option, SPL will consider that the fact table is large and needs to be partitioned. Similar to the algorithm in the previous section, the dimension table needs to be accessed in segments, and should appear as a data file rather than a cursor.

The order of data fetched from the cursor returned by joinx@u() looks unordered. The reason is that we know from the above algorithm that the data is fetched one by one by the segment of dimension table, joined and returned, and the overall order should be the same as the primary key of dimension table but, the data in each segment are fetched by the order of the partitions of fact table, yet the data in each segment is not necessarily ordered by the primary key of dimension table.

If we want to return the data by the original order of fact table, just remove the @u option.

At this time, SPL will generate a sequence number recorded in the original fact table, and write the data after they are sorted by this sequence number during partitioning and buffering, and then write it once again to the buffer file after the joining at each segment is done, and finally merge and sort all the buffered data generated in the latter round by the original sequence number. In this way, it is equivalent to doing one more big sorting than joinx@u(), and the performance will be worse. However, the fact table is originally ordered sometimes, and this order needs to be used for the next round of calculation, so it must continue to maintain this order.


Performance Optimization - 7.1 [Merge and join] Ordered merge
Performance Optimization - Preface