Performance Optimization - 9.4 [Cluster] Segmented dimension table

 

Performance Optimization - 9.3 [Cluster] Duplicate dimension table

Dimension tables need to be accessed randomly, but external storage does not have this capability, so we should try to load the dimension table into memory. If the dimension table is so large that the memory of one node cannot hold, we should try to load it into the memory of multiple nodes.

For larger dimension table, multi-zone composite table is also used to store it in external storage. When it is loaded on the nodes, it will be loaded in zones, with each node holding one zone. The zones in the memory of multiple nodes collectively constitute a complete dimension table, called a segmented dimension table.

A
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 =file(“product.ctx”:to(4),A1).open().memory()
3 =file(“orders.ctx”:to(4),A1).open().cursor(p_id,quantity)
4 =A3.join(p_id,A2,vendor,price)
5 =A4.groups(vendor;sum(price*quantity))

After creating the cluster table in A2, it will be loaded as a segmented dimension table, which can be used to be associated with the cluster table in A4.

Unlike duplicated dimension table, the data in segmented dimension table is stored on multi nodes, and no any node has a full-data dimension table. In this case, the addressization mechanism that converts the foreign key of fact table to the record of dimension table fails, because the dimension table record associated with a certain fact table record on a certain node may be on another node, and a cross-node address does not work. Therefore, here we can only use the join() function to take out the fields of dimension table record to be referenced and then perform subsequent calculations.

In order to quickly locate the node where the associated dimension table record is located, the zone expression of multi-zone composite table will also be used. When associating the dimension table, the master computer will use the foreign key of the fact table to calculate the zone expression to obtain the node where the dimension table record associated with this foreign key is located. The zone expression of the multi-zone composite table that serves as a dimension table must be calculated based on the primary key (the primary key of dimension table corresponds to the foreign key of fact table).

Network transmission and hard disk reading are similar in that both require complex preparation actions and both are not suitable for frequent small-data-amount access. When executing the join() function for a cursor, SPL will fetch a large number of foreign keys at a time and transfer them to appropriate nodes for query, instead of processing only one foreign key at a time. SPL does not provide the method for obtaining the record of segmented dimension table for a single foreign key.

It should be emphasized again that unlike the homo-dimension association and primary-sub association, the segmented dimension table and its associated fact table may use the same zone number, but it just represents a split method, and does not mean that the data with the same zone number will be associated.

When using a segmented dimension tables for association, network transmission will occur during the calculation process. However, the content transmitted is not very large, involving only the foreign keys of the fact table and the fields of the associated records in the dimension table, and there is no need to transmit other fields of fact table, and hence the calculation can be completed directly, and no buffer data is generated during the process. In contrast, when using a distributed database for association operation, it needs to transmit all fields of two tables involved in the associated result set, and it needs to buffer these transmitted data in order to perform subsequent single-machine join operation on each node, and it may encounter “unlucky hash function” that cannot be avoided in hash algorithm, resulting in a very unbalanced amount of computation on different nodes, seriously affecting the overall computing efficiency. In general, the segmented dimension table is more advantageous than the hash join algorithm of database.

For larger dimension table that cannot be held by multiple nodes, the external storage scheme has to be used. Specifically, distribute the fact table to a cluster, and still duplicate the dimension table to every node, and execute the one-side partitioning algorithm for each fact table zone on the node. In this way, it returns to the situation where there is no dependency between nodes, which can be solved using the previous framework.


Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance
Performance Optimization - Preface