Performance Optimization - 9.4 [Cluster] Segmented dimension table

 

Performance Optimization - 9.3 [Cluster] Duplicate dimension table

The dimension table needs to be accessed randomly, but the external storage does not have such ability, therefore, we should load the dimension table into memory whenever possible. If the dimension table is so large that one node cannot hold, we should try to load it into multiple nodes’ memory.

For the relatively large dimension table, storing it in the external storage will also use the multi-zone composite table, and when it is loaded on the node, multiple zones will be used as well. Each node holds a zone, and the zones in the memory of multiple nodes collectively constitute a complete dimension table, called segmented dimension table.

A
1 [“192.168.0.101:8281”,“192.168.0.102:8281”,…, “192.168.0.104:8281”]
2 =file(“product.ctx”:to(4),A1).open().memory()
3 =file(“orders.ctx”:to(4),A1).open().cursor(p_id,quantity)
4 =A3.join(p_id,A2,vendor,price)
5 =A4.groups(vendor;sum(price*quantity))

In A2, after the cluster table is created, it will be loaded as a segmented dimension table that can be used to be associated with the cluster table in A4.

Unlike the duplicate dimension table, the data of the segmented dimension table is stored on all nodes, and hence no any node has a dimension table with all data. In this case, the addressization mechanism which converts the foreign key of fact table to the dimension table record fails, because the dimension table record associated with a certain fact table record on a certain node may be on another node, and a cross-node address does not work. Therefore, we can only use the join() function to take out the fields of dimension table record to be referenced and then perform subsequent calculations.

In order to quickly locate the node where the associated dimension table record is on, the zone expression of multi-zone composite table will also be used. When associating the dimension table, the master computer will calculate the zone expression with the foreign key of fact table so as to obtain the node where the dimension table record associated with this foreign key will locate. The multi-zone composite table, as a dimension table, its zone expression must be calculated based on the primary key (the primary key of dimension table corresponds to foreign key of fact table).

Network transmission and hard disk reading have some similarities, both have a more complex preparation action, and both are not suitable for frequent and small-amount-data access. When executing the join() function for cursors, SPL will fetch a large number of foreign keys and transfer them to appropriate nodes for query at a time, instead of processing only one foreign key each time. SPL does not provide the method for obtaining the record of segmented dimension table for a single foreign key.

We emphasize again that unlike the homo-dimension tables association and the primary-sub table association, although the segmented dimension table and associated fact table may use the same zone number, it just represents a split method, and does not mean that data with the same zone number will be associated.

While using the segmented dimension table to associate, network transmission will occur during the operation. The transmission amount, however, is not very large, only involving the fields of the associated record between the fact table foreign key and the dimension table, and there is no need to transmit other fields of fact table. As a result, the calculation can be achieved directly, and no buffer data is generated in this process. On the contrary, when the distributed database performs the join operation, it has the following disadvantages: it needs to transmit all fields of two tables involved in the associated result set; it also needs to buffer these transmitted data, only in this way can subsequent single-machine join operations be performed on each node; there may be “unlucky hash function” that cannot be avoided in hash algorithm, resulting in a very unbalanced amount of computation on different nodes, which seriously affects the overall computing efficiency. On the whole, the segmented dimension table has more advantages than the hash join algorithm of database.

For larger dimension table that cannot be held by multiple nodes, the external storage scheme should be used. That is, distribute the fact table in a cluster, and still duplicate the dimension table on every node, and execute the one-side partitioning algorithm to the zone of each fact table on the node. In this way, it returns to the situation where there is no dependency between nodes, and hence it can be solved by using the previous framework.


Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance
Performance Optimization - Preface