Performance Optimization - 4.8 [Traversal technology] Redundant grouping key

 

Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation

Sometimes, for the convenience of data processing, there will be some redundant fields in a data table. For example, there may be a customer number field as well as a customer name field in an order table, yet the latter can be determined by the former. In this case, the customer name field is redundant. This kind of data structure is not uncommon in wide table on the multi-dimension analysis server.

If we want to group and aggregate by customers, the customer number is usually used as grouping key, because there may be duplicate customer names. If we want to add customer name in the result set, we need to include customer name as part of grouping keys. Indeed, this writing way is often used in SQL.

However, this writing way will increase the computation load. In the process of grouping, the calculation and comparison of grouping key values account for a significant ratio. Adding one more grouping key will make the calculation and comparison of hash values much more complex, resulting in a performance decrease.

SPL allows such redundant grouping keys to be written in the aggregation parameter as aggregation value, which avoids these unnecessary calculations.

A
1 =file(“orders.btx”).cursor@b()
2 =A1.groups(cust_id;cust_name,sum(amount))

If the aggregation parameter does not have an aggregate function, it will be considered as a redundant grouping key, which will take any one from the current grouped subset as aggregation value. Here, cust_name will not participate in the calculation of grouping keys, but it can still be calculated correctly in the result set.

The main purpose of designing redundant fields in data warehouses is to avoid the difficult-to-optimize join operation, so space is sacrificed for time. In the following chapters, we will discuss how to efficiently implement this type of join in SPL without having to generate a wide table with redundant fields (fewer fields in the traversed table will increase computing performance). The handling of redundant grouping keys is rarely involved in SPL in practice, and is generally only used when migrating and replicating traditional data warehouse operations.


Performance Optimization - 4.9 [Traversal technology] Column-wise computing
Performance Optimization - Preface