Performance Optimization - 7.4 [Merge and join] Attached table
Performance Optimization - 7.3 [Merge and join] Association location
Foreign key association may be established freely between a fact table and multiple dimension tables, even multiple foreign key associations can be established between the same pair of fact table and dimension table. However, the homo-dimension association and primary-sub association established based on the primary keys of two tables can’t be established freely.
Homo-dimension association is an equivalent relationship (If A is homo-dimension with B, then B is homo-dimension with A; If A is homo-dimension with B, and B is homo-dimension with C, then A is homo-dimension with C), and hence we can take advantage of the homo-dimension relationship to divide all the data tables into several groups. As a result, the data tables in the same group associate with each other in a homo-dimension relationship, and won’t associate with tables outside the group. In other words, if we want to perform a homo-dimension association, it must occur between the tables within a group. Among the homo-dimension tables in the same group, there is usually a largest table whose primary key values are complete, and the primary key values of other tables belong to the subset of the complete values. For example, the customer table has the primary key values of all customers, while the VIP customer table has only the primary key values of a portion of customers but has more attributes for VIP customers.
The primary-sub association is slightly more complicated, but there is certain regularity. In a reasonable data structure design, the sub table will only establish primary-sub association with the only primary table. For example, the primary table of order detail table is the order table, andtherewill not be another table as its primary table. Therefore, from the perspective of the sub table, the primary table is unique.
Based on this understanding, we can store the homo-dimension tables and primary-sub tables of the same group together.
For the homo-dimension tables in the same group, first find the table with complete primary key values, which is called the base table, and the remaining homo-dimension tables are called the attached table. Once the base table is determined, the fields of the attached tables will be stored as the additional fields of the base table records, or these fields can be considered as fields of base table, except they may not have values for many records.
For the tables in primary-sub relationship, the primary table serves as the base table, and the sub table serves as the attached table. The fields of the sub table can also be considered as the additional field of primary table records. The difference is that the value of these additional fields is a set, and the set of values of the additional fields from the same sub table have the same length (as the additional field of the same primary table record). Similarly, these additional fields may also have no value.
Due to the relative stability of the homo-dimension association and primary-sub association, binding storage will not affect their association relationship, nor will it affect the foreign key association with other tables.
Binding storage will bring the following benefits:
1)The base table and the attached table have common primary keys. When the fields of attached table are stored as the additional fields of base table records, only one set of primary keys need to be stored, and there is no need to store primary keys (associated with the base table) of attached table again. As a result, the storage amount will be smaller. When the columnar storage is adopted, the amount of data to be read during association will also become less.
2)The fields of attached table, as the additional field of base table records, can be directly referenced (the fields from the sub table is a set, using different reference method), and there is no need to associate and compare, thereby reducing the amount of calculation. In particular, if the base table is filtered, the attached table will be filtered automatically, therefore, there is no need to use the association location method introduced in the previous section, and the reverse is also true.
3) As the field of base table records, the attached table field are bound together with the base table records, and they are naturally synchronizedduring segmentation, so there is no need to do any special follow-up segmentation.
However, this storage method has disadvantages. Because the storage scheme becomes more complex, a lot of extra judgments are needed when referencing additional fields.
Usually, when the primary key or the association is relatively complex, for example, there are multiple primary key fields, or the N in the 1:N of primary-sub association is larger (it means that there will be more comparisons in conventional association), the use of the attached table scheme will be more advantageous. For the homo-dimension association with a single simple primary key, the advantages of the attached table scheme is not obvious and may even have disadvantages.
In theory, the attached table in a primary-sub association relationship can have its own attached table, but it is not common.
Let’s take the primary-sub table as an example.
SPL implements the attached table function on the composite table, and it needs to specify the additional field when creating the composite table.
A | |
---|---|
1 | =file(“orders.ctx”).open().cursor() |
2 | =file(“details.ctx”).open().cursor() |
3 | =file(“order_detail.ctx”).create(#ID,…) |
4 | =A3.attach(detail,#seq,…) |
5 | =A3.append@i(A1) |
6 | =A4.append@i(A2) |
A3 creates a conventional composite table, and an attached table is added to A3 in A4 where there will be a name and fields of the attached table. For the sub table, it needs to design its own primary key (if it has a common primary key with the primary table, there is no need to specify), and then append data like a normal data table. SPL will attach the records to correct primary table records based on the primary key of sub table. It should be noted that except for the primary key, the base table and attached table cannot have fields with the same name, otherwise confusion will occur.
The reason why it is called the composite table is because it is a combination of a base table and an attached table. We call the base table and the attached table in composite table the real table.
Once a composite table with attached table is created, the field of the attached table can be referenced in calculations.
A | |
---|---|
1 | =file(“order_detail.ctx”).open() |
2 | =A1.cursor(dt,detail.sum(quantity):quantity) |
3 | =A2.groups(dt;sum(quantity)) |
The quantity is the field of the attached table detail. Because it is the sub table, the data is a set. When referencing from the primary table, aggregation operation is required.
The records of the sub table can also be generated:
A | |
---|---|
1 | =file(“order_detail.ctx”).open() |
2 | =A1.cursor(dt,detail{price,quantity}:amount) |
3 | =A2.run(amount=amount.sum(price*quantity)) |
4 | =A3.groups(dt;sum(amount)) |
Since there are multiple records in the sub table, which will be used as a field of the primary table cursor after generating, their values will be a table sequence. Due to the complexity of actions such as generating a table sequence table, the action will result in performance loss, which may offset the advantages of reducing associations.
The field of base table can also be referenced from the attached table:
A | |
---|---|
1 | =file(“order_detail.ctx”).open().attch(detail) |
2 | =A1.cursor(dt,price,quantity) |
3 | =A2.groups(dt;sum(price*quantity)) |
To reference the base table field, simply write it directly, which will achieve better performance compared to the method of generating table sequence field as mentioned above.
All these operations can support multi-cursor.
It should be noted that when we talk about these association algorithms, we often say that they may not always achieve better performance. Association is a complex operation, and the implementation code of its optimization algorithm is also very complex. Although most of these algorithms can reduce computational complexity from the theoretical analysis point of view, the impact on engineering practice cannot be ignored when the actual code is very complex.
All these algorithms have been verified through actual tests and can indeed optimize performance in some scenarios, but not in all scenarios. As for which algorithm to use, you should become familiar with these algorithms first, and then select one according to actual conditions.
Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL