Performance Optimization - 6.1 [Foreign key association] Foreign key addressization
Performance Optimization - 5.8 [Ordered traversal] Index sorting
Let’s briefly review the concept of foreign key association: when the field F in the data table T is associated with the primary key K of the data table D, D is called the foreign key table of T, also known as the dimension table, T is called the fact table, and field F is called the foreign key field. If the primary key of D has multiple fields, the foreign key association can also be defined in a similar way. The goal of foreign key association is to reference the field of the record in D corresponding to the field F when accessing the records of T.
The characteristic of foreign key association is that the foreign key of fact table must be associated with the primary key of dimension table. Since the primary key is unique, the records of fact table will only be associated with the unique record of dimension table, which is a typical many-to-one association. This characteristic should be taken advantage of when optimizing the performance.
If the amount of data is not very large, and both the fact table and dimension table can be loaded into memory, we can establish association in advance. After that, we can refer to the field of dimension table quickly.
Assume that the foreign key p_id of the order table orders is associated with the primary key pid of the product table product (In actual table structure design, foreign key field is often given the same name as its corresponding primary key field. Here they are intentionally given slightly different names for distinction), now we want to take the product unit price from the product and multiply it by the sales quantity in the orders to obtain the order amount, and then calculate the total order amount of different manufacturers (also a field in the product).
A | |
---|---|
1 | =file(“product.btx”).import@b() |
2 | =file(“orders.btx”).import@b() |
3 | >A1.keys@i(pid) |
4 | >A2.switch(p_id,A1) |
5 | =A2.groups(p_id.vendor;sum(p_id.price*quantity)) |
A1 reads the product; A2 reads the orders; A3 sets the primary key and creates an index for the product; A4 converts the foreign key field p_id of A2 to the corresponding record of A1, which means that once A4 is executed, the field p_id of A2 will become a record of A1, it is equivalent to executing:
A2.run(p_id=A1.find(p_id))
Creating an index in A3 can make the switch() run faster. Since the fact table is usually much larger than the dimension table, the index can be reused many times.
When A5 traverses the fact table, since the value of the p_id field is already a record, we can directly use the dot operator to reference its field, and both p_id.vendor and p_id.price can be executed normally.
After the switch() operation is executed, what is stored in the p_id field of fact table A2 in memory is already the address of a record of dimension table A1. This action is called foreign key addressization. As a result, when using the dot operator to reference the field of dimension table, we can take out the record directly without the need to use the foreign key value to search in A2, which is equivalent to being able to take out the fields of dimension table within a constant time. On the contrary, even if the fast hash index is used for searching, it still takes some time, and is related to the size of the dimension table.
This is equivalent to reusing the result of the third step switch(). The first three steps, that is, reading tables, creating an index and using switch() to establish association, can all be done in one go in advance, which allows us to quickly reference the field of dimension table in subsequent operations on the fact table.
The reason for being able to do so exactly takes advantage of the uniqueness of foreign key association on the dimension table mentioned earlier, that is, one foreign key field value will correspond uniquely to one dimension table record, so each p_id can be converted to the record in A2 that it uniquely corresponds to.
In fact, the code is simpler to write and easier to understand after foreign key addressization.
This operation is actually the join in the database. Since the relational algebra does not stipulate that any participant of association must be a primary key, the uniqueness of any participant of association cannot be guaranteed, so the foreign key field cannot be converted to a dimension table record in advance. Databases usually uses the hash algorithm to do in-memory join, that is, calculate the hash value of the associated fields of both tables, and then put the records with the same hash value into same group to traverse. The complexity of the join operation is equivalent to creating an index on the association key of one table (there is no clear definition on dimension table and fact table when no primary key is assumed), and then traversing the other table and using the association key of its each record to search in the first table, and the amount of computation is same as that done by A3 and A4 in this example.
The preparation work for foreign key addressization will do the same thing (A3 and A4). Once the preparation is done, the result can be reused without the need to create an index and search based on the index each time the join operation is performed, and the performance of subsequent calculations will be much better. If only one operation is needed, then it will not be faster than the hash algorithm of database after the time spent on preparation work.
We can read the data table and then perform the foreign key addressization at system startup. This action is called pre-association. After that, we can quickly handle association query and calculate. This method is very meaningful for building a high-performance in-memory database, yet this method cannot be implemented in an in-memory database based on relational algebra theory system.
If the fact table has multiple foreign keys to associate with different dimension tables, and a dimension table may also have dimension tables, all the pre-association can be done in advance:
A | |
---|---|
1 | =file(“area.btx”).import@b().keys(aid) |
2 | =file(“product.btx”).import@b().keys@i(pid) |
3 | =file(“orders.btx”).import@b() |
4 | >A2.switch(a_id,A1) |
5 | >A3.switch(p_id,A2;a_id,A1) |
6 | =A3.select(p_id.a_id.state==a_id.state) |
7 | =A6.groups(p_id.vendor;sum(p_id.price*quantity)) |
A1, A2 and A3 read the data, A4 and A5 perform the pre-association, and A6 and A7 perform calculation.
This code will select the orders with the same transaction place and production place, and then calculate the total order amount by manufacturer. It involves associating three tables, among which the order table has two foreign keys and dimension table, and the region table, as a dimension table, is associated with other tables twice.
In principle, the hash join algorithm can only parse one association relationship at a time. If there are multiple association relationships, it needs to parse multiple times and the related tables need to be traversed multiple times. After distinguishing the fact table from dimension table, we can perform foreign key addressization for multiple foreign keys of the same fact table in one traversal. From the perspective of theoretical analysis, although the computation amount and comparison times of hash algorithm that needs to traverse multiple times and separate parsing are not greater than one traversal that needs to perform multiple foreign key addressizations, each traversal of hash algorithm will be accompanied by the generation and copy of a batch of data records, so the actual computation amount of hash algorithm is much greater than that of one traversal.
Moreover, when the number of association layers and the number of the involved data tables are large, the database optimizer often fails to find the most appropriate parsing order. Different orders will result in different data generation and copy operation amount. If each association involves large table, such additional calculation amount will be significant. We observe that when the amount of data does not increase much but only the number of associated tables and layers increases, the computing performance of the database may drop sharply.
Once the fact table and dimension table are clearly distinguished, the multi-layer association of multiple tables will still be very clear. For the association between small tables, addressization can be performed separately without involving other large tables, and the degree of performance decrease is only linearly related to the amount of data and the number of associated layers.
In practice, the association relationship is often much more complex than that in the example here, so the advantage of foreign key addressization will be more obvious.
Although using switch() to implement foreign key addressization can improve performance, it still has a few problems:
1) Addressization will lose the value of the foreign key field itself, which must be obtained by taking the primary key from dimension table, and the speed will be slowed down. In the example shown above, if we want to get pid after addressization, we need to use p_id.pid.
2) If no associated record is found in dimension table, the foreign key field will be converted to null, and its original value will be completely lost.
3)A foreign key may consist of multiple fields, in which case the switch() function cannot be used for conversion.
To solve these problems, SPL provides the join() function:
A | |
---|---|
1 | =file(“area.btx”).import@b().keys(aid) |
2 | =file(“product.btx”).import@b() |
3 | =A2.join(a_id,A1,~:area).keys@i(pid) |
4 | =file(“orders.btx”).import@b() |
5 | =A4.join(p_id,A3,~:product;a_id,A1,~:area) |
6 | =A5.select(product.area.state==area.state) |
7 | =A6.groups(product.vendor;sum(product.price*quantity)) |
The join()will add a field to the original table sequence to store the address of dimension table records without changing the original field values. In this way, all the above problems can be solved. The difference is that join() will return a new table sequence, thus attention should be given to the operation steps when performing the addressization. For a data table that serves as both a fact table and a dimension table, it should first associate with its own dimension table to obtain a new data table, and then be associated with the fact table.
Performance Optimization - 6.2 [Foreign key association] Instant addressization
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL