Performance Optimization - 6.2 [Foreign key association] Instant addressization
Performance Optimization - 6.1 [Foreign key association] Foreign key addressization
Address is an in-memory concept. Foreign key addressization can only be implemented in all-in-memory operation. For big data, however, it often needs to calculate from external storage.
Let’s first see the situation where the fact table is large and the dimension table is small, which is very common in practice. The fact table is used to store growing events and easily becomes very large, while the dimension table is used to store code related information, and changes little in size.
If only the dimension table is in memory, the foreign key of fact table cannot be addressized in advance, which means pre-association cannot be achieved. In this case, the association can only be made instantly.
A | |
---|---|
1 | =file(“product.btx”).import@b().keys@i(id) |
2 | =file(“orders.btx”).cursor@b(p_id,quantity) |
3 | =A2.switch(p_id,A1) |
4 | =A3.groups(p_id.vendor;sum(p_id.price*quantity)) |
We can use the switch()function on the cursor to implement foreign key addressization. It will return a delayed cursor, and the substantial association calculation will occur only when fetching the data. Unlike the all-in-memory operation that can be performed based on the original data table, the operation in A4 can only be done based on the result returned by A3 because the switch() in all-in-memory operation will change the foreign key field of original data table, and the current association is done during the cursor fetching.
This method is called instant addressization, which can only reuse the index created on the dimension table, and the action to search for the record of dimension table has to be done every time the operation is performed. As a result, the performance will be much worse than all-in-memory operation.
Theoretically, the instant addressization algorithm is still more advantageous than the hash join algorithm. To be specific, for large table that cannot be stored in memory, the hash join algorithm will split the records in the table into several parts that can be stored in memory according to the hash values (this process is commonly known as partitioning), and then perform an in-memory hash join for the parts corresponding to the hash values. Since partitioning will cause additional write and read actions on external storage, the performance will become worse. By contrast, instant addressization will not generate buffer data in external storage, so there are no unnecessary write and read actions.
However, when the database encounters a join operation between one large table and one small table, it will not strictly follow the partitioning process of hash join algorithm. Instead, it will first read the small table into memory, and then read the large table data in batches to do in-memory join. As a result, the actual complexity and performance are not much different from the above code.
Theoretically, relational algebra system cannot distinguish between dimension table and fact table, but the size of the tables. If there are two tables, it is easy for the optimizer of database to plan a correct execution path according to the size of tables. However, if there are multiple tables and the association relationship is complex, the optimizer may be confused and cannot plan an appropriate execution path, and instead, it will execute the original partitioning method of hash join algorithm. Therefore, we will observe that even if there is only one large table, a sharp performance decrease will occur when there are many associated tables.
After conceptually distinguishing between dimension table and fact table, it is clear that we should first read the dimension tables into memory, create index and establish pre-association between dimension tables, and then traverse the fact table to perform calculation. Compared to all-in-memory operation, this algorithm has no difference in overall structure, except that the fact table needs to be read with a cursor and the association with dimension tables can only be established instantly, and buffer files will not be generated. Moreover, the index on the dimension tables and the pre-association between dimension tables can still be reused.
A | |
---|---|
1 | =file(“area.btx”).import@b().keys(aid) |
2 | =file(“product.btx”).import@b() |
3 | =A2.join(a_id,A1,~:area).keys@i(pid) |
4 | =file(“orders.btx”).cursor@b(p_id,a_id,quantity) |
5 | =A4.join(p_id,A3,~:product;a_id,A1,~:area) |
6 | =A5.select(product.area.state==area.state) |
7 | =A6.groups(product.vendor;sum(product.price*quantity)) |
The pre-processing actions in A1, A2, and A3 can be executed at system startup. When associating dimension tables in A5, the indexes in A1 and A3 can be reused.
Both join()and switch() functions can perform parallel computing based on multi-cursor. In this example, A4 can be written as a multi-cursor with the cursor@m() option, and there is no need to change subsequent code. The in-memory join operation in the previous section can also be converted to the in-memory multi-cursor to perform parallel computing.
Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL