Performance Optimization - 6.4 [Foreign key association] Time key
Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization
The data in the dimension table is relatively stable, but sometimes it may change. If the change is not handled, statistical errors may occur. For example, if we want to group and summarize the sales amount of orders over a period of time by the origin of the products, but the origin of the products may have changed during this period, then we need to determine which period’s origin of the products to use based on the order time when associating the order table (orders) with the product table (product) to obtain the origin of the products.
To do so, it needs to add a time field in the dimension table to store the time when the dimension table record changes, and this time field is called the time key. After that, we can compare with the time in the fact table to find the appropriate dimension table record. However, this is not a conventional equivalence association. What needs to be found is the record corresponding to the earliest one with the same association key and a time field value no earlier than the specified time.
Here is the logic written in SPL syntax:
D.select( id==id0 && t>=t0 ).minp( t )
where, D is the dimension table, id is the original primary key field, t is the added time field, and id0 and t0 are the primary key and time value for searching.
In fact, because the changes in dimension table record will cause multiple records in the dimension table to have the same primary key, the original primary key cannot be used as a primary key since it is technically no longer unique, so creating index to implement fast search won’t work. If traversal search is done according to the above definition, the performance will be very low.
An easy way to think of is to group the dimension table by the original primary key, with each group member being a small subset, and the whole grouped results can still be regarded as a set with the original primary key as the primary key (it is no longer a table sequence). In theory, we can still speed up the search by creating an index, and then find the appropriate record in the corresponding grouped subset according to the time parameter, thereby avoid traversal to achieve better performance.
SPL expands the primary key of the table sequence to implement this process:
A | |
---|---|
1 | =file(“product.btx”).import@b() |
2 | >A1.keys@it(pid,dt) |
3 | =A1.find(1,date(“2020-5-1”)) |
A2 uses the @t option when creating the index, and SPL will use the last field dt as time key, after which you can use the find() function for searching.
Time key can also be used directly in association.
A | |
---|---|
1 | =file(“product.btx”).import@b() |
2 | =file(“orders.btx”).import@b() |
3 | >A1.keys@it(pid,dt) |
4 | >A2.switch(p_id:dt,A1) |
5 | =A2.groups(p_id.area;sum(p_id.price*quantity)) |
The switch() function in A4 will use the value of the dt field in A2 as the time parameter to search for dimension table records.
Performance Optimization - 6.5 [Foreign key association] Inner join syntax
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL