Performance Optimization - 6.8 [Foreign key association] Big dimension table search
Performance Optimization - 6.7 [Foreign key association] Aligned sequence
When traversing a fact table, using the foreign key to search for the records of dimension table will only take one record at a time. However, since the fact table is usually not ordered by the foreign key field (the fact table may have multiple foreign keys. Being ordered by one foreign key will not be ordered by another. In most cases, the fact table is not ordered by any foreign key), the found dimension table records are random. If the number of records in the fact table is slightly large (not to the extent that the memory cannot hold), the reading of the dimension table is the typical frequent, random and small data access. This access method means extremely low performance for the hard disk.
Fortunately, dimension tables are relatively stable and usually not very large and can be loaded into memory. This is the case discussed in previous sections.
However, there is still a possibility that the dimension table is too large to be loaded into memory. When a dimension table is stored in external storage, the above algorithms can no longer be used, otherwise it would lead to frequent, random and small data access.
Let’s first see the case where the fact table is relatively small and can be loaded into memory.
Using the foreign key to associate a dimension table is essentially a search action. When the dimension table is large, it is the search in external storage. Reviewing the algorithms discussed earlier, we know that to obtain high performance, the data table should be stored orderly by the to-be-searched key (i.e., the primary key of dimension table), or create an index.
A fact table has multiple records, so it has multiple foreign key values, and the problem becomes the batch searches in external storage. In this case, the foreign key values of fact table should be gathered for searching to avoid frequent access caused by separate search.
A | |
---|---|
1 | =file(“customer.btx”) |
2 | =file(“orders.btx”).import@b(c_id,amount) |
3 | =A1.iselect(A2.id(c_id),cid;area,discount).fetch() |
4 | =A2.switch(p_id,A3:pid).groups(p_id.area;sum(amount*discount)) |
The dimension table stored in the bin file is ordered by the primary key. Since the dimension table would be searched randomly, it cannot appear as a cursor. After the foreign key values of the fact table are concatenated into a sequence, using the iselect()function (binary search) can relatively quickly fetch the required dimension table records, hereby avoiding the traversal of whole big dimension table. The foreign key of fact table may have duplicate values, so we need to use the id() to do DISTINCT operation and sort at the same time, and then search for them in the dimension table file, and finally associate them with the fact table.
SPL encapsulates this process into a joinx@q() function:
A | |
---|---|
1 | =file(“customer.btx”) |
2 | =file(“orders.btx”).import@b(c_id,amount) |
3 | =A2.joinx@q(c_id,A1:cid,area,discount) |
4 | =A3.groups(area;amount*discount) |
Unlike the addressization method used earlier, here we only take part of records and fields from the dimension table. Since such records and fields are only used temporarily in most cases, it doesn’t make much sense to addressize after constructing a table sequence, so the joinx@q() function directly joins the dimension table fields to the fact table records.
If the dimension table is stored in the composite table, it can also be searched using index, which can achieve better performance.
A | |
---|---|
1 | =file(“customer.ctx”).open() |
2 | =file(“orders.btx”).import@b(c_id,amount) |
3 | =A2.joinx@q(c_id,A1:cid,area,discount) |
4 | =A3.groups(area;amount*discount) |
The composite table is usually sorted by primary key, so even if there is no index, the records can be found quickly using binary search.
Similar to in-memory dimension table, this method can also resolve the association of multiple big dimension tables at the same time. After resolving the dimension tables in external storage, we can continue to resolve the in-memory dimension table on the obtained table sequence. Understanding the relationship between dimension table and fact table is the key foundation for solving foreign key relationship and achieving high performance.
The relational algebra theory does not distinguish between dimension table and fact table. For the join between two tables, many databases still read the small table into memory first and then traverse the large table. For the current situation, the database will first read the fact table into memory and then traverse the dimension table. For the two situations, i.e., large fact table and small dimension table, or big dimension table and small fact table, the database’s processing strategy is the same.
The dimension table search method we are discussing here also needs to read the fact table, but there is no need to traverse the big dimension table. The processing method for large fact table and small dimension table is not symmetrical to that for large dimension table and small fact table. Every record of the fact table needs to participate in calculation, and traversal of the fact table is unavoidable. However, dimension table is different. Not every dimension table record will participate in the calculation (the maximum number of records used for calculation does not exceed the number of records in the fact table). Moreover, it does not take much time to read small dimension table, and it will not have much impact even if there is waste due to reading the whole dimension table (these dimension tables can also be reused as discussed earlier, hence there is no waste). However, it is not worth traversing the big dimension table. When the fact table is relatively small, only a small part of records in dimension table will be used, so it is not necessary to traverse the whole big dimension table.
Databases, or relational algebra systems, have a relatively simple understanding of data association and do not deeply reflect the more essential characteristics of data association. Therefore, it is impossible to design higher-performance algorithms in theory, and we can only place hope on the engineering optimization of the database. When it comes to this issue, some well-designed database optimizers can find that one of the association keys is the primary key of the big table, and will “intelligently” choose a search technology instead of traversing. However, they will still get “confused” when involving many tables.
Performance Optimization - 6.9 [Foreign key association] One side partitioning
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL