Performance Optimization - 3.8 [Search in external storage] Merging multi indexes
Performance Optimization - 3.7 [Search in external storage] Search that returns a set
In the previous sections, we explained that multiple indexes cannot be used at the same time. After establishing indexes for two fields respectively, only one index can be used in the joint condition query for the two fields, and the conditions of the other field still need to be traversed.
In fact, this statement is not always true. Some optimization of index query algorithm can still make multiple indexes work together to a certain extent.
For the convenience of narration, we record the record set satisfying x as S(x). Assuming that there are indexes for fields A and B respectively in the table, we need to find S(A==a && B==b). Using the index of A can quickly locate S(A==a), but if we want to find the records belonging to S(B==b) from them, we can’t use the index of B any more. And vice versa, we can use the index of B to quickly find S(B==b), but we can’t continue to use the index of A for them.
Generally, the physical positions of records stored in the index are also ordered, that is, in the index of A, the physical positions of records stored in S(A==a) are sorted from small to large; B’s index is similar. This is easy to meet when building an index, or even naturally, because the original order of members with the same value is generally not changed during sorting.
Using this, we can use the indexes of A and B at the same time when searching S(A==a && B==b).
The process is not complicated. Use the index of A to obtain the cursor based on S(A==a) and use the index of B to obtain the cursor based on S(B==b). Both cursors are orderly for the physical position of records. Then, we can use the merge algorithm to calculate their intersection to get S(A==a && B==b). This is equivalent to traversing both S(A==a) and S(B==b) once, and both indexes can be used at the same time.
However, this algorithm does not necessarily have more advantages in complexity than traversing and calculating whether B==b is true in S(A==a), because the latter only needs to traverse one set of S(A==a), while the previous algorithm needs to traverse two sets of S(A==a) and S(B==b).
The advantage of this method is in practice. To get the intersection using merging, there is no need to read out the records of the original data table to calculate whether B==b is true, just compare the index. The index blocks may already be stored in memory, or even in external storage, they are stored continuously, only have two columns (TBS key and physical position) and in row-based storage format. The reading performance is much better than the original dataset with multiple fields, discontinuous storage and columnar storage. In this way, the performance of merging and comparison will be better than that of reading records and then doing judgment.
If we choose the smaller one of S(A==a) and S(B==b) as the criterion to merge the larger one, we can stop merging as long as the smaller one has been traversed, and the members in the larger one that have not been traversed can be abandoned. In this way, the total number of traversing will not be very large. When building an index, we can usually know the number of records of S(A==a).
Of course, this method is also applicable to the case where there are conditions on more fields.
This mechanism has been implemented in SPL. For multi field conditions written on the cursor, such as A==a && B==b &&, SPL will automatically find the available index in the given index file list and merge without the intervention of programmer.
Performance Optimization - 3.9 [Search in external storage] Full-text searching
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL