Performance Optimization - 8.4 [Multi-dimensional analysis] Boolean dimension sequence
Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting
We used the aligned sequences earlier to improve the association after dimension table filtering. This technology can also be used to improve the slicing performance of enumeration dimension.
The so-called enumeration dimension refers to the dimension whose values are a few predetermined and limited values, such as gender, place of origin. In multidimensional analysis, almost all dimensions except the time dimension are enumeration dimension.
The slicing condition (or more precisely, dicing) of this type of dimension is generally to give a set of dimension values, and then filter out all records whose dimension values are within this set. Written in code is equivalent to:
T.select( V.contain( D ) )
where, T is the data table, D is the slice dimension, that is, a field, and V is the list of dimension values used as condition.
The calculation of contain() will not be very fast. When V is not large, sequential search will be used; when V is large, binary search will be used. However, it still requires several comparisons to determine whether the data is in the slice. If the data type is complex, the performance will be even worse.
To avoid this, we first convert the enumeration dimension to integers (generally, the number of dimension values is not too large, and most of them can be converted to small integers less than 65,535), and then convert the slice condition to an aligned sequence composed of boolean values during query. In this way, the slice judgment result can be directly taken from the specified position of the sequence when comparing, thus avoiding the complex contain() operation.
Convert dimension to integers:
A | |
---|---|
1 | =file(“T.ctx”).open() |
2 | =file("T_new.ctx).create(…) |
3 | =DV=A1.cursor(D).id(D) |
4 | =A1.cursor().run(D=DV.pos@b(D)) |
5 | =A2.append@i(A4) |
In A3, all possible values DV of dimension D will be calculated. Usually, such values are known in advance (such as gender, city), and there is no need to traverse the table. Why it is written this way here is to illustrate its logical meaning. In A4, the DV will be employed to convert the dimension D to integers. The DV will be saved separately for use during query.
Code for slice aggregation:
A | |
---|---|
1 | =file(“T_new.ctx”).open() |
2 | =DV.(V.pos(~)) |
3 | =A1.cursor(…;A2(D)) |
4 | =A3.groups(…) |
A2 converts the parameter V into a boolean sequence of the same length as DV. When a member of DV is in V, the member at the corresponding position in A2 will be non-null (that plays the role of true when judging), otherwise it will be filled with null (i.e., false). Then, when traversing to perform slicing, just use the dimension D that has been converted to integer as the sequence number to take the member of this boolean sequence. If it is non-null, it means that the original dimension D belongs to the slice condition list V. The operation complexity of taking the value by sequence number is much smaller than that of contain(), which greatly improves the slicing performance.
The same process can be done for all enumeration dimensions.
Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL