Performance Optimization - 2.6 [Dataset in external storage] Multi-zone composite table
Performance Optimization - 2.5 [Dataset in external storage] Order and data appending
The historical data used by OLAP service generally does not have a large number of frequent updates, but sometimes batch deletion of historical data is necessary. The data from too many years ago have lost their significance for query analysis. If they are still stored in the data table, it will occupy a lot of space and affect the query performance. However, we require data to be stored continuously. Even if the data itself is in chronological order, deleting the data in the initial period will lead to all data rewriting, which is very time-consuming.
SPL provides a multi-zone composite table, which allows multiple files to form a composite table. These files are called the zones of the multi-zone composite table.
A | B | |
---|---|---|
1 | =file(“data.ctx”:to(12)).create(#ID,dt,…;month(dt)) | |
2 | =file(“data.ctx”:[5,6,7,8,9,10,11,12]).open() | |
3 | =A2.cursor(…) |
When creating a composite table, we can set that the composite table is composed of multiple physical files. Each file will have a number called zone number, that is, the to(12) part of the parameter in A1, indicating that the composite table will be divided into 12 files with zone numbers of 1, 2,…, 12. At the same time, an expression for calculating the zone number from the data is given, here is month(dt). When using append@x()to append data, SPL will calculate month(dt) for each appended record and append it to the file of the corresponding zone number according to the result.
When using, we can select only part of the zones to form a logical composite table. These zone numbers can be discontinuous, but must be orderly. When early data needs to be deleted, just delete the file of the corresponding zone, and then open the composite table without adding this zone to the list. There is no need to rewrite the data in other zones.
In particular, when reading data, SPL will instantly merge and sort data by the dimension fields which are sepecified when createing the multi-zone composite table to ensure that the performance of reading multi-zone composite table is consistent with that of reading a single composite table. In other words, the data is ordered by dimension field without the need to use the merge function. Therefore, it can be ensured that the cursor returned in A3 in the code above is already ordered by ID field.
Similarly, the historical data file and recent data file mentioned in the previous section can also be respectively stored as two zones of multi-zone composite table (if the amount of historical data is large, it can be stored as multiple zones). In this way, multiple files can be regarded as a composite table logically to access, making the code more concise.
The method introduced in the previous section doesn’t support real-time appending of data. Neither preparing new data in order nor merging new data with historical data can achieve real-time appending and, the composite table is unusable when appending data. Therefore, this method is applicable only to scenarios where data is appended offline and at a low frequency.
Using the multi-zone composite table can make data appending more real-time.
Divide the data into several layers by time interval level, for example 6 layers: month, day, hour, minute and second. Ideally, a complete multi-zone composite table is composed of ‘zone tables’ (hereinafter referred to as ‘table’ unless otherwise specified) of each layer: month-layer tables of last month and previous months, day-layer tables of yesterday and previous days within the current month, hour-layer tables of last hour and previous hours within the current day, minute-layer tables of last minute and previous minutes within the current hour, second-layer tables within the current minute.
The most recent real-time data is sorted directly in memory and stored in a second-layer table. After one minute, these second-layer tables will be merged into a minute-layer table. After one hour, these minute-layer tables will be merged into an hour-layer table. After one day, these hour-layer tables will be merged into a day-layer table. After one month, these day-layer tables will be merged into a month-layer table.
New table is always created when merging. The existing tables are still used during the merging process. When the merge is accomplished and the new table is enabled, the smaller tables involved in the merge can be deleted. To be specific, when the second-layer tables are being merged into a minute-layer table, these second-layer tables are still available; Once the merge is accomplished and a new minute-layer table is enabled, these second-layer tables can be deleted. The same applies to other layers.
Merging and appending can be done in parallel. Because smaller tables can still be used during merging, even if the merge action cannot be accomplished quickly (the merging of day-layer tables with large amounts of data may even take several hours), it will not affect the use of the data. The only impact is that it may slightly affect the performance of reading recent data since there are many small tables before merging.
This mechanism allows new data to be appended at any time. The new data and historical data together form a multi-zone composite table that is usable at any time, which ensures the data appended in real time takes effect immediately. Moreover, the multi-zone composite table ensures the ordered reading of dimension field, that is to implement the effect of real-time sorting during the append process.
The entire process can be written in SPL code, but there are still many complex details. SPL officially provides a complete set of routine codes, which can be referenced and downloaded on the SPL forum.
Performance Optimization - 2.7 [Dataset in external storage] Data update
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL