Record-based, Index-driven Double Increment Segmentation Technique
The block-based segmentation strategy can achieve the 4 targets. But, besides the troubles inhandling block marks, the strategy isn’t suitable for segmenting data stored in a columnar format.
Segmentation of data stored column-wise should be synchronous on all columns. This means the corresponding segmentation points on all columns need to match the fields of one same record, otherwise data will be mismatched. Columns have different widths, and blocks of the same size holddifferent number of field values withdifferent columns stored in them. In this case, the block-based segmentation cannot guarantee synchronization among columns.
To realize the synchronization, we need to segment columns by records. By doing so, we should give up using fixed-size blocks and create an index on blocks. If data is static, thelength of the index is fixed. If data is keep increasing, the index needs to be increasing dynamically. For this we have two choices. One is to rewrite all data each time new data is appended to make sure that the index is continuous; the other is using some complicated mechanism to handle the discontinuous index. Both are resource-consuming.
The commonly used column segmentation strategy is a kind of block-based segmentation. It divides data into multiple blocks, and in each block, data is stored column-wise. Segmentation of theoriginal data is based on these blocks.There needs to be enough number of blocks to segment data as evenly as possible, but each block needs to be big enough to bring out the columnar storage’s potential. Here rises adilemma. The strategy is suitable when data amount issufficiently large. In addition, this segmentation strategy dividesdata by records, so blocks have different sizes and an index on blocks is needed. When new data is continuously appended, the index also becomes ever-increasing and loses its continuity.
In order to address this issue, we design a record-based, index-driven, double increment segmentation strategy.
Let’s set a fixed-length index spacethat can store N record addresses. N is a preset number, say, 1024. Position numberiin the index space stores the ith address.
Suppose therearen’t any records in the beginning. Add the first record (Record 1) and write its address, which is the one when it is written to the storage space (say, a file), at position number 1 in the index space. Then add Record 2 and write its address at position number 2 in the index space, and so on. Finally, add Record N and write its address at position number N in the index space.
The position numberi in the index space canberegarded asa block, which corresponds to all records between those pointed by theaddress at position numberiand the address at position number i+1 (this is a right-open interval). During the first round of data appending, each segment contains only one record.
By the next round of data appending, the index space is already full. Now let’s perform these actions: keep the address at position number 1, write address at position number 3 to position number 2 and address at position number 5 to position number 3, and address at position number 2i-1 to position number i until the address at position number2*N/2-1 (i.e. N-1) is written to position number N/2. Then delete all addresses from position number N/2+1 to position number N.
The effect is equivalent to merging block 1 and block 2 into a new block 1, merging block 3 and block 4 into a new block 2… merging block 2i-1 and block 2i into a new block i, which now holding two records. Since data records are written continuously, each of the merged blocks consists of continuous records.
Now the bottom half of the index space is empty, which means empty blocks are available.In the next round of data appending, write the addresses of two continuous records to each index position at a time– that is write the addresses of record N+1 and record N+2 to the position number N/2+1, and write the addresses of record N+3 and record N+4 to the position number N/2+2, and so on.
If data is continuously appended and all the positions in the index space are full, repeat the above merging actions to double the number of record addresses to 4 at each position to empty the positions in the bottom half of the index. In a new round of data appending, perform same actions to add 4 addresses of continuous records to each index position. Such actions can be repeated over and over again as data endlessly appended.
This mechanism ensures that there are always a fixed number of (N/2 to N) available blocks. As long as N is big enough (Generally 1024 is sufficient), we can segment data as evenly as possible. Each segment consists of contiguous blocks and each block contains continuous and integral records, which ensures therealization of target 3. And target 4 is already achieved with the above algorithm. This mechanismdoesn’t require the handling of marks as the fixed-size block-basedsegmentation strategy does. A segment can be retrieved from the starting point to the ending point uninterruptedly. It’s extremely convenient.
This double increment segmentation strategy is performed by records, so it’s suitable for segmenting data stored column-wise. By indexing newly-appended data in each column using this mechanism, mismatching won’t happen and the corresponding segmentation points in all columns will always form a correct record. Apart from that,without broken by any marks data in each column is continuous and smooth, while with the block-based segmentation strategy, data is continuous only within each block. Finally, we don’t have to face the conflict between the size and the number of blocks any more.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL