What You Possibly Don’t Know About Columnar Storage
The columnar storage is a commonly used storage technique. Often it implies high performance, and has basically become a standard configuration for today’s analytical databases.
The basic principle of columnar storage is reducing the amount of data retrieved from the hard disk. A data table can have a lot of columns, but the computation may use only a very small number of them. With the columnar storage, useless columns do not need to be retrieved; while with the row-wise storage, all columns need to be scanned. When the retrieved columns only take up a very small part of the total, the columnar storage has a big advantage in terms of IO time and computation seems to get much faster.
But the columnar storage also has another side – it isn’t the fastest for any scenarios.
Implementing columnar storage is much more complex than implementing row-wise storage because for a data table the number of columns can be determined in advance but the number of rows will not stop growing. With the row-wise storage, we write and append data to the table according to the order of records. It is easy to store the data table as a single file. But this does not work for data stored in columnar format. As there will be data appending, we cannot know the number of rows beforehand and it is thus impossible to finish writing a column and then the next. Generally, we divide the storage space into a number of blocks, write a fixed number of rows (represented by N) to one block, and then move to the next when finish the writing. Later data will be retrieved block by block. In each block data is stored in the column-wise format, while between blocks data can be regarded as stored row-wise. A special management module, where a table of contents is used to record information of the continuously growing data blocks and every column they store, is needed, causing a lot of inconveniences. So, it is difficult to implement columnar storage in a single data file. The storage schema is usually adopted by special data warehouse products.
But the block storage mechanism is unfriendly to the implementation of parallel processing when data amount is not large. Parallel processing requires that data be divided into multiple segments. To be able to do this, there are two conditions: almost equal amount of data in each segment (equal processing load for each thread) and the ability for flexible segmentation (the number of parallel tasks cannot be determined beforehand). Row-wise data can be segmented according to the number of rows, and parallel processing becomes feasible even for a very small amount of data. Column-wise data can only be divided into blocks, where data cannot be further divided. The number of records (the above-mentioned N) should not be too small; otherwise, too many resources will be wasted due to the existence of the smallest disk retrieval unit. In the extreme case of N=1, the storage schema is equal to row-wise storage. And when N is too small and the total amount of data involved is huge, the table of contents becomes very large and overburdens the content management. So, N is usually specified as one million or above. In order to segment data flexibly, there need to be at least hundreds of data blocks. That is to say, the parallel computation on column-wise data becomes smooth only when the total amount reaches at least hundreds of millions of data rows.
esProc SPL offers the double increment segmentation strategy to make N grows as data amount increases while maintaining the same number of data blocks. This way the size of the table of contents can also be fixed, the columnar storage can be conveniently implemented in a single file, and flexible segmentation can be implemented for performing parallel computation on a small amount of data.
According to principle of the columnar storage, the storage schema brings an obvious advantage only when the computation involves a relatively small number of columns. Many performance test cases (such as TPCH used as the international standard) choose such computing scenarios so they are convenient for bringing out the advantages of columnar databases. Those are only a part of the real-life business scenarios. In finance industry, it is not rare that a computation involves most of the columns in a table having over one hundred columns. In that case columnar storage only gives half play to its advantage. Even if the columnar storage has higher compression ratio and smaller amount of data retrieved than the row-wise storage, its advantage is not that noticeable when many columns participate in the computation. After all, the process of retrieving data stored column-wise is much more complex than that of retrieving row-wised stored data.
Therefore, when a real-world computation does not have as good performance as the test case gets, it is normal and this does not mean that the test result is fake.
Columnar storage also leads to random disk accesses. Data in each column is stored continuously, but data in different columns isn’t. The more columns that are retrieved, the more serious the degree of randomness resulted from the retrieval, even with a single-thread task. For SSDs, it isn’t a very serious problem because, when data in each column is continuous and the above-mentioned N is big enough, the retrieval cost takes up a very small proportion and the SSD does not have the seek time.
But for HDDs that have the seek time, the problem becomes disastrous. When a lot of columns are accessed, it is probably that the performance is even not as good as that of the row-wise storage. Both concurrency and parallel processing will worsen the problem. On the other hand, increasing the size of cache to alleviate the problem will occupy too much memory space.
Be cautious when you try to use the columnar storage on HDDs.
Another big problem with the columnar storage is that it has much lower indexing performance than the row-wise storage. As we said, the index table stores ordered key values and positions of their corresponding records in the original table. For the row-wise storage, the position of a record can be represented by one number; but for the columnar storage, each column in a record has a different position and, in principle, these positions should all be recorded. This creates an index table almost as big as the original table, which leads to heavy storage utilization and high retrieval cost. There isn’t much difference between this and the method of copying the original table and sorting it.
Of course, no one will do that in real-world practices. The general approach is still the previously mentioned block storage mechanism. The index only stores ordinal numbers of the records. The search reads an ordinal number from the index table, locates the corresponding block, “counts” from the first record to the one with the corresponding ordinal number in the block, and retrieve the column value. The “count” action is performed on each column. In the best-case scenario, a number of disk units equal to the number of columns will be read; if you are not lucky, the whole block will be scanned. By contrast, an index for the row-wise storage generally only needs to read one or two disk units (determined by the space the records occupy). The amount of data retrieved under columnar storage is dozens of, even one hundred, times more than that under row-wise storage. With HDDs, there is also the unbearable seek time. Therefore, the columnar storage basically cannot handle the high-concurrency query requirements.
Use the columnar storage for traversal and the row-wise storage for search. For data on which both traversal and search requires high performance, it is even necessary to store two copies of data redundantly. The data platform should permit programmers to adopt the most suitable storage schema for each computing scenario, rather than making the same decision for them for all scenarios.
Well, esProc SPL allows users to choose a more suitable one between the row-wise storage and the columnar storage. It also offers the value-attached indexing strategy to store a row-wise data copy of the traversal-oriented columnar data for the search.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version