File Performance Issues
In one of the previous articles, we talked about the hard disk performance characteristics, mainly on the hardware level and on the OS level. Here we look at it on the application software level.
Software, in theory, can penetrate the operating system to access disk directly. However, the passage is hardly feasible because it is too inconvenient and loses compatibility. So, let’s just ignore it and move on to the storage forms of the OS, where files hold the major position .
Text is the most common file format. It is widely used because it is universally applicable and highly readable. But, text has extremely low performance!
Text characters cannot participate in computations directly. They need to be first converted into memory data types, such as integer, real number, date and string, for further processing. But text parsing is very complicated.
Suppose we want to convert the text “12345” into in-memory integer 12345, and here is the process:
1) First, set the result’s initial value as 0;
2) Split character “1” from the text and parse it into numeric value 1, then multiply initial value 0 by 10 and plus the number 1 to get numeric value 1;
3) Split character 2 and parse it into numeric value 2, then multiply the newly-obtained numeric value 1 by 10 and plus value 2 to get numeric value 12;
…
All C programmers know that they can use atoi()function to convert a string into an integer with a single line of code. In fact, there are lots of steps behind the seemingly simple operation. It takes CPU many actions and a long time to get it done. In real-world practices, we also need to identify the illegal characters (like nonnumeric characters) that may appear. The actual process is much more complicated.
Integers are the simplest data type. For real numbers, we need to take care of the decimal point; string parsing needs to handle escape characters and quotation matching; and a lot of more need to be taken into account when parsing date data because there are various formats, like 2018/1/10 and 10-1-2018, both of which are common and legal formats, and even Jan-10 2018, a less common format. We have to match various formats to parse it correctly, leading to very long CPU time.
In most cases, disk retrieval occupies the lion’s share of the external data accesses. But the performance bottleneck of text parsing often occurs in the CPU processing phase. Because the parsing is complicated, it is probably that the CPU time is longer than disk retrieval time (particularly when high-performance SSDs are used). Text parsing is extremely slow, so do not use text files for big data processing if you expect high performance!
However, some original data (such as logs) only has text format, and text parsing is inevitable. We can adopt the parallel processing and make use of the characteristic that multi-CPU has high parallelism to parse the text file with multiple threads so that higher process performance can be obtained even with serial accesses to the disk.
If we need to use the text data repeatedly, it would be better to convert it to binary format storage so that there is no need to parse it again for the next computation.
A binary file allows writing bytes in the memory corresponding to a data type directly into it. Data will be directly retrieved and reloaded to the memory during later retrievals without the complicated parsing and without judging and identifying the illegal characters, bringing much better performance.
We need to first decide the compression method when trying to store data in the binary format; otherwise, compared with the text format, more storage space will be used and disk retrieval time becomes longer even if the parsing time is shortened.
Take integer 1 as an example. When stored in the text format, it only occupies one byte; and if followed by a separator, two bytes. But if we convert each integer into 32 bits (Most integer type data in today’s computers occupies such a bit length), they will occupy four bytes, which is one time greater than that used by the text data; and sometimes longer when information of the data type itself is counted.
A reasonable approach is to determine the bit length according to size of the integer. A small integer stores only one or two bytes, but a big integer stores more bytes. As small integers are more common, they will help reduce the total storage space used and get performance benefits.
Of course, it isn’t necessarily that the higher the compression ratio the better the performance. It takes CPU time to decompress data. As we said above, storing integers according to their sizes helps reduce storage space utilization, but causes an extra judgment during parsing and thus lowers performance. The compression strategy we choose should be able to get balance between disk space usage and CPU consumption. Pursing extreme compression ratio (such as using Zip compression algorithm) can indeed lower the space usage more, but the CPU time will exceed the disk retrieval time, causing a lower overall performance instead.
Unfortunately, there isn’t a standard about binary file formats. Vendors offer their own formats. Binary files may be faster, but whether it is truly fast or not is affected by the specific format and the implementation method used.
For example, Excel files are to some extent a type of binary format and store the data types. But they, at the same time, store a great deal of appearance information and inter-cell association information. The format is quite complex, and the performance of reading and writing Excel files is much lower than text read/write. So, do not use the Excel format when trying to import or export a large amount of data when expecting high performance.
Databases can be seen as a type of binary format, and usually have much better read/write performance than text files. But the actual performance is also impacted by their purpose of design. The TP databases intended for transaction processing generally cannot compress data because of frequent data read/write and have low storage efficiency. The AP databases designed to handle data computations can use the columnar storage to increase data compression ratio, which helps achieve much higher read/write performance.
esProc SPL provides btx file format and ctx file format, and both are fast. The btx format uses simple row-wise storage, is 3-5 times faster than text formats, and has an about 30%-50% compression ratio. The ctx format can implement the columnar storage technique in a single file, further push up the compression ratio, and achieve higher performance for most scenarios where not all columns need to be retrieved for the computation.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL