The Hard Disk Performance Characteristic
Compared with expensive, fast memory, hard disks are much cheaper and about one or two orders of magnitude of slower. But problems of hard disks are more than the slow speed.
The basic characteristic of hard disks is that they are not good at handling high-frequency accesses of small amounts. High-frequency accesses of small amounts means that the frequency of data accesses is high and the amount of data retrieved each time is very small. For the memory, it takes almost the same time to retrieve one hundred million bytes of data in one million accesses, 100 bytes at a time and to read the same amount of data in 100 accesses with one million bytes each time. This is because byte is the smallest unit of data access in memory, and no matter how data is retrieved the number of bytes involved is the same.
Hard disk accesses use a different mechanism. Hard disks store data in a number of blocks. There is the smallest unit for reading data. It is usually 4K for the operating system, meaning that it takes the same time to read one byte as it takes to read 4K data from the hard disk. This in turn means that it may take hugely different lengths of time to read the same amount of data, such as, in one million accesses with one hundred bytes at a time and in one hundred accesses with one million byte each time. Besides, unlike memory accesses for which one CPU instruction is enough, hard disk accesses involve a series of complex actions. In theory, one million accesses need to perform those actions one million times. Of course, reading one million bytes at one time also involves multiple actions (since the one million bytes will be stored in multiple data blocks), but the total number of actions is much fewer. The performance of low-frequency accesses in chunks is generally much higher than that of high-frequency accesses in small amounts.
However, the final performance also depends on whether the data storage is continuous. If data is continuously stored on the hard disk, it won’t take very long to retrieve data even in one million accesses with one hundred bytes each time. Because data to be read later is already within the retrieved data block and won’t be read again thanks to the disk and operating system’s cache functionality. The actual number of hard disk accesses isn’t huge and the performance drop isn’t big. So, we need to add “random” before “high-frequency accesses of small amounts” to imply that the to-be-read content is discontinuous. In this case, after the target part is obtained from a retrieved data block, the other part becomes useless and wasted. The same data block will have to be re-read later if needed. This causes a sudden drop in performance.
For HDDs, there is the seek time problem of finding the target data. The seek action is an extremely slow mechanical movement, much slower than reading data. Even if each retrieved data block can be fully used, it is probably that, for random hard disk accesses, the seek time is longer than the retrieval time. When using HDDs, pay special to avoid the random high-frequency data accesses. The situation with SSDs is much better as there are no seek actions. But their performance of random, high-frequency data accesses in small amounts is still poor because data blocks of hard disks are too large.
Then, for computations involving continuous batch access only (such as traversal-based aggregation), is the hard disk performance determined only by its speed?
The answer is yes if it is a single, single-thread computing task. But today parallel processing has become indispensable to high-performance computations, and a lot of computing services need to support concurrency. The parallel processing and the concurrency will to some extent turn the continuous hard disk accesses to random accesses. The reason is simple. Multiple threads share the same hard disk, but their access requests are not continuous. Jumps happen when the hard disk responds to these requests simultaneously, which lead to random accesses.
Usually, this is disastrous to an HDD. Frequent switches between threads even result in the strange thing that the multithreaded processing is slower than the single-threaded processing. For some scenarios, the single-task processing performance is satisfactory, but performance falls dramatically when there is concurrency. A countermeasure is increasing the size of the cache for storing the retrieved data. If each thread retrieves a large enough amount of continuous data, the proportion of the seek time decreases and becomes not noticeable. But this increases the memory consumption. The more the number of threads there are, the more the need of the memory space.
A similar scenario is the columnar storage, which stores data in a column-wise format. When the computation involves multiple columns, random hard disk accesses occur even in a single thread.
Because of that hard disk characteristic, different algorithms need to be adopted to handle in-memory computations and external memory computations, and the two types of computations should even be defined differently.
Relational algebra does not distinguish in-memory computations from external memory computations. Instead, it defines all computations in a general way. Relational databases thus find that there are too many more troubles when trying to handle computations outside memory than those inside memory. One typical example is JOIN operations. The HASH JOIN algorithm commonly used by the database needs to traverse data twice. This is not a burden to the memory. By contrast, the cost of traversing data stored outside memory device is usually higher than that of handling the computation itself. There will be multiple HASH comparisons and repeated traversals if you are not lucky enough, and performance drops sharply. However, if we re-define JOINs and take full account of the characteristic of the external memory (i.e. hard disks), on the condition that the real business requirements can still be met, we can design a low-complexity algorithm that just needs to traverse data once, even without the need of full traversal, to achieve high performance.
Well, that distinguishes esProc SPL from relational databases.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version