Why Are Databases Faster Than Text Files?
Why text files are slower
The main reason is that there are too many data parsing actions.
Let’s look at an example. Suppose we need to convert text "12345" to a binary integer in the memory. The process is like this:
1. Set the initial value of the result as 0.
2. Get character “1” and parse it into number 1; multiply the initial value 0 by 10 and plus the number 1 to get number 1.
3. Get character “2” and parse it into number 2; multiply the current result 1 by 10 and plus the number 2 to get number 12.
4. Get character “3” and parse it into number 3; multiply the current result 12 by 10 and plus the number 3 to get number 123.
5. And so on.
Some C programmers know that atoi()function can achieve conversion from string to integer. Only one line of code is enough. But there are a large number of computational steps behind the seeming simple method. It takes a lot of CPU operations and very long to finish the execution. The actual process is much more complicated as it also involves checking illegal characters (like non-numeric ones) if there are any.
Integers are the simplest-to-handle data type. Parsing real numbers requires handling the decimal points. String parsing needs to take into consideration the matching of escape characters and parentheses. It is much more complicated to parse dates because there are too many formats of them. Both 2018/1/10 and 10-1-2018 are common, legal date formats. Jan-10 2018 uses a less common format. To parse it correctly, a lot of formats are needed to match it. CPU will spend a lot of time doing the matching.
The parsing issue that makes text files slow does not exist in databases. Databases do not store data in the text format, they use binary format instead. The storage format encapsulates data type information in it, and does not need data parsing at retrieval, bringing about higher performance.
Faster binary files
Since complicated data parsing accounts for slow text files, can we improve their performance by storing data in the binary format?
Of course. Binary files even offer higher performance than the databases!
Usually, databases should include the ability to update data, which can generate factors affecting performance.
1. Compactness & Compression
Generally, databases use block storage with segment-page scheme to update data. The block storage scheme sets aside space in each block for later data updates, occupying more disk space than the file without being burdened with the update task.
In addition, the design for data updates makes it difficult to achieve data compression. By storing a small integer with a shorter length of bytes, for instance, the original space becomes insufficient when a large integer is trying to replace it. The data after it has to move backwards to make room for the new, pushing up the costs of data update, which, in turn, keeps database vendors away from using any data compression techniques. They choose to allocate database space according to data types instead. The alternative results in waste of space, and, in extreme cases, space usage heavier than that of the text file.
Binary files haven’t the above burden and problem. They are free to store data in a compact format and compress it to make reading faster (reduce hard disk access time). Yet the effect of compression ratio is not a the-higher-the-better thing. It takes time for CPU to decompress data. Over-compressing data takes up too much CPU time, only reducing performance. An appropriate compression strategy needs to find balance between disk usage reduction and CPU consumption.
2. Complexities of transaction consistency
Many commercial databases also give support for OLTP that requires read consistency. This greatly complicates access actions. One record could have multiple backups due to write operations of other transactions. Databases need to locate the right one according to the start time of the current transaction. This is a very complex action that affects performance considerably. Binary files do not have the consistency problem.
3. Inconvenience for segmentation
The block storage structure databases use is inconvenient for achieving flexible segmentation. The inability to implement the more agile parallel processing makes databases run slower than binary files.
4. Low I/O performance
The slow I/O is a common database problem. Though in-database processing has a satisfactory performance, data needs to be fetched out through the database interface for computation outside the database when it is hard to code a complex computation in SQL. Usually, the interface is very slow. Tests show that retrieving data from the database is very likely to be slower than reading data from a text file, much slower than obtaining data from a binary file.
But why binary files are rarely used in practice even if they have so many advantages compared with databases?
Because files, including binary files, have a fatal weakness. They do not have computing ability.
Storing data only cannot generate value. Data becomes useful only when it can be used, which, in essence, the possibility to be computed. Databases can not only store data but compute data conveniently. They use SQL, the specialized database query language, to process data, and can handle many computations, such as grouping & aggregation and table association, with one line of code. Files do not have any computing ability. They need to be computed through a programming language, and the process is complicated and difficult. Java is the most commonly used language applications use to compute files, but the high-level language lacks specialized structured data computation class libraries. It generates a huge pile of code in an effort to achieve a SQL-style set-oriented operation – dozens of lines of code for a simple grouping & aggregation calculation, let alone the more complex calculations.
Besides computing ability, there isn’t a relatively standard and uniform storage scheme, including an appropriate compression algorithm and an efficient segmentation method, for binary files, though text files have a set of standardized methods. Building programs from scratch each time is not as convenient as using the ready-made databases.
Open-source SPL offers uniform, easily-computable binary files
Open-source SPL makes it convenient to exploit the high-performance binary files.
It provides uniform binary file storage scheme and binary file processing capability, enabling us to make good use the advantages of binary files and thus to compute data efficiently.
SPL binary files
Bin file
Bin files are the basic binary data format SPL uses. They are compressed (enabling less space usage yet faster retrieval), store data types (getting rid of the parsing phase to achieve faster retrieval), and can be conveniently segmented through the specifically designed double increment segmentation technique that supports data appending to achieve parallel processing, which further increases performance.
It is simple for SPL to convert a text file to a bin file, as shown below:
A |
|
1 |
=file("data.btx").export@b(file("data.txt").cursor@t()) |
The code reads text file content through the cursor in a stream style and exports it to a binary bin file data.btx. @b option enables export as a binary file. If there is already a bin bile and you want to append data to it, just add another option @a to the export() function, as shown below:
=file("data.btx").export@ab(file("data.txt").cursor@t())
This way, a newly generated binary file possesses all above qualities that a SPL bin file is given. One of the qualities, double increment segmentation technique, segments data into multiple segments and stores data by segment. It is convenient to perform parallel processing based on segmented data, but data appending has always been an issue. The technique’s double increment method can ensure that data segmentationgoes smoothly while new data is appended continuously. It is also transparent to programmers and data analysts and ready-for-use by them. As the following shows, it is simple to get a certain segment of data:
=file("data.btx").cursor@b(;23:100) // Get the 23rd of the 100 segments
Composite table
Composite tables are a file format SPL uses to offer column-wise storage and indexing mechanism. The column-wise storage has great advantages when there is only a small number of columns (fields) involved in a computation. Besides the benefits of column-wise storage and minmax index, composite tables also support double increment segmentationtechnique that makes it easier to implement parallel processing for performance enhancement.
To generate and use a composite table:
A |
B |
|
1 |
=file("data.ctx") |
=file("date.btx") |
2 |
=A1.create(…) |
=A2.append(B1.cursor@b()) |
3 |
=A1.open() |
|
4 |
=A3.cursor(…;;4:10) |
|
5 |
=A4.fetch(1) |
Unlike using a bin file, we need to first create a composite table before appending data to it (in the above code, A2 creates a composite table and B2 appends data to it). The creation should specify data structure (the … part in A2), which is like CREATE TABLE statement for creating a table in the database. A4 is similar to the previous cursor function that generates a number of data segments but it adds another semicolon. Creating a composite table, by default, uses the column-wise storage format, and the cursor function specifies desired column names in parameters (the … part in A4), using the storage format’s strength to reduce data retrieval volume.
In addition to the high-efficiency binary file format, SPL boasts many other characteristics that facilitates data processing.
Direct computation of binary files
SPL offers all-around computational capabilities, agile syntax and high-performance algorithms. This enables it to directly process binary file in an efficient and simple way.
For example, SPL handles the commonly seen TOPN calculation and intra-group TOPN calculation in the following way:
A |
||
1 |
=file(“data.ctx”).create().cursor() |
|
2 |
=A1.groups(;top(10,amount)) |
Get orders records where amount ranks in top 10 |
3 |
=A1.groups(area;top(10,amount)) |
Get orders records where amount ranks in top 10 in each area |
SPL has rich class libraries. It is easy for it to perform regular and complex computations:
A |
B |
|
1 |
=T("/data/scores.btx") |
|
2 |
=A1.select(CLASS==10) |
Filter |
3 |
=A1.groups(CLASS;min(English),max(Chinese),sum(Math)) |
Group & aggregate |
4 |
=A1.sort(CLASS:-1) |
Sort |
5 |
=T("/data/students.btx").keys(SID) |
|
6 |
=A1.join(STUID,A5,SNAME) |
Join |
7 |
=A6.derive(English+ Chinese+ Math:TOTLE) |
Append a column |
SPL supports SQL92 standards for file handling. Users who are accustomed to SQL can directly use SQL to query both text files and binary files:
$select * from d:/Orders.btx where Client in ('TAS','KBRO','PNS')
It supports complex queries containing WITH clause:
$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.btx group by Client) t
left join ClientTable ct on t.Client=ct.Client
It is suitable to perform complex computations in SPL thanks to its agile syntax and support of procedural programming. To count the longest days when a stock rises consecutively according to the stock record table, for instance, SPL has the following code:
A |
|
1 |
=T("/data/stock.btx") |
2 |
=A1.group@i(price<price[-1]).max(~.len())-1 |
And to list the time interval of the latest two logins for each user according to user login table:
A |
||
1 |
=T(“/data/ulogin.btx”) |
|
2 |
=A1.groups(uid;top(2,-logtime)) |
Get records of last two logins |
3 |
=A2.new(uid,#2(1).logtime-#2(2).logtime:interval) |
Calculate time interval between them |
In addition, SPL supports diverse data sources, covering sources from the commonly used to the rarely used.
Moreover, SPL can achieve mixed computation between diverse data sources, making it particularly suitable for accomplishing T+0 queries. Binary files are fit for storing huge volume of static historical data, and the database is used to store the small amount of hot data that is likely to be modified in real-time. When trying to perform whole-data queries between file and database, we can use SPL’s cross-data-source query ability and data routing functionality, choosing the right data sources and cross-data-source mixed computation as needed. One example is that SPL can achieve T+0 queries through mixed queries between the separated cold and hot data, while maintaining transparency to the upper-level application.
Integration-friendly
In the aspect of integration ability, SPL offers standard JDBC driver and ODBC driver to be invoked by another application. With a Java application, SPL can act as an embedding engine to be integrated into the application, equipping the latter with powerful ability to compute files directly.
Below is an example of invoking SPL code through JDBC:
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
The interpreted execution SPL naturally supports hot-swap. This benefits Java-based applications significantly. Writing, modifying, running and maintaining SPL data computing logics take effect in real-time without the need to restart the application, making code execution and maintenance more convenient.
In summary, SPL is the ideal replacement for databases because it can remarkably facilitate the handling of analytic scenarios that have high demand for performance with its high-efficiency binary format and outstanding file processing ability.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version