How to use SPL storage for beginners

 

To implement high performance computing in SPL, it often needs to convert the data and store as SPL format. Programmers who are new to SPL are unfamiliar with this. Reading this article can make beginners get started quickly, and be able to perform common data converting and storing.

This article offers some code samples suitable for general scenarios. If you want to know special cases or more details, visit http://doc.raqsoft.com/esproc/func/, http://doc.raqsoft.com/esproc/, etc.

File storage of SPL

SPL’s data is stored in files, and there are no database-related concepts like data table. For beginners, you can simply regard a file as a data table. With the further understanding of SPL, you will be able to design more flexible storage scheme.

SPL adopts different storage methods for large table and small table, and we first need to distinguish the size of tables. The small table refers to a table that stores a small amount of data and can be fully held in memory such as department table, area table, employee table. On the contrary, the large table refers to a table that stores a large amount of data and cannot be held in memory, such as order table, transaction detail table and log table, and these tables store a lot of historical data in general and will continue to produce new data over time.

Bin file and composite table

SPL has two file formats, bin file and composite table, with the suffixes btx and ctx, respectively.

The bin file adopts the simple binary format. For files in this format, you cannot set the primary key, you can append data, but you cannot insert, modify or delete data.

In contrast, the composite table has better performance, which is the way of SPL to store big data. For this kind of file, you can set the primary key or not, and you can append data, and can modify, insert and delete data in small amounts when the primary key has been set, but you cannot update a large amount of data.

These two file formats are mainly used for storing historical cold data that hardly changes anymore and are not suitable for storing hot data that still changes frequently. Currently, SPL has not released OLTP functions yet, and the HTAP requirements can be implemented in an architecture that has the ability to calculate the hot and cold data in a mixed manner. If you are interested in that, refer to the information at https://c.scudata.com/.

In practice, the bin file is usually used for storing small table, while the composite table is used for storing large table. The main reason for this arrangement is that large tables have a significant impact on performance, and storing them as composite table helps improve system’s overall performance. Moreover, there is an index block at the header of composite table, and at least one block is required even if there is only one row of data, and more blocks will be required if the columnar storage is adopted. Index blocks can be ignored for big data, but they are not suitable for small data. Both the creation and use of bin file are simpler, and it is convenient to use bin file to store small table, and lower storage efficiency due to index blocks will not occur.

The code for reading a small table from database and writing it to bin file is roughly as follows:

A
1 =connect("demo")
2 =A1.query(“SELECT AREAID,AREANAME FROM AREA ORDER BY AREAID”)
3 =file("area.btx").export@b(A2)
4 =T("area.btx")
5 >A1.close()

A1: connect to a database; A2: fetch the data of area table; A3: write data to bin file; A4: read data from bin file for inspection.

As can be seen from the code that there is no need to specify the data structure in advance when creating the bin file, SPL will automatically use the data structure of the dataset to be written.

There are more considerations to take into account when dumping large tables of the database to composite table, which will be explained in the following sections.

Row-based storage and columnar storage

SPL composite table supports two data storage methods: row-based storage and columnar storage. As for which method to use, it should be determined when creating composite table. Determining the storage method depends primarily on the way of calculating the data (traversal or search). Traversal refers to reading a large amount of data (even all data) from a large table to calculate, such as grouping and aggregating the order data of many years. Search refers to finding out a few data from a large table, such as searching for an order in the order table, or a transaction in the transaction table. Columnar storage is generally used for traversal calculation, and its advantage is more obvious especially when a table has many fields but few fields are often involved in calculation. Conversely, row-based storage is more advantageous for search calculation.

For scenarios requiring high performance in both traversal and search calculations, the data can be stored in two copies, the columnar storage data is used for traversal calculation, while the row-based storage data is used for search calculation. However, this two-method coexistence scheme needs to store data twice redundantly, the hard disk space occupied is thus relatively large.

The code for storing the database table as columnar storage composite table is roughly like this:

A
1 =connect(“demo”)
2 =A1.cursor(“select ORDERID,CUSTOMERID,EMPLOYEEID,AREAID,AMOUNT,ORDERDATE from ORDERS”)
3 =file(“orders.ctx”).create@y(ORDERID,CUSTOMERID, EMPLOYEEID,AREAID,AMOUNT,ORDERDATE)
4 =A3.append(A2)
5 >A1.close(),A3.close()
6 =file(“orders.ctx”).open().cursor().fetch(100)

A2: use the database cursor to fetch data in batches. The reason for fetching data in batches is that the data amount in a large table is so large that reading them all will cause out-of-memory.

A3: create the composite table. The option @y means that if there is a file with the same name, it will be overwritten directly. Because no primary key is set, data can only be appended to this composite table, and cannot be inserted, modified or deleted.

A4: append the data of A2 cursor to the composite table, note that their field names and orders should be exactly the same.

A5: close database connection and composite table.

A6: open the composite table, create a cursor, and fetch 100 pieces of data to have a check.

It can be seen from the code that the data structure must be created before writing data to the composite table. The composite table is more complex to use than that of bin file.

If you want to store a table as row-based storage composite table, the only thing you need to do is to modify A3 by adding an option @r to the function create, roughly as follows:

=file("orders.ctx").create@ry(ORDERID,CLIENT,SELLERID,AMOUNT,ORDERDATE)

Ordered storage

Ordered storage is very useful for improving the performance of traversal and search calculations, and there are usually two ordered storage methods, i.e., ordered by time and ordered by account.

For the former method, the binary search can be used to improve performance when performing a filtering on time, and is also effective for speeding up the grouping and aggregating by date. The code for orderly storing the order data into a composite table by order date is roughly as follows:

A
1 =connect("demo")
2 =A1.cursor("select ORDERDATE,ORDERID,CUSTOMERID,EMPLOYEEID,AREAID,AMOUNT from ORDERS order by ORDERDATE")
3 =file("orders.ctx").create@y(#ORDERDATE,ORDERID,CUSTOMERID, EMPLOYEEID,AREAID,AMOUNT)
4 =A3.append(A2)
5 >A1.close(),A3.close()

A2: when the data are fetched, they are sorted by ORDERDATE.

A3: create a composite table, the symbol # indicates that the composite table is ordered by ORDERDATE and the ordered field must be put at the first place.

A4: append data to the composite table by the order of order dates.

The bin file can also be stored in order:

A
3 =file(“orders.btx”).export@b(A2)

In this code, the order data is orderly stored into bin file by date. When performing filtering on date, the binary search can be used.

For the data ordered by time, the time to add new data on regular basis is usually later than that of existing data. We can first sort the new data by time, and then append them directly to the end of the existing data. The code for appending new data is shown as below:

A B
1 =file("orders.ctx").open() =connect("demo")
2 =B1.cursor("select ORDERDATE,ORDERID,CUSTOMERID,EMPLOYEEID,AREAID,AMOUNT from ORDERS where ORDERDATE=?",date(now()))
3 =A1.append(A2) >A1.close(),B1.close()

A2: fetch the new data of the day from database.

A3: append new data directly to the end of composite table.

This code applies to bin file as well, the only thing you need to do is to change export@b to export@ab, @a means appending data to the end of bin file.

Ordering by account is also very common. The account is generally the primary key or part of primary key, and hence ordering by account is essentially ordering by primary key or part of primary key. Ordering by account will greatly speed up the searching of all transaction data of a specified account, the grouping and aggregating (de-duplication) by account, or the complex calculation for each account (such as user behavior analysis).

Since the bin file is rarely used in this scenario, we only take composite table as an example here. The following code is obtained by modifying the above code, and making the new composite table orderly by CUSTOMERID.

A
1 =connect("demo")
2 =A1.cursor("select EMPLOYEEID,ORDERDATE,ORDERID,CUSTOMERID, AREAID,AMOUNT from ORDERS order by EMPLOYEEID")
3 =file("orders.ctx").create@y(#EMPLOYEEID,ORDERDATE,ORDERID,CUSTOMERID, AREAID,AMOUNT)
4 =A3.append(A2)
5 >A1.close(),A3.close()

A2: change to sorting by customer and put the customer field at the first place.

A3: when creating a composite table, it also needs to put the customer field at the first place, and it needs to add the symbol #, indicating the data are in order.

When the data is ordered by account, new data cannot be appended directly to the end of the existing data, for the reason that the values of the accounts of new data are still the same as the existing data, direct appending will destroy the order of accounts. It will take too a long time to re-sort all data and regenerate a composite table every time there is new data.

The composite table of SPL provides a way to avoid re-sorting all data each time new data is appended, it only needs to re-sort all data once after multiple accumulating the new data, which reduces the time to update data each time. Moreover, the ordered merge algorithm can be used for sorting, which is also much more efficient than regular full sorting. Visit: Ordered storage of SPL for details.

For example, for a composite table ordered by CUSTOMERID, the code for new data appended every day is roughly like this:

A B
1 =file("orders.ctx") =connect("demo")
2 =B1.cursor("select EMPLOYEEID,ORDERDATE,ORDERID,CUSTOMERID, AREAID,AMOUNT from ORDERS where ORDERDATE=? order by EMPLOYEEID",date(now()))
3 if (day(now())==1 >A1.reset(;file("new_orders.ctx").open().cursor())
4 =file("new_orders.ctx").create@y(#EMPLOYEEID,ORDERDATE,ORDERID,CUSTOMERID, AREAID,AMOUNT)
5 =file("new_orders.ctx").reset(;A2)

A2: fetch new data of the day from database.

A3: judge whether it is the first day of each month. If not, do daily routine work: orderly merge the new data of the day into a separate small file in A5: new_orders.ctx.

If it is the first day, then it needs to perform the monthly resetting work in B3: orderly merge the new data accumulated in the separate small file into existing composite table, and then generate a new composite table, and finally empty the small file in B4.

Before reading data, it needs to merge the historical data composite table with the incremental data composite table, for example:

=[file("orders.ctx").open().cursor(),file("new_orders.ctx").open().cursor()].merge(#1)

It should be noted that the cursor sequence should be in the same order as the data files, that is, [historical data, incremental data].

Storage segmentation

When performing multi-thread parallel computing on ordered data, consideration should be given to segmentation (each thread executes one segment). For example, the above-mentioned transaction detail data is stored orderly by account, this helps perform complex calculation for each account. However, if there is a need to do complex account calculation in parallel, you should not divide multiple records of the same account to different segments during segmenting, otherwise you will get wrong result.

In this case, when creating a composite table, you need to declare: dividing multiple pieces of data of the same account field to one segment is a must while segmenting. For the transaction detail table, the code for creating composite table should be as follows.

A
=file("detail.ctx").create@py(#ACCOUNTID,…)

In this code, create@p indicates that the same records of the first field ACCOUNTID will not be divided to different segments when segmenting for following parallel computing. Currently SPL only provides the segmentation mechanism for the first field, yet practices show that this mechanism fully meets the needs. Since there is no need for bin file to specify the data table structure in advance, this mechanism is not provided.

Another typical segmentation application scenario is the primary-sub table with one-to-many relationship. After storing orderly by primary key, it often needs to perform parallel ordered merge calculation. For the primary table, the primary key is unique, it is impossible to divide two records of the same primary key to two segments while segmenting. However, association fields of sub-table are not the whole primary key, and there will almost certainly be duplicate values. Hence, the records of the same association field may appear in two natural segments, such segmentation will lead to association error. To solve this problem, when creating primary and sub composite tables, you must use create@p to create sub composite table to ensure that this error will never occur.

Data type conversion

When converting and storing the data, it may need to change the type of original data so as to obtain smaller storage amount and better computing performance.

1. Numerical value

The values in numerical value field should be converted to integer whenever possible, preferably small integer less than 65536. For example, there are only a few thousands of EMPLOYEEIDs in the order table, but they are artificially changed to large number like 100001, 100002. To obtain a better performance, you can convert them to the sequence number in the employee table. Suppose 100001 is the first record in employee table, convert it to 1, and so on. Another example, when reading the data from Oracle via JDBC, all numerical values (including integer) become big decimal by default, which seriously affects the computing performance, and it needs to convert back to the appropriate data type.

2. Date

Date can be converted to the number of days from a certain day, which does not affect comparison. Small integer can represent a date range greater than 60,000 days, and can store a time period of more than 100 years, the requirements of most application scenarios can be satisfied. SPL provides a method, that is, convert a date that includes year and month to the number of months from 1970, and represent the day as 5 binary bits (a month has a maximum of 31 days, and a 5-bit binary number can represent any number between 0 and 31), it is equivalent to ((yyyy-1970)*12+(mm-1))*32+dd. In this way, you can use small integer to represent a date between 1970 and 2140, which basically meets the requirements.

3. Enumeration string

Some string fields, such as gender, education background, country or area abbreviations, are actually a kind of code with a very small range of values. Such strings are called the enumeration string. It also needs to convert the enumeration string to the sequence number in the code table. For example, if the AREAID values in the area field of order table is an area abbreviation, then you can convert it to corresponding sequence number in the area table. If a corresponding code table is unavailable for the enumeration string, you can count all the codes, create a new code table, and then perform the conversion of enumeration string.

The code for type conversion of order data is roughly as follows:

A
1 =connect("demo")
2 =T("area.btx").keys@i(AREAID)
3 =T("employee.btx").keys@i(EMPLOYEEID)
4 =A1.cursor("select ORDERID,CUSTOMERID,EMPLOYEEID,AREAID,AMOUNT,ORDERDATE from ORDERS order by ORDERID")
5 =A4.new(ORDERID,int(CUSTOMERID):CUSTOMERID,A3.pfind(EMPLOYEEID):EMPLOYEENO,A2.pfind(AREAID):AREANO,AMOUNT,days@o(ORDERDATE):ORDERDATE)
6 =file("orders.ctx").create@y(#ORDERID,CUSTOMERNO,EMPLOYEENO,AREAID,AMOUNT,ORDERDATE)
7 =A6.append(A5)
8 >A1.close(),A6.close()

A2, A3: read the area and employee tables pre-stored as btx and create the primary key.

A4: if it is Oracle database, you can write cursor@d() here to convert big decimal to double precision type.

A5: forcibly convert customer number to integer; convert the employee number and area abbreviations to corresponding sequence number in employee table and area table; the function days@o uses the method mentioned above to convert date to small integer. If the year, month and day are involved in date calculation, use the expressions year(ORDERDATE), month(ORDERDATE) and day(ORDERDATE) to get them respectively.

Index

For the search calculation, in addition to using row-based storage composite table, it also needs to create an index for to-be-searched (TBS) field. For example, if you create an index for the order numbers of row-based storage order table, the performance will be significantly improved when searching for an order number. The code is roughly as follows:

A
1 =file("orders.ctx").open()
2 =A1.index(index_orderid;ORDERID)

A2: create an index for the order number field, and the file name is: orders.ctx_index_orderid.

However, when more records are retrieved out through conditional traversal, the effect of searching via the index is not necessarily good. For example, the performance of using the customer number index to search order table for the transaction details of a certain customer is not good. The reason is that the order table may be ordered by order number, in this case, it is disordered relative to customer number, and the data of the same customer is scattered everywhere in the large table. As a result, searching for the said data via the index will cause a lot of discontinuous hard disk reads, and the performance will be very unsatisfactory.

To address the problem, you need to store the data orderly by TBS field while using the index. For example, we first sort and store the order data by customer number, and then create an index of customer number. When querying, first use the index to quickly locate the specified customer number, and then read all order data of the customer continuously from hard disk. In this way, the performance will be significantly improved, and it is sufficient to cope with the high-concurrency query of massive data.

For the aforementioned scenarios that requires high performance in both traversal and search calculations, SPL provides a “index with values” mechanism, which will copy the values of all other fields while creating the index. The original composite table continues to use columnar storage for traversal, while since the index itself has stored field values and used row-based storage, the original table is generally no longer accessed during search, which can obtain better performance. The code for creating index-with-values for order table is roughly as follows:

A
1 =file("orders.ctx").open()
2 =A1.index(INDEX_ORDERID_1;ORDERID;CUSTOMERID,AMOUNT)

A2: create an index-with-values for order numbers of order table. The index contains the customer number and amount.

Both the index-with-values and the above-mentioned coexistence scheme take account of performances of traversal and search. Moreover, the index-with-values mechanism combines the row-based storage with index, it occupies less space than the said coexistence scheme.

Multi-file storage

Usually, we will store the data that have the same business logic and structure into one data table, which is equivalent to one file in SPL storage. However, it is less convenient when the file has a particularly large amount of data, for the reason that it may involve known unrelated data whether it’s query calculation or data maintenance, resulting in severe performance decrease.

In fact, when using a database, experienced programmers often use the means of splitting the database and table, and they will split huge data table into several smaller tables (or even into multiple databases) to facilitate related processing. SPL that uses file storage can also use this scheme, that is, make a logical data table correspond to multiple files physically.

A relatively common way is to split by time. For example, when the account transaction detail table is very large, it would be inconvenient to maintain if it was still stored in a single file. The reason is that new data will be continuously appended to the detail table over time, and expired old data needs to be deleted regularly. Yet, SPL does not support deleting large amounts of old data, and the composite table has to be rewritten. The new data should be kept in order by account. Although the composite table can avoid re-sorting at each data update, it still requires merging with full old data after regular organization of cumulative updates. These calculations are time-consuming and will make maintenance very long.

Moreover, there is often a need to do account analysis for account transaction detail table, and using one file to store will result in performance decrease. Account analysis generally involves filtering out detail data for a period of time and then performing complex calculation on the filtered data of each account. In this case, if a large file is ordered by time, a quick filtering can be accomplished, but it is difficult to perform complex account calculation. Conversely, if a large file is ordered by account, it is difficult to quickly filter by time.

To deal with the dilemma, you can split data into multiple files by time to make each file only store the data for a small period of time (one month). In this way, maintenance becomes relatively simple: new data only needs to be merged with the last file, and dealing with expired data just deletes the earliest file. Moreover, the computing performance for account analysis will also be improved because the split files are ordered overall by date, and within each file the data can be stored in an orderly manner by account. Doing so is equivalent to the fact that the data as a whole is ordered by both date and account, which means that you can quickly filter by time, and can also perform complex account calculation easily.

SPL provides a method, which can make these split files unify into one table logically. As long as the programmer declares the names of split small files, time and account fields, these small files can be used as a large table. After that, SPL will automatically perform the above-mentioned operations such as filtering by time and ordered merge. For more detailed introduction on the principle and usage, visit: SPL Pseudo Table Bi-dimension Ordering Mechanism .

In addition to splitting by time, there are actually other ways, such as splitting large data table by area. Splitting by area can also make the data unit smaller and easier to maintain. Also, it can improve computing performance, for example, when some calculations always focus on a certain area, this way works.

SPL uses file to store data; the splitting of large table can be done with multi-layer tree structure of file system. When necessary, it is acceptable to split big data into hundreds or thousands of files, as long as the files are grouped into folders. However, it is not convenient to split large table into so many small files for the linear management mechanism that database uses for data table.

Of course, it’s not always helpful to split data table into as many smaller files as possible, splitting into too many files may lead to performance decrease. The reason is that although SPL has the ability to unify the split files into a single table logically, there are still multiple files physically. Mapping small files into a single table usually requires merging, which will consume additional resources, and the more files you split, the more time it takes to merge them. Therefore, the number of split files should be moderate and must be determined according to actual needs.

In addition to the storage mechanisms described above, if you use SPL for multi-machine cluster computing, you need to consider some issues like distributed data storage, which will not be discussed here. For more information, visit: Performance Optimization - Preface .