SPL Programming - 11.1 [Big data] Big data and cursor


We’ll talk about how to deal with big data in this chapter.

The so-called big data here refers to the data that cannot be loaded in the memory. It is generally stored on the hard disk in the form of file or in the database (also on the hard disk). This book does not involve the database, but only the big data in the file.

Excel files are generally not too large (it has the maximum number of rows). Large files are usually in the txt or csv format mentioned before. According to the memory of modern computers, it takes tens of millions or even hundreds of millions of lines (number of records) of data to be big, and the file size will be several G or dozens of G or more.

In fact, we rarely encounter such a large amount of data in our daily desktop work, so the operation methods mentioned before are sufficient in most scenarios. However, as a complete book, we still introduce the processing methods of big data.

Is there any difference between big data and the data we processed earlier? Can’t the methods we learned earlier be used?

The processing methods described in the previous chapters are all for data in memory. The file data used is also processed after being read into memory.

Because the CPU can only directly process the data in memory, it can not directly process the hard disk and other non memory data.

Can we do this: give an address for all the data on the hard disk, temporarily read it into the memory from the hard disk when we need to use it, throw it away when it has been used, and write it back to the hard disk if we need to change it, and this way it looks like directly operating a large piece of memory.

In theory, it is possible. Indeed, there are programs in some scenarios that do this (Windows has a system cache to do this). But it is hard for structured data operation to do this.

Is that because there is any essential difference between memory and hard disk?

Yes, in IT terms, memory allows random small-piece accessing, that is, the data under an address can be retrieved by randomly giving the address of a piece of memory. For example, when using a sequence, we can access any member with a sequence number.

The hard disk can’t do that.

The hard disk is more suitable for continuous batch accessing. It can only read and write the whole block, at least one minimum unit at a time (the specific size is related to the operating system. At present, most computers are 4096 bytes, and bytes are used to measure the space occupied by data. It doesn’t matter if you don’t understand it, and it doesn’t affect your understanding), and to only read an integer (only 4 or 8 bytes, and it will be smaller if compressed) will have to read in the whole block. If a large number of small pieces of data at various addresses are read (many places are accessed, and each place has a small piece of data, which is actually a very common situation), the amount of data read is actually very large, resulting in poor performance. The old mechanical hard disk can not support high-frequency random access (even reading the whole block), and the performance degradation is very serious.

Because of these troubles, for the scenario of structured data operation, no one simulates the hard disk as memory. Moreover, what’s more troublesome is that the operating system does not do any special work for structured data. It only reads and writes file for the data on the hard disk, while operating files is a very complex thing, and this will further reduce the performance.

What can we do?

We can’t directly access the structured data on the hard disk with the previous method. We usually use cursor technology to complete the operation.

When we open a data file, the cursor is like a pointer. First it points to the first record of the file, and then several records can be read into memory for processing. During the reading process, the cursor pointer will also move forward. After the data in memory is processed and discarded, and then read the next batch from the current position of the cursor for processing, …, it is repeated until all the data are read into memory and processed.

Because the cursor will read a batch of data every time, it usually occupies multiple blocks of the hard disk, and this is actually batch continuous access, and the difficulty of hard disk is avoided.

However, due to the special structure of most data files such as txt and csv, the cursor usually can only move forward, not backward (the cost of moving backward is very high). Even moving forward, it can’t jump and can only traverse row by row. In other words, accessing data with a cursor has lost randomness.

Therefore, many of the operation methods mentioned before can not be used. We need to design a set of functions for the cursor.

Let’s talk about cursor related functions.

In order to do the experiment, we first generate a large file, a simplified order table.

1 [East,West,North,South,Center]
2 for 0,999 =10000.new(A2*10000+~:id,A1(rand(5)+1):area,date(now())-1000+A2:dt,1+rand(10000):amount,1+rand(100):quantity)
3 =B2.select(rand()>=0.01)
4 >file(“data.txt”).export@at(B3)

The export()function will write the table sequence to the file, @a option means to append when writing. Thus, the file will be appended 1000 times, and less than 10000 records will be written each time (a part will be discarded randomly in B3), with a total of less than 10 million records. @t means that the first line of the file writes the field names as the title, which is the same as @t in xlsimport().

The id field can act as the primary key, but the primary key cannot be really established in the file data. dt is the date. The data generated each time is of the same day and it increases with the loop variable A2, which can ensure that the date is orderly (which will be used later), and the time information of now() needs to be discarded. amount and quantity can be generated randomly.

Note that the original data.txt should be deleted before execution, otherwise a batch of data will be added after the current file. In addition, it’s better to add the absolute path of the file in actual use, otherwise you may not find where the file is written because the current path is uncertain.

Let’s first count the total number of records in this file.

1 =0 =file(“data.txt”).cursor@t()
2 for =B1.fetch(10000)
3 if B2==null break
4 >A1+=B2.len()

The cursor()function will open the file and create a cursor (also a complex object), @t indicating that the first line of the file is the title. The fetch()function of the cursor will read out a table sequence composed of a specified number of records (with the title as the field name) and move forward. The next fetch()will be the next batch of records. If the end of the file is encountered in the process of reading, all the remaining records will be read, therefore, the desired number of records may not always be read out. When there is nothing to be read (end of file), null will be returned.

If we understand the above knowledge, this code is easy to understand. Do an endless loop, read 10000 records each time, and jump out of the loop after the reading is finished. In the process, add up the number of records read each time to get the result, which is stored in A1.

On this basis, let’s calculate the average order amount:

1 =0 =file(“data.txt”).cursor@t() =0
2 for =B1.fetch(10000)
3 if B2==null break
4 >A1+=B2.len()
5 >C1+=B2.sum(amount)
6 =C1/A1

The sum can also be calculated in the same loop.

To calculate the sum, we only need to read the amount field; and to count the number of records, randomly reading one field is enough. It’s not necessary to read all the fields. If we read less, the speed will be faster (as can be seen from the previous discussion, the amount of data read from the hard disk is almost the same, but reading fewer fields will make the processing of file reading easier):

1 =0 =file(“data.txt”).cursor@t(amount) =0
2 for =B1.fetch(10000)
3 if B2==null break
4 >A1+=B2.len()
5 >C1+=B2.sum(amount)
6 =C1/A1

We can write the fields to be read in the parameter of the cursor() function, so that the table sequence fetched will have only a few fields, and the unnecessary fields do not have to be read in. This is an important difference between external storage calculation and in-memory calculation. We never pay attention to this problem in in-memory calculation.

The structure of this endless loop is a fixed pattern. It is a little troublesome to write every time. In addition, it may happen that B3 is forgotten to write or written wrong, which will really become an endless loop. SPL actually supports direct looping against cursor:

1 =0 =file(“data.txt”).cursor@t(amount) =0
2 for B1,10000 >A1+=A2.len()
3 >C1+=A2.sum(amount)
4 =C1/A1

The for statement can loop through the cursor, the following integer is the number of records read out each cycle, and the loop variable is the table sequence read in. When the data cannot be read out, the loop will naturally stop, so there is no need to write fetch() and end conditions, and the code is simpler.

Using loop, the data in the cursor (actually the data in the file used to create the cursor, and we will often use this saying later) is read completely in turn. The process of this operation is called traversal of the cursor. Recalling the previous rules, the cursor traversal can only move forward, and can only traverse once. After all the data in the cursor has been read once, the cursor traversal ends, and the cursor is useless and can no longer read data. To read the data from the beginning again, you have to recreate a new cursor.

SPL Programming - Preface
SPL Programming - 10.4 [Association] Join
SPL Programming - 11.2 [Big data] Fuctions on cursor