Routine for regular maintenance of multi-zone composite table

 

Background and method

This routine is applicable to the following scenarios: data maintenance has no real-time requirements and can be performed regularly in a specific period (usually in hours or day); the total data is very large and needs to be split and stored in multiple zone tables (hereinafter referred to as ‘table’ unless otherwise specified); two modes, append and update, are supported, and the data amount for maintenance each time may be large, and may be passed in as a cursor.

Methods:

Append mode: the incoming data is required to be ordered by the time field. Maintenance steps: i)split the received data by the time interval of the current table and store them to multiple temporary tables; ii) merge the temporary tables with the corresponding current tables and write to new tables; iii)after merging, let the current tables point to the new tables to provide external query services; iv) put the original current tables into the discard list and delete when the next update starts. This routine supports making up for the missing data, which means that it can correctly insert the early data missed before the passing-in of data into an appropriate table.

When querying, select the table number list that covers the query interval based on the query time parameter and return.

Update mode: the incoming data is required to be ordered by the primary key field of composite table. Maintenance steps: i)write the received data to a temporary table identified by the current time; ii) merge the temporary table with the current table of the corresponding time period and write to a new table. The rest steps are the same as those in append mode.

In update mode, a special processing is needed: use a main table to store very early historical data, and provide a manual call module to merge the table that contains the data before a certain time into the main table.

When querying, return the main table number and the list of all current table numbers.

Definitions and concepts

1. Write the new data to the temporary table as soon as it is received.

2. Write out the data in multi-zone composite table format. The multi-zone composite table consists of multiple tables, with each table containing the data of a time interval.

3. A new table is used every other time interval, with one table corresponding to one time interval. Now, the time interval of table only supports two options: 1 month and 1 day, represented by letters ‘m’ and ‘d’ respectively.

4. Date write-out thread: the external program will call this thread actively, and pass in the data as a cursor. This thread works in a way that writes data to temporary tables first, and then merge the temporary tables with the corresponding current tables. In append mode, it requires the incoming data to be ordered by time field; In update mode, the incoming data is required to be ordered by primary key field.

5. Manual merge thread: in update mode, a manual merge program will be provided to merge the table generated prior to a specified time into the main table.

6. Query thread: in append mode, select the table number that meets the time period requirement based on the time period parameter and return, so as to generate a multi-zone composite table for query. In the case of update mode, return the main table and all table numbers.

7. Rule for defining table number: if the time interval is month, the format of table number is yyMM00k; if the time interval is day, the format is yyMMddk, where k is an alternate bit, whose value is 0 or 1.

8. Alternate bit: if the target table exists, merge the table and the merged data together and write to a new table. The last binary bit of the new and old table numbers alternates between 0 and 1, that is, if the last bit of the original target table number is 0, that of the new table number is 1, and vice versa. Moreover, all the previous bits of new table number have the same value as those of the original target table number.

Configuration file description

Configuration file: ubc.json

By default, this file is located in the main path of esProc (if you want to save it to another directory, modify the code to use an absolute path). The configuration file includes the following contents:

[{ "addOnly":true,
"sortKey":"account,tdate",
"timeField":"tdate",
"deleteField":null,
"otherFields":"ColName1,ColName2",
"dataDir":"data/",
"dataFilename":"data.ctx",
"interval":"d",
"currentZone":0,
"backupZone":1,
"discardZone":[]
}]

“addOnly” refers to whether or not to append data only, and its value is true/false.

“sortKey” refers to the sorting field name if it is in append mode, and generally includes the time field; it refers to the primary key field in update mode, and the data is required to be sorted by primary key field. When there are multiple such fields, they are separated by commas.

“timeField” refers to the name of time field. In append mode, this field must be configurated, while in update mode, this field should be set to null.

“deleteField” refers to the deletion flag field. In append mode, add this field if there is deletion operation, and set this field to null if there is no deletion operation.

“otherFields” refers to the name of other fields. If there are multiple such fields, separate them by commas.

“dataDir” refers to the storage path of composite table relative to the main directory.

“dataFilename” refers to the name of composite table file, such as “data.ctx”.

“interval” refers to the time interval of the table; m represents 1 month, and d represents 1 day.

“currentZone” refers to the number of the main table currently in use, such as 0. In append mode, there is no need to set this number.

“backupZone” refers to the number of backup main table, such as 1. In append mode, there is no need to set this number.

“discardZone” refers to the list of the discarded table numbers, the content of which is automatically generated by code, and is initially set as [].

Storage structure

Files and sub-paths in the main path:

data: refers to the storage path of composite table, and its name is set in ubc.json (refer to the description above).

ubc.json: refers to configuration file.

The files under ‘data’ include:

The form of file name is ‘table number.composite table file name.ctx’, where the composite table file name can be set in ubc.json (refer to the description above), and the table number is automatically calculated by the system based on the time interval level (see section 2 for the rule for calculating table number).

Configuration and storage examples

The data of a certain e-commerce system is as follows:

Number of users: 500,000-1 million; number of records per day: 500,000-1 million rows.

Since the amount of data is small, we set the time interval to 1 month.

The configuration file is as follows;

[{ "sortKey":"uid,etime",
"timeField":"etime",
"otherFields":"eventID,eventType",
"dataDir":"data/",
"dataFilename":"data.ctx",
"interval":"m",
"currentZone":0,
"backupZone":1,
"addOnly":true,
"discardZone":[],
}]

Global variable

zones: refers to the list of table numbers.

current: refers to the current main table number, only available in update mode.

Code analysis

init.splx

This script is executed only once when the server starts and will no longer be executed during calculating. If the server is started for the first time, it needs to initialize parameters.

A B C
1 >config=json(file("ubc.json").read())
2 if(!config.addOnly)
3 =file(config.dataDir/config.dataFilename:config.currentZone)
4 if(!B3.exists())
5 =config.sortKey.split@c().("#"+trim(~)).concat@c()
6 =if(config.deleteField, B3.create@d(${C5},${config.deleteField},${config.otherFields}), B3.create(${C5},${config.otherFields}) )
7 =C6.close()
8 >env(current,config.currentZone)
9 =file("zone.btx")
10 if(A9.exists()) =A9.import@bi()
11 else >B10=[]
12 =env(zones,B10)

A1: read the table number storage file;
A2: if it is in update mode, then:
B4: if the current main table does not exist, then:
C5-C7: create an empty main table;
B8: set the current main table number as global variable;
A9-B11: if the table number file exists, read this file. Otherwise, create an empty sequence;
A12: set the table number sequence as global variable ‘zones’.

write.splx

This script is called and executed by the external program in a way that passes in the cursor data first, and then write the data to temporary tables, and finally merge with the formal tables. If it is in append mode, the incoming cursor data is required to be sorted by time field, while in update mode, the incoming cursor data is required to be sorted by primary key field.

A B C D
1 >config=json(file("ubc.json").read())
2 =config.discardZone.run(movefile(config.dataDir/~/"."/config.dataFilename))
3 >config.discardZone=[]
4 =[]
5 =config.sortKey.split@c().("#"+trim(~)).concat@c()
6 if(!config.addOnly)
7 =(year(now())%100)*10000+month(now())*100+day(now())
8 =file(config.dataDir/config.dataFilename:B7)
9 =if(config.deleteField, B8.create@d(${A5},${config.deleteField},${config.otherFields}), B8.create(${A5},${config.otherFields}) )
10 =B9.append(cs)
11 =B9.close()
12 >A4.insert(0,B7)
13 else for
14 =cs.fetch@0(1).${config.timeField}
15 if(!C14) break B13
16 =func(A40,C14)
17 =(year(C14)%100)*10000+month(C14)*100+day(C14)
18 >data=cs.fetch(;${config.timeField}>=C16)
19 if(!data) break B13
20 =file(config.dataDir/config.dataFilename:[C17])
21 >data=data.sort(${config.sortKey})
22 =C20.create(${A5},${config.otherFields})
23 =C22.append@i(data.cursor())
24 =C22.close()
25 >A4.insert(0,C17)
26 =A4.group(if(config.interval=="m",(~\100)*1000,~*10):zu;~:zd)
27 =[] =[]
28 for A26
29 =zones.select@1(xor(~,A28.zu)<=1)
30 =if(B29,xor(B29,1),A28.zu)
31 =file(config.dataDir/config.dataFilename:(B29|A28.zd))
32 =file(config.dataDir/config.dataFilename:B30)
33 =B31.reset${if(config.addOnly,"","@w")}(B32)
34 >A28.zd.run(movefile(config.dataDir/~/"."/config.dataFilename))
35 >config.discardZone.insert(0,B29)
36 >A27=A27|B29,B27=B27|B30
37 =file("ubc.json").write(json(config))
38 >zones=((zones\A27)|B27).sort()
39 =file("zone.btx").export@b(zones)
40 func
41 =[year(A40),month(A40)-1,day(A40)-1,hour(A40),minute(A40),second(A40)]
42 =if(config.interval=="m",2,3)
43 =B41(B42)=B41(B42)+1
44 >B41.run(if(#>B42,~=0) )
45 =datetime(B41(1),B41(2)+1,B41(3)+1,B41(4),B41(5),B41(6))
46 return B45

A1-A3: read the configuration file and delete the discarded tables;
A4: store temporary table number;
A6: if it is in update mode, then:
B7: calculate the temporary table number based on the current time;
B8-B11: write the cursor data to temporary table;
B12: record the temporary table;
A13: if it is in append mode, then:
B13: loop;
C14: fetch the time of the first record;
C15: end the loop if no data can be read;
C16: calculate the end time of the table to which the first record belongs based on the time of the first record;
C17: calculate the temporary table number based on the time of the first record;
C18: select the records smaller than C16 from the cursor and store them in the data variable;
C20-C24: sort the data by sortKey and write them to the temporary table with C17 as table number;
C25: temporarily store C17 into A4 sequence;
A26: group the temporary table numbers by the formal table number to which they belong;
A28: loop by group;
B29: read the table number to which the current group belongs from the formal table number list;
B30: if B29 exists, use the alternate tail number as new table number;
B31-B33: merge the temporary tables with the original target table (if it exists) and write to the new alternate table number;
B34: delete the temporary tables;
B35: write the original table number B29 (if it exists) to the discard table number list;
B36: record the original table number and new table number;
A38: delete the original table number from the table number list and add a new table number;
A40: calculate the end time of the table interval to which the pass-in time belongs;

manualMerge

This program is manually called to merge the tables into the main table. This script is only applicable to update mode.

Input parameter:

t: merge the tables generated before the current time into main table.

A B
1 >config=json(file("ubc.json").read())
2 if t!=null
3 =(year(t)%100)*100000+month(t)*1000+day(t)*10
4 =zones.select(~<B3)
5 else >B4=zones
6 =file(config.dataDir/config.dataFilename:(config.currentZone|B4))
7 =file(config.dataDir/config.dataFilename:config.backupZone)
8 =A6.reset@w(A7)
9 >config=json(file("ubc.json").read())
10 >config.discardZone.insert(0,(config.currentZone|B4))
11 >config.currentZone=config.backupZone,config.backupZone=current
12 >current=config.currentZone
13 =file("ubc.json").write(json(config))
14 >zones=zones\B4
15 =file("zone.btx").export@b(zones)

A2-B5: if ‘t’ is not null, select the numbers of the tables maintained before the time written in ‘t’ parameter. Otherwise, select all table numbers;
A6-A8: merge the current main table with the selected tables, and then write to the backup main table;
A10: write the current main table number and the selected table numbers to the discard list;
A11-A12: exchange the current main table number and the backup table number;
A14-A15: update the list of table numbers and back it up to a file;

query.splx

This script is used for data query and will return the number of the tables of multi-zone composite table;

Input parameter for append mode:
start: start time
end: end time

No input parameters for update mode.

A B
1 if ifv(current) return current|zones
2 >sz=((year(start)%100)*10000+month(start)*100+day(start))*10+1
3 >ez=((year(end)%100)*10000+month(end)*100+day(end))*10+1
4 >sp = ifn(zones.pselect@z( ~<=sz), 1 )
5 >ep = ifn(zones.pselect( ~>ez), zones.len()+1)
6 =if (sp >= ep,null,zones.to( sp, ep-1))
7 return A6

A1: if ‘current’ exists, it means update mode, and will return all table numbers;
A2-A3: calculate the table number to which the input parameter start\end belongs. Considering the alternate bit, setting the last bit to 1 makes it easy to write query code;
A4: find out the position of the first table number that is earlier than or equal to the start time from back to front;
A5: find out the position of the first table number that is later than the end time from front to back;
A6: select the table number between the start time and the end time (including the start time, excluding the end time); return null if no table number is found.

Notes:

1. The multi-zone composite table must be closed after use. Otherwise, its tables cannot be deleted after the merge operation;

2. In update mode, when generating a multi-zone composite table cursor, the @w option must be added, i.e., cursor@w(), indicating that the update mechanism is adopted to merge tables.

Download the routine code: code.zip