"Using SPL for performance optimization, such as converting data types like string to integer dur .."

mars RaqForum 25 No.
1 Reply • 398 View • 1 Years ago

SPL Practice: integerization during data dump

Using SPL for performance optimization, such as converting data types like string to integer during data dump, can reduce storage space and improve computing performance. This article will present how to implement integerization through a practical example.

Problem description

The following table is the data structure of a certain space-time collision problem:

Field name	Field type	Field meaning	Remarks	Sample data
no	String	Object flag	Unique flag of object	100000000009
ct	Int	Timestamp	Unix timestamp (seconds)	1690819200
lac	String	Space flag 1		40000
ci	String	Space flag 2		66000000

After understanding the business, we know that all the values of ‘no’ are composed of numbers and can be directly converted to Long integer; ‘lac’ and ‘ci’ always appear in pairs and can be merged as one field called ‘loc’, and the count range of ‘loc’ after de-duplication is approximately 270,000, making it possible to use ‘Int’ integer.

Since the field ‘no’ only involves comparison in subsequent calculations and does not participate in positioning calculation, it does not need to be sequence-numberized, and we just need to convert its value to integer. On the contrary, the field ‘loc’ will involve positioning calculation in subsequent calculations, so it needs to be sequence-numberized, that is, converting its values to natural numbers starting from 1.

Data integerization and dump

The data structure after dumping is as follows:

Field name	Data type	Field meaning	Remarks	Sample data
no	Long	Object flag	Unique flag of object	100000000009
ct	Int	Timestamp	Unix timestamp (seconds)	1690819200
loc	Int	Space flag	Sequence number of space flag	10282

Compared with the original data structure, the following two changes are made during data dump:

1. merge the ‘lac’ and ‘ci’ fields as one ‘loc’ field and convert their values to Int sequence number. The original ‘lac’ and ‘ci’ fields, as dimension table, are stored separately;

2. convert the data type (number string) of the ‘no’ field to Long integer.

SPL code:

	A
1	=connect@l("oracle12c")
2	=loc_list=A1.query("SELECT DISTINCT lac,ci FROM testdata").keys@i(lac,ci)
3	=file("loc_list.btx").export@b(A2)
4	=A1.cursor@x("SELECT no,ct,lac,ci FROM testdata").new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc)
5	=file("tmp.btx").export@b(A4)
6	=file("tmp.btx").cursor@b().sortx(no,ct)
7	=file("1.day.ctx").create@y(#no,#ct,loc)
8	>A7.append@i(A6)
9	=movefile(file("tmp.btx"))

A1: connect to the database; the @l option represents that the returned field name and table name are in lowercase;

A2: fetch the de-duplicated space flags from the database and create an index;

A3: save the dimension table information with space flags to a bin file;

A4: while fetching data from the database, convert the values of ‘no’ to long integer, and utilize the index to find out the sequence number corresponding to ‘loc’ and ‘ci’ from loc_list;

A5 - A9: save the ordered results of ‘no’ and ‘ct’ to a composite table.

Restore through sequence number

Since the dumped fact table loses the information of original ‘lac’ and ‘ci’ fields, we need to restore such information by utilizing the above-mentioned bin file saved when dumping. For example, the following code is to restore the ‘lac’ and ‘ci’ values of a certain calculation result - table sequence T(no, ct, loc):

	A
1	=file("loc_list.btx").import@b()
2	=T.new(string(no):no,ct,A1(loc).lac,A1(loc).ci)

A1: read the dimension table;

A2: since the value of loc in T correspond to the sequence number of record of dimension table in A1, we can directly take the field corresponding to A1(loc).

New data processing

The data will continue to increase, and the space flag on new data may also be new, so we need to append the data with new space flag to the original dimension table. SPL code:

	A
1	=file("loc_list.btx").import@b().keys@i(lac,ci)
2	=connect@l("oracle12c")
3	=A2.query@x("SELECT DISTINCT lac,ci FROM testdata where ct >="/(long(date(now()))\1000))
4	=A3.select(!A1.find(lac,ci))
5	=file("loc_list.btx").export@ab(A4)

In this way, when dumping new data for a day, we just need to store new data into the dimension table bin file ‘loc_list.btx’ where the data has been appended, which will not affect the corresponding relationship of original space flag sequence number. SPL code:

	A
1	=connect@l("oracle12c")
2	=loc_list=file("loc_list.btx").import@b().keys@i(lac,ci)
3	=A1.cursor@x("SELECT no,ct,lac,ci FROM testdata where ct >="/(long(date(now()))\1000)).new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc)
4	=file("tmp.btx").export@b(A3)
5	=file("tmp.btx").cursor@b().sortx(no,ct)
6	=file("2.day.ctx").create@y(#no,#ct,loc)
7	>A6.append@i(A5)
8	=movefile(file("tmp.btx"))

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

spl practice(8)

esProc

mars • 398 View • 1 Years ago

SPL Practice: integerization during data dump

Problem description

Data integerization and dump

Restore through sequence number

New data processing

ToC