Performance Optimization - 7.2 [Merge and join] Merge in segments

 

Performance Optimization - 7.1 [Merge and join] Ordered merge

When performing ordered merge algorithm for big data, there is an inconvenience, that is, the data needs to be read one by one from two (or more) cursors before comparison, resulting in a failure to directly achieve parallel processing in segments. Since this logic cannot ensure that the associated key values of two tables are distributed synchronously in two corresponding segments, the key value in the second segment of table A may correspond to the third segment of table B.

To solve this, we can still use the ordered nature of the primary key.

When tables A and B are associated by the primary key, take A as the reference table to divide it into segments, and then quickly take the start and end values of the primary key of each segment, and finally search table B by taking the primary key interval of each segment as a condition. Since table B is also ordered by primary key, the records in the interval into which the primary key value falls are stored continuously, which make it possible to quickly locate these records to generate a cursor. In this way, synchronous segmentation and parallel computing in segments can be implemented:

A B
1 =file("A.ctx").open() =file("B.ctx").open()
2 =[-inf()] | 9.(A1.cursor(;;~+1:10).fetch(1).id)| [inf()]
3 =10.("id>="/A2(#)/"&& id<"A2(#+1))
4 fork to(10) =A1.cursor(;;A4:10)
5 =B1.cursor(;${A3(A4)})
6 =joinx(B4,id;B5,id)
7 =…
8 =A4.conj().…

A2 first reads the id value of the first record of each segment in table A; A3 uses these segmentation points to form the interval condition. Then in each thread, table A is still segmented normally, while table B is segmented based on the corresponding conditions. In this way, it can be ensured that the key values in two tables are synchronously corresponding, and the associated cursor of the segmented cursor is obtained to continue the calculation.

SPL provides corresponding function with which associable multi-cursors can be generated:

A B
1 =file("A.ctx").open() =file("B.ctx").open()
2 =A1.cursor@m(;;4) =B1.cursor(;;A2)
3 =joinx(A2,id;B2,id)

A2 normally generates a multi-cursor for table A, and B2 can generate a multi-cursor for table B based on A2. Since the primary key (the field name with # when executing the create()) can be defined when creating a composite table, there is no need to explicate the field name here. Instead, search will be directly performed according to corresponding primary key of the two tables (different names are allowed for the primary key field of two tables). The joinx() in A3 will also return a multi-cursor.

There is another problem.

Since the primary key is unique, it is impossible to put two records with the same primary key into two segments when segmenting. However, in the primary-sub association, the associated keys of the sub table are not whole primary keys and may have duplicate values, so it is possible that records with the same key may appear in two natural segments. Therefore, when performing parallel segmentation in primary-sub association, the primary table should be used as the reference table, and the sub table should follow the primary table for segmentation instead of reversing the order at will.


Performance Optimization - 7.3 [Merge and join] Association location
Performance Optimization - Preface