Performance Optimization - 7.1 [Merge and join] Ordered merge

 

Performance Optimization - 6.9 [Foreign key association] One side partitioning

We have mentioned the ordered merge algorithm many times. For example, this algorithm was used when we talked about appending data to the ordered composite table in Chapter 2. With the algorithm, we can implement the intersection, union and difference operations on large sets. Let’s take the union operation as an example to illustrate its calculation logic:

A B C D
1 func =file(“A.btx”).cursor@b() =file(“B.btx”).cursor@b()
2 =C1.fetch(1) =D1.fetch(1)
3 for if C2==null && D2==null break
4 else if C2==null >r=D2,D2=D1.fetch(1)
5 else if D2==null >r=C2,C2=C1.fetch(1)
6 else if C2.id< D2.id >r=C2,C2=C1.fetch(1)
7 else if C2.id>D2.id >r=D2,D2=D1.fetch(1)
8 else >r=C2,C2=C1.fetch(1),D2=D1.fetch(1)
9 return r
10 return cursor@c(A1)

Read the data tables A and B (both are ordered by the id field); fetch one record from cursors C1 and D1 respectively and compare which record has a smaller id; keep the record with a smaller id and prepare to return it; continue to fetch the cursor to which this record belongs; repeat this process until all the data of two cursors are taken out. At this time, the returned cursor is the union of two data sets based on the id field.

To compute the intersection, the code can be modified to return data only when two records are equal, and end the loop as soon as one cursor is completely fetched. The computation of the difference can be implemented in a similar way. This algorithm can be extended to handle multiple tables, but the code will be more complex.

The ordered merge algorithm can also be used to implement join operations. Let’s take the inner join as an example to illustrate its calculation logic:

A B C D
1 func =file(“A.btx”).cursor@b() =file(“B.btx”).cursor@b()
2 =C1.fetch(1) =D1.fetch(1)
3 for if C2==null || D2==null break
4 else if C2.id< D2.id >C2=C1.fetch(1)
5 else if C2.id>D2.id >D2=D1.fetch(1)
6 Else =new(C2,D2)
7 >r,C2=C1.fetch(1),D2=D1.fetch(1)
8 return D6
9 return cursor@c(A1)

This logic is very similar to that of intersection operation. Full join or left join can be implemented similarly by just adjusting the judgment code in the middle. Likewise, this algorithm can be extended to handle multiple tables.

The ordered merge algorithm will complete after traversing both tables only once. Even for large data tables, there is no need to buffer data. If the size of two tables is N and M respectively, the complexity will be O(N+M). To perform set or join operation on unordered data, we need to compare every piece of data to each other if no optimization is taken, and the complexity will be O(N*M), which is much larger than O(N+M). Databases usually adopt the hashing & partitioning algorithm discussed in the previous chapter. Although the hashing & partitioning algorithm can decrease the complexity by K times (K is the average number of repetitions of the hash value), the complexity is generally still much greater than the ordered merge algorithm. Moreover, for the operation of big data in external storage, hashing & partitioning algorithm will also lead to the action of reading and writing buffer files, and may encounter the unlucky situation we have mentioned many times. Therefore, using the ordered merge algorithm has huge performance advantages.

SPL implements the ordered merge algorithm for both set operation and join operation:

A
1 =file(“A.btx”).cursor@b()
2 =file(“B.btx”).cursor@b()
3 =[A1,A2].merge@u(id)
3 =[A1,A2].merge@i(id)
3 =[A1,A2].merge@d(id)

The options @u, @i and @d of the merge() function represent the union, intersection and difference respectively.

A
1 =file(“A.btx”).cursor@b()
2 =file(“B.btx”).cursor@b()
3 =joinx(A1,id;A2,id)
3 =joinx@1(A1,id;A2,id)
3 =joinx@f(A1,id;A2,id)

The joinx() function without option, with option @1 and with option @f represent the inner join, left join and full join respectively.

Both merge()and joinx() assume that the cursor as parameter is already ordered. Executing these two functions on an unordered cursor will result in incorrect calculation results.

Both functions are delayed cursors, and further calculations need to be defined.

In addition to foreign key association, common join operations also include two types: homo-dimension association and primary-sub association. Homo-dimension association is a one-to-one relationship, in which two tables are associated by the primary key, while primary-sub association is a one-to-many relationship, in which the association occurs between the primary key of the primary table with the first part of primary keys of the sub table. Since both tables in the two types of associations are related to the primary key, if the data is ordered by the primary key, the low-complexity ordered merge algorithm can be used. If the large dimension table is ordered by the primary key, using the foreign key association algorithm discussed in the previous chapter can also achieve high performance.

For a data table, being sorted by primary key is crucial for achieving high-performance join operations, and it only needs to be ordered by the primary key, and the cost of meeting this premise is not very high. In SPL, we will assume that any data table in external storage is ordered by the primary key by default, and only few data tables used for special search targets will be sorted differently.

The definition of join operation in relational algebra is too simple (Cartesian product followed by filtering) and does not involve primary key. If the definition is strictly implemented, the order of association keys cannot be assumed in theory, and hence these optimization algorithms cannot be used. Moreover, according to the unordered data theory of relational algebra, it is difficult to ensure that the data table is ordered by any field, and it is also difficult to make the data in order even if the database wants to optimize in practice.


Performance Optimization - 7.2 [Merge and join] Merge in segments
Performance Optimization - Preface