Performance Optimization - 7.3 [Merge and join] Association location

 

Performance Optimization - 7.2 [Merge and join] Merge in segments

What we discussed in the previous two sections are all about full-table traversal. In practice, however, the associated tables are often filtered with conditions. Of course, we can filter after association, but this will traverse all associated tables once, which will be time consuming if one table is very large. Sometimes the filter can be performed quickly, and the remaining result set after filtering may be very small. Since the two tables are associated by primary key, using the primary key of the filtered table to search for the records in the associated table can avoid full-table traversal and achieve better performance.

The logic of this algorithm is somewhat similar to the foreign key association between a small fact table and a large dimension table. However, we can take advantage of the characteristic that the primary keys of both tables are ordered to deal with larger filtered result sets.

A B
1 =file(“A.ctx”).open() =file(“B.ctx”).open()
2 =A1.cursor(;…)
3 for A2,1000 =B1.find(A3.(id))
4 =join(A3,id;B3,id)
5

B3 uses the primary key from A3 to take data. Since the primary key of the records taken out from A3 is ordered, there will be no duplicate record taken out from B1. Moreover, table B is always scanned from front to back, and in the worst case, it is only traversed once, avoiding the problem of repeated traversals that may occur if we adopt this method when discussing the large fact table and large dimension table.

However, this code is just an example to illustrate the algorithm process. In fact, the find()function cannot be used in such a simply way because it will search from the scratch every time, and will not inherit the previous search result (If used appropriately, it only needs to continue searching forward from the last location point). The algorithm that uses association for positioning is not easy to implement, the code is complex, and the cost of the algorithm itself is not small, which may offset the advantage of reducing the amount of reading. Therefore, testing is necessary for specific application.

SPL encapsulates these operations. Let’s use the tables in primary-sub relationship as an example.

A
1 =file(“orders.ctx”).open().cursor(dt;area==“CA”)
2 =file(“details.ctx”).open().news(A1,dt,price,quantity)
3 =A2.groups(dt;sum(price*quantity))

Filter the primary table first, and then use the filtered primary table to take the records of associated sub table. Because the primary table and sub table are in one-to-many relationship, it needs to use the news()here to indicate that each record taken from the primary table may correspond to multiple records of sub table. At this time, the field of primary table will be duplicated according to the number of the records of associated sub table, which is equivalent to executing joinx().

Alternately, first filter the sub table, and then take the records of the associated primary table.

A
1 =file(“details.ctx”).open().cursor(price,quantity;quanity>10)
2 =file(“orders.ctx”).open().new(A1,dt,sum(price*quantity):amount)
3 =A2.groups(dt;sum(amount))

By default, the returned cursor is based on the latter associated table (if there is no any filter condition, the number of records in the returned cursor is the same as that in the associated table). Here the latter table is the primary table, and the records in the sub table need to be aggregated first.

These functions can all be applied to multi-cursor.

For column-wise computing, SPL provides the pjoin() function to implement ordered merge and association location. Since the optimization principle is similar, it will not be explained in detailI here. Interested readers can refer to the information on the SPL forum.


Performance Optimization - 7.4 [Merge and join] Attached table
Performance Optimization - Preface