Performance Optimization - 4.1 [Traversal technology] Cursor filtering

 

Performance Optimization - 3.9 [Search in external storage] Full-text searching

Using index or taking advantage of order can help us find records quickly. However, the cost of creating and maintaining an index and keeping the data in order is not low. Since we cannot pre-create index for all query conditions, it still needs to use sequential search, when necessary, i.e., traversal.

When traversing a data table on external storage to search for data, an easy method to think of is to read the data one by one and generate record, and then calculate whether the search conditions are met according to the generated records to decide whether to retain or discard the record. Regrettably, this method will cause some unnecessary actions. For example, when we want to search for the name, gender, age, address and other information of people over 18 years old, if we first read all these fields and generate personnel records, and then determine whether a person is over 18 and discard the record that fails to meet the age condition, then the actions of reading fields and generating records for people under 18 are unnecessary. In contrast, if we first read the age filed and then immediately determine whether the condition is met, if it is met, read other fields and generate records; if not, skip directly and give up reading other fields and generating records, then it can omit a lot of actions, especially when there are many records that do not meet conditions.

SPL implements this mechanism for composite table. When creating a cursor on a composite table, a filter condition can be attached. In this way, SPL will first read the field value used to calculate the condition. If the condition is not met, the next step will be canceled. Only when the condition is met will it continue to read other required fields and create a record.

Using the select()function on a created cursor cannot achieve this effect. Since SPL doesn’t know the data source of this cursor, and cannot apply the filter condition before the cursor creates records, it has to take the records using fetch() function first and then do judgement.

A
1 =file(“persons.ctx”).open()
2 =A1.cursor(sex,age;age>=18).groups(sex;avg(age))
3 =A1.cursor(sex,age).select(age>=18).groups(sex;avg(age))

The calculation results of A2 and A3 are the same. However, since A2 uses the above-mentioned pre-cursor filtering technology, its performance is much better.

SPL has not yet implemented this optimization for text files and bin files.

Sometimes there may be multiple filter conditions. For example, to find female people over 18 years old, the filter conditions are:

age>=18 && sex=="Female"

Since the && operation is commutative, the order of writing these two conditions doesn’t affect the operation result, but it will affect the computing performance. Therefore, we should consider which condition is more likely to be false and put it first. The reason for doing so is that if the former condition is calculated as false, there is no need to calculate the latter one, and the entire logical expression must be false. In this case, this record can be skipped directly, and the field used to calculate the latter condition doesn’t need to be read.

For example, if we roughly know that less than half of the people in this data are female and most people are over 18 years old, then the condition for female is more likely to be calculated as false, in which case it should be written as:

sex=="Female" && age>=18

This writing method will achieve better performance than the previous one.

The writing method for || related conditions is exactly the opposite. We should put the condition that is more likely to be true in the first place. If the former condition has already been calculated as true, there is no need to calculate the latter condition.

During traversal, the data amount is often large, and the filter condition may be calculated many times. In this case, the computing speed becomes very important, so we should try to choose a syntax that makes the computing speed faster when coding.

For example, for the example given above, according to the knowledge we talked about earlier, we know that sex==“female” is a string operation, and its operation speed is not as fast as that of age>=18. If the probabilities of calculating these two conditions as false are almost the same or cannot be determined, then we should put the condition with faster speed in the first place so as to calculate a definite result as soon as possible according to the condition.

Of course, a better way is to use the data type conversion method mentioned earlier, that is, convert the sex field into small integer and do judgement using integer operations like sex==1.

In addition, try not to put some unnecessary calculation expressions in the filter condition, calculate them in advance instead.

A
1 =file(“persons.ctx”).open()
2 =A1.cursor(;year(now())-year(birthday)>=18)

The filter condition in A2 is not good. Since now()returns an instant information, which cannot be calculated in advance, year(now()) will be calculated provisionally during every operation, resulting in a relatively poor performance. If written as:

A
1 =file(“persons.ctx”).open()
2 =year(now())
3 =A1.cursor(;A2-year(birthday)>=18)

The performance will be much better, because it only needs to calculate year(now()) once. A better writing method is:

A
1 =file(“persons.ctx”).open()
2 =year(now())-18
3 =A1.cursor(;year(birthday)<=A2)

Here, subtraction can also be omitted. If the conversion data type talked about in the previous chapter is used to convert birthday into small integer, the performance will be further improved.

Sometimes, we need to determine whether a field value is in a certain set. For example, search for the people with specified names in a given name table:

A
1 =file(“persons.ctx”).open()
2 [John,Alice,Mary,Steven,…]
3 =A1.cursor(;A2.contain(name))

The contain()usually uses the sequential search, which cannot achieve better performance. If the list is a little long (10 or more), we can use the binary search to speed up.

A
1 =file(“persons.ctx”).open()
2 [John,Alice,Mary,Steven,…]
3 =A2.sort()
4 =A1.cursor(;A3.contain@b(name))

In this way, the traversal speed is much faster.

In Chapter 8, we will talk about another efficient method to perform such set membership judgment in traversal.


Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal
Performance Optimization - Preface