Performance Optimization - 5.2 [Ordered traversal] DISTINCT and COUNT(DISTINCT)

 

Performance Optimization - 5.1 [Ordered traversal] Ordered grouping and aggregating

DISTINCT and COUNT (DISTINCT) in SQL has always been a common problem that affects performance. DISTINCT is essentially a grouping and aggregation operation without aggregations. Indeed, it is not easy to do DISTINCT when the data is out of order, because it requires keeping the traversed grouping key value for subsequent comparison. If the result set is too large, it needs to buffer as described above in big grouping, resulting in poor performance.

When the data is ordered, we can use the ordered grouping algorithm described in the previous section to calculate DISTINCT.

If we first sort the orders.ctx table by id, and then calculate the distinct result of id:

A
1 =file(“orders.ctx”).open().cursor(id;amount>1000)
2 =A1.group(id)

group without aggregations is essentially the DISTINCT. Here it will return a cursor to meet the scenario of large data volume.

COUNT(DISTINCT) is slightly more complicated. If we simply want to do COUNT(DISTINCT) on the whole set, then all we need to do is count the DISTINCT result.

A
3 =A2.skip()

COUNT(DISTINCT) is more commonly used in grouping by other key values. For example, calculate the number of accounts of each month.

A
1 =file(“orders.ctx”).open().cursor(id,dt;amount>1000)
2 =A1.groups(month@y(dt);icount@o(id))

When the data is ordered by id, icount@o can be used to calculate COUNT(DISTINCT). In this case, SPL will no longer keep all the traversed id values, but will use the ordered grouping algorithm, only keeping the last id value for comparison. This way, the memory usage is much smaller and the computing performance is much better.


Performance Optimization - 5.3 [Ordered traversal] Ordered grouped subsets
Performance Optimization - Preface