To Index Data is To Sort Data
Indexing is commonly used among programmers. Without fully grasping the idea behind the technique, a programmer is always eager to take advantage of it whenever they encounter a query performance problem, only to get disappointed by the result on many occasions when resources are consumed. By analyzing the principles of indexing, the article shows you when is the appropriate time to use an index and how to use it.
The purpose of creating an index is to quickly find the record where a certain field value is equal to a given value (find the right person according to an ID number, for example). For a data table having N rows, a full traversal-based search involves N comparisons. If the data table is already ordered by values of the target field (called key values in the index), we can use the binary search. This way the number of comparisons is only logN (base 2) – for example, only 30 comparisons are needed for a data table having one billion rows (230 comparisons if a billion records are traversed), and obviously this increases performance greatly. Sometimes key values could be duplicated (find people according to birthdays, for example); or there are queries according to a certain range of key values (find people who were born in a certain period of time). In both scenarios, the number of comparisons is bigger than logN but remains of the same order of magnitude.
To index data is to sort data.
Generally, we won’t really sort the original data table. Instead, we create an index, a smaller data table, which stores key values of all records and the records’ positions in the storage. If there are other fields specified for the key value-based search, we can also create an index on each of these fields. For one original data table, there can be multiple indexes. But if the original data table is sorted and replicated each time when an index is created, the space consumption will be huge.
There is the derived index, named hash index, which calculates a certain type of hash values of the key values, even making the binary search unnecessary and further speeding up the search. The method can only find key values according to a specified exact value, not an interval, because the hash function generally isn’t monotonic and doesn’t retain information of the key values’ original order. Yet the method is sufficient in handling many scenarios (like finding a person according to ID number). A hash index is in essence about sorting according to hash values of the key values. The following discussions still revolve around the ordinary index that sorts by key values, but we can understand the HASH index in the similar way.
According to how the index works, it obviously cannot increase the performance of full traversal on a huge amount of data. The technique has been overused because some programmers do not understand its underlying principle and even try to use it to improve the performance of grouping and aggregation.
With a deep understanding about the idea behind indexing, we know which are the right scenarios to use it.
Indexes are effective for scenarios where the condition is specified on key values, such as finding the person whose ID number is equal to the given value, or finding people who were born in a certain period of time.
Most of the time, the method is ineffective if the condition is specified on key values’ function; other times the effect depends on database optimization.
For example, we have a scenario where birth dates are specified as days of a week but the index key is Birthday. In this case, the index cannot work because it isn’t ordered by the days of a week.
Another example is a scenario where a certain age group is specified but the index key is Birthday. Here the index can’t be directly used, either. But, as the relationship between the age and the birthday can be expressed as a monotonic function, a well optimized database may be able to use the index; otherwise, they cannot.
In view of this, it is recommended that you write a query condition directly on the original values of the index key, rather than writing it as a function or an expression.
Suppose we create indexes on all the fields specified in the search condition, will that lead to a stronger performance boost for multi-condition queries on those fields?
The answer is a disappointing “No”, according to the idea previously explained. Often only one index is useful.
For example, there are two indexes created on field A and field B respectively and the search condition is A=1 AND B=2. The records found according to condition A=1 using index A aren’t ordered by field B, so a traversal is still needed to find the records meeting the condition B=2; and vice versa.
We can also find records meeting A=1 and B=2 respectively and then perform the intersection. It seems that in this way both indexes can be used. But, calculating intersection also involves comparisons, and that is a type of traversal and isn’t necessarily faster than finding records meeting the condition B=2. The actual performance may be different for different scenarios.
If we build a double-field index on both field A and field B, of course the data table ordered by A and B is also ordered by A. In this case, the index can be used to find both records meeting A=1 and those meeting A=1 AND B=2, and gets broader application range.
Yet the double-field index cannot work when there is only the condition B=2 because an index ordered by both field A and field B isn’t ordered by field B. It is easy for many programmers to make the mistake because they thought a multi-field index is useful for all conditions specified on these fields.
According to the analysis, a multi-field index has broader application range than a single-field one. Since an index created on fields A, B and C can be used for handling conditions A, A/B and A/B/C, should we create an index on as many fields as possible? It seems that the answer is yes according to the principle of index. But this results in bigger index table and the I/O cost increases. So, the decision should come from a good judgement about the scenario.
Since the essence of index is about sorting, if data is already ordered by a certain field when it is stored physically, will the search be still quick without creating an index on this field?
Yes, it will.
However, relational databases do not have the concept of data order. Whether we can make use of the physical order of the data for the index depends on luck. esProc SPL offers such a concept and enables automatically using the data’s physical order as the index. The significance that data is physically ordered is much more than one less index is needed. The use of order plays a very important role in increasing performance of grouping operation, distinct operation and join operations, which we’ll discuss later.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version