To Index Data Is to Sort Data
Artificial Intelligence Overview
Indexing is commonly used among programmers.
Without fully grasping the idea behind the technique, a programmer is always eager to take advantage of it whenever they encounter a query performance problem, only to get disappointed by the result on many occasions. By analyzing the principle of indexing, the article tries to show programmers when is the appropriate time to use an index and how to use it.
The purpose of indexing is to quickly find the records where the value of a certain field is equivalent to a given value (find the right person according to an ID card number, for instance). For a data set with a certain scale (N rows), searching an eligible record using a full traversal needs N comparisons. If the data set is sorted by the target field (called the key for an index), we can use the binary search by creating a binary tree. This way the number of comparisons is logN (base 2). For example, only 30 comparisons are needed for a data set with a billion rows (2^30 comparisons if a billion records are traversed), thereby causing a great performance increase. The key values could be duplicated (find people according to birthdays, for instance); or there is probably a query according to a certain range of key values (find people who were born in a certain period of time). In both scenarios, the number of comparisons is bigger than logN but remains of the same order of magnitude.
To index data is to sort data.
Of course we won’t really sort the original data set. Instead, we create an index, a smaller data set, which stores the key values of all records and the records’ positions in the original data set in a table sorted by the key values. If more fields are specified for key value searching, we can also create an index over each of these fields. For one original data set, multiple different indexes are allowed. But if the original data set is sorted and replicated each time when an index is created, the space consumption could be huge.
Database is designed to take data insertion and deletion into consideration when creating an index. An index with a simple sorting makes the insertion and deletion a costly process. For this reason, a B-tree is used to facilitate the data update. A B-tree is the result of extending a binary tree into an n-ary tree, with data still sorted by key values. (We’ll discuss how to create an index, which covers a lot of aspects, in another article; here we only focus on how to use them.)
There is a further way, hash index, which calculates the hash values of the key values in all records. These hash values are natural numbers falling into the range from 1 to k, and they can be used to locate records directly without a binary search. The hash method finds key values according to a specified exact value, not an interval, because the hash function isn’t a monotonic function and doesn’t show the original order of the key values any more. Yet the method is sufficient in handling many scenarios (like finding a person by ID card number). At its heart, a hash index is also about sorting, according to the hash values of the key values. The following discussion takes the ordinary indexing for example but the conclusion also applies to the hash index.
In view of how the index works, it can’t increase the full traversal performance. So, it’s inappropriate to use an index to improve the performance of grouping and aggregation.
With a deep understanding about the idea behind indexing, we know the right occasions to use an index and how to write the code properly.
1. It is very effective with a condition on key values, like finding the person whose ID card number is equivalent to the given value, or finding people who were born in a certain period of time;
2. Most of the time, it is ineffective if the condition is a function of the key values. Other times whether an index can be used depends on the degree of database optimization. Take a scenario where the index key is Birthday but the condition is people who were born on a certain day of a week for example. An ordinary index won’t work because it isn’t sorted by the days of a week. We need to create an index over the function of the key values. That type of index is supported by most databases. Another example is finding people who are in a certain age group with the index key as Birthday. We can’t use an index directly, either. As there is a monotonic relationship between the age and the birthday, a properly optimized database may enable the use of index, which, however, is rarely seen. Besides, phrase a query condition directly over the values of index key, rather than over their function or an expression.
3. It is effective if a general search condition separated by AND puts the condition over key values at the outermost layer, like finding people who were born on a certain date and whose names contain a certain character. The database will first filter out those who were born on this date using the index, and then find those whose names contain a given character. All modern commercial databases are capable of analyzing a conditional expression intelligently to find the part for which an index can be used to speed up the searching. But for a condition that finds people who were born on a certain date or whose names contain a certain character, the index becomes useless. Since only a traversal can be used for the second part of the condition, the database will skip the indexing to perform a traversal directly.
So it would be better to write the condition over the key values at the outmost layer of a compound search condition separated by AND. Performance can only be enhanced when the index key can be used to narrow down the searching range.
Suppose we create indexes over all the fields involved in a searching condition, will that lead to a stronger performance boost?
The answer is a disappointing “No”, according to the idea previously introduced. In most cases, only one index over a single field is useful.
There are a search condition A=1 AND B=2, and two indexes created over field A and field B separately. First index A is used to find records where the value of field A is 1; but as these records aren’t ordered by field B, a traversal is still needed to find the records meeting the condition B=2. We’ll face the same situation since the records filtered out with index B according to the condition B=2 are not ordered by field A. A commercial database will assess the situation and choose an index whose filtered data set is smaller.
Both indexes might be useful if the search condition is A=1 OR B=2. A properly optimized database has the ability of using both indexes separately to filter the original data set according to A=1 and B=2, and then performing a union.
We can also build a multi-field index over both fields. The result set after filtering according to A=1 is now also ordered by field B, and thus it can be further filtered using the index according to B=2. A database, if perfectly optimized, identifies A=1 AND B=2 and B=2 AND A=1 as the same condition, so the order of the different parts of a compound search condition is allowed to be different from the order of the index keys, though the situation discussed in the 3rd case in the last section should be noted: it is preferable that a condition on the index key is written at the outermost layer.
Yet the double-field index over field A and field B doesn’t work when being applied to B=2, because an index ordered by both field A and field B doesn’t necessarily ordered by field B. To find records according to B=2, the only choice is to perform a traversal. But a multi-field index tends to steer programmers wrong. In fact, a multi-field index over fields A, B and C is valid for conditions including A=?, A=? AND B=? and A=? AND B=? AND C=?, but it is invalid for conditions including B=?，C=?，B=? AND C=?. An additional index over field B or field C is needed.
Since a multi-field index over field A, field B and field C is valid for conditions A, A/B and A/B/C, should we create as many index keys as possible? It seems we should according to its principle. With an index table getting bigger, however, the I/O cost will increase accordingly. So the decision should come from a good assessment about a situation.
Index and traversal
Is a correctly created and properly used index a guarantee of performance boost?
The purpose of using an index is getting records according to key values. In most cases, a relatively small number of records are expected to be retrieved from a huge data set. On those occasions, a suitable index, if used properly, will surely give a big boost to the retrieval performance. On other occasions where a lot of records may be retrieved through a conditional traversal, performance might or might not be improved, even could be worse.
Here is the reason:
As mentioned earlier, we create a separate index instead of sorting the original data set. Data retrieved based on the index is not continuously stored in the original data set, even totally disordered if the database isn’t properly optimized. Retrieving a large amount of discontinuous data from the hard disk will be accompanied by the retrieval of a lot of irrelevant data. The data retrieval time shouldn’t be calculated simply according to the amount of data to be retrieved. More importantly, the performance increase brought by the use of an index is limited, not as much as expected. Retrieving data in a discontinuous way according to the index may cause duplicate data retrieval. For an HDD, it is most probably that a traversal is better than the index-based retrieval in performance due to a large number of time-consuming head jumps, if the result set is very large. Usually a commercial database will assess the cost before deciding the way to retrieve data. If any adverse conditions exist, the index-based data retrieval method will be abandoned. Generally the performance is no worse than that of the traversal. But if the cost assessment is not as precise as expected, a wrong data retrieval method may result in a worse performance.
Generally the database data is stored in the order of insertion. If the order is in line with the order of the values of the index key, the data retrieved will be physically stored in a relatively continuous way. In this situation, the index-based filtering can bring a considerably performance increase, even if the retrieved amount of data is large.
Some opinions expressed in this article may be those of a guest author and not necessarily Analytikus. Staff authors are listed in http://www.bigdatanews.datasciencecentral.com/profiles/blogs/to-index-data-is-to-sort-data