When it comes to database optimization, creating indexes is often seen as a panacea that can magically speed up queries.
However, the truth is that not all fields require an index, and in some cases, creating an index can actually harm
performance. This is particularly true for boolean fields, which have only two possible values (true
or false
). In
this article, we will explore the scenarios where
creating an index for a boolean field in MySQL is necessary and when it is not.
When to avoid creating an index?
Potential issue with creating an index for a boolean field is that it can lead to performance problems if the diversity of data is small. This is because iterating over the index to find the matching records and then iterating over the file system to retrieve the data for all the matching records can be less efficient than simply iterating over the file system for all the data.
For instance, if we have a boolean field called in_stock
, and the
diversity of data is around 50/50, creating an index on the in_stock
field may not be efficient.
In this case, MySQL may end up iterating over the index (step 1) to find the matching records and then iterating over the file system (step 2) to retrieve the data, even though most of the data is not needed. This can lead to increased disk I/O, CPU usage, and memory usage.
Therefore, it is important to consider the diversity of data before creating an index for a boolean field.
Why diversity of data is important?
The case here is how databases fetch information from the disk. Index provide pointers to the rows. Rows are stored in pages. Database reads pages and using pointers fetches the rows.
Let’s take a practical example, it might be easier to understand the concept.
|
|
If we have a table with 100 rows (for example 10 pages with 10 rows in each), and we want to fetch products that are
in_stock
the database will read all 10 pages into the memory (the whole table)
Using index we might get the following result:
Pointer | Row |
---|---|
r2 | 2 |
r3 | 3 |
r6 | 6 |
r9 | 9 |
And here is the first page with the rows:
Pointer | Row ID | Field: is_active | action |
---|---|---|---|
r1 | 1 | 0 | skip |
r2 | 2 | 1 | take |
r3 | 3 | 1 | take |
r4 | 4 | 0 | skip |
r5 | 5 | 0 | sip |
r6 | 6 | 1 | take |
r7 | 7 | 0 | skip |
r8 | 8 | 0 | skip |
r9 | 9 | 1 | take |
r10 | 10 | 0 | skip |
As you can see page 1 consist of our target rows and rows that we have to skip. MySql will read the whole page into the memory and there is no difference if we have row pointers or not, we still have to read the whole page. This is the main reason why indexes are not efficient in this particular situation.
When to create an index?
Firstly, let’s consider the situations where creating an index for a boolean field is beneficial. One such scenario is when we have a small variety of data. For instance, if a boolean field is used to represent whether a product is in stock or not, and the majority of the products are always in stock, then indexing the field can improve query performance. Since the field has only two possible values, the index can effectively split the dataset into two groups, making it faster to filter and retrieve the records that meet the criteria.
The second scenario where creating an index for a boolean field can be beneficial is when running count queries against the field. Count queries can be resource-intensive, especially on large datasets, and an index can help to speed up the execution time. For instance, if we want to count the number of products that are currently in stock, we can use the following query:
If we have an index on the in_stock field, MySQL can use the index to quickly identify the matching records, instead of having to scan the entire table. This can significantly reduce the execution time of the query.
|
|
I made some test using table with 1M rows. Here is comparison of indexed and unindexed tables.
Indexed query performance:
|
|
Unindexed query performance:
|
|
Winner is obvious ;)
Conclusion
Creating an index for a boolean field in database is not always necessary. While indexing can improve query performance in certain scenarios, it can also harm performance if the field has a low cardinality or if the number of distinct values is small. As with all optimizations, it is important to carefully consider the characteristics of the data and the queries being executed before deciding whether to create an index or not.
There are a numerous number of databases and storage engines available for use. Keep this in mind when making a decision about creating indexes. For example PostgreSQL has a bitmap indexes and in some scenarios you should create an index to speedup performance even if the diversity of data is small.
- Know your database
- Know your data
- Know your queries