Indexes for boolean fields. When to create/skip an index?

When it comes to database optimization, creating indexes is often seen as a panacea that can magically speed up queries. However, the truth is that not all fields require an index, and in some cases, creating an index can actually harm performance. This is particularly true for boolean fields, which have only two possible values (true or false). In this article, we will explore the scenarios where creating an index for a boolean field in MySQL is necessary and when it is not.

Library. Indexes in real life

When to avoid creating an index?

Potential issue with creating an index for a boolean field is that it can lead to performance problems if the diversity of data is small. This is because iterating over the index to find the matching records and then iterating over the file system to retrieve the data for all the matching records can be less efficient than simply iterating over the file system for all the data.

For instance, if we have a boolean field called in_stock, and the diversity of data is around 50/50, creating an index on the in_stock field may not be efficient.

In this case, MySQL may end up iterating over the index (step 1) to find the matching records and then iterating over the file system (step 2) to retrieve the data, even though most of the data is not needed. This can lead to increased disk I/O, CPU usage, and memory usage.

Therefore, it is important to consider the diversity of data before creating an index for a boolean field.

Why diversity of data is important?

The case here is how databases fetch information from the disk. Index provide pointers to the rows. Rows are stored in pages. Database reads pages and using pointers fetches the rows.

Let’s take a practical example, it might be easier to understand the concept.

1
2
3
select *
from products
where in_stock = 1;

If we have a table with 100 rows (for example 10 pages with 10 rows in each), and we want to fetch products that are in_stock the database will read all 10 pages into the memory (the whole table)

Using index we might get the following result:

Pointer	Row
r2	2
r3	3
r6	6
r9	9

And here is the first page with the rows:

Pointer	Row ID	Field: is_active	action
r1	1	0	skip
r2	2	1	take
r3	3	1	take
r4	4	0	skip
r5	5	0	sip
r6	6	1	take
r7	7	0	skip
r8	8	0	skip
r9	9	1	take
r10	10	0	skip

As you can see page 1 consist of our target rows and rows that we have to skip. MySql will read the whole page into the memory and there is no difference if we have row pointers or not, we still have to read the whole page. This is the main reason why indexes are not efficient in this particular situation.

When to create an index?

Firstly, let’s consider the situations where creating an index for a boolean field is beneficial. One such scenario is when we have a small variety of data. For instance, if a boolean field is used to represent whether a product is in stock or not, and the majority of the products are always in stock, then indexing the field can improve query performance. Since the field has only two possible values, the index can effectively split the dataset into two groups, making it faster to filter and retrieve the records that meet the criteria.

The second scenario where creating an index for a boolean field can be beneficial is when running count queries against the field. Count queries can be resource-intensive, especially on large datasets, and an index can help to speed up the execution time. For instance, if we want to count the number of products that are currently in stock, we can use the following query:

If we have an index on the in_stock field, MySQL can use the index to quickly identify the matching records, instead of having to scan the entire table. This can significantly reduce the execution time of the query.

1
2
3
SELECT COUNT(*)
FROM products
WHERE in_stock = 1;

I made some test using table with 1M rows. Here is comparison of indexed and unindexed tables.

Indexed query performance:

1
2
3
explain analyze SELECT count(*) FROM products_indexed where in_stock = 1;
-> Aggregate: count(0)  (cost=99864.35 rows=1) (actual time=69.717..69.717 rows=1 loops=1)
    -> Covering index lookup on products_indexed using in_stock_index (in_stock=1)  (cost=49970.35 rows=498940) (actual time=0.161..53.604 rows=500223 loops=1)

Unindexed query performance:

1
2
3
4
explain analyze SELECT count(*) FROM products_unindexed where in_stock = 1;
-> Aggregate: count(0)  (cost=105581.43 rows=1) (actual time=185.244..185.245 rows=1 loops=1)
 -> Filter: (products_unindexed.in_stock = 1)  (cost=96036.30 rows=95451) (actual time=0.092..169.252 rows=500223 loops=1)
 -> Table scan on products_unindexed  (cost=96036.30 rows=954513) (actual time=0.090..120.032 rows=1000000 loops=1)

Winner is obvious ;)

Conclusion

Creating an index for a boolean field in database is not always necessary. While indexing can improve query performance in certain scenarios, it can also harm performance if the field has a low cardinality or if the number of distinct values is small. As with all optimizations, it is important to carefully consider the characteristics of the data and the queries being executed before deciding whether to create an index or not.

There are a numerous number of databases and storage engines available for use. Keep this in mind when making a decision about creating indexes. For example PostgreSQL has a bitmap indexes and in some scenarios you should create an index to speedup performance even if the diversity of data is small.

Know your database
Know your data
Know your queries