The Magic of Bloom Filters

We’re always trying to improve application performance. Caching is a cool thing, but it can use a lot of memory.

What is the minimum measurement of memory? Correct - BIT. In this article, I’ll explain how to use bits (actually use less memory) and skip old fashion caching approach.

Bloom filers bits sequence

What are Bloom Filters?

Bloom filters are probabilistic data structures that allow us to check if an element is in a set. They were invented by Burton Bloom in 1970.

Practical part

Let’s say you’re building a new social media platform and you want to check if a username is already in your user database before allowing someone to register with it. Instead of caching a massive list of every username ever used in Redis, you can use a Bloom filter.

To do this, you would first create a Bloom filter with the appropriate number of bits and hash functions. For example, we can create a bloom filter with the size of 64bits. By default, all bits are off - 000000……..0 and so on.

When a user tries to register with a username, you would apply a hash function to the username and detects if a bit is on or off in our Bloom filter.

1
2
3
user_name = 'funivan'
Hash(user_name) % 64 // bit index 3
// check if bit #3 is ON or OFF

Index	0	1	2	3	5	6	…	63
Bit	0	0	0	0	0	0	0	0

In the current step bit 3 is off, so we jump to the database and register user with the name funivan

The next step is to set the bit under index 3 to the value 1

Then, when someone else tries to register with the same username, the platform applies the same hash functions and checks if a specific bit is 1.

When the bit is off (value 0) - we are 100% sure that the user does not exist.
When the bit is on (value 1) - we are 99.99% sure that the users exist, but it is not 100% accuracy. In some cases we can get collisions.

Collision example:

1
2
user_name = 'daniel'
Hash(user_name) % 64 // bit index 3 

The bit index is the same for both names funivan and daniel. In this case we still need to jump to the database and check if daniel exists. This is the cost of the bloom filter, it is not 100% accurate.

Advantages

Bloom filters have several advantages over traditional methods of checking if an element is in a set:

They require very little memory compared to storing all elements in a set.
They have a constant time complexity for insertion and lookup.
They can be easily distributed across multiple machines.

Disadvantages

They have a non-zero probability of false positives. This means that they may incorrectly report that an element is in a set when it is not. The probability of false positives can be controlled by adjusting the size of the bit array and the number of hash functions used.
They cannot be used to delete elements from a set. Once an element is added to a Bloom filter, it cannot be removed. This can make them unsuitable for applications where deletions are common. We have to rebuild the Bloom filter from scratch using persistent data.

Conclusion

Bloom filters are a powerful tool for quickly checking if an element is in a set. They have several advantages over traditional methods and can be used in a variety of applications. So the next time try not a basic Cache it all approach, but take a look into Bloom Filters.