Explaining Bloom Filters

Introduction

I’ve been meaning to write this post for a while—welcome to my corner of the internet! A while back, I came across a tweet by Dillon, a former engineeer on the domains team at vercel.

In the tweet, Dillon mentioned using four different Bloom filter in a layered cache to make domain search faster on Vercel Domains. That got me curious—what’s a Bloom filter, and why use it this way?

So I decided to build one from scratch. After some tinkering (and a few “aha!” moments), it finally worked! I showed it to a few CS friends, and now I’d love to share what I learned with you.

Hey friends 👋, let’s dive in!

What is a bloom filter?

A Bloom filter is a probabilistic data structure used to quickly check whether an item might be in a set. Notice that word—might. Let’s make that concrete with a simple analogy.

Imagine you’re shopping with your mum. You pack a bag with apples, eggs, onions, and vegetables. Later, she asks:

Do we have eggs in the bag?

One way to check is to go through everything in the bag one by one. That’s what a traditional search does, it scans the whole list until it finds a match (or not). On a computer, this is what that would look like:

Speed

Searching for:

apples

eggs

vegetable

How traditional search would work

But what if your bag could hold millions of items? Searching one by one would take forever! That’s where a Bloom filter comes in, it lets you quickly check if something might be there, without searching the entire list.

The tradeoff? You might get false positives (it says something is there when it’s not), but you’ll never get false negatives (it will never say something isn’t there when it actually is).

Probabilistic Search: Is legldntlegndltnegd

What’s the trick up a Bloom filter’s sleeve?

Surprisingly, Bloom filters do not actually store data directly. Instead, the filter uses a bit array (an array of zeros and ones).

An Empty Bloom Filter

So, how are items added to the filter? This is the interesting part specific indexes are flipped from 0 to 1 meaning that part is filled. So, Howare words turned into indexes. You may know the answer to that question already, hashing.

Hashing is a one-way function for turning strings for into random numbers

1572909631

A simple Hash Implementation

Because hashing is a one-way function it would always give the same output for the same string. victor always returns 226095160. A Bloom filter uses several hash functions to map an item to different positions in the bit array and flips those bits from 0 to 1. There is actually a formula to find out how many hash functions to use.

This is the trick used by bloom filters.

Adding Everything Together

Later, if you check for “eggs”, the same hash function(s) are used. If the corresponding bits are all 1, then the filter says “eggs” was probably added before.

Why “probably”? Because different items can sometimes hash to the same index — this is called a collision. For example, hashing “Victor” and “Kalu” might return the same result, making the Bloom filter think both items exist when only one was added.

Even if you use better hash functions, collisions can still occur. That’s why Bloom filters give you probabilistic answers — they can guarantee when something is not in the set, but they can only probably confirm when something is in the set.

Explaining Bloom Filters - A Guide

Introduction

What is a bloom filter?

What’s the trick up a Bloom filter’s sleeve?

Adding Everything Together