Created 2025/01/04 at 01:30AM

Last Modified 2025/01/25 at 06:39PM

What Is A Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that is for determining set membership: does an element x belong to a given set S? Its probabilistic nature makes it produce false positives; that is, it may declare that x belongs to S even when this is not true. However, by sizing the Bloom filter appropriately, the false positive probability can be made small enough for many applications.

A standard bloom filter exists of

Adding an element

The element is passed through each of the k hash functions, producing k different positions in the bit vector, each of which are set to 1.

Checking for membership

Compute the k hash values for the new element, and check all the k bits in the vector

Math Behind Bloom Filter

After inserting n elements into a Bloom filter of size m bits using k hash functions, the probability that a specific bit remains 0 is

$$P(bit\ is\ 0) = {(1 - \frac{1}{m})}^{k \times n} \approx {e}^{-\frac{k \times n}{m}}$$

(Each hash function has 1/m probability of setting a bit to 1)

$$P(bit\ is\ 1) = 1 - P(bit\ is\ 0) = 1 - {e}^{-\frac{k \times n}{m}}$$

$$P(false\ positive) = {(1 - {e}^{-\frac{k \times n}{m}})}^{k}$$ since each of the k bits has to be 1 for a false positive.

Minimizing false positive rate

$$Let -\frac{n}{m} = r$$

$$p ={(1 - {e}^{k \times r})}^{k}$$

Let's take logarithm, $$ln(p) = k \times ln(1 - {e}^{k \times r})$$

Minimizing p is same as minimizing ln(p), but makes it easier for calculating the derivative.

$$\frac{d}{dk} [k \times ln(1 - {e}^{k \times r})] = 0$$

$$ln(1 - {e}^{k \times r}) + k \times \frac{d}{dk} ln(1 - {e}^{k \times r}) = 0$$

$$ln(1 - {e}^{k \times r}) + k \times \frac{(e^{k \times r}) \times (-r)}{1 - e^{k \times r}} = 0$$

The above equation has doesn't have a direct algebraic solution, but the optimal solution turns out to be $$k = \frac{m}{n} ln(2)$$ which gives optimal false positive rate as $$p = e^{-(\frac{m}{n}) \times {ln(2) ^ 2}}$$

Usually we make it work other way around when using bloom filter in some application -- we set a desired false positive rate, and determine desired size of Bloom filter bit vector, given we also know the size of elements. The lesser we keep the desired value of false positive rate, the more memory and time, the bloom filter will take to process and check membership of an element.


Now we know the basic working of a bloom filter. But there's a very trivial problem here -- we cannot delete elements from a bloom filter set. To support deletion, there exists another data structure called Counting Bloom Filter which extends the standard bloom filter, and uses m counters, which can be incremented or decremented based on insertion or deletion of an element.

Example

We've seen that almost every website that offers user registration/login, has a reset password functionality, which takes in an email address from client, and sends a link to the email to reset the password for account associated with the email address. To avoid SMTP calls for sending emails, you would usually check if email is valid or not -- this can be done via database calls. But if someone spams on this functionality, it would lead to a lot of database calls which is undesirable. This can be avoided through various mechanisms, but let's try using Counting Bloom Filter to solve this.

# main.py
import hashlib
import math
import time
import typing

from contextlib import contextmanager


@contextmanager
def timer():
    s_time = time.perf_counter()
    try:
        yield
    finally:
        e_time = time.perf_counter()
        print(f"{'-' * 33}\nExecution took {e_time - s_time:.6f} seconds\n{'-' * 33}")


class CountingBloomFilter:
    def __init__(self, size: int, num_hashes: int):
        self.size = size
        self.num_hashes = num_hashes
        self.counters = [0] * size

    def _hashes(self, item: str) -> typing.Generator[int, None, None]:
        item_bytes = item.encode("utf-8")
        hash1 = int(hashlib.md5(item_bytes).hexdigest(), 16)
        hash2 = int(hashlib.sha1(item_bytes).hexdigest(), 16)
        for i in range(self.num_hashes):
            combined_hash = hash1 + i * hash2
            yield combined_hash % self.size

    def add(self, item: str):
        for pos in self._hashes(item):
            if self.counters[pos] < 65535:  # Prevent overflow
                self.counters[pos] += 1
            else:
                pass

    def remove(self, item: str):
        for pos in self._hashes(item):
            if self.counters[pos] > 0:
                self.counters[pos] -= 1
            else:
                # ignore if already zero.
                pass

    def contains(self, item: str):
        for pos in self._hashes(item):
            if self.counters[pos] == 0:
                return False  # Definitely not in the set
        return True  # Might be in the set


def optimal_parameters(n: int, fp_prob: float) -> (int, int):
    m = -(n * math.log(fp_prob)) / (math.log(2) ** 2)
    k = (m / n) * math.log(2)
    return math.ceil(m), math.ceil(k)


def main():
    num_elements = 1000000
    false_positive_rates = [
        0.01,  #  1%
        0.05,  #  5%
        0.1,  # 10%
        0.25,  # 25%
        0.5,  # 50%
        0.99,  # 99%
    ]

    for false_positive_rate in false_positive_rates:
        m, k = optimal_parameters(num_elements, false_positive_rate)

        print("Counting Bloom Filter Parameters:")
        print(f"Number of elements (n): {num_elements}")
        print(f"False positive rate (p): {false_positive_rate * 100:.2f}%")
        print(
            f"Size of counter array (m): {m} counters ({m / 1024 / 1024:.2f} MB assuming 1 byte per counter)"
        )
        print(f"Number of hash functions (k): {k}")
        print()

        emails = [f"user{i}@example.com" for i in range(1, 101)]

        test_emails_1 = [
            ("user101@example.com", False),
            ("user3@example.com", True),
            ("user69@example.com", True),
            ("user42@example.com", True),
            ("user91@example.com", True),
            ("user0@example.com", False),
        ]

        delete_emails = [
            "user3@example.com",
            "user42@example.com",
        ]

        test_emails_2 = [
            ("user101@example.com", False),
            ("user3@example.com", False),
            ("user69@example.com", True),
            ("user42@example.com", False),
            ("user91@example.com", True),
            ("user0@example.com", False),
        ]

        not_existing = [f"user{i}@example.com" for i in range(101, 1000000)]

        cbf = CountingBloomFilter(size=m, num_hashes=k)

        for email in emails:
            cbf.add(email)

        with timer():
            print("Checking membership...")
            for email, exist in test_emails_1:
                flag = cbf.contains(email)
                if flag:
                    if not exist:
                        print(f"found a false positive for {email}")
                else:
                    if exist:
                        raise Exception(f"oh no, {email} is a false negative!")

            print("Deleting some members...")
            for email in delete_emails:
                cbf.remove(email)

            print("Checking membership again...")
            for email, exist in test_emails_2:
                flag = cbf.contains(email)
                if flag:
                    if not exist:
                        print(f"found a false positive for {email}")
                else:
                    if exist:
                        raise Exception(f"oh no, {email} is a false negative!")

            real_fps = 0
            for email in not_existing:
                if cbf.contains(email):
                    real_fps += 1
            print(
                f"Realtime false positive rate: {real_fps} out of {len(not_existing)}"
            )


if __name__ == "__main__":
    main()

We setup a high value of n (estimated users in our database based on growth), and we test out Counting Bloom Filter for different false positive rates to see how the execution time and memory taken differs. We also run out filter against some extra inputs to show that how it gives false positives.

In case of cbf.contains(...) returning true, we should always check with our persistent storage (in this case, our database) that the data point does really exist. But in case of false (example - malicious intent from a spammer), we avoid all the expensive database calls.

Let's run this and see the output

$ python main.py

Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 1.00%
Size of counter array (m): 9585059 counters (9.14 MB assuming 1 byte per counter)
Number of hash functions (k): 7

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 0 out of 999899
---------------------------------
Execution took 1.425833 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 5.00%
Size of counter array (m): 6235225 counters (5.95 MB assuming 1 byte per counter)
Number of hash functions (k): 5

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 0 out of 999899
---------------------------------
Execution took 1.431639 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 10.00%
Size of counter array (m): 4792530 counters (4.57 MB assuming 1 byte per counter)
Number of hash functions (k): 4

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 0 out of 999899
---------------------------------
Execution took 1.429877 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 25.00%
Size of counter array (m): 2885391 counters (2.75 MB assuming 1 byte per counter)
Number of hash functions (k): 2

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 0 out of 999899
---------------------------------
Execution took 1.407195 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 50.00%
Size of counter array (m): 1442696 counters (1.38 MB assuming 1 byte per counter)
Number of hash functions (k): 1

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 76 out of 999899
---------------------------------
Execution took 1.403171 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000000
False positive rate (p): 99.00%
Size of counter array (m): 20919 counters (0.02 MB assuming 1 byte per counter)
Number of hash functions (k): 1

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 4689 out of 999899
---------------------------------
Execution took 1.378936 seconds
---------------------------------

As we can see, with increasing desired false positive probability, our execution time decreases as well as the memory needed decreases. But with that, we also start seeing some real-time false positives.

Limitations

Above, we set a very large n initially. This is a limitation with bloom filter, that if our number of elements surpasses the number of elements for which bloom filter was constructed, the false positives increase rapidly -- only way to deal with it is to re-create the bloom filter.

Let's see what happens to our script, if we change n from 1000000 to 1000

$ python main.py

Counting Bloom Filter Parameters:
Number of elements (n): 1000
False positive rate (p): 1.00%
Size of counter array (m): 9586 counters (0.01 MB assuming 1 byte per counter)
Number of hash functions (k): 7

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 9 out of 999899
---------------------------------
Execution took 1.387928 seconds
---------------------------------
Counting Bloom Filter Parameters:
Number of elements (n): 1000
False positive rate (p): 5.00%
Size of counter array (m): 6236 counters (0.01 MB assuming 1 byte per counter)
Number of hash functions (k): 5

Checking membership...
Deleting some members...
Checking membership again...
Realtime false positive rate: 24 out of 999899
---------------------------------
Execution took 1.398940 seconds
---------------------------------
...

As we can see, we start seeing real-time false positives even for the smallest of desired false positive probability of 0.01.

The above is a very naive implementation of a bloom filter. I suggest you to use a well maintained open source library in programming language of your choice, if you want to integrate bloom filters in your application.