Bloom Filters

Published: 5/14/2026 | Author: Alex Merced

probabilistic data structuresfile skippingquery speedindexing

Introduction to Bloom Filters

In the quest for ultimate query performance in massive Data Lakehouses, the primary goal is always to read as little data from disk as possible. We use techniques like Predicate Pushdown to check the min/max statistics of a Parquet or ORC file. If we are looking for customer_id = 50, and the file’s metadata says its minimum ID is 100 and its maximum ID is 200, we instantly skip the file.

However, min/max statistics have a severe limitation: they are completely useless for data that is unsorted or highly randomized.

Imagine looking for a specific UUID (e.g., a1b2-c3d4...) or a specific email address (john.doe@email.com) in a 1GB Parquet file. Because strings are often written randomly, the minimum value in the file might be “A”, and the maximum value might be “Z”. The min/max stats provide no help. The query engine is forced to read the entire 1GB file into memory just to see if John Doe exists inside it.

This is where the Bloom Filter steps in. A Bloom Filter is a highly space-efficient, probabilistic data structure designed to answer one specific question with blazing speed: “Does this exact value exist in this dataset?”

How a Bloom Filter Works

Invented by Burton Howard Bloom in 1970, the data structure abandons the idea of storing the actual data (like a traditional B-Tree index) and instead relies entirely on hashing and bit arrays.

The Construction Phase

The Bit Array: The system creates a blank array of bits, all set to 0 (e.g., [0, 0, 0, 0, 0, 0, 0, 0]).
Multiple Hash Functions: The system utilizes several different, mathematically distinct hash functions (e.g., Hash A, Hash B, Hash C).
Inserting Data: When the string “apple” is written into the Parquet file, the system passes “apple” through the three hash functions.
- Hash A outputs 2.
- Hash B outputs 5.
- Hash C outputs 7. The system changes the bits at positions 2, 5, and 7 to 1. The string “apple” is now “stored” in the filter.

The Query Phase

When a user runs SELECT * FROM table WHERE word = 'apple', the engine does not read the data file. It consults the Bloom Filter. It hashes “apple” using the exact same three functions, resulting in 2, 5, and 7. It checks the bit array at those positions. Because all three positions are 1, the Bloom Filter replies: “Yes, ‘apple’ is probably in this file.” The engine proceeds to read the file.

Now, imagine the user searches for “banana”. The hashes output 1, 3, and 7. The engine checks the array. Position 7 is a 1, but positions 1 and 3 are 0. The Bloom Filter replies: “No, ‘banana’ is definitely NOT in this file.” The engine instantly skips the massive file.

The Magic of False Positives

Notice the language in the previous section. A Bloom Filter can say “definitely no,” but it can only say “probably yes.”

Because multiple different words might accidentally hash to the exact same bit positions (a hash collision), the filter might occasionally tell the query engine that a word exists in the file when it actually doesn’t. This is a False Positive.

In big data analytics, false positives are completely acceptable. If the filter is wrong 2% of the time, the engine wastes a few milliseconds reading a file unnecessarily. But if the filter correctly skips the file the other 98% of the time, the aggregate performance gain across millions of files is astronomical.

Bloom Filters in the Lakehouse

Modern analytical file formats like Apache Parquet and Apache ORC support native Bloom Filters embedded directly in their file footers.

When configuring a Data Lakehouse, data engineers actively choose which columns should have Bloom Filters built for them during the ingestion process.

Good Candidates: High-cardinality, unique identifiers like UUIDs, email addresses, phone numbers, or session IDs.
Bad Candidates: Low-cardinality columns (like Gender or State), because every bit in the array would quickly turn to 1, rendering the filter useless.

Conclusion

Bloom Filters are a masterclass in computer science trade-offs. By accepting a tiny, mathematically controlled margin of error, they condense the searchability of gigabytes of data into a microscopic bit array taking up mere kilobytes of RAM. In the architecture of modern open formats, embedding Bloom Filters alongside min/max statistics ensures that query engines can rapidly eliminate massive swaths of irrelevant data, regardless of how chaotically that data is sorted.

Deepen Your Knowledge

Ready to take the next step in mastering the Data Lakehouse? Dive deeper with my authoritative guides and practical resources.

Explore Alex's Books