Bloom filters are an application of hashing. It is a data structure that allows us to check if an object has already been seen (without storing the object).
The idea is that we have an array of \(n\) bits, and \(k\) hash functions, each of which maps to \(\{0,1,\dots,n-1\}\). \(k\) need not be large.
Operations
Insertion
To note that an object \(x\) has been looked up, evaluate \(h_{i}(x)\) and set the \(k\) corresponding bits to 1 (regardless of their current value).
Look Up
To look up an object \(x\), evaluate \(h_{i}(x)\) and examine the \(k\) corresponding bits. If they are all 1, then we declare the object has been seen.
Properties
- There is no deletion.
- If a bit is 1, it is never set to 0.
- False positives can arise.
- False negatives cannot arise.
- Compared to hash tables, this is very space efficient.
- \(k\) is what enables this space efficiency. A naive approach would have been a bit per object, which is quite expensive. This way we can store \(\dbinom{n,k}\) objects.
- Taking unions and intersections are straightforward (AND and OR).
Applications
- Early spell checkers used this. A bloom filter was created and all dictionary words were inserted. Then when a text document was scanned, each word was checked with the filter. Of course, this let some incorrectly spelled words through. But the benefit in the old days is that one did not have to store the whole dictionary.
- Check for poor passwords: A “dictionary” of poor passwords was stored. If it matched, the user was forced to pick another password. False positives are harmless here (albeit annoying).