## Some notes on Bloom filters

#### « previous entry | next entry »

** 27th Jan 2008 | 17:32**

Bloom filters are a simple and clever data structure that every programmer should know about. A Bloom filter represents a set where the universe of possible members is very large and where it doesn't matter if the filter occasionally falsely claims that a non-member is in the set. It is useful for filtering lookups into more expensive (larger) data structures, so that most of the time you can avoid performing lookups that will fail. For example, a hyphenation program on a small computer might have some rules that cover most words, and a dictionary of exceptions; it can represent the entries in the exceptions dictionary as a Bloom filter that fits in RAM, so that it doesn't have to go to disk to find out that its standard rules apply.

The way it works is to represent the set as an array of *size* bits, which is typically a small multiple of the expected population of the set. To add an item to the set, you hash it with *nh* independent hash functions, and set the *nh* bits in the array indexed by the hashes. To check if an item is in the set, you hash it in the same way, and check that all *nh* bits are set. You get a false positive when hash collisions cause all the bits to be set by accident.

There are a few neat tricks you can perform with Bloom filters. You can take the union of two sets by ORing the two Bloom filters together. You can halve the size of a bloom filter by ORing the top half with the bottom half. You can't delete elements from a Bloom filter, though if false negatives are OK as well as false positives, you can represent the set of deletions as a second Bloom filter. Alternatively, you can use a counting Bloom filter, in which each bit is replaced with a small counter that is incremented for insertions and decremented for deletions. Four bit counters overflow with a probability of 1.37 ∗ 10^{-15}.

The following analysis is mostly based on Andrei Broder and Michael Mitzenmacher's survey of network applications of Bloom filters.

After *pop* elements have been added to the set, the probability that a particular bit is zero is

which is approximatelypz′= (1 − 1/size)^{nh ∗ pop}

The probability of a false positive ispz= exp(−nh∗pop/size)

Given a fixed array size and population, we can work out the number of hash functions that minimizes the false positive rate. Increasing

fpr= (1 − pz)^{nh}= exp( nh∗ ln( 1 −pz))

*nh*makes the exponent larger which reduces

*fpr*, but beyond a certain point the smaller

*pz*starts making

*fpr*larger again. First note that

which gives us

ln( pz)= − nh∗pop/sizenh= −ln( pz) ∗size/pop

Symmetry reveals that the minimum occurs when

ln( fpr)= nh∗ ln(1 −pz)= −( size/pop) ∗ ln(pz) ∗ ln(1 −pz)

*pz*= 1/2, i.e. half the bits are set, so when the array is full (i.e. at maximum information density),

The

nh= ln(2) ∗ size/popsize= pop∗nh/ ln(2)= pop∗nh∗ log_{2}(e)fpr= exp(− nh∗ ln(2))= 2 ^{−nh}

*size*is a factor of only log

_{2}(

*e*) ≈ 1.44 larger than the information-theoretic lower bound (which I'm not going to explain here).

Let's compare a Bloom filter with a simple one-bit-per-element hashed set, with a target false positive rate of 2^{-k}. If you set only one bit per element, then (assuming no hash collisions) *fpr* = *pop* / *size*, so *size* = *pop* ∗ 2^{k}, which is exponentially worse than *k*-hash Bloom filter.

It seems reasonable (for the application I have in mind) to set *size* = 16 ∗ *limit* and *nh* = 8, which gives us headroom for the population to overflow *limit* by 2 ∗ ln 2 ≈ 1.39 before *fpr* rises to 1/256.

Adam Kirsch and Michael Mitzenmacher show that, rather than requiring *nh* independent hash functions, we can use a set of functions of the form *h*_{1}(*x*) + *i* ∗ *h*_{2}(*x*) where 0 ≤ *i* < *nh*, without increasing the false positive rate. A hash function might be implemented by taking log_{2}(*size*) bits from the result of MD5(*x*), so without this trick, *nh* = 8 limits *size* to 2^{16}, whereas with the trick we only need two functions so the limit is large enough that we don't need to worry.

from:dwmalonedate:28th Jan 2008 21:49 (UTC)Permalink

I wonder if it would be worth adding a Bloom Filter to dirhash? It could speed up file creates, though maybe they are already usually fast enough.

## Reply | Thread

from:fanfdate:29th Jan 2008 12:44 (UTC)Permalink

For dirhash you might want to think about cache efficiency. Bloom filters are very random-access so if they are too big they'll increase the number of cache misses to look up a directory entry. (Random observation: dirhash uses a two-level array; is this in order to be friendly to the allocator? It isn't cache-friendly!) But this might all be completely irrelevant compared to the locking and other stuff in the lookup routine. Dunno.

## Reply | Parent | Thread

from:dwmalonedate:29th Jan 2008 20:19 (UTC)Permalink

## Reply | Parent | Thread