Log in

No account? Create an account


Some notes on Bloom filters

« previous entry | next entry »
27th Jan 2008 | 17:32

Bloom filters are a simple and clever data structure that every programmer should know about. A Bloom filter represents a set where the universe of possible members is very large and where it doesn't matter if the filter occasionally falsely claims that a non-member is in the set. It is useful for filtering lookups into more expensive (larger) data structures, so that most of the time you can avoid performing lookups that will fail. For example, a hyphenation program on a small computer might have some rules that cover most words, and a dictionary of exceptions; it can represent the entries in the exceptions dictionary as a Bloom filter that fits in RAM, so that it doesn't have to go to disk to find out that its standard rules apply.

The way it works is to represent the set as an array of size bits, which is typically a small multiple of the expected population of the set. To add an item to the set, you hash it with nh independent hash functions, and set the nh bits in the array indexed by the hashes. To check if an item is in the set, you hash it in the same way, and check that all nh bits are set. You get a false positive when hash collisions cause all the bits to be set by accident.

There are a few neat tricks you can perform with Bloom filters. You can take the union of two sets by ORing the two Bloom filters together. You can halve the size of a bloom filter by ORing the top half with the bottom half. You can't delete elements from a Bloom filter, though if false negatives are OK as well as false positives, you can represent the set of deletions as a second Bloom filter. Alternatively, you can use a counting Bloom filter, in which each bit is replaced with a small counter that is incremented for insertions and decremented for deletions. Four bit counters overflow with a probability of 1.37 ∗ 10-15.

The following analysis is mostly based on Andrei Broder and Michael Mitzenmacher's survey of network applications of Bloom filters.

After pop elements have been added to the set, the probability that a particular bit is zero is

pz′ = (1 − 1/size) nhpop
which is approximately
pz = exp(−nhpop / size)
The probability of a false positive is
fpr= (1 − pz) nh
= exp(nh ∗ ln( 1 − pz))
Given a fixed array size and population, we can work out the number of hash functions that minimizes the false positive rate. Increasing nh makes the exponent larger which reduces fpr, but beyond a certain point the smaller pz starts making fpr larger again. First note that
ln(pz)= −nhpop / size
nh= −ln(pz) ∗ size / pop
which gives us
ln(fpr)= nh ∗ ln(1 − pz)
= −(size / pop) ∗ ln(pz) ∗ ln(1 − pz)
Symmetry reveals that the minimum occurs when pz = 1/2, i.e. half the bits are set, so when the array is full (i.e. at maximum information density),
nh= ln(2) ∗ size / pop
size= popnh / ln(2)
= popnh ∗ log2(e)
fpr= exp(−nh ∗ ln(2))
= 2nh
The size is a factor of only log2(e) ≈ 1.44 larger than the information-theoretic lower bound (which I'm not going to explain here).

Let's compare a Bloom filter with a simple one-bit-per-element hashed set, with a target false positive rate of 2-k. If you set only one bit per element, then (assuming no hash collisions) fpr = pop / size, so size = pop ∗ 2k, which is exponentially worse than k-hash Bloom filter.

It seems reasonable (for the application I have in mind) to set size = 16 ∗ limit and nh = 8, which gives us headroom for the population to overflow limit by 2 ∗ ln 2 ≈ 1.39 before fpr rises to 1/256.

Adam Kirsch and Michael Mitzenmacher show that, rather than requiring nh independent hash functions, we can use a set of functions of the form h1(x) + ih2(x) where 0 ≤ i < nh, without increasing the false positive rate. A hash function might be implemented by taking log2(size) bits from the result of MD5(x), so without this trick, nh = 8 limits size to 216, whereas with the trick we only need two functions so the limit is large enough that we don't need to worry.

| Leave a comment |

Comments {3}

from: dwmalone
date: 28th Jan 2008 21:49 (UTC)

I guess this is a situation where MD5 is no longer a good example 'cos of the collision attacks, at least in some applications? For example, I think innd used to use a Bloom filter to do a quick check on article IDs, if the filter said no, then you didn't have to search the on-disk database, you could add the article ID to the database and the Bloom filter. You could make innd work a lot harder by feeding it many articles that all hashed to the same bit.

I wonder if it would be worth adding a Bloom Filter to dirhash? It could speed up file creates, though maybe they are already usually fast enough.

Reply | Thread

Tony Finch

from: fanf
date: 29th Jan 2008 12:44 (UTC)

Yes, I suppose that is a concern for something like message-IDs. I'm planning to use the filter for counting unique email addresses in a specified time period, so a collision attack would only be useful if you wanted to spam your own cleverly-prepared accounts, or (I guess) if all your victims use suffix addressing.

For dirhash you might want to think about cache efficiency. Bloom filters are very random-access so if they are too big they'll increase the number of cache misses to look up a directory entry. (Random observation: dirhash uses a two-level array; is this in order to be friendly to the allocator? It isn't cache-friendly!) But this might all be completely irrelevant compared to the locking and other stuff in the lookup routine. Dunno.

Reply | Parent | Thread

from: dwmalone
date: 29th Jan 2008 20:19 (UTC)

The two level allocations in dirhash were to be kind to the allocator. I must check if things have got smarter 'cos it would be handy to avoid them. I suspect dirhash already causes quite a lot of cache trashing, though I suppose the fact that it uses linear probing probably helps it be a little more cache friendly (though that isn't why it it was chosen).

Reply | Parent | Thread