## Searching a sorted array faster than O(log(N))

#### « previous entry | next entry »

** 28th Jul 2009 | 18:39**

The usual way to find an element in a sorted array is using a binary search, which takes log(n) time, where logs are understood to be in base 2 and N is the size of the array.

Linus Torvalds made the clever observation that you can do better than log(N) if you know that the contents of the array are uniformly distributed. For instance, git's pack files store multiple objects identified by the SHA-1 hashes of their contents. Each pack file has an index containing a list of the pack's SHA-1 IDs and their objects' locations in the pack. The index is sorted by SHA-1 ID. Since SHA-1 is a cryptographic hash, we can assume its output is uniformly distributed.

Linus described his technique as "Newton-Raphson" which is a bit of a misnomer, since N-R works on smooth differentiable curves whereas what we have is a straight line with some stochastic variations. What we're actually doing is an iterated linear interpolation. If the SHA-1 IDs were perfectly evenly distributed then a single linear interpolation would land us right on the target item, but the random variation means we will be off by some amount, so we need to continue searching.

How far off will we be? It turns out (based on Monte Carlo simulation) that the expected error is about 0.31 * sqrt(N) with a standard deviation of about 0.26 * sqrt(N). This is a really promising result since it implies that each iteration reduces the search space to N^{1/2} whereas an iteration of binary search reduces it to N/2. So we should expect a complete search to take O(log(log(N))) iterations.

I wrote a simulation to try this out, and it matches this prediction: in fact the number of iterations was about 1 + log(log(N)). However what is the variation around this expected result? In my tests it turned out that the maximum number of probes was log(N) though for small N it bottomed out at about 16. When testing lots of different randomly filled arrays, the standard deviation was about 1.2 for all values of N, but when I tested fewer arrays this number ramped up.

Junio Hamano's implementation of Linus's idea is included in git but disabled by default. He added a tweak that biases the linear interpolation towards the centre of the search range, so it's kind of a balance between binary search and linear interpolation search. In my simulator this tweaked version required (log(N)+3)/2 iterations on average with a standard deviation of 0.8. The maximum number of iterations was again log(N) but it bottomed out at about 12. Overall it's a bit slower but better behaved.

In git, where a large repository might contain two million objects, and where pack index lookups are not particularly performance-critical, this improved lookup code doesn't provide a noticeable advantage. Still, I think it's interesting and the idea might be useful in other situations. Note that unlike a binary search, which can just use comparisons returning greater / equal / less, the linear interpolation search needs to know the absolute values of the elements. Git's code actually uses a lexicographic variant that ignores any common prefix shared by the elements in the search range, and uses only the next two bytes for the interpolation.

To finish, here's a bit of code. In this example, 0.0 <= `array[k]`

< 1.0, and I use k for keys and v for values of array elements. We are searching for `vtarg`

.

/* all bounds are exclusive */ double vlo = -DBL_MIN, vhi = +1.0; int klo = -1, khi = N; while(klo - khi > 1) { int kmid = klo + (khi-klo) * (vtarg-vlo) / (vhi-vlo); /* ensure rounding does not put us out of bounds */ if(guess_k <= min_k) guess_k = min_k + 1; if(guess_k >= max_k) guess_k = max_k - 1; double vmid = array[kmid]; if(vmid == vtarg) return(kmid); if(vmid < vtarg) klo = kmid, vlo = vmid; if(vmid > vtarg) khi = kmid, vhi = vmid; } return(-1);

**Addendum:** there are a few corrections in a follow-up post.

from:gareth_reesdate:28th Jul 2009 23:17 (UTC)Permalink

Linus Torvalds made the clever observationDonald Knuth writes [The Art of Computer Programming §6.2.1], "Interpolation searching was suggested by W. W. Peterson [IBM J. Res. & Devel. 1 (1957), 131–132]" and in exercise 22 he asks, "Show that an appropriate formulation of interpolation searching requires asymptotically lg lg

Ncomparisons, on the average, when applied toNindependent uniform random keys that have been sorted." with a reference to A. C. Yao & F. F. Yao, "The complexity of searching an ordered random table", Proceedings of FOCS 1976, 173–177.## Reply | Thread

from:ext_44906date:29th Jul 2009 13:13 (UTC)Permalink

## Reply | Parent | Thread

from:gareth_reesdate:29th Jul 2009 17:36 (UTC)Permalink

fanfthat it was a clever observation. Inventing algorithms is what we programmers do every week, and I don't imagine that very many of the algorithms we come up with are original to us. It's for scholars and historians to assign priority to ideas.## Reply | Parent | Thread

from:fanfdate:29th Jul 2009 17:35 (UTC)Permalink

## Reply | Parent | Thread

from:simontdate:29th Jul 2009 08:36 (UTC)Permalink

isiterated linear interpolation used to compensate for small deviations from linearity in a basically straightish target function. Even the rate of convergence is exactly what you expect from N-R (squaring the error term at every stage means you expect to terminate in O(log(log(error tolerance))) time).## Reply | Thread

from:dwmalonedate:29th Jul 2009 19:51 (UTC)Permalink

## Reply | Parent | Thread