?

Log in

No account? Create an account

fanf

Searching a sorted array faster than O(log(N))

« previous entry | next entry »
28th Jul 2009 | 18:39

The usual way to find an element in a sorted array is using a binary search, which takes log(n) time, where logs are understood to be in base 2 and N is the size of the array.

Linus Torvalds made the clever observation that you can do better than log(N) if you know that the contents of the array are uniformly distributed. For instance, git's pack files store multiple objects identified by the SHA-1 hashes of their contents. Each pack file has an index containing a list of the pack's SHA-1 IDs and their objects' locations in the pack. The index is sorted by SHA-1 ID. Since SHA-1 is a cryptographic hash, we can assume its output is uniformly distributed.

Linus described his technique as "Newton-Raphson" which is a bit of a misnomer, since N-R works on smooth differentiable curves whereas what we have is a straight line with some stochastic variations. What we're actually doing is an iterated linear interpolation. If the SHA-1 IDs were perfectly evenly distributed then a single linear interpolation would land us right on the target item, but the random variation means we will be off by some amount, so we need to continue searching.

How far off will we be? It turns out (based on Monte Carlo simulation) that the expected error is about 0.31 * sqrt(N) with a standard deviation of about 0.26 * sqrt(N). This is a really promising result since it implies that each iteration reduces the search space to N1/2 whereas an iteration of binary search reduces it to N/2. So we should expect a complete search to take O(log(log(N))) iterations.

I wrote a simulation to try this out, and it matches this prediction: in fact the number of iterations was about 1 + log(log(N)). However what is the variation around this expected result? In my tests it turned out that the maximum number of probes was log(N) though for small N it bottomed out at about 16. When testing lots of different randomly filled arrays, the standard deviation was about 1.2 for all values of N, but when I tested fewer arrays this number ramped up.

Junio Hamano's implementation of Linus's idea is included in git but disabled by default. He added a tweak that biases the linear interpolation towards the centre of the search range, so it's kind of a balance between binary search and linear interpolation search. In my simulator this tweaked version required (log(N)+3)/2 iterations on average with a standard deviation of 0.8. The maximum number of iterations was again log(N) but it bottomed out at about 12. Overall it's a bit slower but better behaved.

In git, where a large repository might contain two million objects, and where pack index lookups are not particularly performance-critical, this improved lookup code doesn't provide a noticeable advantage. Still, I think it's interesting and the idea might be useful in other situations. Note that unlike a binary search, which can just use comparisons returning greater / equal / less, the linear interpolation search needs to know the absolute values of the elements. Git's code actually uses a lexicographic variant that ignores any common prefix shared by the elements in the search range, and uses only the next two bytes for the interpolation.

To finish, here's a bit of code. In this example, 0.0 <= array[k] < 1.0, and I use k for keys and v for values of array elements. We are searching for vtarg.

	/* all bounds are exclusive */
	double vlo = -DBL_MIN, vhi = +1.0;
	int klo = -1, khi = N;
	while(klo - khi > 1) {
		int kmid = klo + (khi-klo) * (vtarg-vlo) / (vhi-vlo);
		/* ensure rounding does not put us out of bounds */
		if(guess_k <= min_k) guess_k = min_k + 1;
		if(guess_k >= max_k) guess_k = max_k - 1;
		double vmid = array[kmid];
		if(vmid == vtarg) return(kmid);
		if(vmid < vtarg) klo = kmid, vlo = vmid;
		if(vmid > vtarg) khi = kmid, vhi = vmid;
	}
	return(-1);

Addendum: there are a few corrections in a follow-up post.

| Leave a comment | Share

Comments {6}

gareth_rees

from: gareth_rees
date: 28th Jul 2009 23:17 (UTC)

Linus Torvalds made the clever observation

Donald Knuth writes [The Art of Computer Programming §6.2.1], "Interpolation searching was suggested by W. W. Peterson [IBM J. Res. & Devel. 1 (1957), 131–132]" and in exercise 22 he asks, "Show that an appropriate formulation of interpolation searching requires asymptotically lg lg N comparisons, on the average, when applied to N independent uniform random keys that have been sorted." with a reference to A. C. Yao & F. F. Yao, "The complexity of searching an ordered random table", Proceedings of FOCS 1976, 173–177.

Reply | Thread

from: ext_44906
date: 29th Jul 2009 13:13 (UTC)

It does seem that Linus is the sort of guy that would rather hack out some code than sit down and read some literature about what people have done before.

Reply | Parent | Thread

gareth_rees

from: gareth_rees
date: 29th Jul 2009 17:36 (UTC)

That's not such a bad thing, and I didn't mean to offer any criticism of Torvalds or others; in fact, I agree with fanf that it was a clever observation. Inventing algorithms is what we programmers do every week, and I don't imagine that very many of the algorithms we come up with are original to us. It's for scholars and historians to assign priority to ideas.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 29th Jul 2009 17:35 (UTC)

Thanks for those reeferences. Dunno why I didn't look at Knuth - more fun to play with the code, I guess :-) I also have a horrible suspicion it might have been part of the CST syllabus...

Reply | Parent | Thread

Simon Tatham

from: simont
date: 29th Jul 2009 08:36 (UTC)

I don't think calling it Newton-Raphson is that much of a misnomer. The context is rather different, yes, but the technique is almost exactly analogous: N-R is iterated linear interpolation used to compensate for small deviations from linearity in a basically straightish target function. Even the rate of convergence is exactly what you expect from N-R (squaring the error term at every stage means you expect to terminate in O(log(log(error tolerance))) time).

Reply | Thread

from: dwmalone
date: 29th Jul 2009 19:51 (UTC)

Isn't is more like the Secant Method?

Reply | Parent | Thread