?

Log in

No account? Create an account

fanf

qp tries: smaller and faster than crit-bit tries

« previous entry | next entry »
4th Oct 2015 | 21:03

tl;dr: I have developed a data structure called a "qp trie", based on the crit-bit trie. Some simple benchmarks say qp tries have about 1/3 less memory overhead and are about 10% 30% faster than crit-bit tries.

"qp trie" is short for "quadbit popcount patricia trie". (Nothing to do with cutie cupid dolls or Japanese mayonnaise!)

Get the code from http://dotat.at/prog/qp/.

background

Crit-bit tries are an elegant space-optimised variant of PATRICIA tries. Dan Bernstein has a well-known description of crit-bit tries, and Adam Langley has nicely annotated DJB's crit-bit implementation.

What struck me was crit-bit tries require quite a lot of indirections to perform a lookup. I wondered if it would be possible to test multiple bits at a branch point to reduce the depth of the trie, and make the size of the branch adapt to the trie's density to keep memory usage low. My initial attempt (over two years ago) was vaguely promising but too complicated, and I gave up on it.

A few weeks ago I read about Phil Bagwell's hash array mapped trie (HAMT) which he described in two papers, "fast and space efficient trie searches", and "ideal hash trees". The part that struck me was the popcount trick he uses to eliminate unused pointers in branch nodes. (This is also described in "Hacker's Delight" by Hank Warren, in the "applications" subsection of chapter 5-1 "Counting 1 bits", which evidently did not strike me in the same way when I read it!)

You can use popcount() to implement a sparse array of length N containing M < N members using bitmap of length N and a packed vector of M elements. A member i is present in the array if bit i is set, so M == popcount(bitmap). The index of member i in the packed vector is the popcount of the bits preceding i.

    mask = 1 << i;
    if(bitmap & mask)
        member = vector[popcount(bitmap & mask-1)]

qp tries

If we are increasing the fanout of crit-bit tries, how much should we increase it by, that is, how many bits should we test at once? In a HAMT the bitmap is a word, 32 or 64 bits, using 5 or 6 bits from the key at a time. But it's a bit fiddly to extract bit-fields from a string when they span bytes.

So I decided to use a quadbit at a time (i.e. a nibble or half-byte) which implies a 16 bit popcount bitmap. We can use the other 48 bits of a 64 bit word to identify the index of the nibble that this branch is testing. A branch needs a second word to contain the pointer to the packed array of "twigs" (my silly term for sub-tries).

It is convenient for a branch to be two words, because that is the same as the space required for the key+value pair that you want to store at each leaf. So each slot in the array of twigs can contain either another branch or a leaf, and we can use a flag bit in the bottom of a pointer to tell them apart.

Here's the qp trie containing the keys "foo", "bar", "baz". (Note there is only one possible trie for a given set of keys.)

[ 0044 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
                        [    value | "foo" ]    [ value | "baz" ]

The root node is a branch. It is testing nibble 1 (the least significant half of byte 0), and it has twigs for nibbles containing 2 ('b' == 0x62) or 6 ('f' == 0x66). (Note 1 << 2 == 0x0004 and 1 << 6 == 0x0040.)

The first twig is also a branch, testing nibble 5 (the least significant half of byte 2), and it has twigs for nibbles containing 2 ('r' == 0x72) or 10 ('z' == 0x7a). Its twigs are both leaves, for "bar" and "baz". (Pointers to the string keys are stored in the leaves - we don't copy the keys inline.)

The other twig of the root branch is the leaf for "foo".

If we add a key "qux" the trie will grow another twig on the root branch.

[ 0244 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
                        [    value | "foo" ]    [ value | "baz" ]
                        [    value | "qux" ]

This layout is very compact. In the worst case, where each branch has only two twigs, a qp trie has the same overhead as a crit-bit trie, two words (16 bytes) per leaf. In the best case, where each branch is full with 16 twigs, the overhead is one byte per leaf.

When storing 236,000 keys from /usr/share/dict/words the overhead is 1.44 words per leaf, and when storing a vocabulary of 54,000 keys extracted from the BIND9 source, the overhead is 1.12 words per leaf.

For comparison, if you have a parsimonious hash table which stores just a hash code, key, and value pointer in each slot, and which has 90% occupancy, its overhead is 1.33 words per item.

In the best case, a qp trie can be a quarter of the depth of a crit-bit trie. In practice it is about half the depth. For our example data sets, the average depth of a crit-bit trie is 26.5 branches, and a qp trie is 12.5 for dict/words or 11.1 for the BIND9 words.

My benchmarks show qp tries are about 10% faster than crit-bit tries. However I do not have a machine with both a popcount instruction and a compiler that supports it; also, LLVM fails to optimise popcount for a 16 bit word size, and GCC compiles it as a subroutine call. So there's scope for improvement.

crit-bit tries revisited

DJB's published crit-bit trie code only stores a set of keys, without any associated values. It's possible to add support for associated values without increasing the overhead.

In DJB's code, branch nodes have three words: a bit index, and two pointers to child nodes. Each child pointer has a flag in its least significant bit indicating whether it points to another branch, or points to a key string.

[ branch ] -> [ 3      ]
              [ branch ] -> [ 5      ]
              [ "qux"  ]    [ branch ] -> [ 20    ]
                            [ "foo"  ]    [ "bar" ]
                                          [ "baz" ]

It is hard to add associated values to this structure without increasing its overhead. If you simply replace each string pointer with a pointer to a key+value pair, the overhead is 50% greater: three words per entry in addition to the key+value pointers.

When I wanted to benchmark my qp trie implementation against crit-bit tries, I trimmed the qp trie code to make a crit-bit trie implementation. So my crit-bit implementation stores keys with associated values, but still has an overhead of only two words per item.

[ 3 twigs ] -> [ 5   twigs ] -> [ 20  twigs ] -> [ val "bar" ]
               [ val "qux" ]    [ val "foo" ]    [ val "baz" ]

Instead of viewing this as a trimmed-down qp trie, you can look at it as evolving from DJB's crit-bit tries. First, add two words to each node for the value pointers, which I have drawn by making the nodes wider:

[ branch ] ->      [ 3    ]
              [ x  branch ] ->      [ 5    ]
              [ val "qux" ]    [ x  branch ] ->       [ 20  ]
                               [ val "foo" ]    [ val "bar" ]
                                                [ val "baz" ]

The value pointers are empty (marked x) in branch nodes, which provides space to move the bit indexes up a level. One bit index from each child occupies each empty word. Moving the bit indexes takes away a word from every node, except for the root which becomes a word bigger.

conclusion

This code was pretty fun to write, and I'm reasonably pleased with the results. The debugging was easier than I feared: most of my mistakes were simple (e.g. using the wrong variable, failing to handle a trivial case, muddling up getline()s two length results) and clang -fsanitize=address was a mighty debugging tool.

My only big logic error was in Tnext(); I thought it was easy to find the key lexicographically following some arbitrary string not in the trie, but it is not. (This isn't a binary search tree!) You can easily find the keys with a given prefix, if you know in advance the length of the prefix. But, with my broken code, if you searched for an arbitrary string you could end up in a subtrie which was not the subtrie with the longest matching prefix. So now, if you want to delete a key while iterating, you have to find the next key before deleting the previous one.

finally...

I have this nice code, but I have no idea what practical use I might put it to!

postscript...

I have done some simple tuning of the inner loops and qp tries are now about 30% faster than crit-bit tries.

| Leave a comment | Share

Comments {7}

Simon Tatham

from: simont
date: 5th Oct 2015 14:44 (UTC)

I hadn't previously known about crit-bit tries, but was pleased to see that it didn't take me long to learn them as preliminary reading for this post. (Mostly because DJB didn't bother to actually explain how to do all the operations – he just assumed, rightly as far as I'm concerned, that they weren't hard to figure out given the data structure description. And the latter is very short indeed :-)

The popcount trick is very cute – I'm sure there must be other interesting uses for the popcount(bits & ((1 << index)-1)) idiom.

I'm also faintly intrigued by the idea that this sort of thing is even a thing you can write up and give a name to in the first place. From a purely theoretical asymptotics point of view, it seems to me, crit-bit tries and qp tries are not essentially different from any other kind of trie with degree-1 nodes elided. (And that in turn is a very natural idea given the notion of tries to start with, which I for one didn't need anyone to suggest to me.) You could easily imagine an old-school algorithms researcher considering the differences between them to be pure implementation details (since, in particular, they don't affect the big-O-level performance of any operation). So this sort of discussion, nailing down some details of the exact format of the trie but without going all the way to an actual code implementation, seems to me to be an interesting kind of in-between layer, which looks theory-ish from the point of view of an implementor, but looks implementation-ish from the point of view of a theorist.

Reply | Thread

Tony Finch

from: fanf
date: 5th Oct 2015 15:02 (UTC)

What do you mean "without going all the way to an actual code implementation"? :-)

DJB argues firmly that the constant factors (especially memory overhead) are important. I should try benchmarking against other people's code, e.g. https://github.com/armon/libart which has much bigger per-leaf overhead but supports wider branch fanout.

There's space in qp tries for jumbo branch nodes. I currently use flag values of 0,1,2, so 3 is available for byte-at-a-time branches. With some care that could reduce the trie depth further.

Reply | Parent | Thread

Simon Tatham

from: simont
date: 5th Oct 2015 15:12 (UTC)

Sorry, yes, I didn't mean to deny the existence of your code implementation! I just meant that this kind of writeup has value in its own right, and would do even if it didn't come with an actual implementation.

In particular, it gives enough information that someone else could reimplement substantially the same thing, which could easily be useful to people who were unable to use your actual code. For whatever reason – hardware and compiler assumptions not satisfied, minor tweak required for some reason you haven't thought of, corporate paranoia preferring employees to reimplement at great length rather than importing free software of unverified (by that particular legal department) licensing and provenance...

Reply | Parent | Thread

Tony Finch

from: fanf
date: 5th Oct 2015 15:29 (UTC)

Indeed, there are lots of reasons you might not want to use my code. I am sure my API could be improved - the error handling in particular is ugly. And there are several XXX notes in my code about optimization opportunities and portability gotchas.

For instance, it's possible to port to a 32 bit machine and still have less than 2 words of overhead per leaf if your keys are less than 16 Kbytes long; but you can support longer keys with negligible extra overhead if you make the index values relative instead of absolute, and allow for single-twig stepping-stone branches if a relative index is larger than 16K. But this would complicate the code quite a bit and I struggle to care about that situation!

Reply | Parent | Thread

Sion

from: sion_a
date: 8th Oct 2015 15:10 (UTC)

I'm almost certainly failing to understand the details, but in the "adding qux to the trie" example, shouldn't it be creating a new root node, not a new twig on the existing root node, as "qux" differs from the other three values in the first nibble? ie
[ 0140 | 0 | twigs ] -> [ 0044 | 1 | twigs ] -> ...
                        [    value | "qux" ]

(Trying to work out if this can be converted into an efficient on-disk structure as an alternative to Btrees.)

Edited at 2015-10-08 03:10 pm (UTC)

Reply | Thread

Tony Finch

from: fanf
date: 9th Oct 2015 09:55 (UTC)

Hmm yes, thanks for spotting that error :-) I knew when I was making those diagrams that I was going to screw something up. I shall have to choose another word that correctly illustrates a branch with fanout greater than 2!

There's some discussion of external storage in Phil Bagwell's "ideal hash tries" paper, with citations of hashed data structures that allow you to locate a key in about one seek on average. Hard to do better than that :-)

Reply | Parent | Thread

Sion

from: sion_a
date: 9th Oct 2015 11:30 (UTC)

My application wants to be able to do an ordered traversal (I'm still thinking about whether it really needs it, or it's just an optimisation because access will frequently be to clusters of adjacent keys). I think I'm going to have to go back and carefully reread "ideal hash tries" because I could not get my head around the section about using ordered keys rather than hashes in external storage.

Reply | Parent | Thread