tl;dr: I have developed a data structure called a "qp trie", based on
the crit-bit trie. Some simple benchmarks say qp tries have about 1/3
less memory overhead and are about ~~10%~~ 30% faster than crit-bit tries.

"qp trie" is short for "quadbit popcount patricia trie". (Nothing to do with cutie cupid dolls or Japanese mayonnaise!)

Get the code from http://dotat.at/prog/qp/.

## background

Crit-bit tries are an elegant space-optimised variant of PATRICIA tries. Dan Bernstein has a well-known description of crit-bit tries, and Adam Langley has nicely annotated DJB's crit-bit implementation.

What struck me was crit-bit tries require quite a lot of indirections to perform a lookup. I wondered if it would be possible to test multiple bits at a branch point to reduce the depth of the trie, and make the size of the branch adapt to the trie's density to keep memory usage low. My initial attempt (over two years ago) was vaguely promising but too complicated, and I gave up on it.

A few weeks ago I read about Phil Bagwell's hash array mapped trie (HAMT) which he described in two papers, "fast and space efficient trie searches", and "ideal hash trees". The part that struck me was the popcount trick he uses to eliminate unused pointers in branch nodes. (This is also described in "Hacker's Delight" by Hank Warren, in the "applications" subsection of chapter 5-1 "Counting 1 bits", which evidently did not strike me in the same way when I read it!)

You can use popcount() to implement a sparse array of length *N*
containing *M < N* members using bitmap of length *N* and a packed
vector of *M* elements. A member *i* is present in the array if bit
*i* is set, so *M == popcount(bitmap)*. The index of member

*i*in the packed vector is the popcount of the bits preceding

*i*.

```
mask = 1 << i;
if(bitmap & mask)
member = vector[popcount(bitmap & mask-1)]
```

## qp tries

If we are increasing the fanout of crit-bit tries, how much should we increase it by, that is, how many bits should we test at once? In a HAMT the bitmap is a word, 32 or 64 bits, using 5 or 6 bits from the key at a time. But it's a bit fiddly to extract bit-fields from a string when they span bytes.

So I decided to use a quadbit at a time (i.e. a nibble or half-byte) which implies a 16 bit popcount bitmap. We can use the other 48 bits of a 64 bit word to identify the index of the nibble that this branch is testing. A branch needs a second word to contain the pointer to the packed array of "twigs" (my silly term for sub-tries).

It is convenient for a branch to be two words, because that is the same as the space required for the key+value pair that you want to store at each leaf. So each slot in the array of twigs can contain either another branch or a leaf, and we can use a flag bit in the bottom of a pointer to tell them apart.

Here's the qp trie containing the keys "foo", "bar", "baz". (Note there is only one possible trie for a given set of keys.)

```
[ 0044 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
[ value | "foo" ] [ value | "baz" ]
```

The root node is a branch. It is testing nibble 1 (the least
significant half of byte 0), and it has twigs for nibbles containing 2
('b' == 0x6**2**) or 6 ('f' == 0x6**6**). (Note 1 << 2 == 0x0004 and 1
<< 6 == 0x0040.)

The first twig is also a branch, testing nibble 5 (the least
significant half of byte 2), and it has twigs for nibbles containing 2
('r' == 0x7**2**) or 10 ('z' == 0x7**a**). Its twigs are both leaves,
for "bar" and "baz". (Pointers to the string keys are stored in the
leaves - we don't copy the keys inline.)

The other twig of the root branch is the leaf for "foo".

If we add a key "qux" the trie will grow another twig on the root branch.

```
[ 0244 | 1 | twigs ] -> [ 0404 | 5 | twigs ] -> [ value | "bar" ]
[ value | "foo" ] [ value | "baz" ]
[ value | "qux" ]
```

This layout is very compact. In the worst case, where each branch has only two twigs, a qp trie has the same overhead as a crit-bit trie, two words (16 bytes) per leaf. In the best case, where each branch is full with 16 twigs, the overhead is one byte per leaf.

When storing 236,000 keys from `/usr/share/dict/words`

the overhead is
1.44 words per leaf, and when storing a vocabulary of 54,000 keys
extracted from the BIND9 source, the overhead is 1.12 words per leaf.

For comparison, if you have a parsimonious hash table which stores just a hash code, key, and value pointer in each slot, and which has 90% occupancy, its overhead is 1.33 words per item.

In the best case, a qp trie can be a quarter of the depth of a
crit-bit trie. In practice it is about half the depth. For our example
data sets, the average depth of a crit-bit trie is 26.5 branches, and
a qp trie is 12.5 for `dict/words`

or 11.1 for the BIND9 words.

My benchmarks show qp tries are about 10% faster than crit-bit tries. However I do not have a machine with both a popcount instruction and a compiler that supports it; also, LLVM fails to optimise popcount for a 16 bit word size, and GCC compiles it as a subroutine call. So there's scope for improvement.

## crit-bit tries revisited

DJB's published crit-bit trie code only stores a set of keys, without any associated values. It's possible to add support for associated values without increasing the overhead.

In DJB's code, branch nodes have three words: a bit index, and two pointers to child nodes. Each child pointer has a flag in its least significant bit indicating whether it points to another branch, or points to a key string.

```
[ branch ] -> [ 3 ]
[ branch ] -> [ 5 ]
[ "qux" ] [ branch ] -> [ 20 ]
[ "foo" ] [ "bar" ]
[ "baz" ]
```

It is hard to add associated values to this structure without increasing its overhead. If you simply replace each string pointer with a pointer to a key+value pair, the overhead is 50% greater: three words per entry in addition to the key+value pointers.

When I wanted to benchmark my qp trie implementation against crit-bit tries, I trimmed the qp trie code to make a crit-bit trie implementation. So my crit-bit implementation stores keys with associated values, but still has an overhead of only two words per item.

```
[ 3 twigs ] -> [ 5 twigs ] -> [ 20 twigs ] -> [ val "bar" ]
[ val "qux" ] [ val "foo" ] [ val "baz" ]
```

Instead of viewing this as a trimmed-down qp trie, you can look at it as evolving from DJB's crit-bit tries. First, add two words to each node for the value pointers, which I have drawn by making the nodes wider:

```
[ branch ] -> [ 3 ]
[ x branch ] -> [ 5 ]
[ val "qux" ] [ x branch ] -> [ 20 ]
[ val "foo" ] [ val "bar" ]
[ val "baz" ]
```

The value pointers are empty (marked x) in branch nodes, which provides space to move the bit indexes up a level. One bit index from each child occupies each empty word. Moving the bit indexes takes away a word from every node, except for the root which becomes a word bigger.

## conclusion

This code was pretty fun to write, and I'm reasonably pleased with the
results. The debugging was easier than I feared: most of my mistakes
were simple (e.g. using the wrong variable, failing to handle a
trivial case, muddling up getline()s two length results) and ```
clang
-fsanitize=address
```

was a mighty debugging tool.

My only big logic error was in Tnext(); I thought it was easy to find the key lexicographically following some arbitrary string not in the trie, but it is not. (This isn't a binary search tree!) You can easily find the keys with a given prefix, if you know in advance the length of the prefix. But, with my broken code, if you searched for an arbitrary string you could end up in a subtrie which was not the subtrie with the longest matching prefix. So now, if you want to delete a key while iterating, you have to find the next key before deleting the previous one.

*finally...*

I have this nice code, but I have no idea what practical use I might put it to!

*postscript...*

I have done some simple tuning of the inner loops and qp tries are now about 30% faster than crit-bit tries.