?

Log in

No account? Create an account

fanf

prefetching tries

« previous entry | next entry »
11th Oct 2015 | 20:33

The inner loop in qp trie lookups is roughly

    while(t->isbranch) {
        __builtin_prefetch(t->twigs);
        b = 1 << key[t->index]; // simplified
        if((t->bitmap & b) == 0) return(NULL);
        t = t->twigs + popcount(t->bitmap & b-1);
    }

The efficiency of this loop depends on how quickly we can get from one indirection down the trie to the next. There is quite a lot of work in the loop, enough to slow it down significantly compared to the crit-bit search loop. Although qp tries are half the depth of crit-bit tries on average, they don't run twice as fast. The prefetch compensates in a big way: without it, qp tries are about 10% faster; with it they are about 30% faster.

I adjusted the code above to emphasize that in one iteration of the loop it accesses two locations: the key, which it is traversing linearly with small skips, so access is fast; and the tree node t, whose location jumps around all over the place, so access is slow. The body of the loop calculates the next location of t, but we know at the start that it is going to be some smallish offset from t->twigs, so the prefetch is very effective at overlapping calculation and memory latency.

It was entirely accidental that prefetching works well for qp tries. I was trying not to waste space, so the thought process was roughly, a leaf has to be two words:

    struct Tleaf { const char *key; void *value; };

Leaves should be embedded in the twig array, to avoid a wasteful indirection, so branches have to be the same size as leaves.

    union Tnode { struct Tleaf leaf; struct Tbranch branch; };

A branch has to have a pointer to its twigs, so there is space in the other word for the metadata: bitmap, index, flags. (The limited space in one word is partly why qp tries test a nibble at a time.) Putting the metadata about the twigs next to the pointer to the twigs is the key thing that makes prefetching work.

One of the inspirations of qp tries was Phil Bagwell's hash array mapped tries. HAMTs use the same popcount trick, but instead of using the PATRICIA method of skipping redundant branch nodes, they hash the key and use the hash as the trie index. The hashes should very rarely collide, so redundant branches should also be rare. Like qp tries, HAMTs put the twig metadata (just a bitmap in their case) next to the twig pointer, so they are friendly to prefetching.

So, if you are designing a tree structure, put the metdadata for choosing which child is next adjacent to the node's pointer in its parent, not inside the node itself. That allows you to overlap the computation of choosing which child is next with the memory latency for fetching the child pointers.

| Leave a comment | Share

Comments {6}

Jos Dingjan

from: happydisciple
date: 12th Oct 2015 08:06 (UTC)

Was the naming of Tleaf intentional?

I'm now waiting for boiling water, milk, and possibly sugar to appear.

Reply | Thread

Tony Finch

from: fanf
date: 12th Oct 2015 09:20 (UTC)

Code is an excellent place for random puns :-)

Reply | Parent | Thread

Simon Tatham

from: simont
date: 12th Oct 2015 10:07 (UTC)

And union Tnode, being a wrapper around the other two T-related things, should have been called union Tcosy.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 12th Oct 2015 10:10 (UTC)

That deserves a slap with a wet Troot.

Reply | Parent | Thread

Simon Tatham

from: simont
date: 12th Oct 2015 10:24 (UTC)

So, if you are designing a tree structure, put the metdadata for choosing which child is next adjacent to the node's pointer in its parent, not inside the node itself.

Hmm, that's faintly annoying – in one particular kind of tree structure I use a lot, I used to do that, but moved away from it!

The structure in question is a 'counted' 2-3-4 tree, by which I mean that each node of the tree is annotated with a count of elements in the subtree it heads. This permits log-time indexed lookup ("get me the nth thing in this tree"), which means that if the tree is maintained according to a sorting order as usual then you can efficiently retrieve order statistics (median, percentiles etc); alternatively, you can do away with the sorting order and use the same node counts at insertion time ("insert this item at position n"), and then you have a structure for storing a list of arbitrary unsorted items which provides a log-time compromise between an array's fast retrieval and very slow insertion and a linked list's vice versa.

When I originally invented this (that is, 'invented' in the sense that you eventually find out it was hiding in Knuth all along), each tree node contained between 2 and 4 (count, pointer) pairs to subnodes. Eventually I decided this was wasteful of space at the bottom level of the tree, and it was cheaper to have each tree node contain its own count, because now you only need one count slot per node rather than four. But that means that each step of an indexed lookup now needs to go and read the counts out of up to 4 subnodes in order to decide which link to follow, whereas the original organisation allowed me to do that whole job using only information stored in the node itself – and, as you now point out, it might also have been possible to do something clever with prefetching.

(Though I'm not quite sure whether the details work out sensibly. I think your idiom here might actually require me to have transformed in the other direction, so that each subnode pointer record had the form (pointer, child_counts[4]).)

And I think you could solve the wasted-space problem in a much simpler way: since 2-3-4 trees have a known uniform depth at all times, you just specify a different data layout for the bottommost layer, which doesn't waste space on subtree pointers and their associated counts at all.

Perhaps I should go back and rewrite my standard tree234 code, and see if I can make it faster!

Reply | Thread

HairyEars

Take your PIQ

from: hairyears
date: 12th Oct 2015 21:51 (UTC)

Prefetching *should* work better with this than just about any other data structure more complex than a linked list - if it compiles the way I think it does, you have a near-perfect match to the typical branch-prediction logic of a reasonably smart prefetch.

On the other hand... you do get interesting inconsistencies, and I can name at least one app making heavy use of associative arrays that runs faster on a VM.

Reply | Thread