Log in


LIAR: Life in a register

« previous entry | next entry »
22nd Jan 2008 | 22:49

There's a category of bit-twiddling hacks called "SWAR": SIMD within a register. The essence is to treat a register not as a single number but as a packed array of bit fields. The x86 vector processing extensions (MMX / 3Dnow / SSE) have a SWARish flavour, but you can use the idea with just basic bitwise operations. The ultimate reference for this is the book Hacker's Delight, and there are several useful sites on the web (grep for "hack" in my URL log).

A few years ago I wrote a fairly neat implementation of Conway's Game of Life that did a lot of SWAR. (I described the evolution of the code soon after I finished it.) The key SWAR-related idea is to use four 64-bit registers to represent 64 four-bit numbers: one register has bit 0 times 64, one has bit 1 times 64, one has bit 2 times 64, etc. Then you can use bitwise operations to implement parallel addition in exactly the way that is taught in basic digital electronics. The following macros take 2 or 3 bit vectors of equal value and produce sum and carry result vectors.

The half adder:
	#define ADD2(s0,s1,a0,a1) do {		\
			s1 = (a0) & (a1);	\
			s0 = (a0) ^ (a1);	\
		} while(0)
The full adder:
	#define ADD3(s0,s1,a0,a1,a2) do {	\
			uint64_t c0, c1;	\
			ADD2(s0,c0,a0,a1);	\
			ADD2(s0,c1,s0,a2);	\
			s1 = c0 | c1;		\
		} while(0)

More recently I have been thinking about Bill Gosper's Hashlife algorithm. I'm not going to describe its mind-bending brilliance here, because I want to concentrate on the SWAR tangent that spun off from it. Hashlife's core algorithm is strategically very powerful, but it is not very efficient at low-level tactical bit banging. For example, Golly uses table lookup for calculating small parts of the universe instead of relying on Hashlife all the way down.

My previous SWAR Life code worked on 66x3 cell bitmaps (i.e. a 64 bit register plus its surrounding cells) which arose from the way it raster-scanned its way across the universe. However Hashlife works on square subdivisions of the universe that are a power of two cells on a side. An 8x8 square fits perfectly into a 64 bit register, which is ideal for SWAR. The real beauty (and what makes this Life in a register and not around a register) is that Hashlife deals with the splicing of adjoining squares at its strategic level, so we don't have to look beyond the register file to produce a useful result.

LIAR represents each row of the 8x8 square as an octet in the register. We don't have to worry if the big byte is the top row or the bottom row, or if the big bit of a byte is at the left or right end of the row, because that's handled at the strategic level. By the time the square gets to us symmetry has eliminated worries of endianness. I'll use the big endian convention in the code since it produces more consistent nomenclature.

In order to SWARishly add the eight 1-bit numbers corresponding to a cell's neighbours, they must be shifted to equivalent places in their bit vectors. (In fact we also add the cell itself into the count, since this gives us a neat short-cut.) Start off by getting the left and right cells into the proper place. We need to mask so that rows don't bleed into each other.

	uint64_t liar(uint64_t bmp) {
		uint64_t left  = bmp << 1 & 0xFEFEFEFEFEFEFEFE;
		uint64_t right = bmp >> 1 & 0x7F7F7F7F7F7F7F7F;

Now we can add each cell to its left and right neighbours. (The numeric variable suffix is the value of the bits in the vector.) This does a 3-to-2 horizontal compression of the whole bitmap.

		uint64_t mid1, mid2;
		ADD3(mid1, mid2, left, bmp, right);

By shifting again we can line up the counts of the upper and lower neighbours with each cell. This is where it is helpful to count the cell as well as its neighbours, so all rows are treated the same.

		uint64_t up1   = mid1 << 8;
		uint64_t up2   = mid2 << 8;
		uint64_t down1 = mid1 >> 8;
		uint64_t down2 = mid2 >> 8;

Now we can add these together to compress vertically. We're lazy about propagating the carries in the value-2 bits since it doesn't buy us anything. We end up with bits valued 1, 2, 2, and 4, which makes a maximum total value of 9.

		uint64_t sum1, sum2a, sum2b, sum4;
		ADD3(sum1, sum2a, up1, mid1, down1);
		ADD3(sum2b, sum4, up2, mid2, down2);

The usual way of expressing the Life rule is: if the cell is dead and it has 3 neighbours, it is born, otherwise it stays dead; if a cell is live and it has 2 or 3 neighbours, it survives, otherwise it dies. We need to adjust this because we count the cell too, so if a cell is live and its count is 3 or 4, it survives, otherwise it dies. Notice that when the count is 4 the cell's state is the same in the previous and next generations. Translated into boolean arithmetic, this logic becomes:

		return(bmp & ~sum1 & (~sum2a & ~sum2b  &  sum4 |
				       sum2a &  sum2b  & ~sum4)
			    | sum1 & ( sum2a ^  sum2b) & ~sum4);

This code uses just 39 ALU operations, most of which can run concurrently on a superscalar chip. It needs (I think) a maximum of 8 live values (which is slightly obscured by the coding style) so it shouldn't even need to hit the stack. By contrast, Golly's equivalent code (leafres() near the start of its lifealgo.cpp file) uses 52 ALU ops and 9 random reads of a 64KB table (plus overhead owing to a sub-optimal 8x8 bitmap layout). Hashlife likes to calculate two generations of an 8x8 square at a time, and the second one is easier because you don't need to worry about the edges of the square, so it takes Golly 30 ALU ops and 4 random reads. The two generation total is more than 82 ALU ops and 13 reads for Golly, and just 78 ALU ops for LIAR.

A success, I think :-) With a bit more cunning it might be possible to squeeze LIAR even more...

(After thinking some more...) Actually, propagating carries does simplify it. We don't need to propagate them all the way, because we can treat counts of 8 and 9 identically to 0 and 1. This means we can replace the return statement above with the following. This saves 4 ALU ops, reducing the cost from 39 to 35. I've also worked out that the maximum number of registers needed is six, not eight, and it fits into an in-order triple-issue instruction pipeline so it can run in 15 clock cycles.

		uint64_t sum2, sum4b;
		ADD2(sum2, sum4b, sum2a, sum2b);
		sum4 = sum4 ^ sum4b;
		return(bmp & ~sum1 & ~sum2 &  sum4 |
			      sum1 &  sum2 & ~sum4);

The final code is available on my web site

| Leave a comment | Share

Comments {7}

from: kaet
date: 23rd Jan 2008 01:25 (UTC)

The hodge-podge cellular automaton code I wrote for the "real" mode 386 used many SWAR-like tricks (though I didn't know the name) to keep everything in registers, my machine had a craptastic FSB. On the 386 there was the added fun of being able to treat 8 bits interchangeably as 16 bit ones, and each has its own characteristics. I was desperate to get it up to 1 frame-per-second, 1 cell-per-pixel.

Reply | Thread

Tony Finch

from: fanf
date: 23rd Jan 2008 01:31 (UTC)

The main challenge nowadays is getting the data onto the screen. That is by far the slowest part of my old listlife code. Bloating the 1-bit cells to 32-bit pixels then copying the result to the video card really chews memory bandwidth.

Reply | Parent | Thread


from: nonameyet
date: 23rd Jan 2008 07:38 (UTC)

Bloating the 1-bit cells to 32-bit pixels then copying the result to the video card really chews memory bandwidth.

Can't you use your 1-bit map as a mask and get the GPU to do the bloating and colouring so the memory bandwidth isn't wasted. Or have moden graphics cards dropped that feature ?

You don't by any chance still have a Matrox Millennium with WRAM do you ? That was the thing that made WRAM special: the memory chips did the bloating, rather than wasting the GPU's time ! XFree86 v3 had a speed test option and some of the blit primitives reported (IIRC) 2giga-pixels/second across a standard PCI bus.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 23rd Jan 2008 08:53 (UTC)

Dunno how to get a GPU to do the bloating for me. My understanding of X is rather rudimentary and its old basic bitmap handling code (the parts I understand) do lots of copying.

Reply | Parent | Thread


from: nonameyet
date: 23rd Jan 2008 16:57 (UTC)

Last time I looked, the next big thing was going to be downloadable snippets of code that the GPU would run on each pixel (or each texture pixel - which might mean that a wall or someones clothes could show a running life sim :-). This was too high a level for OpenGL, so it looked like the only option was bare-metal coding or Microsoft's direct drawing language (Direct-X?). But certainly coding life in SIMD on a GPU should theoretically be possible.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 23rd Jan 2008 18:18 (UTC)

It has been done in a number of ways. I believe you can do it with straight OpenGL, though it can be made rather faster with a pixel shader. See for example http://www.geocities.com/simesgreen/gllife/ or http://wiki.vrmedia.it/index.php?title=FBO_Shader_Simulation

The problem with doing the whole thing on the graphics card is that it drastically limits the size of the universe that you can compute, and it's not as fast as doing it on the main CPU.

Edited at 2008-01-23 06:18 pm (UTC)

Reply | Parent | Thread

Possible Speedup Hashlife: Tony Finch SWARs it!

from: anonymous
date: 31st Aug 2008 07:48 (UTC)

Hello Tony, I just dropped Tom Rokicki (and you) an email about maybe coding this bit hack into golly hashlife... if you're interested.

It might break the 386 lowest common target to rely on 64b mode or SSE 128b hacks, but I doubt that matters much in this case -- who's running golly in hash mode on a 286?


P.S. thanks for your list code! I've admired it for several years. I'm currently splicing it into perl via Inline::C to make a search framework which will also take advantage of 5.10's new regex backtracking controls. I hope to use FinchLife to explore constrained regions of pattern/problem/catalyst space by calculating the entire future tree of possibilities without gen/node repetitions -- in a single walk; dump the complete gen-by-gen state space to an annotated flatfile; and then let the perl 5.10 regex engine scan it repeatedly for interesting stuff. This pipelined approach doesn't need speed, but it'll be nice to get my 100GB file in days instead of weeks :) and there may be uses for re-gen during the filter phase. Sorry to babble!

Reply | Thread