Tony Finch - Comprehending endianness

dotatfanf wrote
on 15th September 2007 at 00:49
Previous Entry Share Next Entry

Comprehending endianness

I've recently been reading about Erlang's bit syntax, which is a pretty nice way of serializing and deserializing binary data. The Erlang documentation includes some programming examples as well as the reference description. Although it is quite simple, it gets a lot of bang per buck by using essentially the same syntax to parse binaries using pattern matching as it does to construct them. It can also be implemented quite efficiently.

More recently the Erlang implementers have been working on a syntax for binary comprehensions which makes it even easier to implement really fiddly binary data manipulation code incredibly concisely and efficiently. It's one of the greatest inventions in programming language syntax in the last ten years.

However all is not sweetness and light, and the problem is that Erlang has incomplete (or broken) support for differing endianness.

Endianness affects the behaviour of two basic operations: splitting a large number into a sequence of smaller numbers; and joining a sequence of small numbers to form a larger number. (I'm being deliberately vague talking about "numbers" because they can be any size, smaller than bytes as well as larger.) These split and join operations are the fundamental building blocks of higher-level serialization and deserializaton operations.

The simplest view of endianness is that it is a choice of byte order, where the larger numbers are words comprising a whole number of bytes, which are the smaller numbers. Serialization is splitting and deserialization is joining, because the programmer wants to deal with words but they must be represented as bytes. Erlang allows you to control the byte order of numbers when they are serialized into or deserialized from a binary.

C's bitfield support illustrates how endianness is more complicated than just byte order. C allows you to specify the width in bits of integer fields in a structure. The compiler packs sequences of bitfields into (typically) word-sized storage units (for efficient manipulation in the processor's registers). In this situation, serialization is joining and deserialization is splitting - the opposite of the previous paragraph! - because the programmer wants to deal with a sequence of small numbers but these must be represented as words. However, it gets more complicated because byte order affects how words are represented in memory, so there are two layers of endianness in effect: how the compiler joins bitfields into words and how the processor splits words into bytes. The compiler agrees with the processor on endianness, so that a sequence of 8-bit-wide bitfields has the same representation as a sequence of char fields.

Erlang's bit syntax deals directly with byte strings, without restricting the alignment of bitfield boundaries to allow simple word-by-word manipulation as C does. However you still get both splitting and joining when serializing (and when deserializing), since large numbers must be split into bytes and small numbers must be joined to form bytes. Small numbers that overlap byte boundaries must be both split and joined!

The bug is that Erlang allows you (when serializing) to control the byte order of splitting, but not the endianness it uses to join small bit fields to form bytes - the latter is always big endian. This means that Erlang bit syntax cannot elegantly handle bitfields in a C struct on a little-endian machine. For example, this is the structure of the program counter on a 26-bit little-endian ARM processor:

	struct {
		unsigned mode : 2;
		unsigned pc : 24;
		unsigned imask : 2;
		unsigned flags : 4;
	};

The layout in memory is as follows. I have numbered the bytes on the left-hand side and the bits along the top, according to their power-of-two values. I have also indicated how the program counter bits end up split between the four bytes.

	    7   6   5   4   3   2   1   0
	  ------------------------+-------+
	0   5         pc        0 |  mode |
	  ------------------------+-------+
	1  13            pc             6
	  ---------------------------------
	2  21               pc          14
	  +---------------+-------+--------
	3 | N   Z   C   V | I   F | 23  22
	  +---------------+-------+--------

It would be nice if you could write in Erlang:

	<< Mode:2/little, PC:24/little, IMask:2/little, Flags:4/little >> 

However, what you get is completely crazy:

	    7   6   5   4   3   2   1   0
	  +-------+------------------------
	0 | mode  | 7         pc        2
	  +-------+------------------------
	1   1   0   15        pc       10
	  ---------------------------------
	2   9   8   23        pc       18
	  --------+-------+---------------+
	3  17  16 | I   F | N   Z   C   V |
	  --------+-------+---------------+

That is, the endianness flags have had no effect on the small fields: they have been joined to form bytes in big endian order. The program counter has been split into bytes in little-endian order, and then this three-byte string has been shifted in a big-endian fashion to fit into its unaligned space. As a result its bits are shuffled into an utterly baffling order.

To represent little-endian structures like this one in Erlang, you must deal with layout details manually, defeating the expressiveness of Erlang's bit syntax. In the example, the logically-contiguous PC field gets split up, and its value must be recovered with an auxiliary expression.

	<< PCa:6, Mode:2, PCb:8, PCc:8, Flags:4, IMask:2, PCd:2 >> 

So, how to fix this bug?

One possibility that I considered was to have a little-endian version of the binary append operation, which causes small bitfields to be joined into bytes in little-endian order, and larger fields to be shifted and split in little-endian fashion when they are appended on non-byte boundaries. This is nice and straight-forward, but it can still lead to craziness if you mix big- and little-endian operations. Appending binaries is expressed in Erlang by the comma inside << >> brackets; if, instead of putting the endianness flag on each element, you put it on the binary as a whole, e.g. << mode:2, pc:24, imask:2, flags:4 >>/little, then it can simultaneously affect splitting and joining which helps to ensure the results are consistent not crazy, and it also reduces verbosity.

However I'm not sure this is entirely sufficient. Erlang used to require that binaries were a whole number of bytes, so sub-byte fragments could only appear as part of a larger << >> expression. However the new generalized bit syntax allows binaries to end on arbitrary bit boundaries, which opens the question of whether the sub-byte tail of a binary is represented in big- or little-endian fashion. The same problem can occur at the start of a binary if it has been created by splitting a larger binary on a non-byte boundary, and the new binary is represented by reference to the original instead of as a new copy.

Therefore I think binaries should know their endianness, which would determine: the layout of non-byte-aligned boundaries at the start and/or end of the binary; how the binary is re-aligned (shifted) so that it can be appended to a binary with different alignment; and the byte order of large fields. (Shifting left moves big-endians towards the start and little-endians towards the end.) The append operation can also check that binaries have compatible endianness so that you get a sensible error instead of crazy bit shuffling.

Binaries actually need three kinds of endianness: big, little, and indeterminate. The latter can only be used for whole-byte binaries and would (for example) apply to binaries freshly read from the outside world. It seems reasonable to allow whole-byte binaries to have their endianness coerced. Appending binaries would cause an error if they are not a whole number of bytes and they are not the same endianness; if they are a whole number of bytes and of differing endianness then the result has indeterminate endianness; otherwise the result has the same endianness as the arguments.

So, to conclude: Erlang's bit syntax is fantastic but not quite finished. The run-time system needs to annotate binary objects with their endianness, and endianness flags should be attached to whole bit syntax expressions instead of each element.


(Leave a comment)
From:gerald_duck
Date:2007-09-15 10:54 (UTC)
(Link)
For one reason or another I've been cursed by endedness and addressing-granularity issues over the years. Nothing I've learned pertains directly to this, but I thought I'd mention two things anyway.

Firstly, my view is that the succinct way to express endedness is as a value that is XORed with addresses. This can express the convolutions that occur, regardless of the word sizes involved, even when atrocities such as vax-endedness occur.


My second was borne out of considering inefficiencies in CPU bus protocols. You want to be able to address (aligned) 32-bit, 16-bit and 8-bit operations using an address bus and some flag bits. How many flags do you need? The ARM answer was two, but I say one, regardless of how many addressing sizes are needed:

WLOG, consider 8-bit addressing, and concatenate an extra bit below the least-significant. Bytes are then addressed as abcdefgh1, 16-bit values as abcdefg10 and 32-bit values as abcdef100. Larger numbers of trailing zeroes after the 1 marker denote still larger accesses. Bit addressing can be implemented compatibly by adding a further three zeroes to the right of the existing bits. The one remaining bit pattern, 000000000 could well be used to denote a null address that is always faulted.

It's struck me more than once that such addresses could beneficially be used in CPU instruction-set architectures as well as on the bus. It associates the size of the datum with the address not with the load or store instruction; although that's counterintuitive to people used to the current CPU ISA monoculture (and that includes, for example, the designers of the C standard) it feels like an advantage to me. If the ALU refused to do implicit arithmetic between addresses of different granularities, that would immediately catch a variety of run-time logic errors in programs which currently cause deferred explosion — especially attempts to perform address arithmetic on NULL.
(Reply) (Thread)
From:fanf
Date:2007-09-15 12:20 (UTC)
(Link)
Your XOR idea only deals with byte order, and not with sub-byte endianness.

I like your addressing idea. However I don't think its extension to bit addressing is entirely straight-forward, because people who want to be able to address bits also want arbitrary width datums. (For example, the IP header has fields with widths of 3, 4, 8, 13, 16, and 32 bits.) Binary subdivision and fixed alignments of 8/16/32/64 bit values is too rigid.

Endianness problems can get nasty with bit addressing, but I think your XOR technique can be a big help when dealing with data that has the right bit order but the wrong byte order. Bit addressing won't solve byte order problems because the rest of the world is byte addressed - all data storage and transmission is done in bytes, from the point of view of the programmer. So the fact that ethernet transmits bytes in little-endian order is a trivial detail, and it doesn't matter that this is inconsistent with the big-endianness of higher-level network protocols. Bit addressing would expose this kind of inconsistency and turn the trivial detail into a persistent annoyance.
(Reply) (Parent) (Thread)

(Leave a comment)

Powered by LiveJournal.com