Log in

No account? Create an account


Stringing along

« previous entry | next entry »
17th Sep 2007 | 01:10

I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.

Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.

If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:

  • strlen() no longer counts characters.
  • random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.

In fact, even in the absence of encodings you have similar problems, e.g.

  • the length of a string is not the same as the number of glyphs, because of combining characters.
  • even in a fixed width font, the number of glyphs is not the same as the width of the string on the display, because some characters are double-width.

Even ASCII is not simple enough to be immune from problems:

  • random access into strings can produce garbage if the string contains escape sequences.
  • some logical characters (e.g. newline) are not represented as a single byte (i.e. CR LF).
  • in some contexts you can use typewriter-style backspace/overstrike to implement something like combining characters.

On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.

I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)

What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.

Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)

To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.

| Leave a comment |

Comments {19}

Matthew Garrett

from: mjg59
date: 17th Sep 2007 02:05 (UTC)

random access into strings can lead to nonsense, e.g. if you miss a code shift sequence or you don't land on the start of a UTF-8 sequence or if you accidentally strip off a BOM.

I'm not sure that's true of the UTF-8 case. It's always possible to resynchronise UTF-8, and any competent parser ought to do so rather than just generating garbage. The worst case is that you end up with a single invalid character (though, of course, if you were expecting a useful string, the effect of that might be somewhat worse)

Reply | Thread

Tony Finch

from: fanf
date: 17th Sep 2007 07:03 (UTC)

Exactly: competent software doesn't try to use random access on a UTF-8 string, because that leads to the kind of awkward fiddling you describe.

Also, it's pragmatically useful to treat UTF-8 parse errors as ISO-8859-1, since that makes the software less brittle. Unix makes your locale setting essentially global - in particular, it does not support charset and language metadata on text files - so if you have a UTF-8 locale but sometimes need to process non-UTF-8 data, you want your traditional text processing tools to have relaxed parsers so that you don't have to keep frobbing your environment to make them work correctly.

However if you do that you sacrifice UTF-8's ability to resynchronize.

Reply | Parent | Thread

Gerald the cuddly duck

from: gerald_duck
date: 17th Sep 2007 10:29 (UTC)

Not necessarily.

Provided you retain knowledge of the extent of the entire string it's possible to determine whether you've plonked your pointer at a byte in the middle of a UTF-8 sequence or a byte that's an 8859 fallback. In my experience (which isn't extensive, but is at least recent and vivid) you have to retain that knowledge anyway for a whole host of other reasons.

In our code base, we model strings as forward-iterable sequences of characters. This lets us keep stuff as UTF-8 internally and, perhaps surprisingly, doesn't get in the way much even though we spend quite a bit of time picking strings apart. We've not even bothered with a function to step backwards yet.

Yes, we'd have been happier if the standard library had given us a cheap way of manipulating UTF-8, but we're in an embedded environment and don't want to bring in however many megabytes of locale guff.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 17th Sep 2007 10:07 (UTC)

p.s. Even if you can resynchronize to codepoint boundaries, careless random access to Unicode strings will break things like combining characters, context-sensitive scripts like Arabic, and l2r/r2l mode switches.

Reply | Parent | Thread

from: vatine
date: 17th Sep 2007 06:07 (UTC)

SBCL (as one instance of "common lisp compiler") has strings as "arrays of characters" (as the language specifies), but the characters it uses may be (so far) either 8 bits or 29 bits wide (well, any singluar characters will occupy 32 bits of storage, due to tagging, but in an array, they'll either use 1 or 4 octets), depending on if the string is a BASE-STRING or STRING (the former is a subtype of string composed entirely of ASCII or Latin-foo).

This meant that all stream-opening constructs grew the concept of external coding, so you can transcode on input or output (or, indeed, have a binary rather than string-based external source/sink).

Reply | Thread

Tony Finch

from: fanf
date: 17th Sep 2007 07:23 (UTC)

That's essentially the same model as D, and I argue it is broken.

If the idea of external encoding is similar to Unix locales then it is inadequate. You need to be able to transcode different parts of the input data in different ways depending on the data itself, for example when dealing with MIME objects transferred via HTTP or email.

A lot of software will want to work entirely in one encoding, and it makes sense for them to do transcoding and I/O together. But this should be understood as a convenient wrapper around two separate layers, where the I/O layer just deals with binary streams without any textual semantics.

Reply | Parent | Thread

from: vatine
date: 17th Sep 2007 08:04 (UTC)

It is possibly broken, though in one respect I think that "the right way" of providing a stream for (say) MIME-encoded multi-part mail is to provide a method to open each part as a stream in itself, that way each and every part gets read with the character-conversion or binariness that is required.

The reason SBCL does what it does is that the underlying language standard specifies a string as "An ARRAY of CHARACTER" and using variable-size characters depending on the what the programmer asks for is the least broken model you can fit into that straight-jacket.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 17th Sep 2007 09:34 (UTC)

The point of this post is not to argue what's possible within old constraints, but that the constraints themselves are broken, and we need a better model.

Yes, you are right that stackable I/O layers can give you a nice API to some of MIME, but it's too heavyweight for RFC 2047 encoded words, for example. Text transcoding must be separate from I/O but pluggable into it, like compression, encryption, base64, etc.

Reply | Parent | Thread


from: hairyears
date: 17th Sep 2007 09:02 (UTC)

Have I got this right? What you're suggesting is that strings should be an array or list of pointers.

There are logical reasons why this is preferable to the current mess - library dependence is better than language dependence - but I can't help thinking that this particular solution is worse than the original problem: a lot of common string operations will become rather difficult. Strlen() being the obvious one - how much do we have to do before we know the size of the blobs? - but there are more subtle ones like Ucase() and casting to numbers which, while they are arguably bad programming practice, are very widely used.

Plus the whole idea of hoping that all participants have the same version of a particular string library replaces the problem of different parties using different string encoding schemes with a new problem, with the only saving grace being that all parties involved can at least compile... And maybe run if the library standard provides enough information for a flexible memory allocation scheme to manage the varying sizes of the blobs.

Reply | Thread

Tony Finch

from: fanf
date: 17th Sep 2007 10:50 (UTC)

I think that languages should not have character or string types, but instead a type for binary blobs without any textual semantics.

strlen has unclear semantics. If it's the size of the blob then it has a trivial O(1) implementation, since blobs need to know their sizes. Alternatively it might be the number of codepoints, number of glyphs, displayed width, etc. all of which depend on contextual information (encoding, charset, font). The library needs to make it convenient to deal with this context, but I don't have any suggestions how. Given a convenient library, operations like case-coercion and parsing out numbers are straight-forward. (I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.)

The exact representation of blobs and how their memory is managed would have to depend on how high-level the language is, i.e. how much complexity it's willing to hide. A lower-level language probably wants to expose the representation of its scatter/gather data structure, and when copying and sharing happen. A higher-level language probably wants to hide these details under the blob abstraction. For example, Erlang's implementation of binaries is fairly complicated and varies depending on the binary's size, according to memory management and copying trade-offs, but the language still has a programmer-visible scatter/gather structure for fast appends.

Reply | Parent | Thread


The more I look at it the simpler it gets...

from: hairyears
date: 17th Sep 2007 13:54 (UTC)

I expect you've considered this already: a standard format for the library, in which calls 0 to 127 return the corresponding ASCII characters.

Yes, this will be abused horribly, but a lot of the common objections to this scheme will probably go away. However, I doubt that it is possible (and I am certain that it is far, far less desirable) to achieve any further backward-compatibility with the existing UTF-8 'standard'

...Or is it? UTF-8 would become just another library and it may as well be the system default when right-to-left late-period Ptolemaic Coptic can't be found.

Speaking of which, how (or where, as in 'which layer') should rtl and ltr be implemented? If you're using linked lists rather than arrays, there is scope to tidy up a truly biblical mess. And to please the deep-sea-geeks by encoding a suitably exotic branched-list schema for Classical Klingon.

Back in the real world, I am entirely in agreement with your point here:

I dislike string/number type ambiguity and the idea that it makes sense to hide parsing and formatting behind language-level cast operations - they make more sense as library calls.

Nothing you can do will ever eliminate bad coding but, in higher-level languages, there is a clear need for a better structure for strings. This may sound counterintuitive - after all, the complexity will be hidden - but a day spent inserting unicode API calls to C++ libraries in an environment designed for Excel's VBA 'everything's-a-string' macro generator is a masterclass in the limitations of the current state of play.

Reply | Parent | Thread

Tony Finch

Re: The more I look at it the simpler it gets...

from: fanf
date: 17th Sep 2007 14:28 (UTC)

If you force compatibility with 7 bit ASCII then you break the alternate ISO 646 character sets, and EBCDIC, and ISO 2022, and UTF-16, and UTF-32. My aim is to make the lower levels (built in to the language) completely independent of text encoding, and to discourage character-at-a-time string manipulation in favour of higher-level tools such as regexes. Hence, the language should have no character type (and since it doesn't exist it doesn't have a standard encoding) and although you can subdivide binaries down to a byte (or even less, as I discussed in my previous post) there's no implication that bytes have anything to do with characters, or code points, or glyphs.

rtl and ltr is (I think) mostly the job of the text rendering library - Pango, or something like that. You're probably right that it's useful to be able to segment binaries into logical sections (e.g. different lines or paragraphs, as well as segments of different directions) whilst still chaining them together in a meaningful way (e.g. DOM). So you probably don't want to entirely hide scatter/gather behind the blob abstraction.

Reply | Parent | Thread

Strings and blobs need different APIs

from: andrep
date: 17th Sep 2007 10:30 (UTC)

The Cocoa frameworks on Mac OS X have some of the best string and blob (binary data) support that I've seen. You can create an NSString object (Cocoa's standard string class) from all common character encodings (ASCII, UTF-8, UTF-16, ISO-2022-JP, etc), and once you do, the entirety of the string API is available to you. For libraries that require strings to be encoded a particular way, it's one line of code to obtain the string as whatever encoding you like -- the string class will perform the conversion for you, with or without NULL-termination as appropriate. Internally, I believe NSString usually stores things in UTF-16, which enables O(1) for indexing into particular characters, falling back to a linear search if the string has been marked as having surrogate pairs. The Objective-C compiler on Mac OS X prepends a @ character to string literals to make them NSString*s rather than standard char*s, so you just write @"Foo" rather than "Foo".

Cocoa's NSData class (for storing raw binary data) is much simpler than its string class: it only stores a raw buffer of bytes and a buffer size.

You can, of course, create strings from raw binary data, but when you do, an encoding must be specified. Ditto for transforming strings to binary data: you have to give the output encoding format.

For C++ mavens, the ICU C++ <a href="http://icu-project.org/apiref/icu4c/classUnicodeString.html>UnicodeString string class</a> has an excellent API, similar to NSString.

Reply | Thread

Tony Finch

Re: Strings and blobs need different APIs

from: fanf
date: 17th Sep 2007 12:14 (UTC)

That sounds pretty swish to me. I agree that the APIs need to be different, but blobs should be good enough that strings can be layered on top.

Reply | Parent | Thread

Re: Strings and blobs need different APIs

from: andrep
date: 17th Sep 2007 23:54 (UTC)

Well, a string has to have a backing store (i.e. a blob) of some sort, so I don't really see how blobs can't be "good enough". Blob APIs really aren't that hard; there's not much more you want to do with binary data other than get its content and find out its length (and perhaps insert and remove stuff if the blob is mutable).

I think the biggest problem is that it takes a long time to educate programmers that strings are a lot more complicated than just raw bytes of memory ending in NULL, and Unicode is a lot more than "16-bit strings". C strings are horrible to use not only because its API is horrible, but its innate representation (NULL termination) is error-prone and slow.

Also, in most real-world usage, accessing individual characters in a string is very rare. I work on a website building application where we have extremely heavy text manipulation, and we have a total of eighteen lines of code that index characters into strings. Five of those lines are for checking the program's registration code, six of those lines are for syntax highlighting code, leaving a measly seven other lines of code for every single other aspect of the program. I think having a character type as well as a string type is fine, although like a string, it's important to assign that character some sort of code point (i.e. use Unicode or whatever).

I think it's also OK (and healthy) to have multiple string classes, since strings can be used in so many different ways. Ideally you'd want all the different string classes to have the same API, but it's not completely necessary.

Reply | Parent | Thread


Re: Strings and blobs need different APIs

from: charles
date: 19th Sep 2007 07:33 (UTC)


Manipulating strings is a fundamental part of writing software, and any language that doesn't provide a suitable abstraction for doing so is quite simply going to fail. By getting rid of built-in string types, all you're doing is shifting around the cognitive load.

The problem here, I think, is that C strings are a terrible starting-point for any discussion of how to do things better. If all you have to start with is a dumb byte-sequence that just happens to be called "string" for historical reasons, then choosing to go back to treating it like a dumb byte sequence seems like a good idea. :)

Reply | Parent | Thread

Tony Finch

Re: Strings and blobs need different APIs

from: fanf
date: 20th Sep 2007 08:26 (UTC)

It seems that I need to explain more clearly. What andrep described is close to what I have in mind. Objective C uses the legacy C string and charater types not as strings and charaters but as blobs and bytes. It defines all high-level functionality in its libraries, especially choice of encoding of strings. The only specific support for strings that the base language needs is syntax for string constants (about which I have more to write, but not now). Eliminating the built-in types does not imply the lack of a suitable abstraction.

Reply | Parent | Thread


from: ewx
date: 17th Sep 2007 11:56 (UTC)

aiming to encourage a style of string manipulation more like perl's than C's

This seems like a good way to sell the idea, given that people are often happy (rightly or not!) with their quick-fix solutions to their character encoding problems.

Reply | Thread


from: gareth_rees
date: 17th Sep 2007 22:07 (UTC)

Here's the Unicode interface from a project I worked on recently:
/* Return the first Unicode character in the UTF-8 encoded string
   pointed to by *p, and advance *p to the start of the next
   character. */
u32 unicodeDecodeUTF8(const u8 **p);

/* Encode up to 'n' Unicode characters from the array 'chars' in UTF-8,
   writing the result into the buffer pointed to by *p, which is 'size'
   bytes long. Return the number of characters encoded. The buffer is
   terminated with a zero byte unless 'size' is 0. If successful,
   advance *p to the end of the string. */
size_t unicodeEncodeUTF8(u8 **p, size_t size, const u32 *chars, size_t n);
The standard C library functions strcmp, strncpy and snprintf were also useful. For video game localization, there's not much in the way of sophisticated string processing to do.

Reply | Thread