dotat

Tony Finch

Tuesday 9th June 2009

Tempting Fate, and getting her unwanted attention

Yesterday I dealt with a fairly routine question from a computer officer about sending email from a web server. It's always nice when someone asks for advice, even when they expect the answer won't contain any surprises, just in case it does. I gave our usual advice about the safest ways to set up web forms that send email. The bit that tempted fate was saying that web forms have caused spam problems for us "in the past" - and I had been reflecting recently that we haven't had a big web spam problem for quite a long time.

Our recent problems have all been account compromises of one kind or another. One that will be familiar to other postmasters is caused by users sending their login credentials in reply to a phishing message. We seem to have a relatively clued-up userbase, based on tales from other universities; very few reply to most phishing spams and most of those replies do not contain passwords - they are more or less skeptical and sometimes taunt the spammer. However there's still the occasional idiot, and even a particularly special one who had to be re-educated twice! If any passwords do escape and accounts get used to spam, our strict rate limits for remote email users shut the unwanted flows down pretty quick.

More recently we have had problems with compromised accounts on email systems other than our own. These seem to have been due to infected PCs on networks with a Microsoft Windows domain and Exchange server - I don't think remote access (e.g. via Outlook Web Access) was to blame in these cases. We have been reasonably successful applying rate limits to mail servers that relay outgoing mail through us. We set the limit for each server to be about 10% above their normal peak usage, so it should not trigger unless something suspicious is happening, and we have a monitoring script to warn us if they stray into (or past!) the 10% headroom. The first time this mechanism caught a spam run it stopped an 80,000 message flood about 2% of the way through.

So Fate observed my complacency about web form compromises and handed us a big plate full of crap today. Shortly before I got into work, one of our departments started spewing email and fairly swiftly whacked into their rate limits. I was foolishly trusting of their competence and aware of their occasional need to send large mailshots, so I tried to let the mail through - but it rapidly became clear that tens of thousands of messages was way beyond normal and their arrival rate of hundreds per second was going to cause us problems even if it was legit. Silly me for not looking at a sample message. I got in touch with the department's techies and it rapidly became apparent there was a compromise, and we went into lock-down and clean-up mode. Between us we had to delete about 100,000 queued messages, with the added joy that we had to avoid deleting the couple of dozen legitimate messages that had the same sender address as the spam. Sadly my error meant that 20,000 messages escaped (plus 6000 with invalid recipient addresses) whereas it should have been only two or three thousand :-(

As I said yesterday, there are two ways to ensure that a web form sends email safely. The first is to fix the recipient address, for example in "send feedback to the webmaster" forms, or mail to registered users of your site. The second is to fix the contents of the message, for example in signup or purchase confirmation messages. Either of these is enough to make the form useless to spammers.

Today's excitement was caused by a "mail a copy of this story" form, which should have been implemented safely in the second style. However the form allowed remote users to add their own comments above the story, and gave them plenty of space to do so - enough for a typically lengthy 419 scam.

It's fair to say this has been a learning experience, more painful for some than for others :-) I should remember that technical cockups are one thing, but even if you are technically competent political requirements can still screw the pooch.

(3 comments | Leave a comment)

Tuesday 2nd June 2009

CRSIDs and email addresses

I've recently written a three page memo about the advantages of our username scheme compared to "friendly name" email addresses in large domains. I have put a copy of the PDF on the web which you can read if you like.

(9 comments | Leave a comment)

Friday 15th May 2009

Use your bonce

Vesta includes a purely functional programming language for specifying build rules. It has an interesting execution model which avoids unnecessary rebuilds. Unlike make, it automatically works out dependencies in a way that is independent of your programming language or tools - no manually maintained dependencies or parsing source for #include etc. Also unlike make, it doesn't use timestamps to decide if dependencies are still valid, but instead uses a hash of their contents; it can do this efficiently because of its underlying version control repository. Vesta assumes that build tools are essentially purely functional, i.e. that their output files depend only on their input files, and that any differences (e.g. embedded timestamps) don't affect the functioning of the output.

I've been wondering if Vesta's various parts can be unpicked. It occurred to me this morning that its build-once functionality could make a quite nice stand-alone tool. So here's an outline of a program called bonce that I don't have time to write.

bonce is an adverbial command, i.e. you use it like bonce gcc -c foo.c. It checks if the command has already been run, and if so it gets the results from its build results cache. It uses Vesta's dependency cache logic to decide if a command has been run. In the terminology of the paper, the primary key for the cache is a hash of the command line, and the secondary keys are all the command's dependencies as recorded in the cache. If there is a cache miss, the command is run in dependency-recording mode. (Vesta does this using its magic NFS server, which is the main interface to its repository.) This can be done using an LD_PRELOAD hack that intercepts system calls, e.g. open(O_RDONLY) is a dependency and open(O_WRONLY) is probably an output file, and exec() is modified to invoke bonce recursively. When the command completes, its dependencies and outputs are recorded in the cache.

bonce is likely to need some heuristic cleverness. For example, Vesta has some logic that simplifies the dependencies of higher-level build functions so that the dependency checking work for a top-level build invocation scales less than linearly with the size of the project. It could also be useful to look into git repositories to get SHA-1 hashes and avoid computing them.

It should then be reasonable to write very naive build scripts or makefiles, with simplified over-broad dependencies that would normally cause excessive rebuilds - e.g. every object file in a module depends on every source file - which bonce can reduce to the exact dependencies and thereby eliminate redundant work. No need for a special build language and no need to rewrite build scripts.

(15 comments | Leave a comment)

Thursday 14th May 2009

Define SCM

Here's an abbreviation to avoid, because its various meanings overlap so heavily.

  • Source Code Manager: synonym for version control system, e.g. as used by the Linux crowd in the early days of git.
  • Software Configuration Manager: usually incorporates version control, build system, archival of build products, and CASE methodology baggage, e.g. ClearCASE and Vesta.
  • System Configuration Manager: for system administrators rather than developers, e.g. cfengine and Puppet.
(1 comment | Leave a comment)

Wednesday 13th May 2009

Never delete anything

How long will it be before it becomes normal to archive everything? It's already normal in some situations, and I think that's increasing. It's been the norm in software development for a long time. There's an increase in append-mostly storage systems (i.e. append-only with garbage collection) which become never-delete systems if you replace the GC with an archiver. Maybe the last hold-outs for proper deletion will be high data volume servers...

Anyway, I feel like listing some interesting append-only and append-mostly systems. A tangent that I'm not going to follow is the rise of functional programming and immutability outside the field of storage. Many of these systems rely on cryptographic hashes to identify stuff they have already stored and avoid storing it again, making append-only much more practical.

  • All version control systems, and software configuration management systems even more so. The former archive source code whereas the latter archive build tools and build products as well. DEC's Vesta SCM is particularly interesting, being based on a purely functional build language designed to maximize memoization - i.e. minimize unnecessary rebuilds. It's sort of ccache on steroids since it caches the results of entire module builds, not just individual source file compiles.
  • Nix is a purely functional package manager. Unlike most packaging systems like dpkg or rpm, Nix packages do not conflict with each other: you upgrade by installing new packages alongside your existing ones, then you stop running the old ones and start running the new ones.
  • Archival / backup systems, like Venti which is Plan 9's append-only filesystem. Apple's Time Machine isn't nearly as clever.
  • Most filesystems don't use hash-based uniquification. Append-mostly filesystems often provide cool undelete features like snapshots, e.g. NetApp's WAFL or Sun's ZFS. Early filesystems of this kind, e.g. BSD LFS tried to avoid wasting space, so didn't make old data available as snapshots, and sacrificed performance to eager garbage collection. More recently, DragonFly BSD's Hammer filesystem doesn't even have an in-kernel garbage collector, and running it is entirely optional.
  • Email archives: gmail's ever-increasing quotas, cyrus delayed expunge.
(6 comments | Leave a comment)

Thursday 23rd April 2009

Some thoughts about git

I was originally planning to witter about distributed version control vs. centralized version control, especially the oft-neglected problem of breaking up a large cvs / svn / p4 repository. This was partly triggered by Linus's talk about git at Google in which he didn't really address a couple of questions about how to migrate a corporate source repository to distributed version control. But in the end I don't think I have any point other than the fairly well-known one that distributed version control systems work best when your systems are split into reasonably modestly-sized and self-contained modules, one per repository. Most systems are modular, even if all the modules are in one huge central repository, but the build and system integration parts can often get tightly coupled to the repository layout making it much harder to decentralize.

Instead I'm going to wave my hands a bit about the ways in which git has unusual approaches to distributed version control, and how bzr in particular seems to take diametrically opposing attitudes. I'm not saying one is objectively better than the other, because most of these issues are fairly philosophical and for practical purposes they are dominated by things like quality of implementation and documentation and support.

Bottom-up

Git's design is very bottom-up. Linus started by designing a repository structure that he thought would support his goals of performance, semantics, and features, and worked upwards from there. The upper levels, especially the user interface, were thought to be of secondary importance and something that could be worked on and improved further down the line. As a result it has a reputation for being very unfriendly to use, but that problem is pretty much gone now.

Other VCSs take a similar approach, for example hg is based on its revlog data structure, and darcs has its patch algebra. However bzr seems to be designed from the top down, starting with a user interface and a set of supported workflows, and viewing its repository format and performance characteristics as of secondary importance and something that can be improved further down the line. As a result it has a reputation for being very slow.

Amortization

Most VCSs have a fairly intricate repository format, and every operation that writes to the repository eagerly keeps it in the canonical efficient form. Git is unusual because its write operations add data to the repository in an unpacked form which makes writing cheaper but makes reading from the repository gradually less and less efficient - until you repack the repo in a separate heavy-weight operation to make reads faster again. (Git will do this automatically for you every so often.) The advantage of this is that the packed repository format isn't constrained by any need for incremental updates, so it can optimise for read performance at the expense of greater pack write complexity because this won't slow down common write operations. Bzr being the opposite of git seems to do a lot more up-front work when writing to its repository than other VCSs, e.g. to make annotation faster.

Thus git has two parallel repository formats, loose and packed. Other VCSs may have multiple repository formats, but only one at a time, and new formats are introduced to satisfy feature or performance requirements. Repository format changes are a pain and happily git's stabilized very early on - unlike bzr's.

Laziness

As well as being slack about how it writes to its repository, git is also slack about what it writes. There has been an inclination in recent VCSs towards richer kinds of changeset, with support for file copies and renames or even things like token renames in darcs. The bzr developers think this is vital. Git, on the other hand, doesn't bother storing that kind of information at all, and instead lazily calculates it when necessary. There are some good reasons for this, in particular that developers will often not bother to be explicit about rich change information, or the information might be lost when transmitting a patch, or the change might have come from a different VCS that doesn't encode the information. This implies that even VCSs that can represent renames still need to be able to infer them in some situations.

Git's data structure helps to make this efficient: it identifies files and directories by a hash of their contents, so if the hash is the same it doesn't need to look any closer to find differences because there aren't any - and this implies a copy or rename. This means that you should not rename or copy a file and modify it in the same commit, because that makes git's rename inference harder. Similarly if you rename a directory, don't modify any of its contents (including renames and permissions changes) in the same commit.

Mercurial also uses hashes to identify things, but they aren't pure content hashes: they include historical information, so they can't be used to identify files with the same contents but different histories. Thus efficiency forces hg to represent copies explicitly.

Any more?

I should say that I know very little about bzr, and nothing about tla, mtn, or bk, so if any of the above is off the mark or over-states git's weirdness, then please correct me in a comment!

(6 comments | Leave a comment)

Wednesday 8th April 2009

LISTSERV crapness

LISTSERV uses a null return path (RFC821 MAIL FROM:<>) on its administrative mail, and some mail hosts reject this. I discard message with a null return path that do not match a few simple heuristics, so I lose things like subscription confirmations from services like JISCfail. This makes me cross.

L-Soft claim that LISTSERV is following the specifications, and they cite a couple of paragraphs from RFC 821 (published in 1982) and RFC 1123 (1989). However they fail to cite text from RFC 2821 (2001) which explicitly forbids what they are doing.

There are several types of notification messages which are required by existing and proposed standards to be sent with a null reverse path, namely non-delivery notifications, other kinds of Delivery Status Notifications (DSNs), and also Message Disposition Notifications (MDNs). [...]

All other types of messages (i.e., any message which is not required by a standards-track RFC to have a null reverse-path) SHOULD be sent with with a valid, non-null reverse-path.

The only other permitted use of null return paths that I know of is vacation notifications, described in RFC 3834 (published in 2004).

L-Soft needs to get a grip, read some RFCS, and fix their software.

(6 comments | Leave a comment)

Thursday 2nd April 2009

Configuration management

We've been discussing configuration management in the office this week. None of us are happy with the way we're doing things...

On Hermes, we do most configuration management using rdist. On our management server we have a number of file trees containing files that differ from the default - a smattering of stuff in /etc, the contents of /home and /opt, and a few other bits scattered around. These trees are cloned and hacked for each flavour of machine (ppswitch, cyrus, lists, webmail, etc.) and each version of the underlying OS. These trees are mostly kept under revision control.

This setup has the advantage of simplicity, and it's easy to push out small changes. One key feature is rdist's testing mode, which makes it easy for us to ensure that the servers are in sync with the configuration tree without changing anything. We often run a test across a cluster of 10 or 16 machines in parallel. It's easy to selectively push a change to an idle machine for testing before rolling it out to the rest of the cluster. For more tricky changes I do a phased rollout so I can check for unwanted changes in behaviour without breaking the entire service at once.

Of course it has some serious disadvantages. We have to be root on the management host to be able to push changes out to the servers. We can't easily keep file ownership and permissions under revision control. This mechanism misses significant parts of the configuration, such as installed base OS packages and which rc scripts are enabled. There's also a lot of scope for improving our initial OS install scripts to reduce the amount that they get out of sync with the rdist configuration.

So we'd like something better. Our colleagues have different system management setups with different problems, and they are also looking for something better.

A couple of my colleagues have looked at Puppet but weren't happy with it. I dislike its basic design. Managed servers pull their configurations from the master, which means you must never screw up a change on the master, and it's harder to test changes - you have to explicitly set up test server profiles. Yes, you can run Puppet in push mode but it's often a bad idea to work against a program's basic architecture. Puppet also has its own security mechanisms, whereas I'd prefer to avoid multiplying channels of trust. Finally, I really hate writing configuration files for programs that write configuration files. It's a waste of brain cells to understand this superfluous abstraction layer: I just want to write the underlying configuration file directly.

Aside: this is why I really really hate autoyast, which uses an XML configuration file to control a program that writes configuration files that control rc scripts that manipulate the underlying configuration files for the programs you actually care about. It takes hours to work out how to make it produce the correct results.

I spent some time working on a program called pstx which was to be a suped-up replacement for rdist. It was going to be an sftp client (so it would require nothing special on the target servers) that had a simple configuration file to specify which directory trees to copy where, including ownerships and permissions (bringing them under revision control and removing the need to be root on the master), and possibly with added bells and whistles like remote diff, and a reverse (pull) mode. I also intended to make it easer to combine collections of files and thereby share common files. Sadly it's only about a third written and likely won't get much further.

One of the conversations this week was about how to reduce the chance of mistakes when rolling out changes, in which we discussed the use of revision control to help with testing and rollback. At the moment Hermes mostly uses CVS and other unixy bits of the CS share a Subversion repository, so the conversation very much assumed a central repository model - which fits in well with a master configuration server. Recently I have also been investigating git more seriously, so I thought it might be part of a solution.

The key things that git provides include a flexible and efficient network protocol for moving files around (flexible in that it can use ssh and http as well as git's native protocol, and efficient in that it's incremental and compressed), it can tell us what differs between a directory tree and the state of the repository, changes can be pushed and pulled, etc. The distributed version control functionality is way better at all this than pstx was ever going to be.

The big missing part is the ability to track ownerships and permissions: git only supports normal development checkouts which should be done using the developer's uid and umask. There are also some low-level problems with the way git performs checkouts: it removes the destination file before writing the updated version in its place. I would like a special deployment command which checks all the changed files out under temporary names, fixes their ownerships and permissions, then renames them into place. The first stage almost exists in the form of git-checkout-index --temp d, though it does weird things with symlinks. It doesn't look like much work to add the other stages.

This could provide some really nice workflows. A configuration change is created on a small topic branch, then pushed to the configuration repositories on all the machines - which changes none of the live configuration. You can then switch a test machine to the new branch with a single git deploy branch, or roll back by re-deploying the master branch. You can find out the current configuration of a machine in summary using git status or git diff to make sure the summary isn't a lie.

One interesting possibility might be to tame the clone-and-hack problem by using a shared repository for all the different classes of machines. This should make it easier to see the differences between each kind of machine and spot spurious ones.

One thing that is very manual in our current setup is hupping daemons to make them load an updated configuration. It might be feasible to use a post-deploy hook to do this, provided the hook script is given enough information that it can avoid unnecessary restarts.

Now I just need to find some time to turn the idea into code...

(11 comments | Leave a comment)

Wednesday 25th March 2009

Ada Lovelace Day

This is a bit late, but I've been inspired to post after reading about the women other people admire. Here are a few of mine.

  • Margo Seltzer: developer of the BSD log-structured filesystem and Berkeley DB. An open source entrepreneur - founder and CTO of Sleepycat Software - who also has a successful academic career.
  • Dina Katabi invented XCP, the eXplicit Congestion control Protocol, which is a brilliant way for routers to make each individual flow adjust to the level of congestion without the need for any per-flow state. I think it was one of the first papers I read about advanced transport protocol research, and I have continued to follow the subject since then.
  • Lisa Dusseault is a WebDAV expert and IETFer, currently one of the Applications Area Directors and therefore a member of the IESG. She was one of the funner people I met at the Paris IETF meeting a few years ago. I like the new ideas she's brought to the IESG, such as monthly updates on her work as Apps AD.
  • And of course [info]rmc28, who got the University Card working properly and now slaves away on the student information system.
(4 comments | Leave a comment)

Sunday 15th March 2009

John Taylor talks about the Corpus clock

I posted a couple of articles (one, two) about the Corpus clock soon after it was unveiled, and I optained a copy of the Chronophage book soon after it became available. The book is a bit lacking in technical detail, so I was pleased to find out that John Taylor would be talking about it for the Cambridge Science Festival and I attended his talk yesterday. The lecture theatre was packed.

The talk was in three parts: a bit of autobiography, then a quick overview of the clock and its development, then questions. The talk revealed some interesting facts which do not appear in the book.

Taylor's story of how he came to make a fortune in kettle thermostats is relevant to the clock not just because that's how he made the money: his horological hero, John Harrison, invented the grasshopper escapement that inspired the Chronophage, and also invented the bimetal on which Taylor's thermostats rely. He explained his snap-action bimetallic disc, which has a C cut out to form a tongue in the middle; this tongue moves further than the centre of a disc without the cut-out does, which loosens the tolerances required to manufacture a thermostat. The calligraphic flourish engraved below his name on the clock's pendulum bob forms the shape of his bimetallic disc. (Sorry, I can't find a picture on line though it's visible in the book.)

ETA: I have found a large picture of the pendulum bob in which you can see the shape of Taylor's bimetallic disc under the date. It's overdrawn by some of the flourish so it's a bit obscure unless you know what you are looking for.

There were a couple of hints at the clock's workings. It is usual for clocks with chiming mechanisms to have a second set of drive springs or weights for the chimes. The Corpus clock has only one spring, so its hourly death-rattle is driven in an unusual way. This is why the clock ticks back and forth while rattling the chain in the coffin. There are two main mechanisms inside the Chronophage itself. It contains a mechanical pseudo-random mechanism to control the blinking and biting. The sting, however, is regular: it slowly rises then jabs down on each quarter hour.

He also mentioned that the clock "has fun" four times a year: on 24th March, which is the date of Harrison's birth and also his death; on 25th November, Taylor's birthday; on new year's eve and new year's day; and on Corpus Christi day, the Church's festival that occurs on the Thurdsday after Trinity Sunday, the 8th Sunday after Easter. The clock is away for maintenance this week, so I hope it'll be back in time to muck around on Tuesday week. The fun seems to involve even more erratic ticking than usual, with more colourful lights.

The clock's spring has to be particularly strong in order to overcome the momentum of the large escape wheel that goes around the outside of the clock. When prototyping the clock this caused a serious problem: the force of the spring is transmitted through the escapement to the pendulum, making it swing higher and higher and eventually breaking the clock. Taylor overcame this problem by adding a regulator, which also serves two other functions: it produces the clock's erratic behaviour that plays tricks with observers, and it listens to the MSF time signal to synchronize the clock every five minutes. (Taylor confirmed to me after the talk that these are all functions of the same mechanism. He also said that the hollow pendulum bob is not in fact "massive and weighty" as the book says, which answered my question about conservation of momentum.) I'm not sure how this is consistent with his assertions that the clock is purely mechanical - would it work if the computer regulator broke? News reports about the clock's teething problems suggest that it's pretty vital. I wonder if the maintenance periods this week and back in January are for software patches rather than mechanical adjustment?

I asked Taylor if he would publish more technical details about the clock, but he thinks that the mystery makes it fun. I disagree :-/

(5 comments | Leave a comment)

Thursday 5th March 2009

Job Ad

My colleagues over in MISD (the University's administrative computing unit, as opposed to the CS which is the University's ISP and IT training department) are looking for another DBA to help look after the Oracle running under exciting services like the financial system and student information system.

http://www.admin.cam.ac.uk/offices/hr/jobs/vacancies.cgi?job=4781
(Leave a comment)

Saturday 21st February 2009

Microblogging

Since April 2002 I have been keeping a link log: a web page where I record links to other pages that are notable in some way, with a brief comment about each one. It's gatewayed to Delicious, LiveJournal, and Twitter. The Twitter feed is new.

I've been on Twitter since March 2007, when it made a big splash at the SXSW festival. I joined to see what the fuss was about but soon stopped using it. None of my social circle were using it much, and there wasn't a sensible way of finding people in the more distant reaches of my social network. For 18 months my last tweet said " Grumbling about the difficulty of finding out people's twitter usernames".

Recently Twitter has taken off in the UK, since a fairly large contingent of journalists and celebrities have started using it and talking about it. (Reminds me in a small way of the rise of the web in 1993/4.) Coincidentally at about the same time I decided to get back into using Facebook and Twitter a bit more, to keep in better touch with my family's activities. (They're on Facebook but I linked my Twitter and Facebook statuses together for convenience.)

The celebrity thing helped me to get more out of Twitter in a significant way. I had thought of it as more like Facebook or LiveJournal: a way of keeping up with the activities of people I already know. But in fact it's more like the wider blogging culture where it's normal to follow someone's writing just because they are interesting. (Yes, LJ blurs into this kind of blogging too.) In fact the even the Twitter guys didn't realise this at first: until June 2007 Twitter sent me email saying "You are foo's newest friend!" but the following month it started sending me email saying "foo is now following you on Twitter!". I now feel free to follow well-known people just because they are interesting, and I found more interesting people by looking through the lists of people they follow.

Part of the Twitter culture is heavy use of short URL services when posting interesting links. I decided that my link log ought to be gatewayed to Twitter to join in the fun, so I embarked on a bit of hacking. Previously it had been just a static HTML page, with a command-line script that I used to insert a new link. There was another script to turn the HTML into an RSS feed for LJ, and some code for uploading the results to chiark. I had a rudimentary click tracker so that I could get some idea of what links other people followed, but it did not shorten URLs. (That meant it was an open redirector until I found out what evil that made it vulnerable to.)

I replaced all the old static HTML and RSS gubbins with a CGI script which can format my link log in HTML or Atom, as well as providing short URLs for all the links. I decidided to host my own short URLs partly to continue tracking what people find interesting, and partly because I have a vanity domain that's just right for this job :-) The command line script for adding new links remained basically the same, apart from new code to create short tags and to post the link to Twitter.

After a couple of weeks of this new regime I decided I needed to be able to save links found when reading Twitter or LiveJournal on my iPod Touch. Its UI makes it easy to email links but not much else, so I set up a gateway from email to my link log. I send links to a specially tagged email address which is filtered to a special mailbox. A script polls this mailbox over IMAP every few minutes, looking for messages that match a number of paranoid checks. From these extracts a URL and a comment which it passes on to the old command-line script. (I still use the latter when I'm on a computer with a keyboard.)

The end result is pretty neat: I can easily save interesting links that I see in places where that was not previously practical. It's probably a good thing I don't have an iPhone because then I'd be able to spod everywhere.

PS. a brief hate: web sites that redirect deep links to the mobile version of their front page. Utter cWAP. They'll get no Google Juice from me. (In fact the iPhone browser is good enough that mobile sites are usually NOT an improvement.)

(1 comment | Leave a comment)

The joy of lpeg

I've recently started playing with lpeg, a parsing library for Lua. It is based on "Parsing Expression Grammars", which were recently popularized by the prolific Bryan Ford. PEGs have some nice properties: they're suitable for unified parsers that handle both the low-level lexical syntax as well as higher-level hierarchial syntax; they have much simpler operational semantics than traditional extended regexes or context-free grammars; and as well as familiar regex-like and CFG-like operators they have nice features for controlling lookahead and backtracking. PEGs were originally developed alongside a cute algorithm for linear-time parsing which unfortunately also requires space linear in the input size with a fairly large multiplier. Lpeg instead uses a simple parsing machine, implemented somewhat like a bytecode interpreter. Its performance is quite competitive: the long paper says it has similar performance to pcre and glibc's POSIX regex implementation, and the short paper says it has similar performance to lex and yacc.

Lpeg actually consists of two modules. The core lpeg module is written in C and allows you to compose parser objects using operator overloading, building them up from primitives returned from tersely named constructor functions. The resulting syntax is rather eccentric. On top of that is the re module which provides a more normal PEG syntax for parsers, which despite the name of the module are rather different from regular expressions. This module is written in Lua, using an lpeg parser to parse PEGs and construct lpeg parsers from them. The PEG syntax is extended so that you can define "captures". Captures are the really nice thing about lpeg. You can use them like captures in Perl regexes to just extract substrings of the subject, but you can often do better. Lpeg captures are more like the semantic actions that you can attach to rules in parser generators like yacc. So, where in Perl you would do the match then fiddle around with $1, $2, etc, with lpeg the match can incorporate the fiddling in a nice tidy way. (In fact, probably the closest comparison is with Perl 6 rules, but they're not yet practically usable.)

The program I was writing with lpeg was to process some logs. I needed to convert the timestamps from ISO 8601 format into POSIX time_t which implied converting 8 fields from strings to numbers. Rather than having to convert each capture individually, or loop over the captures, I could write a single grammar rule to match a pair of digits and convert it to a number, then refer to that rule elsewhere in the grammar. (In fact Lua will coerce strings to numbers implicitly in most - but not all - circumstances. I happened to be tripped up trying to compare a number with a string, which doesn't coerce.) In the end it's nicest to let the parser drive all the program's activity through its semantic actions.

code )
(Leave a comment)

Thursday 5th February 2009

Impressive display of security clue from the Student Loans Company

The Student Loan Company executive management board minutes from a meeting just over a year ago says the following in section 6, "update on data security processes":

RSJ provided an update on Data Security and advised that information which was being received from external sources confirmed that the transfer of data on removable media devices was now unacceptable. He stated that there was a need to consult with HEI’s as to the method of transferring Attendance Confirmation Reports as SLC now had PGP encryption software available which could replace the previous method of transferring the data via CD’s. He also stated that the PGP software which SLC were using should be checked to ensure that it was on the US Government list of standard encryption as HEI’s are only permitted to use PGP software from this list.

Not shipping media is good. Using end-to-end encryption is good. (Unlike banks which seem to like SMTP over TLS, which provides no additional security for inter-domain communication.) I wonder why the choice of PGP instead of S/MIME - I believe that PGP usually requires an add-on whereas S/MIME is often built in to MUAs. Perhaps they've been nobbled by a vendor.

(6 comments | Leave a comment)

Thursday 29th January 2009

Meanies

In my previous post I said I was using the standard formula for the exponentially weighted moving average. This is not entirely true because the standard formula uses a complementary smoothing constant (α = 1 − a) which allows you to write the formula as an incremental adjustment to the previous value.

It's also possible to express the standard unweighted mean in a similar manner.

I was quite pleased when I found out about this analogy :-)

(Thanks to wikipedia for rendering the maths.)

(3 comments | Leave a comment)

Tuesday 27th January 2009

Luck or judgment?

In 2005 I developed a mathematical model for measuring average event rates which became the core of a new rate limiting feature for Exim. The model has a particularly useful property which I did not expect it to have and which I did not (until recently) fully understand.

The model

The central formula is the standard exponentially weighted moving average:

rnew = (1 − a) ∗ rinst + a ∗ rold

We use this to calculate the new average rate from the instantaneous rate and the result of the previous average calculation. These rates are all measured in events per some configurable time period. The smoothing factor a is a value between 0 and 1 which determines how slowly the model forgets about past behaviour.

We calculate rinst and a from the raw inputs to the formula, which are p, a time period configured by the postmaster, and i, the time interval between the previous event and this event. By dividing them we get

rinst = p / i

In this formula, p determines the per time unit used for measuring rates, e.g. events per hour or events per day.

The exponentially weighted moving average is usually used to smooth samples that are measured at fixed intervals, in which case the smoothing factor, a, is also fixed. In our situation events occur at varying intervals, so the smoothing factor needs to be varied accordingly.

a = e−i / p

In this formula, p determines the smoothing period, i.e. the length of time it takes to forget 63% of past behaviour.

A useful property

When developing the model, I needed to understand how it reacts to changes in the rate of events. It's fairly simple to see that if the rate drops, the average decays exponentially towards the new rate. It's less clear what happens when the rate increases. A particular practical question is what happens if there's a sudden burst of messages? How much of the burst gets through before the average rate passes some configured maximum?

I did some trial computations and I found that when the interval is very small (i.e. the rate is high) the average rate increases by nearly one for each event (the smaller the interval, the closer to one). This means that the maximum rate is also the maximum burst size. How wonderfully simple! The postmaster can configure the model with two numbers, a maximum number of events per a time period, which also directly specifies the units of measurement, the smoothing period, and the maximum burst size.

(The maximum burst size is larger for slower senders, increasing to infinity for those sending below the maximum rate. However you don't have to be much above the maximum for your average to hit the limit within one period.)

This property produces a simple user interface, but I did not understand how it works. It's obvious that when i is small, a ≈ 1, but it is not so clear why (1 − a) ∗ rinst ≈ 1. It seems I landed directly on a mathematical sweet spot without the fumbling around that might have led me to understand the situation better.

Arbitrary choices

I made two choices when developing the model which at the time seemed arbitrary, but which both must be right to get the max rate = burst size property.

Firstly, I decided to re-use the configured period for two purposes: as the smoothing period and as the per time unit of rates. I could instead have made them independently configurable, but this seemed to give no benefits that compensated for the extra complexity. Alternatively, I could have used a fixed value instead of p in the formula for rinst, but that seemed liable to be confusing or awkward, and to require more mental arithmetic to configure.

Secondly, I decided to use e as the base of the exponent. I could have used 2, in which case the smoothing period would have been a half life, or 10, so that 90% of past behaviour would be forgotten after one period. There seemed to be no clear way to choose, so I split the difference and went with e on the basis of mathematical superstition and because exp() has fewer arguments than pow().

How it works

Looking back over my old notes this week, I had a revelation that the max rate = burst size property comes from the fact that the gradient of ex is 1 when x is 0. To show why this is so, I first need to define a bit of notation:

y ≡ f(x) ≡ ex

δy ≡ f(x + δx) − f(x)

δx ≡ −i / p

This allows us to say, when x is zero and δx is small,

(1 − a) ∗ rinst
= (1 − e−i/p) ∗ p/i
= (1 − eδx) / −δx
= (eδx − 1) / δx
= (e0 + δx − e0) / δx
= ( f(0 + δx) − f(0) ) / δx
= δy / δx
dy / dx
= ex
= 1

Sweet.

(Leave a comment)

Wednesday 21st January 2009

The MacBook saga

Three months ago, Apple released their new "unibody" aluminium MacBook laptop, and just over a month before that they released the second-generation iPod touch. I was in the market for a new Mac laptop, and at some point I planned to upgrade my iPod when a 64GB Touch or iPhone became available. However when I found out that Apple were offering £95 "back to school" rebates to those who bought an iPod Touch and a MacBook, I couldn't wait any longer - this was the 30th October and the offer ceased at the end of the month.

Unfortunately I made the order with a rather crusty old version of Firefox which handled Apple's customization JavaScript incorrectly - it discarded all the options I chose, and because I wasn't familiar with the order process I didn't know that the final description screen should have included more details than it did. This wasn't a problem for the bits and pieces that I could also buy at the local greengrocer Apple Store, but one important option I chose was the US keyboard layout. I wanted it partly because I normally use a Happy Hacking keyboard which has a similar layout, and partly because the ISO layout has thin return key that is awkward to press as well as being ugly.

When I discovered this upon unboxing the new Shiny! my heart sank. You can't change the keyboard on the current MacBooks without replacing the entire machine. But I phoned Apple to see what I could do about it, and I was pleasantly surprised to find out that they would replace the machine with the correct model for free, no questions asked.

I discovered Apple's FAIL when the second MacBook arrived. I had been sent the "International English" model, which is identical to the "British" layout apart from a couple of currency symbols. They also sent me a continental power plug instead of a BS 1363 plug. When I phoned Apple they were suitably contrite, even offering me £70 compensation for the error!

Until this point (two weeks after the original order) I was very happy with Apple's support - no dumb scripts, reasonably competent staff. Sadly this didn't last. While the phone staff continued to be helpful and informative about what was going on, I had to keep phoning to get an update because the supervisor who was supposed to be dealing with the problem was too busy doing other stuff. The cause seemed to be that the department which handled replacements didn't believe that US keyboards were available in the UK (despite what the online store said) and the support people were incapable of fixing or working around this failure.

What was worse was that it took another three weeks to come to this conclusion. (I was away in France for one of them, during which I didn't chase Apple so nothing happened.) I couldn't use my new iPod because I didn't want to have the pain of re-pairing and re-populating it with music etc. This all made me extremely cross.

In the end I had to return the second MacBook for a refund and buy a third one with the correct configuration, which finally arrived six weeks after the original order. Fortunately this didn't affect my eligibility for the rebate - in fact they had already sent me the £95, though I hadn't sent in the proof-of-purchase paperwork they said they needed. (Um, I didn't notice the extra money until this month!) Unfortunately they cancelled the compensation they promised me! They fixed that after I made an irate phone call, but sheesh, I still get angry when remembering it.

But I suppose it's all OK in the end. The MacBook and iPod Touch are things of astounding beauty and utility. The iPod utterly blows away the Nokia 770 I previously used for spodding, and the MacBook replaces both my elderly PC laptop and my Mac Mini (though the latter will probably become a backup server). It's really nice to be able to hack away on the comfy sofa, and I'm playing choonz via the Airport Express a lot more than I did.

(Leave a comment)

Friday 16th January 2009

RJ45 is too fat

One notable feature of the ports on the new MacBook is that the RJ45 Ethernet socket only just fits - it wouldn't if the machine were much thinner. In fact the MacBook Air notoriously doesn't have any wired ethernet connection at all, relying on IEEE 802.11n WiFi for connectivity.

Of course other manufacturers are making expensive thin laptops. The Dell Adamo neatly solves the RJ45 thickness problem by putting the port behind the display hinge, where the machine is as thick as the body plus the display.

The Voodoo Envy's solution is to put the ethernet socket in the power supply, which acts as a WiFi hot spot which the laptop uses for connectivity. This is cute but it means you only get 54Mb/s because the base station doesn't have 802.11n support.

A cheaper and faster alternative would be to pass a gigabit Ethernet connection through from an RJ45 on the power supply to a combined power network port on the laptop. The connector could be much thinner than RJ45 without sacrificing compatibility.

(10 comments | Leave a comment)

Monday 1st December 2008

Weird network bug

I just spent a couple of hours debugging a strange network problem, which I haven't seen before.

The initial description was that email relayed from one departmental email server to another via my email server sometimes failed with a timeout error. This only occurred with messages with multiple recipients.

My initial suspicion was that it might be related to an old (now fixed) Exim callout verification bug, which had caused us some problems in the past. If a callout's target server was slow, and there were many recipients needing callout verification, and the client used pipelining, the total time for the callout delays could add up to more than the client's timeout, so it wouldn't see Exim's pipelined replies. Exim now flushes its replies before doing a callout to avoid this problem.

However, some testing showed that the relevant servers were quite swift, which cast some doubt on this diagnosis.

The problem report included some good details which were unfortunately too old for me to be able to correlate with our logs. I had a look to see if there was anything more recent that I could examine. It turned out that there were a couple of messages on our queues that had been delayed by this problem. I ran a test delivery of one of them and it did in fact time out between us sending the message envelope and them sending the reply. Consistent with my guess, but not consistent with other testing.

I tried a delivery with full debugging from Exim, but there was no enlightenment. I tried to send a copy of the problem message's envelope manually. This worked. Um, WTF? What was the difference between what Exim did and what I did?

I ran tcpdump to analyze the two connections. It turned out that Exim was sending the whole envelope (MAIL, RCPT, RCPT, ..., DATA) in one packet, whereas my cut-and-paste into telnet split the envelope into a packet for the MAIL command and one for the rest. So I created a file containing the envelope and used a little shell script to drive telnet. Bingo. I had reproduced the problem.

At this point I thought it might be an MTU-related problem, e.g. blocking ICMP "Fragmentation Needed" messages. I tried pinning down the packet size at which the problem started occurring, with a guess of something related to 512 bytes. After a couple of goes I noticed that the problem message had an envelope of 513 bytes, and an envelope of 512 bytes worked OK.

Then I tried a larger envelope, and - boggle - it worked. 512 or less worked, 513 failed, 514 or greater worked. This also explained why the problem affected message envelopes but not message data. The usual symptom of MTU firewall problems is timeouts at the message data stage, not during the envelope, and in this case message data was getting through fine. (The overhead of message headers makes it very unlikely that a message will have as little as 513 bytes of data.)

A couple of friends suggested testing 1025 as well, and it demonstrated similar but weirder problems. Again 1024 worked, 1025 failed, and 1026 worked. However in the failure case, the timeout occurred after I received the destination's replies to the first half of the envelope - enough commands to fit in 512 bytes. A closer look at both the 513 and 1025 cases revealed that I was getting TCP ACKs for all the data I sent, but something was going wrong after that.

I guess the problem is a firewall that's doing some kind of TCP interception and re-segmentation, and getting it wrong. The ACKs would therefore have been generated by the firewall and not by the actual destination. Someone needs to be given a good whipping with a copy of end-to-end arguments in system design before having their firewall programming licence revoked.
(10 comments | Leave a comment)

Tuesday 18th November 2008

Licence revoked

I'm happy to say that the problem I ranted about in my previous entry has been fixed. The general guidelines for bulk email have been restored, and the new guidelines for intraspam large internal mailing lists have been published on their own page.

(5 comments | Leave a comment)

Thursday 6th November 2008

Licence to spam

Over a year ago we had some discussions inside the Computing Service about better provision for large mailing lists within the University. At the moment we rely too much on paper spam, and electronic spam is distributed to staff by departmental administrators who manually forward irrelevant utterances from the Old Schools. My hope was that officially sanctioned mailing lists for internal communications would improve this situation because, as well as being more efficient, there would be better moderation and recipients would be able to unsubscribe. We approved a document which I thought was quite good, and which I expected would live alongside our existing, very general, bulk email guidelines. This then went to the IT Syndicate for approval.

Unfortunately the IT Syndicate was disbanded at about this time to be replaced by an Information Services and Strategy Syndicate which has a broader remit. The large list policy got dropped for several months as the committee rebooted. When the work item was picked up again it somehow got corrupted. It was rewritten with absolutely no input from us, and it replaced the bulk email policy instead of being adopted alongside it. As a result there is now almost no policy against spam intentionally sent by University members to recipients inside and outside the University. I am extremely annoyed.

The new policy is on the ISSS web site and I'll put a copy of the old policy behind a cut to preserve it after Google's cache expires.

Old bulk email policy )
(6 comments | Leave a comment)

Wednesday 5th November 2008

REST FAIL

This afternoon my colleague Andy gave a talk about Microsoft Exchange 2007, which I attended since I need to know what the competition has to offer and I have to provide support for its SMTP interfaces. One thing he mentioned which perked up my interest was "autodiscovery" for Outlook 2007. The unnecessary difficulty of configuring email software has irritated me for a long time, so after the talk I immediately went to find out more. It turns out to be a mixture of good stuff and utter FAIL.

The documentation describes three ways that Outlook 2007 can configure user accounts automatically - server names, security requirements, etc. If you are logged into a Windows Domain then it first tries querying the Active Directory. If that succeeds then it can find everything out by itself. Nice. If not, it falls back to an open protocol, which any standards-based mail server can implement, that configures the server settings automatically given an email address. More about this below.

If server-supported autodiscovery doesn't work, Outlook tries to guess the settings by attempting various combinations of host name, port number, TLS or not, POP or IMAP, etc. and stopping when it finds something that works. I think this is a great idea, so it's a damn shame that it prefers to use the lame POP not the studly IMAP if both are available. By itself that is enough to make it worth our while to implement the autodiscover protocol; but because our primary mail domain is cam.ac.uk but our service name is hermes.cam.ac.uk, Outlook's guesses will fail. This has the advantage of preventing users walking blindly into bad configurations, but the disadvantage that they're less likely to be able to configure it at all.

And so to the autodiscovery protocol itself. How can something so simple be fucked up in so many ways? It looks to me like it was bodged together by a clueless intern over two or three summer breaks, each time getting worse as the requirements evolved. The result is a case study in what RESTful protocol design is not.

The essence is simple: the client sends a request containing the user's email address, and the server sends back an XML document containing the configuration settings. The request is sent using HTTP-over-SSL to a URL derived from the domain part of the user's email address. (In fact it tries https://domain/autodiscover/autodiscover.xml first, then tries https://autodiscover.domain/autodiscover/autodiscover.xml.) So far, so good.

The first mistake is that the request is a POST with the email address contained in another XML document in the request body. It would have been more RESTful to use a GET with the email address encoded in the URL. This mistake turns into FAIL when you realise that most email services have the same settings for all users, in which case autodiscover.xml can be a static file, but many web servers do not allow POST requests to static files. If they had done it right, using a GET request with the email address in the query string would have Just Worked for both CGI scripts and static files.

It gets worse when they bodge around this mistake. A POST to a file commonly results in a "405 Method Not Allowed" error. Microsoft specify that if this is the case for your web server, you should configure it to send the autodiscover.xml file in the error reply, as if the request was successful. A foul perversion of the protocol.

It gets worse when they add support for virtual domains. The requirement seems to be that the service provider wants to host the autodiscover.xml script centrally, and doesn't want to fork out for an SSL certificate for every virtual domain. So if both of the https URLs fail, Outlook tries a GET request to the autodiscover.domain URL, which can redirect to the central autodiscover.xml. However that's not secure enough so there must also be an _autodiscover._tcp.domain SRV record in the DNS pointing to the same central web server. However that's still not secure enough so Outlook also admonishes the user to click OK in a popup dialogue box.

It gets worse when you look at the autodiscover.xml "schema". It isn't a schema, it's a commented skeleton fill-in-the-blanks autodiscover.xml file. That's just everyday lameness; the real FAIL is to be found in the various elements described therein. The redirect I mentioned above isn't an HTTP redirect, it's an autodiscover.xml document containing various redirection settings. There are also settings for controlling the expiry time of the document, because of course configuring these in a .htaccess file is too hard.

Ugh. I think that just about exhausts the rantage this stuff has triggered. I am most of the way to implementing it; I just need to get the necessary SSL certificates. It turns out to be easier than I expected because newish Apache does not get upset by POST requests aimed at static files: it just throws away the request body and returns the file with a "200 OK". No need to fart around with ErrorDocument 405 autodiscover.xml. With any luck it'll make Outlook trivial to get to work with Hermes.

(9 comments | Leave a comment)

Wednesday 8th October 2008

LOLauditors

A colleague tells me that our auditors are quizzing our system administrators about our backup schedules. A wag asked them if they were interested in restores too. The auditors replied, "No, just backups."

PS. I bought a copy of the book about the Chronophage from Corpus Christi plodge for £8. It is a glossy A4 booklet of 44 pages with large type and lots of colourful pictures.

(2 comments | Leave a comment)

Tuesday 30th September 2008

More on the Corpus Christi Chronophage clock

  • I said the clock has three circles of 60 slits. In fact it has two circles of 60 and one of 48 - the hour circle has a slit for every quarter hour.
  • A website about the clock is going live soon.
  • A book about the clock is being published, and it will be available from the Corpus Christi porters' lodge amongst other places.
(Leave a comment)

Monday 22nd September 2008

The Corpus Christi Chronophage Clock

I have been admiring the new clock on the corner of Trumpington St and Bene't St. Instead of having hands, the clock has three concentric rings hidden behind its face. The outer ring marks seconds, the middle ring marks minutes, and the inner ring marks hours. Each ring is driven by the next outer ring via a gearing mechanism that makes the rings tick from one mark to the next. It is fairly normal for the second hand's motion to be quantized in this manner, but more unusual for the minute and hour hands to tick like this too.

Each ring sits in front of a circle of 60 blue LEDs, and the face has three circles of 60 slots which would allow the LEDs to shine through if the rings weren't in the way. The position of each disk is indicated by a slot which allows one LED to shine through the disk and the face to be seen by the viewers.

In fact, each ring has 61 slots, one of which is usually aligned with a slot in the face to indicate the time, and 60 of which are normally out of alignment. When a ring ticks from one position to the next, its 60 normally-unaligned slots line up with the face in turn before the normally-aligned slot lines up with the next slot in the face. This has the effect of making it look like the ring has flown around 366° whereas it only moved by 6°

I have made a little animation which shows how this works. It ticks once every 4 seconds and it only has 12 slots on the face and 13 on the disk, so that you can clearly see the mechanism behind the effect. I have drawn the face's slots in the outer circle and the ring's slots in the inner circle. It uses the new-fangled <canvas> HTML element, so you may need to upgrade your browser to view it.

There is a video of the clock where some of its features are described by the designer, John Taylor, who made his fortune from his invention of the kettle thermostat.

(13 comments | Leave a comment)

Monday 15th September 2008

The date of the count

Turning a linear day number into a date is about twice as difficult as the reverse calculation, both from the computer's point of view and the programmer's. The overview of the function is to (1a) calculate the year (1b) subtract the day count for previous years (2a) calculate the month (2b) subtract the day count for previous months. These pairs of operations are essentially calculating quotients and remainders to produce successive components of the dates.

The trickiest part is dealing with the Gregorian correction. You can work out the Julian year in which a day number occurs using d*4/1461 which is a straight-forward inverse of the corresponding part of the reverse calculation (though you need to fiddle with the epoch to make the leap years fall in the right place). The equivalent for the Gregorian calendar, dividing by the average length of a year using integer arithmetic, is d*400/146097. However this produces leap year intervals of 4 or 5 years, not 4 or 8 years. Fortunately the book describes a neat trick which allows us to use this formula to estimate the day's year, then adjust it to fix any error.

Errors are detected using the forwards conversion from years to day numbers, y*1461/4 - y/100 + y/400. This formula produces the number of days from the start of 1 A.D. up to the end of year y inclusive, according to the conventional numbering. My code internally adjusts the year to start with March instead of January; the formula works for adjustments like this so long as the numbering of February doesn't change. (Note that the formula in my earlier post adjusts the year number so that it refers to the end of the previous February, to which is added a day count corresponding to the fraction of the current year that has passed.)

So what we do is calculate d*400/146097 and add one to estimate the year according to the conventional numbering. The result is always an underestimate (i.e. the formula guesses that years start up to two days late) so we may in fact get the previous year's number. We calculate the count of days up to the end of the estimated year and compare with the original day number. The original day number should normally be before the calculated end of the year. If the estimate was wrong, we need to increment the year number again.

We then need to obtain the remainder of days within this year by subtracting the day count up to the end of the previous year. To get this count we subtract one from the year and do the forward calculation again. If we combine the decrement with the estimate correction, the resulting code looks like this, where dn is day number and dy is day of year.

	y = dn*400/146097 + 1;
	y += (dn >= y*1461/4 - y/100 + y/400) - 1;
	dy = dn - y*1461/4 + y/100 - y/400;

The month calculation is more awkward than I would like because of a mismatch between the forward m*153/5 and reverse dy*5/153 rounding patterns, as I've illustrated below. In each column I've italicised the five-month cycle that corresponds to March - July.

month
number
forward
length
reverse
length
03031
13131
23030
33131
43130
53031
63131
73030
83131

The result is that the code needs a series of different administrative fiddles to work out the month and the day of the month. First add one month of days to line up with the italic part of the reverse pattern, then calculate the month giving March = 1 through to February = 12. Then add three months to line up with the italic part of the forward pattern, March = 4 to February = 15, and calculate the number of days before the month. Subtract this from the day of the year, then add four months of days to compensate for the fiddles, to finally get the day of the month.

	m = (dy + 31)*5/153 + 3;
	d = dy - m*153/5 + 123;

The last step is to restore the year to its usual January - December alignment. This, at least, is a simple inverse of the corresponding code in the forwards direction. There's also a preparatory adjustment of 10 months at the start of the code to set up the March - February alignment in the first place. After a few tweaks, the final code is:

void F(int d, int *py, int *pm, int *pd) {
	int y, m;
	d += 305;
	y = d*400/146097 + 1;
	y -= y*1461/4 - y/100 + y/400 > d;
	d -= y*1461/4 - y/100 + y/400 - 31;
	m = d*5/153 + 3;
	d -= m*153/5 - 92;
	if (m < 14) m -= 1;
	else m -= 13, y += 1;
	*py = y, *pm = m, *pd = d;
}
  • 305 is 10 months (March - December) minus the epoch (1st January 1 A.D. is day 1)
  • 400 is the number of years in a Gregorian cycle and 146097 is the number of days
  • 1461 is the number of days in 4 years, usually
  • 153 is the number of days in 5 consecutive months, except February
  • 31 is one month of days to align the reverse month calculation, from 31*5/153 == 1
  • 3 aligns the forward month calculation
  • 92 is three months of days to compensate for the three months added in the previous line, from 3*153/5 plus an extra one because we count days within a month from 1 not 0
  • actually 92 = 4*153/5 - 31 + 1 where the extra 31 is the same 31 we added previously

Edited to add...

It's possible to make the forward and reverse patterns line up by adjusting the 153/5 factors, which saves a couple of additions. The smallest denominator for which this works is 17, which produces the following code. Before the final fix-up, months are numbered Mar=1 - Feb=12.

void G(int d, int *py, int *pm, int *pd) {
	int y, m;
	d += 305;
	y = d*400/146097 + 1;
	y -= y*1461/4 - y/100 + y/400 > d;
	d -= y*1461/4 - y/100 + y/400 - 31;
	m = d*17/520;
	d -= m*520/17;
	if (m < 11) m += 2;
	else m -= 10, y += 1;
	*py = y, *pm = m, *pd = d;
}
(Leave a comment)

Symbolic links

What happens if you try to create a symlink to a zero-length target, as in ln -s "" foo?

Linux: ENOENT.

Mac OS X 10.4: EINVAL.

FreeBSD: works, and any attempt to resolve the symlink returns ENOENT.

Solaris: works, and the resulting symlink behaves the same as ln -s . foo

(10 comments | Leave a comment)

Wednesday 10th September 2008

Counting the days

Nearly 10 years ago I developed a neat function for converting a Gregorian date into a count of days since some epoch. There was nothing particularly new about it; I just tweaked the numbers until the formula was as simple as possible. I've re-used this code a fair number of times since then without re-examining it. This evening I worked out that it can be distilled even more.

The function adds together some expressions that count the number of days indicated by the three components of the date, plus a constant to set the epoch.

The number of normal days before the year y (full four digit year) is just y*365.

The number of leap days up to and including the year y is y/4-y/100+y/400. This is the usual "add a leap day in every fourth year, but not in every hundredth year, but do in every 400th year" rule. I'm using integer division that discards any remainder, i.e. rounds down.

The previous two paragraphs are inconsistent about including or excluding the current year. This is because leap days occur inside a leap year, not at the end, so their contribution to the day count does not depend solely on the year. We can fix this by re-assigning January and February to the previous year using a small administrative fiddle, as follows (where m counts months starting from 1). Then a leap day falls just before the start of years that are multiples of four (etc.) and the expression in the previous paragraph magically does the right thing. The fiddle also requires an adjustment to the epoch constant.

        if (m > 2) m -= 2; else m += 10, y -= 1;

Months are the part that I spent most time fiddling with. Section 1.12 of Calendrical Calculations develops various equations for calendars (such as the Julian calendar) that spread their leap years evenly, like Bresenham's line drawing algorithm spreads out raster steps. The equations also work for month lengths in the Julian and Gregorian calendars, if you treat 30-day months as normal years and 31 day months as leap years, and ignore February. We want to calculate the number of days in a year before the start of a given month, which is calculated by equation 1.73:

        (L*y - L + D*L % C) / C + (y - 1) * N

where L is the number of leap years in a cycle, C is the total number of years in a cycle, D is the offset of the start of the cycle, N is the number of days in a normal year, and years are actually months. Yuck. It turns out that if we perform the right administrative fiddling, we can arrange that D=0, and we can substitute m=y-1. This gives us the greatly simplified:

        m*L/C + m*N = m * (N*C+L) / C

For Julian and Gregorian months, N=30, C=12, and L=7, giving m*367/12. This pretends that February has 30 days, but the pretence doesn't matter because we have moved February to the end of the year so we never count days after February 28th or 29th . The complete function is then as follows (with the epoch set so that 1st January, 1 A.D. is day 1).

    int A(int y, int m, int d) {
        if (m > 2) m -= 2; else m += 10, y -= 1;
        return y*365 + y/4 - y/100 + y/400 + m*367/12 + d - 336;
    }

In fact you don't have to treat a whole year as a cycle for the purpose of the month calculation. The most basic pattern is a five month cycle containing three long months, as in March-July and August-December. (January and February are the start of a third truncated cycle.) However when I tried to make m*153/5 work in 1999 I could not eliminate the D offset to keep the equations simple. My problem was that I only allowed myself to count months starting from zero or one, and I was flailing around without the sage guidance of messrs. Dershowitz and Reingold. (I didn't use their equation 1.73 when I was developing this code, but it's a good example of what I was trying to avoid.)

It turns out that we can play around with the administrative fiddle to adjust D so that the cycle falls in the right place, so long as we compensate by adjusting the epoch constant. For the Julian and Gregorian five-month cycle D=4, i.e. we need to shift March (the start of the cycle) to month number 4, which gives us the following code.

    int B(int y, int m, int d) {
        if (m > 2) m += 1; else m += 13, y -= 1;
        return y*365 + y/4 - y/100 + y/400 + m*153/5 + d - 428;
    }

I mentioned above that equation 1.73 works for Julian leap years, so we can use it for years as well as months. This combines first two terms of the return expression leaving the Gregorian correction separate. No additional fiddling is necessary. We can't merge in the Gregorian correction as well, because evenly spreading the leap years gives intervals of 4 or 5, not 4 or 8.

    int C(int y, int m, int d) {
        if (m > 2) m += 1; else m += 13, y -= 1;
        return y*1461/4 - y/100 + y/400 + m*153/5 + d - 428;
    }

There's a related function for calculating the number of days in a month. It uses the remainders from the divisions to find the length of the current month, instead of using the results to count the preceding days. My old code re-used the same administrative fiddle:

    int D(int y, int m) {
        if (m > 2) m -= 2; else m += 10;
        return m*367%12 > 4 ? 31
             : m != 12     ? 30
             : y % 4      ? 28
             : y % 100   ? 29
             : y % 400  ? 28
                       : 29;
    }

In fact you don't need a fiddle in this case, because there's no need to move February to the end of the year. You can just pick values of C and L which make the pattern fall in the right place.

    int E(int y, int m) {
        return m*275%9 > 3 ? 31
             : m != 2     ? 30
             : y % 4     ? 28
             : y % 100  ? 29
             : y % 400 ? 28
                      : 29;
    }
(9 comments | Leave a comment)

Thursday 4th September 2008

Faster LIAR (Life in a register)

This week I have been discussing the Game of Life with the amazingly enthusiastic Brice Due, after he commented on my original post about LIAR. He put me in touch with Tom Rokicki who, as well as writing probably the best implementation of Life, recently featured in New Scientist for his work on God's solution to Rubik's Cube.

Tom pointed me at a better bit-twiddling implementation of Life which comes in at 26 operations compared to my code's 34 - a significant improvement. However his code is ultra-compressed and really hard to understand. So I wrote a comprehensible version. This code has some rather clever features.

	left = bmp << 1;
	right = bmp >> 1;
	side1 = left ^ right;
	side2 = left & right;
	mid1 = side1 ^ bmp;
	mid2 = side1 & bmp;
	mid2 = side2 | mid2;

In the first stage he uses a normal full adder to do the horizontal 3-to-2 compression which counts the number of cells in each row (bits mid1 and mid2). The first two ops of the full adder form a half adder which he uses to count the cells to either side of the middle cell (bits side1 and side2). Thus he gets four useful outputs from just five ops!

	upper1 = mid1 << 8;
	lower1 = mid1 >> 8;
	upper2 = mid2 << 8;
	lower2 = mid2 >> 8;

The side bits are kept in the middle row, whereas the mid bits are shifted to become the sums for the upper and lower rows.

#define REDUCE(g,f,a,b,c) do {	\
		uint64_t d, e;	\
		d = a ^ b;	\
		e = b ^ c;	\
		f = c ^ d;	\
		g = d | e;	\
	} while(0)

	REDUCE(sum12, sum13, upper1, side1, lower1);
	REDUCE(sum24, sum26, upper2, side2, lower2);

The second stage is to compress the three rows into separate sums of the low bits and the high bits. Tom does not use a normal adder to do this. It turns out that there are 24 adder-like boolean functions of three arguments, which produce two values that together encode the number of input bits. Only one of them requires four operations - the others require five or more. The encoding of the result is a bit strange:

countresult
0f=0 g=0
1f=1 g=1
2f=0 g=1
3f=1 g=0

As well as saving two ops compared to using a full adder, this function makes the final twiddling particularly brief.

	tmp = sum12 ^ sum13;
	bmp = (bmp | sum13) & (tmp ^ sum24) & (tmp ^ sum26);
	return(bmp & 0x007E7E7E7E7E7E00);

I wondered if it's possible to use the 4-op adder-alike in the first stage too. The problem with adding together all nine cells is overflowing the three bit result. This is OK when using a full adder because the overflow happens to confuse a very large count (therefore a dead result) with a very small count (again a dead result) so it makes no difference. The modified adder confuses a middling small count (live result) with a middling big count (dead result) so it fails. The dual-purpose half+full-adder trick neatly avoids overflow problems.

Despite successfully unpicking this code I still find it rather unintuitive, especially the final twiddle. I am in awe of people who can invent such cunning tricks.

(Leave a comment)

Thursday 28th August 2008

More postcodes

Thanks for all the interesting comments on my previous post.

The reason I'm investigating this is to work around false positives caused by SpamAssassin's obfuscation rules. These are intended to match deliberate misspellings of commonly spammed goods such as Viagra. The specific instance that caused the bug report was a Reading postcode being treated as an obfuscated Rolex.

Therefore I'm not particularly worried about missing out obscure special cases like GIR 0AA and the overseas territories AAAA 1ZZ. However it might be worth tightening up the outcode regex, based on the list of UK postcode areas, to reduce the chance of matching a bogus postcode.

Also, the Post Office's postcode FAQ mentions that only London uses the ANA and AANA outcode formats. (In fact it's only the E, EC, SW, W, WC areas.) I managed to find a list of postcode districts which includes these outcodes (Wikipedia omits them) and it shows that the third position rule is wrong: it says M does not appear there but there is a poscode district London W1M. Rule Three also allows A and E which are not in fact used.

qr{\b
  ([BGLMNS][1-9][0-9]?
  |[A-PR-UWYZ][A-HK-Y][1-9]?[0-9]
  |([EW]C?|NW?|S[EW])[1-9][0-9A-HJKMNPR-Y]
  )[ ]{0,2}
  ([0-9][ABD-HJLNP-UW-Z]{2})
\b}x
(3 comments | Leave a comment)

Wednesday 27th August 2008

UK postcode regular expression

A little contribution for anyone else who searches the web for this in the future.

The UK postcode consists of two parts. The first part is the Outward Postcode, or Outcode. This is separated by a single space from the second part which is the Inward Postcode, or Incode. The Outcode directs mail to the correct local area for delivery. The Incode is used to sort the mail at the local area delivery office.

The Outcode has 6 possible formats (as follows) and the Incode is consistently numeric, alpha, alpha format.

  • AN NAA
  • ANN NAA
  • ANA NAA
  • AAN NAA
  • AANN NAA
  • AANA NAA

There are some restrictions on the letters:

  1. The letters [QVX] are not used in the first position.
  2. The letters [IJZ] are not used in the second position.
  3. The only letters to appear in the third position are [ABCDEFGHJKSTUW].
  4. The only letters to appear in the fourth position are [ABEHMNPRVWXY].
  5. The letters [CIKMOV] are not used in the second part.

This translates into a perl extended regex as follows (with slightly relaxed whitespace):

qr{\b
    ([A-PR-UWYZ]\d[\dA-HJKSTUW]? # rules 1,3
    |[A-PR-UWYZ][A-HK-Y]\d[\dABEHMNPRVWXY]? # rules 1,2,4
    )[\t ]{1,2}
    (\d[ABD-HJLNP-UW-Z]{2}) # rule 5
\b}x

Update: more here.

(15 comments | Leave a comment)

Birthday present

I arrived in the office today to find that some fans had given me a birthday present, which rather amused me :-)
(8 comments | Leave a comment)

Wednesday 20th August 2008

Job ad

Pat Stewart is retiring in December. She has been our user accounts manager for many years, as well as being infamous as the Keeper of the Ton of Bricks and Explainer of Rules to wayward undergraduates. We've just advertised her job and we're hoping to find a replacement with enough time for a couple of months overlap. The job description below is actually a fair description of what Pat does - for once it isn't just the usual platitudes. She is a hard act to follow.

User Policy and Account Manager

The University Computing Service has a vacancy for a User Policy and Account Manager. The successful applicant will create, develop and maintain the policies relating to the management of accounts for the use of University Computing Service computing resources, ensuring an orderly and well managed environment for University members to carry out their studies and research.

The successful applicant will have a good degree or equivalent and a sound broad-based background in IT with considerable wide-ranging experience including support for a variety of operating systems. Experience of working in Higher Education and a working knowledge of legislation relating to IT would be expected. The applicant will have experience of managing staff, and must be able to communicate effectively with users at all levels in a variety of situations.

The applicant must have good interpersonal skills and able to act with discretion, empathy and tact. The applicant will need to be self-confident, motivated, well organised, meticulous, able to prioritise tasks when working under pressure, and have the ability to anticipate problems before they arise, and develop solutions to cope with them.

This is a permanent appointment at Grade 9, subject to a nine month satisfactory probation period, on a salary scale of £34,793 to £44,074 with initial salary depending on experience. The closing date for applications is Friday 5 September 2008. It is intended that interviews will be held during the week beginning 15 September 2008. Further details are available on the Computing Service web site.

(14 comments | Leave a comment)

Thursday 7th August 2008

Kaminsky

I've now read Dan Kaminsky's slides, which are mostly ranting "the sky is falling" and pointing out what assumes secure DNS. The actual attack is not described in any more detail than previous public sources. I still don't understand why resolvers accept the poison: Kaminsky seems to be suggesting that data in the additional section of a reply is overwriting cached answers, which RFC 2181 says must not happen. Anyway,

$ md5 <kaminsky; echo; cat kaminsky; echo
ef96f2d9e973a36e825793ddeff48ae5

Kaminsky says his attack allows him to overwrite data in a DNS cache.
The reason Matasano's glue / additional RR explanation is wrong can be
found in RFC 2181 section 5.4.1, which says that additional RRs must
not override authoritative data in the DNS cache. However, if you look
at RFC 1034 section 3.6.2 you'll see that the data for a CNAME target
goes in the answer section, not the additional section. Presumably
this means the resolver treats it as trustworthy and therefore
overwrites its "old" cached data with the CNAME target data. (However
RFC 2181 also says that a resolver shouldn't trust CNAME target data,
which implies this attack shouldn't work, though I can see why it does
if the CNAME target is in the same zone as the CNAME owner.)

So the attack is: (1) cause the victim to look up lots of random
invalid domain names adjacent to the goal name whose data you want to
overwrite; (2) spoof authoritative answers that contain two records: a
CNAME whose owner is your random invalid name and whose target is your
goal name, plus the data for the goal name that you want to insert
into the victim's cache; (3) when you win the spoofing race your data
for the goal name overwrites whatever the victim previously had
cached.

$ md5 <kaminsky2; echo; cat kaminsky2
d4b70e6abfa3e7d49e159d75b5fc277b

So there's another form of attack which is closer to the Matasano
description but still different in significant ways. This relies on
the fact that (again according to RFC 2181 section 5.4.1) data in the
authority section of a reply is given a lot of trust. The attack is:
(1) cause the victim to look up lots of random invalid domain names
adjacent to the goal name whose data you want to overwrite; (2) spoof
authoritative answers that have an authority section containing
replacement NS records for the target's zone which point to
nameservers you control; (3) when you win the spoofing race the victim
will start using your name servers for the target zone instead of the
proper nameservers; (4) your nameserver can reply in whatever way you
want for the target name, but you'll have to wait for it to expire
from the victim's cache. Therefore this attack is slower than the
CNAME attack.
(2 comments | Leave a comment)

Thursday 31st July 2008

To do

Some things to keep me busy for the next few months.

  • PPswitch:
    • Modify activity graphs for the new scanner setup. Add SpamAssassin load tracking.
    • Do a public weekly spam filtering statistics news item. Break down stats per department, especially so that the Old Fools know how much junk their anti-spam appliance isn't seeing.
    • Push per-user filtering options out to SMTP time. Add per-domain filtering options for other University mail servers.
  • Exim:
    • Finish the TLS logging improvements and unit tests for the new ratelimit code, then roll a 4.70 release.
    • Implement greylisting, probably for 4.71.
    • Revamp the hints database layer so it can support concurrent updates (e.g. Berkeley DB row-level locking) and clustered databases (e.g. memcached), probably for 4.72. This will improve the retry, ratelimit, and greylisting databases.
  • FreeBSD:
    • Update factor(6) with bug fixes from NetBSD. (bug found when factorizing my phone number!)
    • Fix bugs and feature omissions in unifdef(1) reported by Ben Hutchings and Anders Kaseorg.
(2 comments | Leave a comment)

Wednesday 30th July 2008

selog-8.7.30 - selective logging library

I have released a new version of selog. The only significant change is C++ support, requested by [info]filecoreinuse.

One thing I would like to be able to do in C++ is call a selector, so that sel(fmt, ...) works like selog(sel, fmt, ...). (The Lua API supports something like this.) However as far as I can tell this would require variadic templates to implement efficiently (i.e. to inline the selog_on() check) and this is probably reaching too far into voodoo territory to be sensible.

(Leave a comment)

Thursday 24th July 2008

ClamAV aargh

We're being hammered by loads of vicious email trojans, which mutate fast. I've resorted to adding manual blocks in Exim because ClamAV isn't keeping up.

Just now I was very puzzled that freshclam wasn't downloading the latest version of the virus database. It turns out that although I have told it to poll 100 times a day (about every 15 minutes), freshclam uses the DNS to check what is the latest version, and the TTL on the relevant DNS record is 30 minutes.
(3 comments | Leave a comment)

More Kaminsky

So there's another form of attack which is closer to the Matasano description but still different in significant ways.

$ md5 <~/doc/kaminsky2
d4b70e6abfa3e7d49e159d75b5fc277b
(2 comments | Leave a comment)

Wednesday 23rd July 2008

Kaminsky's DNS hack

[info]beezari posted a copy of the leaked Matasano explanation of Kaminsky's new DNS attack. I believe the explanation isn't quite right. In his interview in the WIRED Threat Level blog Kaminsky mentions that the attack relies on CNAMEs. This means that it does not depend on glue nor on additional section processing, which is what Matasano described. I believe the real explanation is...

$ md5 <~/doc/kaminsky
ef96f2d9e973a36e825793ddeff48ae5
(15 comments | Leave a comment)

A ratelimit idea

[info]senji suggested ratelimiting email based on the MD5 checksum of any attachments, with the goal of slowing down an email virus attack. I think this might be feasible so I'm noting it here as a sort of public to-do list entry...
(3 comments | Leave a comment)

Wednesday 18th June 2008

Murray Edwards College

Today brings us the cheerful news that New Hall has been refounded with an endowment of £30 million and finally been given the proper name that it has been waiting for since 1954.

This led to a discussion on IRC about people who have colleges at both Cambridge and Oxford named after them. The list is:

God
  • Trinity College, Cambridge
  • Trinity Hall, Cambridge
  • God's House Christ's College, Cambridge
  • Trinity College, Oxford
Jesus
  • Christ's College, Cambridge
  • Corpus Christi College, Cambridge
  • Emmanuel College, Cambridge
  • Jesus College, Cambridge
  • Christ Church, Oxford
  • Corpus Christi College, Oxford
  • Jesus College, Oxford
the Virgin Mary modestly hides in the full names of several colleges
  • The College of the Blessed Virgin Mary, Saint John the Evangelist and the glorious Virgin Saint Radegund, near Cambridge (aka Jesus College)
  • The King's College of Our Lady and Saint Nicholas in Cambridge
  • The College of Corpus Christi and the Blessed Virgin Mary in Cambridge
  • Hall of the Annunciation of the Blessed Virgin Mary Gonville and Caius College, Cambridge
  • New College of St Mary, Oxford
  • The House of Blessed Mary the Virgin in Oxford (aka Oriel College)
Mary Magdalene
  • Magdalene College, Cambridge
  • Magdalen College, Oxford
St Peter
  • Peterhouse, Cambridge
  • St Peter's College, Oxford
St Catherine
  • St Catharine's College, Cambridge
  • St Catherine's College, Oxford
Edmund Rich of Abingdon
  • St Edmund's College, Cambridge
  • St Edmund Hall, Oxford
Sir Isaac Wolfson
  • Wolfson College, Cambridge
  • Wolfson College, Oxford

Cambridge's Pembroke College is named after Marie de St Pol, Countess of Pembroke (widow of the 1st Earl of Pembroke); Oxford's Pembroke College is named after William Herbert, 3rd Earl of Pembroke.

Cambridge has two colleges named after St John the Evangelist, and Oxford has one named after St John the Baptist.

Cambridge has two colleges endowed by Lady Margaret Beaufort (Christ's and John's) whereas Oxford has a college named after her.

Cambridge's University College lasted 7 years before it was renamed Wolfson College. Oxford's University College is over 700 years old.

Both universities have benefited from money obtained from manufacturing cars: Cambridge via the Cripps foundation, and Oxford via the Nuffield foundation.

(18 comments | Leave a comment)

Wednesday 11th June 2008

Venti, Foundation, Bloom filters, and Erlang

I've just read a paper about "Foundation", which is an updated version of Plan 9's Venti archiving system. Foundation is designed to back up a computer nightly to a USB hard disk and an FTP server. Like Venti, it uses content-addressed storage to eliminate duplicate blocks; the main difference is that Foundation has been optimized to reduce seeks on a single disk, where Venti assumed a fast RAID could hide latency. Foundation uses a Bloom filter to speed up index lookups when eliminating duplicate blocks, and it writes data blocks in an append-only log, so it hits two things that I have been thinking about recently.

Towards the end of the paper are a few wonderful paragraphs that address the problem of reclaiming backup space wasted by bulky temporary data - the paper uses the example of on-disk copies of DVDs to save physical space when travelling. This motivates the desire to support deleting a nightly snapshot. They say:

Conceptually, deleting a snapshot resembles garbage collection in programming languages or log cleaning in LFS. First, the CAS layer writes a new version of the system catalog that no longer points to the snapshot. Then, the system reclaims the space used by blocks that are no longer reachable from any other catalog entry. A more recent snapshot, for example, may still point to some block in the deleted snapshot.

Interestingly, the structure of Foundation's log makes identifying unreferenced blocks particularly efficient: as a natural result of the log being append-only, all pointers within the log point "backwards". Garbage collection can thus proceed in a single, sequential pass through the log using an algorithm developed by Armstrong and Virding for garbage collecting immutable data structures in the Erlang programming language.

The algorithm works as follows. Starting at the most recent log entry and scanning backwards, it maintains a list of "live" blocks initialized from the pointers in the system catalog. Each time it encounters a live block, it deletes that block from its list. If the block is a metadata block that contains pointers to other blocks, it adds these pointers to its list. If the algorithm encounters a block that is not in its list, then there are no live pointers to that block later in the log, and since all pointers point backwards, the algorithm can reclaim the block's space immediately. The system can also use this algorithm incrementally: starting from the end of the log, it can scan backward until "enough" space has been reclaimed, and then stop.

The expensive part of the Erlang algorithm is maintaining the list of live blocks. If references to many blocks occur much later in the log than the blocks themselves, this list could grow too large to fit in memory. We note, however, that a conservative version of the collector could use a Bloom filter to store the list of live blocks. Although false positives in the filter would prevent the algorithm from reclaiming some legitimate garbage, its memory usage would be fixed at the size of the Bloom filter.

This reminds me of a recent post on Phil Wadler's blog where he highlights some comments on the proposal to add functional programming to the ACM curriculum. The paragraphs I quoted above are an excellent example of FP knowledge being useful in all parts of computer science.

(4 comments | Leave a comment)

Wednesday 4th June 2008

Bike

I need a bike shop that will sell me something like the following. Preferably somewhere with helpful and knowledgable staff who can help me choose which model and which size. (I am deeply unimpressed by Station Cycles whose staff are so useless they couldn't sell me a bike when I was ready to drop 600 quid on one there and then.)
  • child seat
  • step-through frame
  • panniers
  • hub gears
  • hub brakes
  • hub dynamo
  • chain case
ETA: Yes I know this descrbes a Dutch bike. Yes I can find out which shops stock them. What I want to know is which shops have staff that are helpful and knowledgable about this kind of bike - i.e. not fen mountain bikes and other BSOs.
(15 comments | Leave a comment)

Tuesday 3rd June 2008

Tech talk

I gave a tech talk at Google's London offfices yesterday afternoon, based on my old blog posts about "how not to design an MTA" though with a more positive spin. You can look at the slides and notes now, and I believe there will be video available RSN.
(3 comments | Leave a comment)

Lotus Notes

Oh god. I nearly puked when I got a reply from a Notes user just now. Between the top-posted reply and the text from my original message were my message's headers, formatted like this:

aaah! my eyes! )
(10 comments | Leave a comment)

Friday 16th May 2008

Need a better name than post-postmodern

A few days ago, Ben Goldacre linked to an article by and an interview with a guy called Jonathan Gottschall who likes to apply the scientific method to literary theory. I was gobsmacked! I thought that no-one who takes literary theory seriously has any intellectual engagement with the real world.

I had a brief discussion about Librarything with [info]addedentry at his and [info]j4's birthday party on Saturday, in which he talked about its new approaches to cataloguing. Its use of tagging, as opposed to ontological classification, is its most obvious feature, but its concept of a "work" as an umbrella for the multiple editions of a book is also useful.

In the pub this evening we decided that both of these things are post-postmodern. Traditional cataloguing is very modernist in its approach: top-down, paternalistic, relying on the academic expert to benevolently provide for everyone's needs. Literary criticism is the ultimate expression of postmodernism: observing that experts are not always right and that new theories come from outside of the consensus, they deny the existence of objective truth and assert that all opinion is equally valid. Scientists and engineers instinctively reject postmodernism, but often fail to do so without relying on discredited modernist thinking.

How can we get beyond postmodernism? I think it has to be the acknowledgment that a fuzzy consensus is a valid approximation to the truth, and that we have experimental and statistical tools which can refine that approximation. But the crucial thing is to realize that these tools don't just work for physics or chemistry or biology or medicine, but they also work for cataloguing books, or establishing that the author does have a degree of control over the reader's thoughts, or showing that beauty does have a degree of commonality across cultures, or that we can automatically translate between natural languages.

The preceding ill-informed rant was brought to you by Summer Lightning.
(10 comments | Leave a comment)

Friday 9th May 2008

Password scanning

AJ's idea of scanning email for passwords has provoked a lot of strange suggestions, online and in person. Elaborate password cracking clusters, phishing my own users, hardware acceleration, etc... (When AJ suggested the Idea, my first thought was to use a Bloom filter to find passwords, since you can't recover the contents of a Bloom filter without brute force - but it's very hard to delete items from a Bloom filter, which in this scenario must be done whenever any user changes their password. No, that too is not going to work.)

The whole idea is very borderline: is it worth spending significant effort when the phishing success rate is much less than 1% per incident, and the current fashion for phishing universities is probably short-lived? (This week we got about 2000 messages from the phishers and 5 users replied.) On the other hand it would be very interesting to find out what the detection rate would be. Would there be any false positives? i.e. unintentional password appearances? What is the legitimate positive rate? e.g. senior staff sending their passwords to their PAs? (The latter is against our AUP but it is common practice.) How much password sharing is there outside the anticipated scenarios?

It seems that it's worth making the point that it isn't hard for me to get my users' plaintext passwords: I could just instrument our various SASL implementations. But we (me and my colleagues) don't do that because sysadmins are safer not knowing their users' secrets. This is why we don't log message subjects, and why our accidental-deletion recovery tools don't require seeing any message contents. We don't look at the contents of a user's account in any detail without permission from that user - and even then we'd very much prefer not to know that we're recovering email that was deleted by their jilted lover who obtained their password from a shared browser history.

From my point of view, the interesting thing is that it is feasible to detect when a user is sending their own password in a message, using just a standard Unix encrypted password file and some simple code: crypt every word the user sends and compare with that user's crypted password. This is just a few hundred lines of C, including hooks into our authentication database and email content scanner, and choosing the right words. My prototype code can check 2000 MD5 crypted words per second per CPU, and should be able to skip most words in a message since they are outside our password rules.

There has been a lot of traffic on various mailing lists about these phishing attacks, especially notifications of new reply addresses. But we don't want to be in the business of maintaining blacklists of addresses our users mustn't send to. Password scanning seems like a simple way of avoiding that tar pit, which is I think the main attraction. So why do I think it's absurd?
(27 comments | Leave a comment)

Wednesday 7th May 2008

Dealing with phishers

There's been a raft of phishing attacks against Universities over the last few months. We received a couple of thousand of these last night:

Subject:  CONFIRM YOUR EMAIL ADDRESS
Date:     Tue, 6 May 2008 16:08:53 -0400
From:     CAM SUPPORT TEAM 
Reply-To: 

Dear Cam Subscriber,

To complete your (CAM) account, you must reply to this email
immediately and enter your password here (*********)

Failure to do this will immediately render your email address
deactivated from our database.

You can also confirm your email address by logging into your
CAM account at www.webmail.cam.ac.uk/

Thank you for using CAM.AC.UK!
FROM THE CAM SUPPORT TEAM

We did the usual announcement dance, including a notice on the webmail login page, but this did not prevent some users (including webmail users!) from replying to the phish.

[info]captain_aj suggests scanning email to reject it if it contains the user's password. I wonder how long it would take to crypt() every word of every message... :-)

(19 comments | Leave a comment)

Tuesday 6th May 2008

floaty fiddling

So I needed some Lua functions to extract bit fields from an integral value, specifically struct stat.st_mode. Lua only has floating point numbers, and its standard library doesn't extend beyond ANSI C. So I wrote the following. (Note that a % b == a - floor(a/b)*b.)

   function mask(lobit, hibit, num)
      local toolo = num % 2^lobit
      return (num - toolo) % 2^hibit
   end

   function bit(bit, num)
      return num / 2^bit % 2 >= 1
   end
(8 comments | Leave a comment)

Tuesday 29th April 2008

TCP narg/rant

During a discussion on a work mailing list about TCP performance across the Atlantic, one of my colleagues made some comments that were ridiculous even by his standards. (He's notorious for declaring ex cathedra that things are extremely difficult and no-one understands them any more except a few giants (like himself) from the age when dinosaurs ruled the machine room.) I felt compelled to correct his most egregiously wrong statements, and I'm posting my comments here since they seem to be of wider interest than just University IT staff. In the quoted sections below, when he talks about "products" he means software that games or breaks TCP's fairness and congestion control algorithms.

I went to a very interesting talk by the only TCP/IP experts I have ever met who explained that the only reason the Internet works is that such products were NOT in widespread use.

The importance of congestion control has been common knowledge since the Internet's congestion collapse in the 1980s and Van Jacobson's subsequent TCP algorithm fixes. (It's taught to all CS undergrads, e.g. look at the Digital Communications 1 syllabus in CST part 1B.) However software that games TCP's fairness model *is* in widespread use: practically all P2P software uses multiple TCP connections to get better throughput.

In the mid 1990s the Internet was going to melt down because of all the web browsers making lots of connections that were too short for TCP's congestion control algorithms to be effective. So we got HTTP/1.1 with persistent connections and a limit on the concurrency that browsers use. However these TCP and HTTP measures only work because practically all software on end-hosts co-operates in congestion control: there's nothing to enforce this co-operation.

The rise of P2P means that ISPs are increasingly interested in enforcing limits on bandwidth hogs, hence the arguments about "network neutrality" as the ISPs point fingers at YouTube and the BBC and the file sharers. They are also deploying traffic shapers in the network which manipulate TCP streams with varying degrees of cleverness. (The most sophisticated manipulate the stream of ACKs going back to the sender which causes it to send at the network's desired rate; the stupid ones like Comcast's just kill connections they don't like.)

However, a better approach is to use a more realistic idea of fairness as the basis for congestion control instead of equal bandwidth per TCP connection. e.g. for ISPs, balance per subscriber. Bob Briscoe's recent position paper makes this argument in readable fashion. RFC 2140 groped in the same direction 10 years ago, but was never adopted. Either way, enforcement at the network's edge (instead of trusting the hosts) is likely to be the future whatever model of fairness is used.

Also interesting is the research on adapting TCP to high-bandwidth networks. There are many examples of what Briscoe argues against: great efforts of design and analysis to preserve TCP-style fairness in the new protocols. The GÉANT2 high-speed TCP page has lots of relevant links.

Luckily, there are very few people who know enough about TCP/IP and related protocols to write an effective product because, as I implied, there are probably no more than half-a-dozen TCP/IP experts in the world still working (and may even be none).

Tosh. There's a substantial market out there, both for managing congestion and for exploiting weaknesses in congestion control.

(10 comments | Leave a comment)
Previous 50
Powered by LiveJournal.com