Log in


Tony Finch

Friday 21st August 2015

Fare thee well

At some point, 10 or 15 years ago, I got into the habit of saying goodbye to people by saying "stay well!"

I like it because it is cheerful, it closes a conversation without introducing a new topic, and it feels meaningful without being stilted (like "farewell") or rote (like "goodbye").


"Stay well" works nicely in a group of healthy people, but it is problematic with people who are ill.

Years ago, before "stay well!" was completely a habit, a colleague got prostate cancer. The treatment was long and brutal. I had to be careful when saying goodbye, but I didn't break the habit.

It is perhaps even worse with people who are chronically ill, because "stay well" (especially when I say it) has a casually privileged assumption that I am saying it to people who are already cheerfully well.

In the last week this phrase has got a new force for me. I really do mean "stay well" more than ever, but I wish I could express it without implying that you are already well or that it is your duty to be well.

(10 comments | Leave a comment)

Monday 17th August 2015


We have not been able to visit Rachel this weekend owing to an outbreak of vomiting virus. Nico had it on Friday evening, I had it late last night and Charles started a few hours after me.

Seems to have a 50h ish incubation time, vomiting for not very long, fever. Nico seems to have constipation now, possibly due to not drinking enough?

It's quite infectious and unpleasant so we are staying in. We have had lovely offers of help from lots of people but in this state I don't feel like we can organise much for a couple of days.

Since we can't visit Rachel we tried Skype briefly yesterday evening, though it was pretty crappy as usual for Skype.

Rachel was putting on a brave face on Friday and asked me to post these pictures:

(3 comments | Leave a comment)

Saturday 15th August 2015

Rachel's leukaemia

Rachel has been in hospital since Monday, and on Thursday they told her she has Leukaemia, fortunately a form that is usually curable.

She started chemotherapy on Thursday evening, and it is very rough, so she is not able to have visitors right now. We'll let you know when that changes, but we expect that the side-effects will get worse for a couple of weeks.

To keep her spirits up, she would greatly appreciate small photos of cute children/animals or beautiful landscapes. Send them by email to rmcf@cb4.eu or on Twitter to @rmc28, or you can send small poscards to Ward D6 at Addenbrookes.

Flowers are not allowed on the ward, and no food gifts please because nausea is a big problem. If you want to send a gift, something small and pretty and/or interestingly tactile is suitable.

Rachel is benefiting from services that rely on donations, so you might also be about to help by giving blood - for instance you can donate at Addenbrookes. Or you might donate to Macmillan Cancer Support.

And, if you have any niggling health problems, or something weird is happening or getting worse, do get it checked out. Rachel's leukaemia came on over a period of about three weeks and would have been fatal in a couple of months if untreated.

(12 comments | Leave a comment)

Tuesday 11th August 2015

What I am working on

I feel like https://www.youtube.com/watch?v=Zhoos1oY404

Most of the items below involve chunks of software development that are or will be open source. Too much of this, perhaps, is me saying "that's not right" and going in and fixing things or re-doing it properly...

Exchange 2013 recipient verification and LDAP, autoreplies to bounce messages, Exim upgrade, and logging

We have a couple of colleges deploying Exchange 2013, which means I need to implement a new way to do recipient verification for their domains. Until now, we have relied on internal mail servers rejecting invalid recipients when they get a RCPT TO command from our border mail relay. Our border relay rejects invalid recipients using Exim's verify=recipient/callout feature, so when we get a RCPT TO we try it on the target internal mail server and immediately return any error to the sender.

However Exchange 2013 does not reject invalid recipients until it has received the whole message; it rejects the entire message if any single recipient is invalid. This means if you have several users on a mailing list and one of them leaves, all of your users will stop receiving messages from the list and will eventually be unsubscribed.

The way to work around this problem is (instead of SMTP callout verification) to do LDAP queries against the Exchange server's Active Directory. To support this we need to update our Exim build to include LDAP support so we can add the necessary configuration.

Another Exchange-related annoyance is that we (postmaster) get a LOT of auto-replies to bounce messages, because Exchange does not implement RCF 3834 properly and instead has a non-standard Microsoft proprietary header for suppressing automatic replies. I have a patch to add Microsoft X-Auto-Response-Suppress headers to Exim's bounce messages which I hope will reduce this annoyance.

When preparing an Exim upgrade that included these changes, I found that exim-4.86 included a logging change that makes the +incoming_interface option also log the outgoing interface. I thought I might be able to make this change more consistent with the existing logging options, but sadly I missed the release deadline. In the course of fixing this I found Exim was running out of space for logging options so I have done a massive refactoring to make them easily expandible.

gitweb improvements

For about a year we have been carrying some patches to gitweb which improve its handling of breadcrumbs, subdirectory restrictions, and categories. These patches are in use on our federated gitolite service, git.csx. They have not been incorporated upstream owing to lack of code review. But at last I have some useful feedback so with a bit more work I should be able to get the patches committed so I can run stock gitweb again.

Delegation updates, and superglue

We now subscribe to the ISC SNS service. As well as running f.root-servers.net, the ISC run a global anycast secondary name service used by a number of ccTLDs and other deserving organizations. Fully deploying the delegation change for our 125 public zones has been delayed an embarrassingly long time.

When faced with the tedious job of updating over 100 delegations, I think to myself, I know, I'll automate it! We have domain registrations with JANET (for ac.uk and some reverse zones), Nominet (other .uk), Gandi (non-uk), and RIPE (other reverse zones), and they all have radically different APIs: EPP (Nominet), XML-RPC (Gandi), REST (RIPE), and, um, CasperJS (JANET).

But the automation is not just for this job: I want to be able to automate DNSSEC KSK rollovers. My plan is to have some code that takes a zone's apex records and uses the relevant registrar API to ensure the delegation matches. In practice KSK rollovers may use a different DNSKEY RRset than the zone's published one, but the principle is to make the registrar interface uniformly dead simple by encapsulating the APIs and non-APIs.

I have some software called "superglue" which nearly does what I want, but it isn't quite finished and at least needs its user interface and internal interfaces made consistent and coherent before I feel happy suggesting that others might want to use it.

But I probably have enough working to actually make the delegation changes so I seriously need to go ahead and do that and tell the hostmasters of our delegated subdomains that they can use ISC SNS too.

Configuration management of secrets

Another DNSSEC-related difficulty is private key management - and not just DNSSEC, but also ssh (host keys), API credentials (see above), and other secrets.

What I want is something for storing encrypted secrets in git. I'm not entirely happy with existing solutions. Often they try to conceal whether my secrets are in the clear or not, whereas I want it to be blatantly obvious whether I am in the safe or risky state. Often they use home-brew crypto whereas I would be much happier with something widely-reviewed like gpg.

My current working solution is a git repo containing a half-arsed bit of shell and a makefile that manage a gpg-encrypted tarball containing a git repo full of secrets. As a background project I have about 1/3rd of a more refined "git privacy guard" based on the same basic principle, but I have not found time to work on it seriously since March. Which is slightly awkward because when finished it should make some of my other projects significantly easier.

DNS RPZ, and metazones

My newest project is to deploy a "DNS firewall", that is, blocking access to malicious domains. The aim is to provide some extra coverage for nasties that get through our spam filters or AV software. It will use BIND's DNS response policy zones feature, with a locally-maintained blacklist and whitelist, plus subscriptions to commercial blacklists.

The blocks will only apply to people who use our default recursive servers. We will also provide alternative unfiltered servers for those who need them. Both filtered and unfiltered servers will be provided by the same instances of BIND, using "views" to select the policy.

This requires a relatively simple change to our BIND dynamic configuration update machinery, which ensures that all our DNS servers have copies of the right zones. At the moment we're using ssh to push updates, but I plan to eliminate this trust relationship leaving only DNS TSIG (which is used for zone transfers). The new setup will use nsnotifyd's simplified metazone support.

I am amused that nsnotifyd started off as a quick hack for a bit of fun but rapidly turned out to have many uses, and several users other than me!

Other activities

I frequently send little patches to the BIND developers. My most important patch (as in, running in production) which has not yet been committed upstream is automatic size limits for zone journal files.

Mail and DNS user support. Say no more.

IETF activities. I am listed as an author of the DANE SRV draft which is now in the RFC editor queue. (Though I have not had the tuits to work on DANE stuff for a long time now.)

Pending projects

Things that need doing but haven't reached the top of the list yet include:

  • DHCP server refresh: upgrade OS to the same version as the DNS servers and combine the DHCP Ansible setup into the main DNS Ansible setup.
  • Federated DHCP log access: so other University institutions that are using our DHCP service have some insight into what is happening on their networks.
  • Ansible management of the ipreg stuff on Jackdaw.
  • DNS records for mail authentication: SPF/DKIM/DANE. (We have getting on for 400 mail domains with numerous special cases so this is not entirely trivial.)
  • Divorce ipreg database from Jackdaw.
  • Overhaul ipreg user interface or replace it entirely. (Multi-year project.)
(3 comments | Leave a comment)

Thursday 23rd July 2015

nsdiff-1.70 now with added nsvi

I have released nsdiff-1.70 which is now available from the nsdiff home page. nsdiff creates an nsupdate script from the differences between two versions of a zone. We use it at work for pushing changes from our IP Register database into our DNSSEC signing server.

This release incorporates a couple of suggestions from Jordan Rieger of webnames.ca. The first relaxes domain name syntax in places where domain names do not have to be host names, e.g. in the SOA RNAME field, which is the email address of the people responsible for the zone. The second allows you to optionally choose case-insensitive comparison of records.

The other new feature is an nsvi command which makes it nice and easy to edit a dynamic zone. Why didn't I write this years ago? It was inspired by a suggestion from @jpmens and @Habbie on Twitter and fuelled by a few pints of Citra.

(Leave a comment)

Thursday 2nd July 2015

nsnotifyd-1.1: prompt DNS zone transfers for stealth secondaries

nsnotifyd is my tiny DNS server that only handles DNS NOTIFY messages by running a command. (See my announcement of nsnotifyd and the nsnotifyd home page.)

At Cambridge we have a lot of stealth secondary name servers. We encourage admins who run resolvers to configure them in this way in order to resolve names in our private zones; it also reduces load on our central resolvers which used to be important. This is documented in our sample configuration for stealth nameservers on the CUDN.

The problem with this is that a stealth secondary can be slow to update its copy of a zone. It doesn't receive NOTIFY messages (because it is stealth) so it has to rely on the zone's SOA refresh and retry timing parameters. I have mitigated this somewhat by reducing our refresh timer from 4 hours to 30 minutes, but it might be nice to do better.

A similar problem came up in another scenario recently. I had a brief exchange with someone at JANET about DNS block lists and response policy zones in particular. RPZ block lists are distributed by standard zone transfers. If the RPZ users are stealth secondaries then they are not going to get updates in a very timely manner. (They might not be entirely stealth: RPZ vendors maintain ACLs listing their customers which they might also use for sending notifies.) JANET were concerned that if they provided an RPZ mirror it might exacerbate the staleness problem.

So I thought it might be reasonable to:

  • Analyze a BIND log to extract lists of zone transfer clients, which are presumably mostly stealth secondaries. (A little script called nsnotify-liststealth)
  • Write a tool called nsnotify-fanout to send notify messages to a list of targets.
  • And hook them up to nsnotifyd with a script called nsnotify2stealth.

The result is that you can just configure your authoritative name server to send NOTIFYs to nsnotifyd, and it will automatically NOTIFY all of your stealth secondaries as soon as the zone changes.

This seems to work pretty well, but there is a caveat!

You will now get a massive thundering herd of zone transfers as soon as a zone changes. Previously your stealth secondaries would have tended to spread their load over the SOA refresh period. Not any more!

The ISC has a helpful page on tuning BIND for high zone transfer volume which you should read if you want to use nsnotify2stealth.

(2 comments | Leave a comment)

Monday 15th June 2015

nsnotifyd: handle DNS NOTIFY messages by running a command

About ten days ago, I was wondering how I could automatically and promptly record changes to a zone in git. The situation I had in mind was a special zone which was modified by nsupdate but hosted on a server whose DR plan is "rebuild from Git using Ansible". So I thought it would be a good idea to record updates in git so they would not be lost.

My initial idea was to use BIND's update-policy external feature, which allows you to hook into its dynamic update handling. However that would have problems with race conditions, since the update handler doesn't know how much time to allow for BIND to record the update.

So I thought it might be better to write a DNS NOTIFY handler, which gets told about updates just after they have been recorded. And I thought that wiring a DNS server daemon would not be much harder than writing an update-policy external daemon.

@jpmens responded with enthusiasm, and a some hours later I had something vaguely working.

What I have written is a special-purpose DNS server called nsnotifyd. It is actually a lot more general than just recording a zone's history in git. You can run any command as soon as a zone is changed - the script for recording changes in git is just one example.

Another example is running nsdiff to propagate changes from a hidden master to a DNSSEC signer. You can do the same job with BIND's inline-signing mode, but maybe you need to insert some evil zone mangling into the process, say...

Basically, anywhere you currently have a cron job which is monitoring updates to DNS zones, you might want to run it under nsnotifyd instead, so your script runs as soon as the zone changes instead of running at fixed intervals.

Since a few people expressed an interest in this program, I have written documentation and packaged it up, so you can download it from the nsnotifyd home page, or from one of several git repository servers.

(Epilogue: I realised halfway through this little project that I had a better way of managing my special zone than updating it directly with nsupdate. Oh well, I can still use nsnotifyd to drive @diffroot!)
(Leave a comment)

Monday 27th April 2015

FizzBuzz with higher-order cpp macros and ELF linker sets

Here are a couple of fun ways to reduce repetitive code in C. To illustrate them, I'll implement FizzBuzz with the constraint that I must mention Fizz and Buzz only once in each implementation.

The generic context is that I want to declare some functions, and each function has an object containing some metadata. In this case the function is a predicate like "divisible by three" and the metadata is "Fizz". In a more realistic situation the function might be a driver entry point and the metadata might be a description of the hardware it supports.

Both implementations of FizzBuzz fit into the following generic FizzBuzz framework, which knows the general form of the game but not the specific rules about when to utter silly words instead of numbers.

    #include <stdbool.h>
    #include <stdio.h>
    #include <err.h>
    // predicate metadata
    typedef struct pred {
        bool (*pred)(int);
        const char *name;
    } pred;
    // some other file declares a table of predicates
    #include PTBL
    static bool putsafe(const char *s) {
        return s != NULL && fputs(s, stdout) != EOF;
    int main(void) {
        for (int i = 0; true; i++) {
            bool done = false;
            // if a predicate is true, print its name
            for (pred *p = ptbl; p < ptbl_end; p++)
                done |= putsafe(p->pred(i) ? p->name : NULL);
            // if no predicate was true, print the number
            if (printf(done ? "\n" : "%d\n", i) < 0)
                err(1, "printf");

To compile this code you need to define the PTBL macro to the name of a file that implements a FizzBuzz predicate table.

Higher-order cpp macros

A higher-order macro is a macro which takes a macro as an argument. This can be useful if you want the macro to do different things each time it is invoked.

For FizzBuzz we use a higher-order macro to tersely define all the predicates in one place. What this macro actually does depends on the macro name p that we pass to it.

    #define predicates(p) \
        p(Fizz, i % 3 == 0) \
        p(Buzz, i % 5 == 0)

We can then define a function-defining macro, and pass it to our higher-order macro to define all the predicate functions.

    #define pred_fun(name, test) \
        static bool name(int i) { return test; }

And we can define a macro to declare a metadata table entry (using the C preprocessor stringification operator), and pass it to our higher-order macro to fill in the whole metadata table.

    #define pred_ent(name, test) { name, #name },
    pred ptbl[] = {

For the purposes of the main program we also need to declare the end of the table.

    #define ptbl_end (ptbl + sizeof(ptbl)/sizeof(*ptbl))

Higher-order macros can get unweildy, especially if each item in the list is large. An alternative is to use a higher-order include file, where instead of passing a macro to another macro, you #define a macro with a particular name, #include a file of macro invocations, then #undef the special macro. This saves you from having to end dozens of lines with backslash continuations.

ELF linker sets

The linker takes collections of definitions, separates them into different sections (e.g. code and data), and concatenates each section into a contiguous block of memory. The effect is that although you can interleave code and data in your C source file, the linker disentangles the little pieces into one code section and one data section.

You can also define your own sections, if you like, by using gcc declaration attributes, so the linker will gather the declarations together in your binary regardless of how spread out they were in the source. The FreeBSD kernel calls these "linker sets" and uses them extensively to construct lists of drivers, initialization actions, etc.

This allows us to declare the metadata for a FizzBuzz predicate alongside its function definition, and use the linker to gather all the metadata into the array expected by the main program. The key part of the macro below is the __attribute__((section("pred"))).

    #define predicate(name, test) \
        static bool name(int i) { return test; } \
        pred pred_##name __attribute__((section("pred"))) \
            = { name, #name }

With that convenience macro we can define our predicates in whatever order or in whatever files we want.

    predicate(Fizz, i % 3 == 0);
    predicate(Buzz, i % 5 == 0);

To access the metadata, the linker helpfully defines some symbols identifying the start and end of the section, which we can pass to our main program.

    extern pred __start_pred[], __stop_pred[];
    #define ptbl    __start_pred
    #define ptbl_end __stop_pred

Source code

git clone https://github.com/fanf2/dry-fizz-buzz

(7 comments | Leave a comment)

Tuesday 24th February 2015

DNSQPS: an alarming shell script

I haven't got round to setting up proper performance monitoring for our DNS servers yet, so I have been making fairly ad-hoc queries against the BIND statistics channel and pulling out numbers with jq.

Last week I changed our new DNS setup for more frequent DNS updates. As part of this change I reduced the TTL on all our records from one day to one hour. The obvious question was, how would this affect the query rate on our servers?

So I wrote a simple monitoring script. The first version did,

    while sleep 1

But the fetch-and-print-stats part took a significant fraction of a second, so the queries-per-second numbers were rather bogus.

A better way to do this is to run `sleep` in the background, while you fetch-and-print-stats in the foreground. Then you can wait for the sleep to finish and loop back to the start. The loop should take almost exactly a second to run (provided fetch-and-print-stats takes less than a second). This is pretty similar to an alarm()/wait() sequence in C. (Actually no, that's bollocks.)

My dnsqps script also abuses `eval` a lot to get a shonky Bourne shell version of associative arrays for the per-server counters. Yummy.

So now I was able to get queries-per-second numbers from my servers, what was the effect of dropping the TTLs? Well, as far as I can tell from eyeballing, nothing. Zilch. No visible change in query rate. I expected at least some kind of clear increase, but no.

The current version of my dnsqps script is:

    while :
        sleep 1 & # set an alarm
        for s in "$@"
	    total=$(curl --silent http://$s:853/json/v1/server |
                    jq -r '.opcodes.QUERY')
            eval inc='$((' $total - tot$s '))'
            eval tot$s=$total
            printf ' %5d %s' $inc $s
        printf '\n'
        wait # for the alarm
(13 comments | Leave a comment)

Monday 16th February 2015

DNS server rollout report

Last week I rolled out my new DNS servers. It was reasonably successful - a few snags but no showstoppers.

Authoritative DNS rollout playbook

I have already written about scripting the recursive DNS rollout. I also used Ansible for the authoritative DNS rollout. I set up the authdns VMs with different IP addresses and hostnames (which I will continue to use for staging/testing purposes); the rollout process was:

  • Stop the Solaris Zone on the old servers using my zoneadm Ansible module;
  • Log into the staging server and add the live IP addresses;
  • Log into the live server and delete the staging IP addresses;
  • Update the hostname.

There are a couple of tricks with this process.

You need to send a gratuitous ARP to get the switches to update their forwarding tables quickly when you move an IP address. Solaris does this automatically but Linux does not, so I used an explicit arping -U command. On Debian/Ubuntu you need the iputils-arping package to get a version of arping which can send gratuitous ARPs (The arping package is not the one you want. Thanks to Peter Maydell for helping me find the right one!)

If you remove a "primary" IPv4 address from an interface on Linux, it also deletes all the other IPv4 addresses on the same subnet. This is not helpful when you are renumbering a machine. To avoid this problem you need to set sysctl net.ipv4.conf.eth0.promote_secondaries=1.

Pre-rollout configuration checking

The BIND configuration on my new DNS servers is rather different to the old ones, so I needed to be careful that I had not made any mistakes in my rewrite. Apart from re-reading configurations several times, I used a couple of tools to help me check.


I used bzl, the BIND zone list tool by JP Mens to get the list of configured zones from each of my servers. This helped to verify that all the differences were intentional.

The new authdns servers both host the same set of zones, which is the union of the zones hosted on the old authdns servers. The new servers have identical configs; the old ones did not.

The new recdns servers differ from the old ones mainly because I have been a bit paranoid about avoiding queries for martian IP address space, so I have lots of empty reverse zones.


I used my tool nsdiff to verify that the new DNS build scripts produce the same zone files as the old ones. (Except for th HINFO records which the new scripts omit.)

(This is not quite an independent check, because nsdiff is part of the new DNS build scripts.)


On Monday I sent out the DNS server upgrade announcement, with some wording improvements suggested by my colleagues Bob Dowling and Helen Sargan.

It was rather arrogant of me to give the expected outage times without any allowance for failure. In the end I managed to hit 50% of the targets.

The order of rollout had to be recursive servers first, since I did not want to swap the old authoritative servers out from under the old recursive servers. The new recursive servers get their zones from the new hidden master, whereas the old recursive servers get them from the authoritative servers.

The last server to be switched was authdns0, because that was the old master server, and I didn't want to take it down without being fairly sure I would not have to roll back.

ARP again

The difference in running time between my recdns and authdns scripts bothered me, so I investigated and discovered that IPv4 was partially broken. Rob Bricheno helped by getting the router's view of what was going on. One of my new Linux boxes was ARPing for a testdns IP address, even after I had deconfigured it!

I fixed it by rebooting, after which it continued to behave correctly through a few rollout / backout test runs. My guess is that the problem was caused when I was getting gratuitous ARPs working - maybe I erroneously added a static ARP entry.

After that all switchovers took about 5 - 15 seconds. Nice.

Status checks

I wrote a couple of scripts for checking rollout status and progress. wheredns tells me where each of our service addresses is running (old or new); pingdns repeatedly polls a server. I used pingdns to monitor when service was lost and when it returned during the rollout process.

Step 1: recdns1

On Tuesday shortly after 18:00, I switched over recdns1. This is our busier recursive server, running at about 1500 - 2000 queries per second during the day.

This rollout went without a hitch, yay!

Afterwards I needed to reduce the logging because it was rather too noisy. The logging on the old servers was rather too minimal for my tastes, but I turned up the verbosity a bit too far in my new configuration.

Step 2a: recdns0

On Wednesday morning shortly after 08:00, I switched over recdns0. It is a bit less busy, running about 1000 - 1500 qps.

This did not go so well. For some reason Ansible appeared to hang when connecting to the new recdns cluster to push the updated keepalived configuration.

Unfortunately my back-out scripts were not designed to cope with a partial rollout, so I had to restart the old Solaris Zone manually, and recdns0 was unavailable for a minute or two.

Mysteriously, Ansible connected quickly outside the context of my rollout scripts, so I tried the rollout again and it failed in the same way.

As a last try, I ran the rollout steps manually, which worked OK although I don't type as fast as Ansible runs a playbook.

So in all there was about 5 minutes downtime.

I'm not sure what went wrong; perhaps I just needed to be a bit more patient...

Step 2b: authdns1

After doing recdns0 I switched over authdns1. This was a bit less stressy since it isn't directly user-facing. However it was also a bit messy.

The problem this time was me forgetting to uncomment authdns1 from the Ansible inventory (its list of hosts). Actually, I should not have needed to uncomment it manually - I should have scripted it. The silly thing is that I had the testdns servers in the inventory for testing the authdns rollout scripts; the testdns servers were causing me some benign irritation (connection failures) when running ansible in the previous week or so. I should not have ignored this irritation and (like I did with the recdns rollout script) automated it away.

Anyway, after a partial rollout and manual rollback, it took me a few ansible-playbook --check runs to work out why Ansible was saying "host not found". The problem was due to the Jinja expansion in the following remote command, where the "to" variable was set to "authdns1.csx.cam.ac.uk" which was not in the inventory.

    ip addr add {{hostvars[to].ipv6}}/64 dev eth0

You can reproduce this with a command like,

    ansible -m debug -a 'msg={{hostvars["funted"]}}' all

After fixing that, by uncommenting the right line in the inventory, the rollout worked OK.

The other post-rollout fix was to ensure all the secondary zones had transferred OK. I had not managed to get all of our masters to add my staging servers to their ACLs, but this was not to hard to sort out using the BIND 9.10 JSON statistics server and the lovely jq command:

    curl http://authdns1.csx.cam.ac.uk:853/json |
    jq -r '.views[].zones[] | select(.serial == 4294967295) | .name' |
    xargs -n1 rndc -s authdns1.csx.cam.ac.uk refresh

After that, I needed to reduce the logging again, because the authdns servers get a whole different kind of noise in the logs!

Lurking bug: rp_filter

One mistake sneaked out of the woodwork on Wednesday, with fortunately small impact.

My colleague Rob Bricheno reported that client machines on (the same subnet as recdns1) were not able to talk to recdns0, I could see the queries arriving with tcpdump, but they were being dropped somewhere in the kernel.

Malcolm Scott helpfully suggested that this was due to Linux reverse path filtering on the new recdns servers, which are multihomed on both subnets. Peter Benie advised me of the correct setting,

    sysctl net.ipv4.conf.em1.rp_filter=2

Step 3: authdns0

On Thursday evening shortly after 18:00, I did the final switch-over of authdns0, the old master.

This went fine, yay! (Actually, more like 40s than the expected 15s, but I was patient, and it was OK.)

There was a minor problem that I forgot to turn off the old DNS update cron job, so it bitched at us a few times overnight when it failed to send updates to its master server. Poor lonely cron job.

One more thing

Over the weekend my email servers complained that some of their zones had not been refreshed recently. This was because four of our RFC 1918 private reverse DNS zones had not been updated since before the switch-over.

There is a slight difference in the cron job timings on the old and new setups: previously updates happened at 59 minutes past the hour, now they happen at 53 minutes past (same as the DNS port number, for fun and mnemonics). Both setups use Unix time serial numbers, so they were roughly in sync, but due to the cron schedul the old servers had a serial number about 300 higher.

BIND on my mail servers was refusing to refresh the zone because it had copies of the zones from the old servers with a higher serial number than the new servers.

I did a sneaky nsupdate add and delete on the relevant zones to update their serial numbers and everything is happy again.

To conclude

They say a clever person can get themselves out of situations a wise person would not have got into in the first place. I think the main wisdom to take away from this is not to ignore minor niggles, and to write rollout/rollback scripts that can work forwards or backwards after being interrupted at any point. I won against the niggles on the ARP problem, but lost against them on the authdns inventory SNAFU.

But in the end it pretty much worked, with only a few minutes downtime and only one person affected by a bug. So on the whole I feel a bit like Mat Ricardo.

(Leave a comment)

Friday 30th January 2015

Recursive DNS rollout plan - and backout plan!

The last couple of weeks have been a bit slow, being busy with email and DNS support, an unwell child, and surprise 0day. But on Wednesday I managed to clear the decks so that on Thursday I could get down to some serious rollout planning.

My aim is to do a forklift upgrade of our DNS servers - a tier 1 service - with negligible downtime, and with a backout plan in case of fuckups.

Solaris Zones

Our old existing DNS service is based on Solaris Zones. The nice thing about this is that I can quickly and safely halt a zone - which stops the software and unconfigures the network interface - and if the replacement does not work I can restart the zone - which brings up the interfaces and the software.

Even better, the old servers have a couple of test zones which I can bounce up and down without a care. These give me enormous freedom to test my migration scripts without worrying about breaking things and with a high degree of confidence that my tests are very similar to the real thing.

Testability gives you confidence, and confidence gives you productivity.

Before I started setting up our new recursive DNS servers, I ran zoneadm -z testdns* halt on the old servers so that I could use the testdns addresses for developing and testing our keepalived setup. So I had the testdns zones in reserve for developing and testing the rollout/backout scripts.

Rollout plans

The authoritative and recursive parts of the new setup are quite different, so they require different rollout plans.

On the authoritative side we will have a virtual machine for each service address. I have not designed the new authoritative servers for any server-level or network-level high availability, since the DNS protocol should be able to cope well enough. This is similar in principle to our existing Solaris Zones setup. The vague rollout plan is to set up new authdns servers on standby addresses, then renumber them to take over from the old servers. This article is not about the authdns rollout plan.

On the recursive side, there are four physical servers any of which can host any of the recdns or testdns addresses, managed by keepalived. The vague rollout plan is to disable a zone on the old servers then enable its service address on the keepalived cluster.

Ansible - configuration vs orchestration

So far I have been using Ansible in a simple way as a configuration management system, treating it as a fairly declarative language for stating what the configuration of my servers should be, and then being able to run the playbooks to find out and/or fix where reality differs from intention.

But Ansible can also do orchestration: scripting a co-ordinated sequence of actions across disparate sets of servers. Just what I need for my rollout plans!

When to write an Ansible module

The first thing I needed was a good way to drive zoneadm from Ansible. I have found that using Ansible as a glorified shell script driver is pretty unsatisfactory, because its shell and command modules are too general to provide proper support for its idempotence and check-mode features. Rather than messing around with shell commands, it is much more satisfactory (in terms of reward/effort) to write a custom module.

My zoneadm module does the bare minimum: it runs zoneadm list -pi to get the current state of the machine's zones, checks if the target state matches the current state, and if not it runs zoneadm boot or zoneadm halt as required. It can only handle zone states that are "installed" or "running". 60 lines of uncomplicated Python, nice.

Start stupid and expect to fail

After I had a good way to wrangle zoned it was time to do a quick hack to see if a trial rollout would work. I wrote the following playbook which does three things: move the testdns1 zone from running to installed, change the Ansible configuration to enable testdns1 on the keepalived cluster, then push the new keepalived configuration to the cluster.

- hosts: helen2.csi.cam.ac.uk
    - zoneadm: name=testdns1 state=installed
- hosts: localhost
    - command: bin/vrrp_toggle rollout testdns1
- hosts: rec
    - keepalived

This is quick and dirty, hardcoded all the way, except for the vrrp_toggle command which is the main reality check.

The vrrp_toggle script just changes the value of an Ansible variable called vrrp_enable which lists which VRRP instances should be included in the keepalived configuration. The keepalived configuration is generated from a Jinja2 template, and each vrrp_instance (testdns1 etc.) is emitted if the instance name is not commented out of the vrrp_enable list.


Ansible does not re-read variables if you change them in the middle of a playbook like this. Good. That is the right thing to do.

The other way in which this playbook is stupid is there are actually 8 of them: 2 recdns plus 2 testdns, rollout and backout. Writing them individually is begging for typos; repeated code that is similar but systematically different is one of the most common ways to introduce bugs.

Learn from failure

So the right thing to do is tweak the variable then run the playbook. And note the vrrp_toggle command arguments describe almost everything you need to know to generate the playbook! (The only thing missing is the mapping from instance name (like testdns1) to parent host (like helen2).

So I changed the vrrp_toggle script into a rec-rollout / rec-backout script, which tweaks the vrrp_enable variable and generates the appropriate playbook. The playbook consists of just two tasks, whose order depends on whether we are doing rollout or backout, and which have a few straightforward place-holder substitutions.

The nice thing about this kind of templating is that if you screw it up (like I did at first), usually a large proportion of the cases fail, probably including your test cases; whereas with clone-and-hack there will be a nasty surprise in a case you didn't test.

Consistent and quick rollouts

In the playbook I quoted above I am using my keepalived role, so I can be absolutely sure that my rollout/backout plan remains consistent with my configuration management setup. Nice!

However the keepalived role does several configuration tasks, most of which are not necessary in this situation. In fact all I need to do is copy across the templated configuration file and tell keepalived to reload it if the file has changed.

Ansible tags are for just this kind of optimization. I added a line to my keepalived.conf task:

    tags: quick

Only one task needed tagging because the keepalived.conf task has a handler to tell keepalived to reload its configuration when that changes, which is the other important action. So now I can run my rollout/backout playbooks with a --tags quick argument, so only the quick tasks (and if necessary their handlers) are run.


Once I had got all that working, I was able to easily flip testdns0 and testdns1 back and forth between the old and new setups. Each switchover takes about ten seconds, which is not bad - it is less than a typical DNS lookup timeout.

There are a couple more improvements to make before I do the rollout for real. I should improve the molly guard to make better use of ansible-playbook --check. And I should pre-populate the new servers' caches with the Alexa Top 1,000,000 list to reduce post-rollout latency. (If you have a similar UK-centric popular domains list, please tell me so I can feed that to the servers as well!)

(Leave a comment)

Saturday 24th January 2015

New release of nsdiff and nspatch version 1.55

I have released version 1.55 of nsdiff, which creates an nsupdate script from differences between DNS zone files.

There are not many changes to nsdiff itself: the only notable change is support for non-standard port numbers.

The important new thing is nspatch, which is an error-checking wrapper around `nsdiff | nsupdate`. To be friendly when running from cron, nspatch only produces output when it fails. It can also retry an update if it happens to lose a race against concurrent updates e.g. due to DNSSEC signing activity.

You can read the documentation and download the source from the nsdiff home page.

My Mac insists that I should call it nstiff...

(1 comment | Leave a comment)

Saturday 17th January 2015

BIND patches as a byproduct of setting up new DNS servers

On Friday evening I reached a BIG milestone in my project to replace Cambridge University's DNS servers. I finished porting and rewriting the dynamic name server configuration and zone data update scripts, and I was - at last! - able to get the new servers up to pretty much full functionality, pulling lists of zones and their contents from the IP Register database and the managed zone service, and with DNSSEC signing on the new hidden master.

There is still some final cleanup and robustifying to do, and checks to make sure I haven't missed anything. And I have to work out the exact process I will follow to put the new system into live service with minimum risk and disruption. But the end is tantalizingly within reach!

In the last couple of weeks I have also got several small patches into BIND.

  • Jan 7: documentation for named -L

    This was a follow-up to a patch I submitted in April last year. The named -L option specifies a log file to use at startup for recording the BIND version banners and other startup information. Previously this information would always go to syslog regardless of your logging configuration.

    This feature will be in BIND 9.11.

  • Jan 8: typo in comment

    Trivial :-)

  • Jan 12: teach nsdiff to AXFR from non-standard ports

    Not a BIND patch, but one of my own companion utilities. Our managed zone service runs a name server on a non-standard port, and our new setup will use nsdiff | nsupdate to implement bump-in-the-wire signing for the MZS.

  • Jan 13: document default DNSKEY TTL

    Took me a while to work out where that value came from. Submitted on Jan 4. Included in 9.10 ARM.

  • Jan 13: automatically tune max-journal-size

    Our old DNS build scripts have a couple of mechanisms for tuning BIND's max-journal-size setting. By default a zone's incremental update journal will grow without bound, which is not helpful. Having to set the parameter by hand is annoying, especially since it should be simple to automatically tune the limit based on the size of the zone.

    Rather than re-implementing some annoying plumbing for yet another setting, I thought I would try to automate it away. I have submitted this patch as RT#38324. In response I was told there is also RT#36279 which sounds like a request for this feature, and RT#25274 which sounds like another implementation of my patch. Based on the ticket number it dates from 2011.

    I hope this gets into 9.11, or something like it. I suppose that rather than maintaining this patch I could do something equivalent in my build scripts...

  • Jan 14: doc: ignore and clean up isc-notes-html.xsl

    I found some cruft in a supposedly-clean source tree.

    This one actually got committed under my name, which I think is a first for me and BIND :-) (RT#38330)

  • Jan 14: close new zone file before renaming, for win32 compatibility
  • Jan 14: use a safe temporary new zone file name

    These two arose from a problem report on the bind-users list. The conversation moved to private mail which I find a bit annoying - I tend to think it is more helpful for other users if problems are fixed in public.

    But it turned out that BIND's error logging in this area is basically negligible, even when you turn on debug logging :-( But the Windows Process Explorer is able to monitor filesystem events, and it reported a 'SHARING VIOLATION' and 'NAME NOT FOUND'. This gave me the clue that it was a POSIX vs Windows portability bug.

    So in the end this problem was more interesting than I expected.

  • Jan 16: critical: ratelimiter.c:151: REQUIRE(ev->ev_sender == ((void *)0)) failed

    My build scripts are designed so that Ansible sets up the name servers with a static configuration which contains everything except for the zone {} clauses. The zone configuration is provisioned by the dynamic reconfiguration scripts. Ansible runs are triggered manually; dynamic reconfiguration runs from cron.

    I discovered a number of problems with bootstrapping from a bare server with no zones to a fully-populated server with all the zones and their contents on the new hidden master.

    The process is basically,

    • if there are any missing master files, initialise them as minimal zone files
    • write zone configuration file and run rndc reconfig
    • run nsdiff | nsupdate for every zone to fill them with the correct contents

    When bootstrapping, the master server would load 123 new zones, then shortly after the nsdiff | nsupdate process started, named crashed with the assertion failure quoted above.

    Mark Andrews replied overnight with the linked patch (he lives in Australia) which fixed the problem. Yay!

    The other bootstrapping problem was to do with BIND's zone integrity checks. nsdiff is not very clever about the order in which it emits changes; in particular it does not ensure that hostnames exist before any NS or MX or SRV records are created to point to them. You can turn off most of the integrity checks, but not the NS record checks.

    This causes trouble for us when bootstrapping the cam.ac.uk zone, which is the only zone we have with in-zone NS records. It also has lots of delegations which can also trip the checks.

    My solution is to create a special bootstrap version of the zone, which contains the apex and delegation records (which are built from configuration stored in git) but not the bulk of the zone contents from the IP Register database. The zone can then be succesfully loaded in two stages, first `nsdiff cam.ac.uk DB.bootstrap | nsupdate -l` then `nsdiff cam.ac.uk zones/cam.ac.uk | nsupdate -l`.

    Bootstrapping isn't something I expect to do very often, but I want to be sure it is easy to rebuild all the servers from scratch, including the hidden master, in case of major OS upgrades, VM vs hardware changes, disasters, etc.

    No more special snowflake servers!

(Leave a comment)

Friday 9th January 2015

Recursive DNS server failover with keepalived --vrrp

I have got keepalived working on my recursive DNS servers, handling failover for testdns0.csi.cam.ac.uk and testdns1.csi.cam.ac.uk. I am quite pleased with the way it works.

It was difficult to get started because keepalived's documentation is TERRIBLE. More effort has been spent explaining how it is put together than explaining how to get it to work. The keepalived.conf man page is a barely-commented example configuration file which does not describe all the options. Some of the options are only mentioned in the examples in /usr/share/doc/keepalived/samples. Bah!

Edit: See the comments to find the real documentation!

The vital clue came from Graeme Fowler who told me about keepalived's vrrp_script feature which is "documented" in keepalived.conf.vrrp.localcheck which I never would have found without Graeme's help.


Keepalived is designed to run on a pair of load-balancing routers in front of a cluster of servers. It has two main parts. Its Linux Virtual Server daemon runs health checks on the back-end servers and configures the kernel's load balancing router as appropriate. The LVS stuff handles failover of the back-end servers. The other part of keepalived is its VRRP daemon which handles failover of the load-balancing routers themselves.

My DNS servers do not need the LVS load-balancing stuff, but they do need some kind of health check for named. I am running keepalived in VRRP-only mode and using its vrrp_script feature for health checks.

There is an SMTP client in keepalived which can notify you of state changes. It is too noisy for me, because I get messages from every server when anything changes. You can also tell keepalived to run scripts on state changes, so I am using that for notifications.

VRRP configuration

All my servers are configured as VRRP BACKUPs, and there is no MASTER. According to the VRRP RFC, the master is supposed to be the machine which owns the IP addresses. In my setup, no particular machine owns the service addresses.

I am using authentication mainly for additional protection against screwups (e.g. VRID collisions). VRRP password authentication doesn't provide any security: any attacker has to be on the local link so they can just sniff the password off the wire.

I am slightly surprised that it works when I set both IPv4 and IPv6 addresses on the same VRRP instance. The VRRP spec says you have to have separate vrouters for IPv4 and IPv6. Perhaps it works because keepalived doesn't implement real VRRP by default: it does not use a virtual MAC address but instead it just moves the virtual IP addresses and sends gratuitous ARPs to update the switches' forwarding tables. Keepalived has a use_vmac option but it seems rather fiddly to get working, so I am sticking with the default.

vrrp_instance testdns0 {
        virtual_router_id 210
        interface em1
        state BACKUP
        priority 50
        notify /etc/keepalived/notify
        authentication {
                auth_type PASS
                auth_pass XXXXXXXX
        virtual_ipaddress {
        track_script {

State change notifications

My notification script sends email when a server enters the MASTER state and takes over the IP addresses. It also sends email if the server dropped into the BACKUP state because named crashed.

    # this is /etc/keepalived/notify
    case $state in
        # do not notify if this server is working
        if /etc/keepalived/named_ok
        then exit 0
        else state=DEAD
    exim -t <<EOF
    To: hostmaster@cam.ac.uk
    Subject: $instance $state on $(hostname)

DNS server health checks and dynamic VRRP priorities

In the vrrp_instance snippet above, you can see that it specifies four vrrp_scripts to track. There is one vrrp_script for each possible priority, so that the four servers can have four different priorities for each vrrp_instance.

Each vrrp_script is specified using the Jinja macro below. (Four different vrrp_scripts for each of four different vrrp_instances is a lot of repetition!) The type argument is "recdns" or "testdns", the num is 0 or 1, and the prio is a number from 1 to 4.

Each script is run every "interval" seconds, and is allowed to run for up to "timeout" seconds. (My checking script should take at most 1 second.)

A positive "weight" setting is added to the vrrp_instance's priority to increse it when the script succeeds. (If the weight is negative it is added to the priority to decrease it when the script fails.)

    {%- macro named_check(type,num,prio) -%}
    vrrp_script named_check_{{type}}{{num}}_{{prio}} {
        script "/etc/keepalived/named_check {{type}} {{num}} {{prio}}"
        interval 1
        timeout 2
        weight {{ prio * 50 }}
    {%- endmacro -%}

When keepalived runs the four tracking scripts for a vrrp_instance on one of my servers, at most one of the scripts will succeed. The priority is therefore adjusted to 250 for the server that should be live, 200 for its main backup, 150 and 100 on the other servers, and 50 on any server which is broken or out of service.

The checking script finds the position of the host on which it is running in a configuration file which lists the servers in priority order. A server can be commented out to remove it from service. The priority order for testdns1 is the opposite of the order for testdns0. So the following contents of /etc/keepalived/priority.testdns specifies that testdns1 is running on recdns-cnh, testdns0 is on recdns-wcdc, recdns-rnb is disabled, and recdns-sby is a backup.


I can update this prioriy configuration file to change which machines are in service, without having to restart or reconfigure keepalived.

The health check script is:


    set -e

    type=$1 num=$2 check=$3

    # Look for the position of our hostname in the priority listing

    name=$(hostname --short)

    # -F = fixed string not regex
    # -x = match whole line
    # -n = print line number

    # A commented-out line will not match, so grep will fail
    # and set -e will make the whole script fail.

    grepout=$(grep -Fxn $name /etc/keepalived/priority.$type)

    # Strip off everything but the line number. Do this separately
    # so that grep's exit status is not lost in the pipeline.

    prio=$(echo $grepout | sed 's/:.*//')

    # for num=0 later is higher priority
    # for num=1 later is lower priority

    if [ $num = 1 ]
        prio=$((5 - $prio))

    # If our priority matches what keepalived is asking about, then our
    # exit status depends on whether named is running, otherwise tell
    # keepalived we are not running at the priority it is checking.

    [ $check = $prio ] && /etc/keepalived/named_ok

The named_ok script just uses dig to verify that the server seems to be working OK. I originally queried for version.bind, but there are very strict rate limits on the server info view so it did not work very well! So now the script checks that this command produces the expected output:

dig @localhost +time=1 +tries=1 +short cam.ac.uk in txt
(2 comments | Leave a comment)

Wednesday 7th January 2015

Network setup for Cambridge's new DNS servers

The SCCS-to-git project that I wrote about previously was the prelude to setting up new DNS servers with an entirely overhauled infrastructure.

The current setup which I am replacing uses Solaris Zones (like FreeBSD Jails or Linux Containers) to host the various name server instances on three physical boxes. The new setup will use Ubuntu virtual machines on our shared VM service (should I call it a "private cloud"?) for the authoritative servers. I am making a couple of changes to the authoritative setup: changing to a hidden master, and eliminating differences in which zones are served by each server.

I have obtained dedicated hardware for the recursive servers. Our main concern is that they should be able to boot and work with no dependencies on other services beyond power and networking, because basically all the other services rely on the recursive DNS servers. The machines are Dell R320s, each with one Xeon E5-2420 (6 hyperthreaded cores, 2.2GHz), 32 GB RAM, and a Dell-branded Intel 160GB SSD.

Failover for recursive DNS servers

The most important change to the recursive DNS service will be automatic failover. Whenever I need to loosen my bowels I just contemplate dealing with a failure of one of the current elderly machines, which involves a lengthy and delicate manual playbook described on our wiki...

Often when I mention DNS and failover, the immediate response is "Anycast?". We will not be doing anycast on the new servers, though that may change in the future. My current plan is to do failover with VRRP using keepalived. (Several people have told me they are successfully using keepalived, though its documentation is shockingly bad. I would like to know of any better alternatives.) There are a number of reasons for using VRRP rather than anycast:

  • The recursive DNS server addresses are (aka recdns0) and (aka recdns1). (They have IPv6 addresses too.) They are on different subnets which are actually VLANs on the same physical network. It is not feasible to change these addresses.
  • The 8 and 12 subnets are our general server subnets, used for a large proportion of our services, most of which use the recdns servers. So anycasting recdns[01] requires punching holes in the server network routing.
  • The server network routers do not provide proxy ARP and my colleagues in network systems do not want to change this. But our Cisco routers can't punch a /32 anycast hole in the server subnets without proxy ARP. So if we did do anycast we would also have to do VRRP to support failover for recdns clients on the server subnets.
  • The server network spans four sites, connected via our own city-wide fibre network. The sites are linked at layer 2: the same Ethernet VLANs are present at all four sites. So VRRP failover gives us pretty good resilience in the face of server, rack, or site failures.

VRRP will be a massive improvement over our current setup, and it should provide us a lot of the robustness that other places would normally need anycast for, but with significantly less complexity. And less complexity means less time before I can take the old machines out of service.

After the new setup is in place, it might make sense for us to revisit anycast. For instance, we could put recursive servers at other points of presence where our server network does not reach (e.g. the Addenbrooke's medical research site). But in practice there are not many situations when our server network is unreachable but the rest of the University data network is functioning, so it might not be worth it.

Configuration management

The old machines are special snowflake servers. The new setup is being managed by Ansible.

I first used Ansible in 2013 to set up the DHCP servers that were a crucial part of the network renumbering we did when moving our main office from the city centre to the West Cambridge site. I liked how easy it was to get started with Ansible. The way its --check mode prints a diff of remote config file changes is a killer feature for me. And it uses ssh rather than rolling its own crypto and host authentication like some other config management software.

I spent a lot of December working through the configuration of the new servers, starting with the hidden master and an authoritative server (a staging server which is a clone of the future live servers). It felt like quite a lot of elapsed time without much visible progress, though I was steadily knocking items off the list of things to get working.

The best bit was the last day before the xmas break. The new recdns hardware arrived on Monday 22nd, so I spent Tuesday racking them up and getting them running.

My Ansible setup already included most of the special cases required for the recdns servers, so I just uncommented their hostnames in the inventory file and told Ansible to run the playbook. It pretty much Just Worked, which was extremely pleasing :-) All that steady work paid off big time.

Multi-VLAN network setup

The main part of the recdns config which did not work was the network interface configuration, which was OK because I didn't expect it to work without fiddling.

The recdns servers are plugged into switch ports which present subnet 8 untagged (mainly to support initial bootstrap without requiring special setup of the machine's BIOS), and subnet 12 with VLAN tags (VLAN number 812). Each server has its own IPv4 and IPv6 addresses on subnet 8 and subnet 12.

The service addresses recdns0 (subnet 8) and recdns1 (subnet 12) will be additional (virtual) addresses which can be brought up on any of the four servers. They will usually be configured something like:

  • recdns-wcdc: VRRP master for recdns0
  • recdns-rnb: VRRP backup for recdns0
  • recdns-sby: VRRP backup for recdns1
  • recdns-cnh: VRRP master for recdns1

And in case of multi-site failures, the recdns1 servers will act as additional backups for the recdns0 servers and vice versa.

There were two problems with my initial untested configuration.

The known problem was that I was likely to need policy routing, to ensure that packets with a subnet 12 source address were sent out with VLAN 812 tags. This turned out to be true for IPv4, whereas IPv6 does the Right Thing by default.

The unknown problem was that the VLAN 812 interface came up only half-configured: it was using SLAAC for IPv6 instead of the static address that I specified. This took a while to debug. The clue to the solution came from running ifup with the -v flag to get it to print out what it was doing:

# ip link delete em1.812
# ifup -v em1.812

This showed that interface configuration was failing when it tried to set up the default route on that interface. Because there can be only one default route, and there was already one on the main subnet 8 interface. D'oh!

Having got ifup to run to completion I was able to verify that the subnet 12 routing worked for IPv6 but not for IPv4, pretty much as expected. With advice from my colleagues David McBride and Anton Altaparmakov I added the necessary runes to the configuration.

My final /etc/network/interfaces files on the recdns servers are generated from the following Jinja template:

# This file describes the network interfaces available on the system
# and how to activate them. For more information, see interfaces(5).

# NOTE: There must be only one "gateway" line because there can be
# only one default route. Interface configuration will fail part-way
# through when you bring up a second interface with a gateway
# specification.

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface, on subnet 8
auto em1

iface em1 inet static
      address 131.111.8.{{ ifnum }}
      netmask 23

iface em1 inet6 static
      address 2001:630:212:8::d:{{ ifnum }}
      netmask 64

# VLAN tagged interface on subnet 12
auto em1.812

iface em1.812 inet static
      address 131.111.12.{{ ifnum }}
      netmask 24

      # send packets with subnet 12 source address
      # through routing table 12 to subnet 12 router

      up   ip -4 rule  add from table 12
      down ip -4 rule  del from table 12
      up   ip -4 route add default table 12 via
      down ip -4 route del default table 12 via

iface em1.812 inet6 static
      address 2001:630:212:12::d:{{ ifnum }}
      netmask 64

      # auto-configured routing works OK for IPv6

# eof
(3 comments | Leave a comment)

Thursday 27th November 2014

Uplift from SCCS to git

My current project is to replace Cambridge University's DNS servers. The first stage of this project is to transfer the code from SCCS to Git so that it is easier to work with.

Ironically, to do this I have ended up spending lots of time working with SCCS and RCS, rather than Git. This was mainly developing analysis and conversion tools to get things into a fit state for Git.

If you find yourself in a similar situation, you might find these tools helpful.


Cambridge was allocated three Class B networks in the 1980s: first the Computer Lab got in 1987; then the Department of Engineering got in 1988; and eventually the Computing Service got in 1989 for the University (and related institutions) as a whole.

The oldest records I have found date from September 1990, which list about 300 registrations. The next two departments to get connected were the Statistical Laboratory and Molecular Biology (I can't say in which order). The Statslab was allocated, which it has kept for 24 years!. Things pick up in 1991, when the JANET IP Service was started and rapidly took over to replace X.25. (Last month I blogged about connectivity for Astronomy in Cambridge in 1991.)

I have found these historical nuggets in our ip-register directory tree. This contains the infrastructure and history of IP address and DNS registration in Cambridge going back a quarter century. But it isn't just an archive: it is a working system which has been in production that long. Because of this, converting the directory tree to Git presents certain challenges.


The ip-register directory tree contains a mixture of:

  • Source code, mostly with SCCS history
  • Production scripts, mostly with SCCS history
  • Configuration files, mostly with SCCS history
  • The occasional executable
  • A few upstream perl libraries
  • Output files and other working files used by the production scripts
  • Secrets, such as private keys and passwords
  • Mail archives
  • Historical artifacts, such as old preserved copies of parts of the directory tree
  • Miscellaneous files without SCCS history
  • Editor backup files with ~ suffixes

My aim was to preserve this all as faithfully as I could, while converting it to Git in a way that represents the history in a useful manner.


The rough strategy was:

  1. Take a copy of the ip-register directory tree, preserving modification times. (There is no need to preserve owners because any useful ownership information was lost when the directory tree moved off the Central Unix Service before that shut down in 2008.)
  2. Convert from SCCS to RCS file-by-file. Converting between these formats is a simple one-to-one mapping.
  3. Files without SCCS history will have very short artificial RCS histories created from their modification times and editor backup files.
  4. Convert the RCS tree to CVS. This is basically just moving files around, because a CVS repository is little more than a directory tree of RCS files.
  5. Convert the CVS repository to Git using git cvsimport. This is the only phase that needs to do cross-file history analysis, and other people have already produced a satisfactory solution.

Simples! ... Not.

sccs2rcs proves inadequate

I first tried ESR's sccs2rcs Python script. Unfortunately I rapidly ran into a number of showstoppers.

  • It didn't work with Solaris SCCS, which is what was available on the ip-register server.
  • It destructively updates the SCCS tree, losing information about the relationship between the working files and the SCCS files.
  • It works on a whole directory tree, so it doesn't give you file-by-file control.

I fixed a bug or two but very soon concluded the program was entirely the wrong shape.

(In the end, the Solaris incompatibility became moot when I installed GNU CSSC on my FreeBSD workstation to do the conversion. But the other problems with sccs2rcs remained.)


So I wrote a small script called sccs2rcs1 which just converts one SCCS file to one RCS file, and gives you control over where the RCS and temporary files are placed. This meant that I would not have to shuffle RCS files around: I could just create them directly in the target CVS repository. Also, sccs2rcs1 uses RCS options to avoid the need to fiddle with checkout locks, which is a significant simplification.

The main regression compared to sccs2rcs is that sccs2rcs1 does not support branches, because I didn't have any files with branches.


At this point I needed to work out how I was going to co-ordinate the invocations of sccs2rcs1 to convert the whole tree. What was in there?!

I wrote a fairly quick-and-dirty script called sccscheck which analyses a directory tree and prints out notes on various features and anomalies. A significant proportion of the code exists to work out the relationship between working files, backup files, and SCCS files.

I could then start work on determining what fix-ups were necessary before the SCCS-to-CVS conversion.


One notable part of the ip-register directory tree was the archive subdirectory, which contained lots of gzipped SCCS files with date stamps. What relationship did they have to each other? My first guess was that they might be successive snapshots of a growing history, and that the corresponding SCCS files in the working part of the tree would contain the whole history.

I wrote sccsprefix to verify if one SCCS file is a prefix of another, i.e. that it records the same history up to a certain point.

This proved that the files were NOT snapshots! In fact, the working SCCS files had been periodically moved to the archive, and new working SCCS files started from scratch. I guess this was to cope with the files getting uncomfortably large and slow for 1990s hardware.


So to represent the history properly in Git, I needed to combine a series of SCCS files into a linear history. It turns out to be easier to construct commits with artificial metadata (usernames, dates) with RCS than with SCCS, so I wrote rcsappend to add the commits from a newer RCS file as successors of commits in an older file.

Converting the archived SCCS files was then a combination of sccs2rcs1 and rcsappend. Unfortunately this was VERY slow, because RCS takes a long time to check out old revisions. This is because an RCS file contains a verbatim copy of the latest revision and a series of diffs going back one revision at a time. The SCCS format is more clever and so takes about the same time to check out any revision.

So I changed sccs2rcs1 to incorporate an append mode, and used that to convert and combine the archived SCCS files, as you can see in the ipreg-archive-uplift script. This still takes ages to convert and linearize nearly 20,000 revisions in the history of the hosts.131.111 file - an RCS checkin rewrites the entire RCS file so they get slower as the number of revisions grows. Fortunately I don't need to run it many times.


There are a lot of files in the ip-register tree without SCCS histories, which I wanted to preserve. Many of them have old editor backup ~ files, which could be used to construct a wee bit of history (in the absence of anything better). So I wrote files2rcs to build an RCS file from this kind of miscellanea.

An aside on file name restrictions

At this point I need to moan a bit.

Why does RCS object to file names that start with a comma. Why.

I tried running these scripts on my Mac at home. It mostly worked, except for the directories which contained files like DB.cam (source file) and db.cam (generated file). I added a bit of support in the scripts to cope with case-insensitive filesystems, so I can use my Macs for testing. But the bulk conversion runs very slowly, I think because it generates too much churn in the Spotlight indexes.


One significant problem is dealing with SCCS files whose working files have been deleted. In some SCCS workflows this is a normal state of affairs - see for instance the SCCS support in the POSIX Make XSI extensions. However, in the ip-register directory tree this corresponds to files that are no longer needed. Unfortunately the SCCS history generally does not record when the file was deleted. It might be possible to make a plausible guess from manual analysis, but perhaps it is more truthful to record an artificial revision saying the file was not present at the time of conversion.

Like SCCS, RCS does not have a way to represent a deleted file. CVS uses a convention on top of RCS: when a file is deleted it puts the RCS file in an "Attic" subdirectory and adds a revision with a "dead" status. The rcsdeadify applies this convention to an RCS file.


There are situations where it is possible to identify a meaningful committer and deletion time. Where a .tar.gz archive exists, it records the original file owners. The tar2usermap script records the file owners from the tar files. The contents can then be unpacked and converted as if they were part of the main directory, using the usermap file to provide the correct committer IDs. After that the files can be marked as deleted at the time the tarfile was created.


The main conversion script is sccs2cvs, which evacuates an SCCS working tree into a CVS repository, leaving behind a tree of (mostly) empty directories. It is based on a simplified version of the analysis done by sccscheck, with more careful error checking of the commands it invokes. It uses sccs2rcs1, files2rcs, and rcsappend to handle each file.

The rcsappend case occurs when there is an editor backup ~ file which is older than the oldest SCCS revision, in which case sccs2cvs uses rcsappend to combine the output of sccs2rcs1 and files2rcs. This could be done more efficiently with sccs2rcs1's append mode, but for the ip-register tree it doesn't cause a big slowdown.

To cope with the varying semantics of missing working files, sccs2rcs leaves behind a tombstone where it expected to find a working file. This takes the form of a symlink pointing to 'Attic'. Another script can then deal with these tombstones as appropriate.

pre-uplift, mid-uplift, post-uplift

Before sccs2cvs can run, the SCCS working tree should be reasonably clean. So the overall uplift process goes through several phases:

  1. Fetch and unpack copy of SCCS working tree;
  2. pre-uplift fixups;
    (These should be the minimum changes that are required before conversion to CVS, such as moving secrets out of the working tree.)
  3. sccs2cvs;
  4. mid-uplift fixups;
    (This should include any adjustments to the earlier history such as marking when files were deleted in the past.)
  5. git cvsimport or cvs-fast-export | git fast-import;
  6. post-uplift fixups;
    (This is when to delete cruft which is now preserved in the git history.)

For the ip-register directory tree, the pre-uplift phase also includes ipreg-archive-uplift which I described earlier. Then in the mid-uplift phase the combined histories are moved into the proper place in the CVS repository so that their history is recorded in the right place.

Similarly, for the tarballs, the pre-uplift phase unpacks them in place, and moves the tar files aside. Then the mid-uplift phase rcsdeadifies the tree that was inside the tarball.

I have not stuck to my guidelines very strictly: my scripts delete quite a lot of cruft in the pre-uplift phase. In particular, they delete duplicated SCCS history files from the archives, and working files which are generated by scripts.


SCCS/RCS/CVS all record committers by simple user IDs, whereas git uses names and email addresses. So git-cvsimport and cvs-fast-export can be given an authors file containing the translation. The sccscommitters script produces a list of user IDs as a starting point for an authors file.

Uplifting cvs to git

At first I tried git cvsimport, since I have successfully used it before. In this case it turned out not to be the path to swift enlightenment - it was taking about 3s per commit. This is mainly because it checks out files from oldest to newest, so it falls foul of the same performance problem that my rcsappend program did, as I described above.

So I compiled cvs-fast-export and fairly soon I had a populated repository: nearly 30,000 commits at 35 commits per second, so about 100 times faster. The fast-import/export format allows you to provide file contents in any order, independent of the order they appear in commits. The fastest way to get the contents of each revision out of an RCS file is from newest to oldest, so that is what cvs-fast-export does.

There are a couple of niggles with cvs-fast-export, so I have a patch which fixes them in a fairly dumb manner (without adding command-line switches to control the behaviour):

  • In RCS and CVS style, cvs-fast-export replaces empty commit messages with "*** empty log message ***", whereas I want it to leave them empty.
  • cvs-fast-export makes a special effort to translate CVS's ignored file behaviour into git by synthesizing a .gitignore file into every commit. This is wrong for the ip-register tree.
  • Exporting the hosts.131.111 file takes a long time, during which cvs-fast-export appears to stall. I added a really bad progress meter to indicate that work was being performed.

Wrapping up

Overall this has taken more programming than I expected, and more time, very much following the pattern that the last 10% takes the same time as the first 90%. And I think the initial investigations - before I got stuck in to the conversion work - probably took the same time again.

There is one area where the conversion could perhaps be improved: the archived dumps of various subdirectories have been converted in the location that the tar files were stored. I have not tried to incorporate them as part of the history of the directories from which the tar files were made. On the whole I think combining them, coping with renames and so on, would take too much time for too little benefit. The multiple copies of various ancient scripts are a bit weird, but it is fairly clear from the git history what was going on.

So, let us declare the job DONE, and move on to building new DNS servers!

(13 comments | Leave a comment)

Saturday 22nd November 2014

Nerdy trivia about Unix time_t

When I was running git-cvsimport yesterday (about which more another time), I wondered what were the nine-digit numbers starting with 7 that it was printing in its progress output. After a moment I realised they were the time_t values corresponding to the commit dates from the early 1990s.

Bert commented that he started using Unix when time_t values started with an 8, which made me wonder if there was perhaps a 26 year ambiguity - early 1970s or mid 1990s?. (For the sake of pedantry - I don't really think Bert is that old!)

So I checked and time_t = 80,000,000 corresponds to 1972-07-14 22:13:20 and 90,000,000 corresponds to 1972-11-07 16:00:00. But I thought this was before modern time_t started.

Page 183 of this very large PDF of the 3rd Edition Unix manual says:

TIME (II)                  3/15/72                  TIME (II)

NAME         time -- get time of year

SYNOPSIS     sys time / time = 13
             (time r0-r1)

DESCRIPTION  time returns the time since 00:00:00, Jan. 1,
             1972, measured in sixtieths of a second.  The
             high order word is in the rO register and the
             low order is in the r1.

SEE ALSO     date(I), mdate(II)


BUGS         The time is stored in 32 bits.  This guarantees
             a crisis every 2.26 years.

So back then the 800,000,000 - 900,000,000 period was about three weeks in June 1972.

The 4th Edition Unix manual (link to tar file of nroff source) says:

TIME (II)                   8/5/73                   TIME (II)

NAME         time -- get date and time

SYNOPSIS     (time = 13)
             sys  time
             int tvec[2];

DESCRIPTION  Time returns the time since 00:00:00 GMT, Jan. 1,
             1970, measured in seconds. From asm, the high
             order word is in the r0 register and the low
             order is in r1. From C, the user-supplied vector
             is filled in.

SEE ALSO     date(I), stime(II), ctime(III)


I think the date on that page is a reasonably accurate indicator of when the time_t format changed. In the Unix manual, each page has its own date, separate from the date on the published editions of the manual. So, for example, the 3rd Edition is dated February 1973, but its TIME(II) page is dated March 1972. However all the 4th Edition system call man pages have the same date, which suggests that part of the documentation was all revised together, and the actual changes to the code happened some time earlier.

Now, time_t = 100,000,000 corresponds to 1973-03-03 09:46:40, so it is pretty safe to say that the count of seconds since the epoch has always had nine or ten digits.

(1 comment | Leave a comment)


I recently saw FixedFixer on Hacker News. This is a bookmarklet which turns off CSS position:fixed, which makes a lot of websites less annoying. Particular offenders include Wired, Huffington Post, Medium, et cetera ad nauseam. A lot of them are unreadable on my old small phone because of all the crap they clutter up the screen with, but even on my new bigger phone the clutter is annoying. Medium's bottom bar is particularly vexing because it looks just like mobile Safari's bottom bar. Bah, thrice bah, and humbug! But, just run FixedFixer and the crap usually disappears.

The code I am using is very slightly adapted from the HN post. In readable form:

  (function(elements, elem, style, i) {
    elements = document.getElementsByTagName('*');
    for (i = 0; elem = elements[i]; i++) {
      style = getComputedStyle(elem);
      if (style && style.position == 'fixed')
        elem.style.position = 'static';

Or in bookmarklet style:

Adding bookmarklets in iOS is a bit annoying because you can't edit the URL when adding a bookmark. You have to add a rubbish bookmark then as a separate step, edit it to replace the URL with the bookmarklet. Sigh.

I have since added a second bookmarklet, because when I was trying to read about AWS a large part of the text was off the right edge of the screen and they had disabled scrolling and zooming so it could not be read. How on earth can they publish a responsive website which does not actually work on a large number of phones?!

Anyway, OverflowFixer is the same as FixedFixer, but instead of changing position='fixed' to position='static', it changes overflow='hidden' to overflow='visible'.

When I mentioned these on Twitter, Tom said "Please share!" so this is a slightly belated reply. Do you have any bookmarklets that you particularly like? Write a comment!

(2 comments | Leave a comment)

Thursday 30th October 2014

The early days of the Internet in Cambridge

I'm currently in the process of uplifting our DNS development / operations repository from SCCS (really!) to git. This is not entirely trivial because I want to ensure that all the archival material is retained in a sensible way.

I found an interesting document from one of the oldest parts of the archive, which provides a good snapshot of academic computer networking in the UK in 1991. It was written by Tony Stonely, aka <ajms@cam.ac.uk>. AJMS is mentioned in RFC 1117 as the contact for Cambridge's IP address allocation. He was my manager when I started work at Cambridge in 2002, though he retired later that year.

The document is an email discussing IP connectivity for Cambridge's Institute of Astronomy. There are a number of abbreviations which might not be familiar...

  • Coloured Book: the JANET protocol suite
  • CS: the University Computing Service
  • CUDN: the Cambridge University Data Network
  • GBN: the Granta Backbone Network, Cambridge's duct and fibre infrastructure
  • grey: short for Grey Book, the JANET email protocol
  • IoA: the Institute of Astronomy
  • JANET: the UK national academic network
  • JIPS: the JANET IP service, which started as a pilot service early in 1991; IP traffic rapidly overtook JANET's native X.25 traffic, and JIPS became an official service in November 1991, about when this message was written
  • PSH: a member of IoA staff
  • RA: the Rutherford Appleton Laboratory, a national research institute in Oxfordshire the Mullard Radio Astronomy Observatory, an outpost at Lords Bridge near Barton, where some of the dishes sit on the old Cambridge-Oxford railway line. (I originally misunderstood the reference.)
  • RGO: The Royal Greenwich Observatory, which moved from Herstmonceux to the IoA site in Cambridge in 1990
  • Starlink: a UK national DECnet network linking astronomical research institutions

Edited to correct the expansion of RA and to add Starlink

    Connection of IoA/RGO to IP world

This note is a statement of where I believe we have got to and an initial
review of the options now open.

What we have achieved so far

All the Suns are properly connected at the lower levels to the
Cambridge IP network, to the national IP network (JIPS) and to the
international IP network (the Internet). This includes all the basic
infrastructure such as routing and name service, and allows the Suns
to use all the usual native Unix communications facilities (telnet,
ftp, rlogin etc) except mail, which is discussed below. Possibly the
most valuable end-user function thus delivered is the ability to fetch
files directly from the USA.

This also provides the basic infrastructure for other machines such as
the VMS hosts when they need it.

VMS nodes

Nothing has yet been done about the VMS nodes. CAMV0 needs its address
changing, and both IOA0 and CAMV0 need routing set for extra-site
communication. The immediate intention is to route through cast0. This
will be transparent to all parties and impose negligible load on
cast0, but requires the "doit" bit to be set in cast0's kernel. We
understand that PSH is going to do all this [check], but we remain
available to assist as required.

Further action on the VMS front is stalled pending the arrival of the
new release (6.6) of the CMU TCP/IP package. This is so imminent that
it seems foolish not to await it, and we believe IoA/RGO agree [check].

Access from Suns to Coloured Book world

There are basically two options for connecting the Suns to the JANET
Coloured Book world. We can either set up one or more of the Suns as
full-blown independent JANET hosts or we can set them up to use CS
gateway facilities. The former provides the full range of facilities
expected of any JANET host, but is cumbersome, takes significant local
resources, is complicated and long-winded to arrange, incurs a small
licence fee, is platform-specific, and adds significant complexity to
the system managers' maintenance and planning load. The latter in
contrast is light-weight, free, easy to install, and can be provided
for any reasonable Unix host, but limits functionality to outbound pad
and file transfer either way initiated from the local (IoA/RGO) end.
The two options are not exclusive.

We suspect that the latter option ("spad/cpf") will provide adequate
functionality and is preferable, but would welcome IoA/RGO opinion.

Direct login to the Suns from a (possibly) remote JANET/CUDN terminal
would currently require the full Coloured Book package, but the CS
will shortly be providing X.29-telnet gateway facilities as part of
the general infrastructure, and can in any case provide this
functionality indirectly through login accounts on Central Unix
facilities. For that matter, AST-STAR or WEST.AST could be used in
this fashion.


Mail is a complicated and difficult subject, and I believe that a
small group of experts from IoA/RGO and the CS should meet to discuss
the requirements and options. The rest of this section is merely a
fleeting summary of some of the issues.
Firstly, a political point must be clarified. At the time of writing
it is absolutely forbidden to emit smtp (ie Unix/Internet style) mail
into JIPS. This prohibition is national, and none of Cambridge's
doing. We expect that the embargo will shortly be lifted somewhat, but
there are certain to remain very strict rules about how smtp is to be
used. Within Cambridge we are making best guesses as to the likely
future rules and adopting those as current working practice. It must
be understood however that the situation is highly volatile and that
today's decisions may turn out to be wrong.

The current rulings are (inter alia)

        Mail to/from outside Cambridge may only be grey (Ie. JANET

        Mail within Cambridge may be grey or smtp BUT the reply
        address MUST be valid in BOTH the Internet AND Janet (modulo
        reversal). Thus a workstation emitting smtp mail must ensure
        that the reply address contained is that of a current JANET
        mail host. Except that -

        Consenting machines in a closed workgroup in Cambridge are
        permitted to use smtp between themselves, though there is no
        support from the CS and the practice is discouraged. They
        must remember not to contravene the previous two rulings, on
        pain of disconnection.

The good news is that a central mail hub/distributer will become
available as a network service for the whole University within a few
months, and will provide sufficient gateway function that ordinary
smtp Unix workstations, with some careful configuration, can have full
mail connectivity. In essence the workstation and the distributer will
form one of those "closed workgroups", the workstation will send all
its outbound mail to the distributer and receive all its inbound mail
from the distributer, and the distributer will handle the forwarding
to and from the rest of Cambridge, UK and the world.

There is no prospect of DECnet mail being supported generally either
nationally or within Cambridge, but I imagine Starlink/IoA/RGO will
continue to use it for the time being, and whatever gateway function
there is now will need preserving. This will have to be largely
IoA/RGO's own responsibility, but the planning exercise may have to
take account of any further constraints thus imposed. Input from
IoA/RGO as to the requirements is needed.

In the longer term there will probably be a general UK and worldwide
shift to X.400 mail, but that horizon is probably too hazy to rate more
than a nod at present. The central mail switch should in any case hide
the initial impact from most users.

The times are therefore a'changing rather rapidly, and some pragmatism
is needed in deciding what to do. If mail to/from the IP machines is
not an urgent requirement, and since they will be able to log in to
the VMS nodes it may not be, then the best thing may well be to await
the mail distributer service. If more direct mail is needed more
urgently then we probably need to set up a private mail distributer
service within IoA/RGO. This would entail setting up (probably) a Sun
as a full JANET host and using it as the one and only (mail) route in
or out of IoA/RGO. Something rather similar has been done in Molecular
Biology and is thus known to work, but setting it up is no mean task.
A further fall-back option might be to arrange to use Central Unix
facilities as a mail gateway in similar vein. The less effort spent on
interim facilities the better, however.

Broken mail

We discovered late in the day that smtp mail was in fact being used
between IoA and RA, and the name changing broke this. We regret having
thus trodden on existing facilities, and are willing to help try to
recover any required functionality, but we believe that IoA/RGO/RA in
fact have this in hand. We consider the activity to fall under the
third rule above. If help is needed, please let us know.

We should also report sideline problem we encountered and which will
probably be a continuing cause of grief. CAVAD, and indeed any similar
VMS system, emits mail with reply addresses of the form
"CAVAD::user"@....  This is quite legal, but the quotes are
syntactically significant, and must be returned in any reply.
Unfortunately the great majority of Unix systems strip such quotes
during emission of mail, so the reply address fails. Such stripping
can occur at several levels, notably the sendmail (ie system)
processing and the one of the most popular user-level mailers. The CS
is fixing its own systems, but the problem is replicated in something
like half a million independent Internet hosts, and little can be done
about it.

Other requirements

There may well be other requirements that have not been noticed or,
perish the thought, we have inadvertently broken. Please let us know
of these.

Bandwidth improvements

At present all IP communications between IoA/RGO and the rest of the
world go down a rather slow (64Kb/sec) link. This should improve
substantially when it is replaced with a GBN link, and to most of
Cambridge the bandwidth will probably become 1-2Mb/sec. For comparison,
the basic ethernet bandwidth is 10Mb/sec. The timescale is unclear, but
sometime in 1992 is expected. The bandwidth of the national backbone
facilities is of the order of 1Mb/sec, but of course this is shared with
many institutions in a manner hard to predict or assess.

For Computing Service,
Tony Stoneley, ajms@cam.cus
(6 comments | Leave a comment)

Wednesday 15th October 2014

POP, IMAP, SMTP, and the POODLE SSLv3.0 vulnerability.

A lot of my day has been spent on the POODLE vulnerability. For details see the original paper, commentary by Daniel Franke, Adam Langley, Robert Graham, and the POODLE.io web page of stats and recommendations.

One thing I have been investigating is to what extent mail software uses SSLv3. The best stats we have come from our message submission server, smtp.hermes, which logs TLS versions and cipher suites and (when possible) User-Agent and X-Mailer headers. (The logging from our POP and IMAP servers is not so good, partly because we don't request or log user agent declarations, and even if we did most clients wouldn't provide them.)

Nearly 100 of our users are using SSLv3, which is about 0.5% of them. The main culprits seem to be Airmail, Evolution, and most of all Android. Airmail is a modern Mac MUA, so in that case I guess it is a bug or misuse of the TLS API. For Evolution my guess is that it has a terrible setup user interface (all MUAs have terrible setup user interfaces) and users are choosing "SSL 3.0" rather than "TLS 1.0" because the number is bigger. In the case of Android I don't have details of version numbers because Android mail software doesn't include user-agent headers (unlike practically everything else), but I suspect old unsupported smart-ish phones running bad Java are to blame.

I haven't decided exactly what we will do to these users yet. However we have the advantage that POODLE seems to be a lot less bad for non-web TLS clients.

The POODLE padding oracle attack requires a certain amount of control over the plaintext which the attacker is trying to decrypt. Specifically:

  1. The plaintext plus MAC has to be an exact multiple of the cipher block size;
  2. It must be possible to move the secret (cookie or password) embedded in the plaintext by a byte at a time to scan it past a block boundary.

In the web situation, the attacker can use JavaScript served from anywhere to make repeated POST requests to an arbitrary target host. The JS can manipulate the body of the POST to control the overall length of the request, and can manipulate the request path to control the position of the cookie in the headers.

In the mail situation (POP, IMAP, SMTP), the attacker can make the client retry requests repeatedly by breaking the connection, but they cannot control the size or framing of the client's authentication command.

So I think we have the option of not worrying too much if forced upgrades turn out to be too painful, though I would prefer not to go that route - it makes me feel uncomfortably complacent.

(5 comments | Leave a comment)

Monday 14th July 2014

Data structures and algorithms

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." - Linus Torvalds

"If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming." - Rob Pike

"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." - Fred Brooks
(7 comments | Leave a comment)

Wednesday 14th May 2014

Dilbert feeds

I just noticed that Dilbert had its 25th anniversary last month. I have created an atom feed of old strips to mark this event. To avoid leap year pain the feed contains strips from the same date 24 years ago, rather than 25 years ago. See dilbert_zoom for current strips and dilbert_24 for the old ones.
(2 comments | Leave a comment)

Tuesday 25th March 2014

Update to SSHFP tutorial

I have updated yesterday's article on how to get SSHFP records, DNSSEC, and VerifyHostKeyDNS=yes to work.

I have re-ordered the sections to avoid interrupting the flow of the instructions with chunks of background discussion.

I have also added a section discussing the improved usability vs weakened security of the RRSET_FORCE_EDNS0 patch in Debian and Ubuntu.

(Leave a comment)

Monday 24th March 2014

SSHFP tutorial: how to get SSHFP records, DNSSEC, and VerifyHostKeyDNS=yes to work.

One of the great promises of DNSSEC is to provide a new public key infrastructure for authenticating Internet services. If you are a relatively technical person you can try out this brave new future now with ssh.

There are a couple of advantages of getting SSHFP host authentication to work. Firstly you get easier-to-use security, since you longer need to rely on manual host authentication, or better security for the brave or foolhardy who trust leap-of-faith authentication. Secondly, it becomes feasible to do host key rollovers, since you only need to update the DNS - the host's key is no longer wired into thousands of known_hosts files. (You can probably also get the latter benefit using ssh certificate authentication, but why set up another PKI if you already have one?)

In principle it should be easy to get this working but there are a surprising number of traps and pitfalls. So in this article I am going to try to explain all the whys and wherefores, which unfortunately means it is going to be long, but I will try to make it easy to navigate. In the initial version of this article I am just going to describe what the software does by default, but I am happy to add specific details and changes made by particular operating systems.

The outline of what you need to do on the server is:

  • Sign your DNS zone. I will not cover that in this article.
  • Publish SSHFP records in the DNS

The client side is more involved. There are two versions, depending on whether ssh has been compiled to use ldns or not. Run ldd $(which ssh) to see if it is linked with libldns.

  • Without ldns:
    • Install a validating resolver (BIND or Unbound)
    • Configure the stub resolver /etc/resolv.conf
    • Configure ssh
  • With ldns:
    • Install unbound-anchor
    • Configure the stub resolver /etc/resolv.conf
    • Configure ssh

Publish SSHFP records in the DNS

Generating SSHFP records is quite straightforward:

    demo:~# cd /etc/ssh
    demo:/etc/ssh# ssh-keygen -r $(hostname)
    demo IN SSHFP 1 1 21da0404294d07b940a1df0e2d7c07116f1494f9
    demo IN SSHFP 1 2 3293d4c839bfbea1f2d79ab1b22f0c9e0adbdaeec80fa1c0879dcf084b72e206
    demo IN SSHFP 2 1 af673b7beddd724d68ce6b2bb8be733a4d073cc0
    demo IN SSHFP 2 2 953f24d775f64ff21f52f9cbcbad9e981303c7987a1474df59cbbc4a9af83f6b
    demo IN SSHFP 3 1 f8539cfa09247eb6821c645970b2aee2c5506a61
    demo IN SSHFP 3 2 9cf9ace240c8f8052f0a6a5df1dea4ed003c0f5ecb441fa2c863034fddd37dc9

Put these records in your zone file, or you can convert them into an nsupdate script with a bit of seddery:

    ssh-keygen -r $(hostname -f) |
        sed 's/^/update add /;s/ IN / 3600 IN /;/ SSHFP . 1 /d;'

The output of ssh-keygen -r includes hashes in both SHA1 and SHA256 format (the shorter and longer hashes). You can discard the SHA1 hashes.

It includes hashes for the different host key authentication algorithms:

  • 1: ssh-rsa
  • 2: ssh-dss
  • 3: ecdsa

I believe ecdsa covers all three key sizes which OpenSSH gives separate algorithm names: ecdsa-sha2-nistp256, ecdsa-sha2-nistp384, ecdsa-sha2-nistp521

OpenSSH supports other host key authentication algorithms, but unfortunately they cannot be authenticated using SSHFP records because they do not have algorithm numbers allocated.

The problem is actually worse than that, because most of the extra algorithms are the same as the three listed above, but with added support for certificate authentication. The ssh client is able to convert certificate to plain key authentication, but a bug in this fallback logic breaks SSHFP authentication.

So I recommend that if you want to use SSHFP authentication your server should only have host keys of the three basic algorithms listed above.

NOTE (added 26-Nov-2014): if you are running OpenSSH-5, it does not support ECDSA SSHFP records. So if your servers need to support older clients, you might want to stick to just RSA and DSA host keys.

You are likely to have an SSHFP algorithm compatibility problem if you get the message: "Error calculating host key fingerprint."

NOTE (added 12-Mar-2015): if you are running OpenSSH-6.7 or later, it has support for Ed25519 SSHFP records which use algorithm number 4.

Install a validating resolver

To be safe against active network interception attacks you need to do DNSSEC validation on the same machine as your ssh client. If you don't do this, you can still use SSHFP records to provide a marginal safety improvement for leap-of-faith users. In this case I recommend using VerifyHostKeyDNS=ask to reinforce to the user that they ought to be doing proper manual host authentication.

If ssh is not compiled to use ldns then you need to run a local validating resolver, either BIND or unbound.

If ssh is compiled to use ldns, it can do its own validation, and you do not need to install BIND or Unbound.

Run ldd $(which ssh) to see if it is linked with libldns.

Install a validating resolver - BIND

The following configuration will make named run as a local validating recursive server. It just takes the defaults for everything, apart from turning on validation. It automatically uses BIND's built-in copy of the root trust anchor.


    options {
        dnssec-validation auto;
        dnssec-lookaside auto;

Install a validating resolver - Unbound

Unbound comes with a utility unbound-anchor which sets up the root trust anchor for use by the unbound daemon. You can then configure unbound as follows, which takes the defaults for everything apart from turning on validation using the trust anchor managed by unbound-anchor.


        auto-trust-anchor-file: "/var/lib/unbound/root.key"

Install a validating resolver - dnssec-trigger

If your machine moves around a lot to dodgy WiFi hot spots and hotel Internet connections, you may find that the nasty middleboxes break your ability to validate DNSSEC. In that case you can use dnssec-trigger, which is a wrapper around Unbound which knows how to update its configuration when you connect to different networks, and which can work around braindamaged DNS proxies.

Configure the stub resolver - without ldns

If ssh is compiled without ldns, you need to add the following line to /etc/resolv.conf; beware your system's automatic resolver configuration software, which might be difficult to persuade to leave resolv.conf alone.

    options edns0

For testing purposes you can add RES_OPTIONS=edns0 to ssh's environment.

On some systems (including Debian and Ubuntu), ssh is patched to force EDNS0 on, so that you do not need to set this option. See the section on RRSET_FORCE_EDNS0 below for further discussion.

Configure the stub resolver - with ldns

If ssh is compiled with ldns, you need to run unbound-anchor to maintain a root trust anchor, and add something like the following line to /etc/resolv.conf

    anchor /var/lib/unbound/root.key

Run ldd $(which ssh) to see if it is linked with libldns.

Configure ssh

After you have done all of the above, you can add the following to your ssh configuration, either /etc/ssh/ssh_config or ~/.ssh/config

    VerifyHostKeyDNS yes

Then when you connect to a host for the first time, it should go straight to the Password: prompt, without asking for manual host authtentication.

If you are not using certificate authentication, you might also want to disable that. This is because ssh prefers the certificate authentication algorithms, and if you connect to a host that offers a more preferred algorithm, ssh will try that and ignore the DNS. This is not very satisfactory; hopefully it will improve when the bug is fixed.

    HostKeyAlgorithms ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-rsa,ssh-dss


Check that your resolver is validating and getting a secure result for your host. Run the following command and check for "ad" in the flags. If it is not there then either your resolver is not validating, or /etc/resolv.conf is not pointing at the validating resolver.

    $ dig +dnssec <hostname> sshfp | grep flags
    ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
    ; EDNS: version: 0, flags: do; udp: 4096

See if ssh is seeing the AD bit. Use ssh -v and look for messages about secure or insecure fingerprints in the DNS. If you are getting secure answers via dig but ssh is not, perhaps you are missing "options edns0" from /etc/resolv.conf.

    debug1: found 6 secure fingerprints in DNS
    debug1: matching host key fingerprint found in DNS

Try using specific host key algorithms, to see if ssh is trying to authenticate a key which does not have an SSHFP record of the corresponding algorithm.

    $ ssh -o HostKeyAlgorithms=ssh-rsa <hostname>


What I use is essentially:


    options {
        dnssec-validation auto;
        dnssec-lookaside auto;


    options edns0


    VerifyHostKeyDNS yes

Background on DNSSEC and non-validating stub resolvers

When ssh is not compiled to use ldns, it has to trust the recursive DNS server to validate SSHFP records, and it trusts that the connection to the recursive server is secure. To find out if an SSHFP record is securely validated, ssh looks at the AD bit in the DNS response header - AD stands for "authenticated data".

A resolver will not set the AD bit based on the security status of the answer unless the client asks for it. There are two ways to do that. The simple way (from the perspective of the DNS protocol) is to set the AD bit in the query, which gets you the AD bit in the reply without other side-effects. Unfortunately the standard resolver API makes it very hard to do this, so it is only simple in theory.

The other way is to add an EDNS0 OPT record to the query with the DO bit set - DO stands for "DNSSEC OK". This has a number of side-effects: EDNS0 allows large UDP packets which provide the extra space needed by DNSSEC, and DO makes the server send back the extra records required by DNSSEC such as RRSIGs.

Adding "options edns0" to /etc/resolv.conf only tells it to add the EDNS0 OPT record - it does not enable DNSSEC. However ssh itself observes whether EDNS0 is turned on, and if so also turns on the DO bit.

Regarding "options edns0" vs RRSET_FORCE_EDNS0

At first it might seem annoying that ssh makes you add "options edns0" to /etc/resolv.conf before it will ask for DNSSEC results. In fact on some systems, ssh is patched to add a DNS API flag called RRSET_FORCE_EDNS0 which forces EDNS0 and DO on, so that you do not need to explicitly configure the stub resolver. However although this seems more convenient, it is less safe.

If you are using the standard portable OpenSSH then you can safely set VerifyHostKeyDNS=yes, provided your stub resolver is configured correctly. The rule you must follow is to only add "options edns0" if /etc/resolv.conf is pointing at a local validating resolver. SSH is effectively treating "options edns0" as a signal that it can trust the resolver. If you keep this rule you can change your resolver configuration without having to reconfigure ssh too; it will automatically fall back to VerifyHostKeyDNS=ask when appropriate.

If you are using a version of ssh with the RRSET_FORCE_EDNS0 patch (such as Debian and Ubuntu) then it is sometimes NOT SAFE to set VerifyHostKeyDNS=yes. With this patch ssh has no way to tell if the resolver is trustworthy or if it should fall back to VerifyHostKeyDNS=ask; it will blindly trust a remote validating resolver, which leaves you vulnerable to MitM attacks. On these systems, if you reconfigure your resolver, you may also have to reconfigure ssh in order to remain safe.

Towards the end of February there was a discussion on an IETF list about stub resolvers and DNSSEC which revolved around exactly this question of how an app can tell if it is safe to trust the AD bit from the recursive DNS server.

One proposal was for the stub resolver to strip the AD bit in replies from untrusted servers, which (if it were implemented) would allow ssh to use the RRSET_FORCE_EDNS0 patch safely. However this proposal means you have to tell the resolver if the server is trusted, which might undo the patch's improved convenience. There are ways to avoid that, such as automatically trusting resolvers running on the local host, and perhaps having a separate configuration file listing trusted resolvers, e.g. those reachable over IPSEC.

(1 comment | Leave a comment)

Wednesday 19th February 2014

Relative frequency of initial letters of TLDs

Compare Wikipedia's table of the relative frequencies of initial letters in English.

$ dig axfr . @f.root-servers.net |
  perl -ne '
	next unless /^(([a-z])[a-z0-9-]+)[.][  ].*/;
	$label{$1} = 1; $letter{$2}++; $total++;
	END {
		for my $x (sort keys %letter) {
			my $p = 100.0*$letter{$x}/$total;
			printf "<tr><td>$x</td>
				<td align=right>%5.2f</td>
				<td><span style=\"
					display: inline-block;
					background-color: gray;
					width: %d;\">
			    $p, 32*$p;
a 5.24 
b 5.96 
d 2.45 
e 3.28 
f 2.43 
g 5.03 
h 1.78 
i 3.38 
j 1.14 
k 2.71 
l 3.74 
m 6.69 
n 4.26 
o 0.72 
p 5.68 
q 0.65 
r 2.81 
s 6.15 
t 6.02 
u 1.91 
v 2.66 
w 1.94 
y 0.41 
z 0.75 
(6 comments | Leave a comment)

Wednesday 29th January 2014

Diffing dynamic raw zone files in git with BIND 9.10

On my toy nameserver my master zones are configured with a directory for each zone. In this directory is a "conf" file which is included by the nameserver's main configuration file; a "master" file containing the zone data in the raw binary format; a "journal" file recording changes to the zone from dynamic UPDATEs and re-signing; and DNSSEC and TSIG keys. The "conf" file looks something like this:

    zone dotat.at {
        type master;
        file "/zd/dotat.at/master";
        journal "/zd/dotat.at/journal";
        key-directory "/zd/dotat.at";
        masterfile-format raw;
        auto-dnssec maintain;
        update-policy local;

I must have been having a fit of excessive tidyness when I was setting this up, because although it looks quite neat, the unusual file names cause some irritation - particularly for the journal. Some of the other BIND tools assume that journal filenames are the same as the master file with a .jnl extension.

I keep the name server configuration in git. This is a bit awkward because the configuration contains precious secrets (DNSSEC private keys), and the zone files are constantly-changing binary data. But it is useful for recording manual changes, since the zone files don't have comments explaining their contents. I don't make any effort to record the re-signing churn, though I commit it when making other changes.

To reduce the awkwardness I configured git to convert zone files to plain text when diffing them, so I had a more useful view of the repository. There are three parts to setting this up.

  • Tell git that the zone files require a special diff driver, which I gave the name "bind-raw".

    All the zone files in the repository are called "master", so in the .gitattributes file at the top of the repository I have the line

        master diff=bind-raw
  • The diff driver is part of the repository configuration. (Because the implementation of the driver is a command it isn't safe to set it up automatically in a repository clone, as is the case for .gitattributes settings, so it has to be configured separately.) So add the following lines to .git/config
        [diff "bind-raw"]
    	textconv = etc/raw-to-text
  • The final part is the raw-to-text script, which lives in the repository.

This is where journal file names get irritating. You can convert a raw master file into the standard text format with named-compilezone, which has a -j option to read the zone's journal, but this assumes that the journal has the default file name with the .jnl extension. So it doesn't quite work in my setup.

(It also doesn't quite work on the University's name servers which have a directory for master files and a directory for journal files.)

So in September 2012 I patched BIND to add a -J option to named-compilezone for specifying the journal file name. I have a number of other small patches to my installation of BIND, and this one was very simple, so in it went. (It would have been much more sensible to change my nameserver configuration to go with the flow...)

The patch allowed me to write the raw-to-text script as follows. The script runs named-compilezone twice: the first time with a bogus zone name, which causes named-compilezone to choke with an error message. This helpfully contains the real zone name, which the script extracts then uses to invoke named-compilezone correctly.

    command="named-compilezone -f raw -F text -J journal -o /dev/stdout"
    zone="$($command . "$file" 2>&1)"
    zone="${zone#*: }"
    $command "$zone" "$file" 2>/dev/null

I submitted the -J patch to the ISC and got a favourable response from Evan Hunt. At that time (16 months ago) BIND was at version 9.9.2; since this option was a new feature (and unimportant) it was added to the 9.10 branch. Wind forward 14 months to November 2013 and the first alpha release of 9.10 came out, with the -J option, so I was able to retire my patch. Woo! There will be a beta release of 9.10 in a few weeks.

In truth, you don't need BIND 9.10 to use this git trick, if you use the default journal file names. The key thing to make it simple is to give all your master files the same name, so that you don't have to list them all in your .gitattributes.

(Leave a comment)

Tuesday 3rd December 2013

A weird BIND DNSSEC resolution bug, with a fix.

The central recursive DNS servers in Cambridge act as stealth slaves for most of our local zones, and we recommend this configuration for other local DNS resolvers. This has the slightly odd effect that the status bits in answers have AD (authenticated data) set for most DNSSEC signed zones, except for our local ones which have AA (authoritative answer) set. This is not a very big deal since client hosts should do their own DNSSEC validation and ignore any AD bits they get over the wire.

It is a bit more of a problem for the toy nameserver I run on my workstation. As well as being my validating resolver, it is also the master for my personal zones, and it slaves some of the Cambridge zones. This mixed recursive / authoritative setup is not really following modern best practices, but it's OK when I am the only user, and it makes experimental playing around easier. Still, I wanted it to validate answers from its authoritative zones, especially because there's no security on the slave zone transfers.

I had been procrastinating this change because I thought the result would be complicated and ugly. But last week one of the BIND developers, Mark Andrews, posted a description of how to validate slaved zones to the dns-operations list, and it turned out to be reasonably OK - no need to mess around with special TSIG keys to get queries from one view to another.

The basic idea is to have one view that handles recursive queries and which validates all its answers, and another view that holds the authoritative zones and which only answers non-recursive queries. The recursive view has "static-stub" zone configurations mirroring all of the zones in the authoritative view, to redirect queries to the local copies.

Here's a simplified version of the configuration I tried out. To make it less annoying to maintain, I wrote a script to automatically generate the static-stub configurations from the authoritative zones.

  view rec {
    match-recursive-only yes;
    zone cam.ac.uk         { type static-stub; server-addresses { ::1; }; };
    zone private.cam.ac.uk { type static-stub; server-addresses { ::1; }; };

  view auth {
    recursion no;
    allow-recursion { none; };
    zone cam.ac.uk         { type slave; file "cam";  masters { ucam; }; };
    zone private.cam.ac.uk { type slave; file "priv"; masters { ucam; }; };

This seemed to work fine, until I tried to resolve names in private.cam.ac.uk - then I got a server failure. In my logs was the following (which I have slightly abbreviated):

  client ::1#55687 view rec: query: private.cam.ac.uk IN A +E (::1)
  client ::1#60344 view auth: query: private.cam.ac.uk IN A -ED (::1)
  client ::1#54319 view auth: query: private.cam.ac.uk IN DS -ED (::1)
  resolver: DNS format error from ::1#53 resolving private.cam.ac.uk/DS:
    Name cam.ac.uk (SOA) not subdomain of zone private.cam.ac.uk -- invalid response
  lame-servers: error (FORMERR) resolving 'private.cam.ac.uk/DS/IN': ::1#53
  lame-servers: error (no valid DS) resolving 'private.cam.ac.uk/A/IN': ::1#53
  query-errors: client ::1#55687 view rec:
    query failed (SERVFAIL) for private.cam.ac.uk/IN/A at query.c:7435

You can see the original recursive query that I made, then the resolver querying the authoritative view to get the answer and validate it. The situation here is that private.cam.ac.uk is an unsigned zone, so a DNSSEC validator has to check its delegation in the parent zone cam.ac.uk and get a proof that there is no DS record, to confirm that it is OK for private.cam.ac.uk to be unsigned. Something is going wrong with BIND's attempt to get this proof of nonexistence.

When BIND gets a non-answer it has to classify it as a referral to another zone or an authoritative negative answer, as described in RFC 2308 section 2.2. It is quite strict in its sanity checks, in particular it checks that the SOA record refers to the expected zone. This check often discovers problems with misconfigured DNS load balancers which are given a delegation for www.example.com but which think their zone is example.com, leading them to hand out malformed negative responses to AAAA queries.

This negative answer SOA sanity check is what failed in the above log extract. Very strange - the resolver seems to be looking for the private.cam.ac.uk DS record in the private.cam.ac.uk zone, not the cam.ac.uk zone, so when it gets an answer from the cam.ac.uk zone it all goes wrong. Why is it looking in the wrong place?

In fact the same problem occurs for the cam.ac.uk zone itself, but in this case the bug turns out to be benign:

  client ::1#16276 view rec: query: cam.ac.uk IN A +E (::1)
  client ::1#65502 view auth: query: cam.ac.uk IN A -ED (::1)
  client ::1#61409 view auth: query: cam.ac.uk IN DNSKEY -ED (::1)
  client ::1#51380 view auth: query: cam.ac.uk IN DS -ED (::1)
  security: client ::1#51380 view auth: query (cache) 'cam.ac.uk/DS/IN' denied
  lame-servers: error (chase DS servers) resolving 'cam.ac.uk/DS/IN': ::1#53

You can see my original recursive query, and the resolver querying the authoritative view to get the answer and validate it. But it sends the DS query to itself, not to the name servers for the ac.uk zone. When this query fails, BIND re-tries by working down the delegation chain from the root, and this succeeds so the overall query and validation works despite tripping up.

This bug is not specific to the weird two-view setup. If I revert to my old configuration, without views, and just slaving cam.ac.uk and private.cam.ac.uk, I can trigger the benign version of the bug by directly querying for the cam.ac.uk DS record:

  client ::1#30447 (cam.ac.uk): query: cam.ac.uk IN DS +E (::1)
  lame-servers: error (chase DS servers) resolving 'cam.ac.uk/DS/IN':

In this case the resolver sent the upstream DS query to one of the authoritative servers for cam.ac.uk, and got a negative response from the cam.ac.uk zone apex per RFC 4035 section This did not fail the SOA sanity check but it did trigger the fall-back walk down the delegation chain.

In the simple slave setup, queries for private.cam.ac.uk do not fail because they are answered from authoritative data without going through the resolver. If you change the zone configurations from slave to stub or static-stub then the resolver is used to answer queries for names in those zones, and so queries for private.cam.ac.uk explode messily as BIND tries really hard (128 times!) to get a DS record from all the available name servers but keeps checking the wrong zone.

I spent some time debugging this on Friday evening, which mainly involved adding lots of logging statements to BIND's resolver to work out what it thought it was doing. Much confusion and headscratching and eventually understanding.

BIND has some functions called findzonecut() which take an option to determine whether it wants the child zone or the parent zone. This works OK for dns_db_findzonecut() which looks in the cache, but dns_view_findzonecut() gets it wrong. This function works out whether to look for the name in a locally-configured zone, and if so which one, or otherwise in the cache, or otherwise work down from the root hints. In the case of a locally-configured zone it ignores the option and always returns the child side of the zone cut. This causes the resolver to look for DS records in the wrong place, hence all the breakage described above.

I worked out a patch to fix this DS record resolution problem, and I have sent details of the bug and my fix to bind9-bugs@isc.org. And I now have a name server that correctly validates its authoritative zones :-)

(1 comment | Leave a comment)

Wednesday 13th November 2013

Temporum: secure time: a paranoid fantasy

Imagine that...

Secure NTP is an easy-to-use and universally deployed protocol extension...

The NTP pool is dedicated to providing accurate time from anyone to everyone, securely...

NIST, creators of some of the world's best clocks and keepers of official time for the USA, decide that the NTP pool is an excellent project which they would like to help. They donatate machines and install them around the world and dedicate them to providing time as part of the NTP pool. Their generous funding allows them to become a large and particularly well-connected proportion of the pool.

In fact NIST is a sock-puppet of the NSA. Their time servers are modified so that they are as truthful and as accurate as possible to everyone, except those who the US government decides they do not like.

The NSA has set up a system dedicated to replay attacks. They cause occasional minor outages and screwups in various cryptographic systems - certificate authorities, DNS registries - which seem to be brief and benign when they happen, but no-one notices that the bogus invalid certificates and DNS records all have validity periods covering a particular point in time.

Now the NSA can perform a targeted attack, in which they persuade the victim to reboot, perhaps out of desperation because nothing works and they don't understand denial-of-service attacks. The victim's machine reboots, and it tries to get the time from the NTP pool. The NIST sock-puppet servers all lie to it. The victim's machine believes the time is in the NSA replay attack window. It trustingly fetches some crucial "software update" from a specially-provisioned malware server, which both its DNSSEC and X.509 PKIs say is absolutely kosher. It becomes comprehensively pwned by the NSA.

How can we provide the time in a way that is secure against this attack?

(Previously, previously)

(15 comments | Leave a comment)

Monday 11th November 2013

Security considerations for temporum: quorate secure time

The security of temporum is based on the idea that you can convince yourself that several different sources agree on what the time is, with the emphasis on different. Where are the weaknesses in the way it determines if sources are different?

The starting point for temporum is a list of host names to try. It is OK if lots of them fail (e.g. because your device has been switched off on a shelf for years) provided you have a good chance of eventually getting a quorum.

The list of host names is very large, and temporum selects candidates from the list at random. This makes it hard for an attacker to target the particular infrastructure that temporum might use. I hope your device is able to produce decent random numbers immediately after booting!

The list of host names is statically configured. This is important to thwart Sybil attacks: you don't want an attacker to convince you to try a list of apparently-different host names which are all under the attacker's control. Question: can the host list be made dynamic without making it vulnerable?

Hostnames are turned into IP addresses using the DNS. Temporum uses the TLS X.509 PKI to give some assurance that the DNS returned the correct result, about which more below. The DNS isn't security-critical, but if it worries you perhaps temporum could be configured with a list of IP addresses instead - but maybe that will make the device-on-shelf less likely to boot successfully.

Temporum does not compare the IP addresses of "different" host names. This might become a problem once TLS SNI makes large-scale virtual hosting easier. More subtly, there is a risk that temporum happens to query lots of servers that are hosted on the same infrastructure. This can be mitigated by being careful about selecting which host names to include in the list - no more than a few each of Blogspot, Tumblr, Livejournal, GoDaddy vhosts, etc. More than one of each is OK since it helps with on-shelf robustness.

The TLS security model hopes that X.509 certification authorities will only hand out certificates for host names to the organizations that run the hosts. This is a forlorn hope: CAs have had their infrastructure completely compromised; they have handed out intermediate signing certificates to uncontrolled third parties; they are controlled by nation states that treat our information security with contempt.

In the context of temporum, we can reduce this problem by checking that the quorum hosts are authenticated by diverse CAs. Then an attacker would have to compromise multiple CAs to convince us of an incorrect time. Question: are there enough different CAs used by popular sites that temporum can quickly find a usable set?

(Leave a comment)

Saturday 9th November 2013

nsdiff 1.47

I have done a little bit of work on nsdiff recently.

You can now explicitly manage your DNSKEY RRset, instead of leaving it to named. This is helpful when you are transferring a zone from one operator to another: you need to include the other operator's zone signing key in your DNSKEY RRset to ensure that validation works across the transfer.

There is now support for bump-in-the-wire signing, where nsdiff transfers the new version of the zone from a back-end hidden master server and pushes the updates to a signing server which feeds the public authoritative servers.

Get nsdiff from http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/conf/bind/bin/nsdiff

(Edit: I decided to simplify the -u option so updated from version 1.46 to 1.47.)

(Previously, previously, previously, previously, previously.)

(Leave a comment)

Tuesday 29th October 2013

Temporum: Quorate secure time

There are essentially two ways to find out what the time is: ask an authoritative source and trust the answer, or ask several more or less unreliable sources and see what they agree on. NTP is based on the latter principle, but since the protocol isn't secured, a client also has to trust the network not to turn NTP responses into lies.

NTP's lack of security causes a bootstrapping problem. Many security protocols rely on accurate time to avoid replay attacks. So nearly the first thing a networked device needs to do on startup is get the time, so that it can then properly verify what it gets from the network - DNSSEC signatures, TLS certificates, software updates, etc. This is particularly challenging for cost-constrained devices that do not have a battery-backed real time clock and so start up in 1970.

When I say NTP isn't secured, I mean that the protocol has security features but they have not been deployed. I have tried to understand NTP security, but I have not found a description of how to configure it for the bootstrap case. What I want is for a minimally configured client to be able to communicate with some time servers and get responses with reasonable authenticity and integrity. Extra bonus points for a clear description of which of NTP's half dozen identity verification schemes is useful for what, and which ones are incompatible with NATs and rely on the client knowing its external IP address.

In the absence of usable security from NTP, Jacob Appelbaum of the Tor project has written a program called tlsdate. In TLS, the ClientHello and ServerHello messages include a random nonce which includes a Unix time_t value as a prefix. So you can use any TLS server as a secure replacement for the old port 37 time service.

Unlike NTP, tlsdate gets time from a single trusted source. It would be much better if it were able to consult multiple servers for their opinions of the time: it would be more robust if a server is down or has the wrong time, and it would be more secure in case a server is compromised. There is also the possibility of using multiple samples spread over a second or two to obtain a more accurate time than the one second resolution of TLS's gmt_unix_time field.

The essential idea is to find a quorum of servers that agree on the time. An adversary or a technical failure would have to break at least that many servers for you to get the wrong time.

In statistical terms, you take a number of samples and find the mode, the most common time value, and keep taking samples until the frequency at the mode is greater than the quorum.

But even though time values are discrete, the high school approach to finding the mode isn't going to work because in many casees we won't be able to take all the necessary samples close enough together in time. So it is better to measure the time offset between a server and the client at each sample, and treat these as a continuous distribution.

The key technique is kernel density estimation. The mode is the point of peak density in the distribution estimated from the samples. The kernel is a function that is used to spread out each sample; the estimated distribution comes from summing the spread-out samples.

NTP's clock select algorithm is basically kernel density estimation with a uniform kernel.

NTP's other algorithms are based on lengthy observations of the network conditions between the client and its servers, whereas we are more concerned with getting a quick result from many servers. So perhaps we can use a simpler, more well-known algorithm to find the mode. It looks like the mean shift algorithm is a good candidate.

For the mean shift algorithm to work well, I think it makes sense to use a smooth kernel such as the Gaussian. (I like exponentials.) The bandwidth of the kernel should probably be one second (the precision of the timestamp) plus the round trip time.

Now it's time to break out the editor and write some code... I think I'll call it "temporum" because that rhymes with "quorum" and it means "of times" (plural). Get temporum from my git server.

(2 comments | Leave a comment)

Wednesday 23rd October 2013

My circular commute

We moved to new offices a month ago and I have settled in to the new route to work. Nico's nursery was a bit out of the way when I was working in the city centre, but now it is about a mile in the wrong direction.

But this is not so bad, since I have decided on a -Ofun route [1] which is almost entirely on cycle paths and very quiet roads, and the main on-road section has good cycle lanes (at least by UK standards). There is lots of park land, a bit of open country, and it goes through the world heritage site in the city centre :-) And it's fairly easy to stop off on the way if I need to get supplies.

[1] optimize for fun!

My route to work on gmap-pedometer

I don't have to wrangle children on the way home, so I take the direct route past the Institute of Astronomy and Greenwich House (where Rachel previously worked and where the Royal Greenwich Observatory was wound down).

My route home on gmap-pedometer

So far it has been pleasantly free of the adrenaline spikes I get from seeing murderous and/or suicidally stupid behaviour. Much better than going along Chesterton Lane and Madingley Road!

(7 comments | Leave a comment)

Tuesday 8th October 2013

Maintaining a local patch set with git

We often need to patch the software that we run in order to fix bugs quickly rather than wait for an official release, or to add functionality that we need. In many cases we have to maintain a locally-developed patch for a significant length of time, across multiple upstream releases, either because it is not yet ready for incorporation into a stable upstream version, or because it is too specific to our setup so will not be suitable for passing upstream without significant extra work.

I have been experimenting with a git workflow in which I have a feature branch per patch. (Usually there is only one patch for each change we make.) To move them on to a new feature release, I tag the feature branch heads (to preserve history), rebase them onto the new release version, and octopus merge them to create a new deployment version. This is rather unsatisfactory, because there is a lot of tedious per-branch work, and I would prefer to have branches recording the development of our patches rather than a series of tags.

Here is a git workflow suggested by Ian Jackson which I am trying out instead. I don't yet have much experience with it; I am writing it down now as a form of documentation.

There are three branches:

  • upstream, which is where public releases live
  • working, which is where development happens
  • deployment, which is what we run

Which branch corresponds to upstream may change over time, for instance when we move from one stable version to the next one.

The working branch exists on the developer's workstation and is not normally published. There might be multiple working branches for work-in-progress. They get rebased a lot.

Starting from an upstream version, a working branch will have a number of mature patches. The developer works on top of these in commit-early-commit-often mode, without worrying about order of changes or cleanliness. Every so often we use git rebase --interactive to tidy up the patch set. Often we'll use the "squash" command to combine new commits with the mature patches that they amend. Sometimes it will be rebased onto a new upstream version.

When the working branch is ready, we use the commands below to update the deployment branch. The aim is to make it look like updates from the working branch are repeatedly merged into the deployment branch. This is so that we can push updated versions of the patch set to a server without having to use --force, and pulling updates into a checked out version is just a fast-forward. However this isn't a normal merge since the tree at the head of deployment always matches the most recent good version of working. (This is similar to what stg publish does.) Diagramatically,

     | \
     |  `A---B-- 1.1-patched
     |    \       |
     |     \      |
     |      `C-- 1.1-revised
     |            |
    2.0           |
     | \          |
     |  `-C--D-- 2.0-patched
     |            |
    3.1           |
     | \          |
     |  `-C--E-- 3.1-patched
     |            |
  upstream        |

The horizontal-ish lines are different rebased versions of the patch set. Letters represent patches and numbers represent version tags. The tags on the deployment branch are for the install scripts so I probably won't need one on every update.

Ideally we would be able to do this with the following commands:

    $ git checkout deployment
    $ git merge -s theirs working

However there is an "ours" merge strategy but not a "theirs" merge strategy. Johannes Sixt described how to simulate git merge -s theirs in a post to the git mailing list in 2010. So the commands are:

    $ git checkout deployment
    $ git merge --no-commit -s ours working
    $ git read-tree -m -u working
    $ git commit -m "Update to $(git describe working)"

Mark Wooding suggested the following more plumbing-based version, which unlike the above does not involve switching to the deployment branch.

    $ d="$(git rev-parse deployment)"
    $ w="$(git rev-parse working)"
    $ m="Update deployment to $(git describe working)"
    $ c="$(echo "$m" | git commit-tree -p $d -p $w working^{tree})
    $ git update-ref -m "$m" deployment $c $d
    $ unset c d w

Now to go and turn this into a script...

(2 comments | Leave a comment)

Sunday 6th October 2013

Bacon and cabbage

Bacon and brassicas are best friends. Here's a very simple recipe which is popular in the Finch household.


  • A Savoy cabbage
  • Two large or three small onions
  • A few cloves of garlic
  • A 200g pack of bacon
  • Oil or butter for frying
  • Soured cream


Chop the onion

Slice the cabbage to make strips about 1cm wide

Cut up the bacon to make lardons


Get a large pan that is big enough to hold all the cabbage. Heat the fat, press in the garlic, then bung in the onion and bacon. Fry over a moderate heat until the bacon is cooked.

Add the cabbage. Stir to mix everything together and keep stirring so it cooks evenly. As the cabbage cooks down and becomes more manageable you can put the heat right up to brown it slightly. Keep stir-frying until the thick ribby parts of the cabbage are soft as you like, usually several minutes. (I haven't timed it since I taste to decide when it is done...)

I serve this as a main dish with just a few dollops of sour cream on top and plenty of black pepper.

(Leave a comment)

Wednesday 14th August 2013

Subverting BIND's SRTT algorithm: derandomizing NS selection

This morning I saw a link on Twitter to a paper that was presented at the USENIX Workshop on Offensive Technologies this week. Although it sounds superficially interesting, the vulnerability is (a) not new and (b) not much of a vulnerability.

The starting point of the paper is query randomization to make cache poisoning harder. They cite RFC 5452, saying:

The most common values which the resolver randomizes are DNS transaction ID (TXID), UDP source port and the IP address of the queried name server.
In this work we present a newly discovered vulnerability in BIND which allows an attacker to determine (derandomize) the IP address of the name server a BIND resolver queries. The attack reduces the amount of information a blind attacker must guess to successfully poison BIND's cache.

The problem is that this exact vulnerability is described in RFC 5452 section 4.4:

Many zones have two or three authoritative nameservers, which make matching the source address of the authentic response very likely with even a naive choice having a double digit success rate. Most recursing nameservers store relative performance indications of authoritative nameservers, which may make it easier to predict which nameserver would originally be queried -- the one most likely to respond the quickest.

This vulnerability reduces the amount of randomness in the query by about one bit, and so it is fairly trivial. If you care that much about query randomness you should implement the dns0x20 hack to randomize the case of the query name.

Of course the real defence against cache poisoning attacks is DNSSEC.

The actual technique they use to exploit the vulnerability is new and quite clever. The real impact is that BIND's SRTT algorithm sometimes calculates bogus values, when an NS RRset has a mixture of lame and working nameservers and overlaps with other NS RRsets. Bogus SRTT values can slightly increase average query latency.

It seems to me to be a stretch to claim this is a security bug rather than a performance bug.

(Leave a comment)

Monday 17th June 2013

Dominoes and dice patterns

Nico has a box of dominoes, and playing with it often consists of me trying to arrange them into nice patterns, and him trying to shuffle them. The box contains the usual 28 dominoes, but has coloured Paddington Bear pictures instead of spots. The dominoes can fit into the box in four stacks of seven; in this arrangement there are eight squares visible but the dominoes only have seven different pictures. There isn't a particularly satisfying choice of which four dominoes get to show their faces.

Traditional dominoes use the same six arrangements of spots as dice, plus blank. They are based on a 3x3 grid, in which the middle spot is present in odd numbers and absent in even numbers, and opposing pairs of spots are added starting with the diagonals. This extends nicely from zero to nine:

I could solve my four-pile problem with a set of 36 dominoes with faces numbered 1 to 8 (which I think is prettier than 0 to 7), or I could make five piles showing squares numbered 0 to 9 if I had a "double-nine" set of 55 dominoes.

Another way to arrange the spots is hexagonally, which also allows you to use a translation from binary to unary. The middle dot represents bit 2^0; two opposing dots represent bit 2^1; and the other four dots in the hexagon represent bit 2^2:

I think this is even more pretty :-) It can also make nice octahedral dice, and the hexagon patterns will fit in the faces particularly well if the corners are rounded off.

ETA: Following the discussion in the comments, I have come up with an extended layout that works up to 31 spots. It fits fairly well in a square, but loses some of the hexagonal symmetry. It is based on the observation that three overlapping hexagonal rings contain 16 spots (3 * 6 - 2 overlapping spots). No great insight that shows how to extend it further, I am afraid. See dice-spot.html which includes the code to draw the diagrams.

(11 comments | Leave a comment)

Sunday 2nd June 2013

Guilty pleasures

Corned beef / bully beef

Not the American thing we call salt beef (which is excellent and one of my non-guilty pleasures) but the stuff that comes in cans from Fray Bentos in South America. Cheap meat and many food miles, but yummy salty fatty.

I tend to have enthusiasms for a particular food, and will eat a lot of it for a few weeks until I tire of it and move on to something else. Recently i have been making sort of ersatz Reuben sandwiches, with bully beef, emmental, sauerkraut, and mustard, often toasted gently in a frying pan to melt the filling.

Heinz Sandwich Spread

Finely diced vegetables in salad cream. Sweet and sour and crunchy. A good alternative to a pickle like Branston, especially before they did a sandwich version.

Salad cream is another of those dubious cost-reduced foods, like mayo but with less egg and oil, more vinegar, and added water and sugar. A stronger sweeter and more vinegary flavour.

(8 comments | Leave a comment)

Thursday 16th May 2013

Mixfix parsing / chain-associative operators

Earlier this evening I was reading about parsing user-defined mixfix operators in Agda.

Mixfix operators are actually quite common though they are usually called something else. The Danielsson and Norell paper describes operators using a notation in which underscores are placeholders for the operands, so (using examples from C) we have

  • postfix operators
    • _ ++
    • _ .member
  • prefix operators
    • ! _
    • & _
  • infix operators
    • _ + _
    • _ == _

There are also circumfix operators, of which the most common example is (_) which boringly does nothing. Mathematical notation has some more fun examples, such as ceiling ⎡_⎤ and floor ⎣_⎦.

Mixfix operators have a combination of fixities. C has a few examples:

  • _ [ _ ]
  • _ ? _ : _
You can also regard function call syntax as a variable-arity mixfix operator :-)

The clever part of the paper is how it handles precedence in a way that is reasonably easy for a programmer to understand when defining operators, and which allows for composing libraries which might have overlapping sets of operator definitions.

One thing that their mixfix parser does not get funky about is associativity: it supports the usual left-, right-, and non-associative operators. One of my favourite syntactic features is chained relational operators, as found in BCPL and Python. (For fun I added the feature to Lua - see this and the following two patches.) You can write an expression like

    a OP b OP c OP d
which is equivalent to
    a OP b && b OP c && c OP d
except that each operand is evaluated at most once. (Though unfortunately BCPL may evaluate inner operands twice.) This is not just short-cutting variable-arity comparison because the operators can differ.

So I wonder, are there other examples of chain-associative operators? They might have a different underlying reduction operator instead of &&, perhaps, which would imply different short-cut behaviour.

Perhaps an answer would come to mind if I understood more of the category-theory algebraic functional programming stuff like bananas and lenses and what have you...

(12 comments | Leave a comment)

Friday 3rd May 2013

Two compelling applications for universal surveillance

Face recognition is pretty good now: you can upload a picture to Facebook and it will automatically identify who is in the frame. Combine this with Google Glass and the idea of lifelogging: what can these technologies do as prosthetic memories?

I am at a party and I meet someone I am sure I have met before, but I can't remember. I could try talking to them until I get enough context to remind myself of their name, or I could just wink to take a photo, which is uploaded and annotated, and then I know their name, employer, marital status, and I have a picture of our last photographed encounter. Oh yes, now I remember! And we have saved a few minutes of embarrassing protocol negotiation and impedance matching.

But why wink? If I am lifelogging, everything I do gets uploaded. So when I see someone, I can be automatically reminded to say happy birthday, or that I need to give them the thing I offered to lend them. Contextual social cues!

How much embarrassment we could avoid! How sooner we could get to the fun part of the conversation!

(This post was inspired by a pub conversation with Ian Jackson and Mark Wooding; Simon Tatham suggested the auto-reminder feature.)
(8 comments | Leave a comment)

Thursday 11th April 2013

DNS reflection / amplification attacks: security economics, nudge theory, and perverse incentives.

In recent months DNS amplification attacks have grown to become a serious problem, the most recent peak being the 300 Gbit/s Spamhaus DDoS attack which received widespread publicity partly due to effective PR by CloudFlare and a series of articles in the New York Times (one two three).

Amplification attacks are not specific to the DNS: any service that responds to a single datagram with a greater number and/or size of reply datagrams can be used to magnify the size of an attack. Other examples are ICMP (as in the smurf attacks that caused problems about 15 years ago) and SNMP (which has not yet been abused on a large scale).

The other key ingredient is the attacker's ability to spoof the source address on packets, in order to direct the amplified response towards the ultimate victim instead of back to the attacker. The attacks can be defeated either by preventing amplification or by preventing spoofed source addresses.

Smurf attacks were stopped by disabling ICMP to broadcast addresses, both in routers (dropping ICMP to local broadcast addresses) and hosts (only responding to directly-addressed ICMP). This fix was possible since there is very little legitimate use for broadcast ICMP. The fix was successfully deployed mainly because vendor defaults changed and equipment was upgraded. Nudge theory in action.

If you can't simply turn off an amplifier, you may be able to restrict it to authorized users, either by IP address (as in recursive DNS servers) and/or by application level credentials (such as SNMP communities). It is easier to police any abuse if the potential abusers are local. Note that if you are restricting UDP services by IP address, you also need to deploy spoofed source address filtering to prevent remote attacks which have both amplifier and victim on your network. There are still a lot of vendors shipping recursive DNS servers that are open by default; this is improving slowly.

But some amplifiers, such as authoritative DNS serves, are hard to close because they exist to serve anonymous clients across the Internet. It may be possible to quash the abuse by suitably clever rate-limiting which, if you are lucky, can be as easy to use as DNS RRL; without sufficiently advanced technology you have to rely on diligent expert operators.

There is a large variety of potential amplifiers which each require specific mitigation; but all these attacks depend on spoofed source addresses, so they can all be stopped by preventing spoofing. This has been recommended for over ten years (see BCP 38 and BCP 84) but it still has not been deployed widely enough to stop the attacks. The problem is that there are not many direct incentives to do so: there is the reputation for having a well-run network, and perhaps a reduced risk of being sued or prosecuted - though the risk is nearly negligible even if you don't filter. Malicious traffic does not usually cause operational problems for the originating network in the way it does for the victims and often also the amplifiers. There is a lack of indirect incentives too: routers do not filter spoofed packets by default.

There are a number of disincentives. There is a risk of accidental disruption due to more complicated configuration. Some networks view transparency to be more desirable than policing the security of their customers. And many networks do not want to risk losing money if the filters cause problems for their customers.

As well as unhelpful externalities there are perverse incentives: source address spoofing has to be policed by an edge network provider that is acting as an agent of the attacker - perhaps unwittingly, but they are still being paid to provide insecure connectivity. There is a positive incentive to corrupt network service providers that allow criminals to evade spoofing filters. The networks that feel the pain are unable to fix the problem.

More speculatively, if we can't realistically guarantee security near the edge, it might be possible to police spoofing throughout the network. In order for this to be possible, we need a comprehensive registry of which addresses are allocated to which networks (a secure whois), and a solid idea of which paths can be used to reach each network (secure BGP). This might be enough to configure useful packet filters, though it will have similar scalability problems as address-based packet forwarding.

So we will probably never be able to eliminate amplification attacks, though we ought to be able to reduce them to a tolerable level. To do so we will have to reduce both datagram amplifiers and source address spoofing as much as possible, but neither of these tasks will ever be complete.

(7 comments | Leave a comment)

Thursday 14th March 2013

It is hard to test a DNSSEC root trust anchor rollover

ICANN are running a consultation on their plans for replacing the DNSSEC root key. I wondered if there was any way to be more confident that the new key is properly trusted before the old key is retired. My vague idea was to have a test TLD (along the lines of the internationalized test domains) whose delegation is only signed by the new root key, not the normal zone-signing keys used for the production TLDs. However this won't provide a meaningful test: the prospective root key becomes trusted because it is signed by the old root key, just like the zone-signing keys that sign the rest of the root zone. So my test TLD would ultimately be validated by the old root key; you can't use a trick like this to find out what will happen when the old key is removed.

So it looks like people who run validating resolvers will have to use some out-of-band diagnostics to verify that their software is tracking the key rollover correctly. In the case of BIND, I think the only way to do this currently is to cat the managed-keys.bind pseudo-zone, and compare this with the root DNSKEY RRset. Not user-friendly.

(Leave a comment)

Wednesday 6th March 2013

DoS-resistant password hashing

In the office this morning we had a discussion about password hashing, in particular what we could do to improve the security of web servers around the university. (You might remember the NullCrew attacks on Cambridge University a few months ago.)

One of the points the standard password storage advice (which boils down to "use scrypt, or bcrypt, or PBKDF2, in decreasing order of preference") is that a decent iteration count should make a password hash take tens of milliseconds on a fast computer. This means your server is incapable of servicing more than a few dozen login attempts per second, and so your login form becomes quite a tempting denial of service target.

One way to mitigate this is to make the client perform a proof-of-work test that is more expensive than the password hash which brings you a little closer to fairness in the arms race. On Twitter, Dan Sheppard tersely suggested some refinements:

Improve client-side puzzles: bypassable with uid-tied unexpiring cookie. If supplied & not recently & not blacklisted, relax puzzles. Under DOS require this cookie to go into 1st queue, otherwise into 2nd with DOS. 5s at first login on new machine w/ spinner is ok.

But for real DoS resistance you need the client to be doing a lot more work than the server (something like 3 to 10 binary orders of magnitude), not roughly the same amount. So I wondered, can you move some of the work of verifying a password from the server to the client? And require that an attacker with a copy of the password file should have to do both the client's and server's work for each trial decrypt.

I thought you could use two hash functions, "hard" and "easy". ("Easy" should still be a slow-ish password hash.) The server sends some parameters (salt and work factor) to the client, which executes the "hard" function to crypt the password, and then sends password and crypt to the server. The server uses these as the input for the "easy" function, and checks this matches the stored crypt. The server stores only the first salt and the final output of the process, along with the work factors.

What might be the problems with this idea? (I will not be surprised or upset if it turns out to be completely laughable to cryptographers!)

The output of the "hard" KDF is password-equivalent, so it needs to be well protected. I'm assuming it is treated like an ephemeral key. Don't be tempted to put it in a cookie, for example.

This protocol is not a proof of work like what is described in the Stack Exchange question I linked to above: in my idea the challenge from the server to the client is always the same. (Which is why the result of the work is password-equivalent.) But this is OK for a threat model that is satisfied by basic password authentication.

The "easy" function needs to be resistant to attacks that aim to recover the salt. As I understand it, this is not a normal requirement for password hashes (the salt is usually considered public) so an unusual construction might be needed.

Any other problems? Is this a completely foolish idea?

Of course this is a mostly useless idea, since there are much better authentication protocols. In particular, SRP (Secure Remote Password authentication and session key establishment) does all the work of hashing the password and salt on the client - observe in the summary where Hash(salt, password) happens.

What other password authentication protocols should be mentioned in this context?

(17 comments | Leave a comment)

Wednesday 27th February 2013

ccTLD registry web server survey

There has been some amusement on Twitter about a blog post from Microsoft offering their security auditing services to ccTLDs. This is laughable from the DNS point of view, but fairly plausible if you consider a registry's web front end. So I thought I would do a little survey. I got a list of ccTLDs from my local copy of the root zone, and looked up their web sites in Wikipedia, then asked each one in turn what web server software it uses. Frederic Cambus suggested on Twitter that I should get the registry URLs from the IANA root zone database instead, and this turned out to be much better because of the more consistent formatting.

No Server17
Apache verbose39
Apache (Unix)21
Apache (CentOS)20
Apache (Debian)13
Apache (Ubuntu)12
Apache (FreeBSD)7
Apache (Fedora)4
Apache (Win32)2
Nginx / Varnish1
GlassFish v31
VNNIC IDN Proxy 21
Zope / Plone1
    dig axfr . |
    sed '/^\([A-Za-z][A-Za-z]\)[.].*/!d;
         s||\1 http://www.iana.org/domains/root/db/\1.html|' |
    sort -u |
    while read t w
        r=$(curl -s $w |
            sed '/.*<b>URL for registration services:<.b> <a href="\([^">.*/!d;
        case $r in
        '') echo "<tr><th>$t</th><td></td><td></td></tr>"
        s=$(curl -s -I $r | sed '/^Server: \([^^M]*\).*/!d;s//\1/')
        echo "<tr><th>$t</th><td>$r</td><td>$s</td></tr>"
All the dataCollapse )
(1 comment | Leave a comment)

Thursday 21st February 2013


Dan Bernstein recently published slides for a talk criticizing DNSSEC. The talk is framed as a description of the fictitious protocol HTTPSEC, which is sort of what you would get if you applied the DNSSEC architecture to HTTP. This seems to be a rhetorical device to make DNSSEC look stupid, which is rather annoying because it sometimes makes his points harder to understand, and if his arguments are strong they shouldn't need help from extra sarcasm.

The analogy serves to exaggerate the apparent foolishness because there are big differences between HTTP and the DNS. HTTP deals with objects thousands of times larger than the DNS, and (in simple cases) a web site occupies a whole directory tree whereas a DNS zone is just a file. Signing a file isn't a big deal; signing a filesystem is troublesome.

There are also differences in the way the protocols rely on third parties. Intermediate caches are ubiquitous in the DNS, and relatively rare in HTTP. The DNS has always relied a lot on third party secondary authoritative servers; by analogy HTTP has third party content delivery networks, but these are a relatively recent innovation and the https security model was not designed with them in mind.

The DNS's reliance on third parties (users relying on their ISP's recursive caches; master servers relying on off-site slaves) is a key part of the DNSSEC threat model. It is designed to preserve the integrity and authenticity of the data even if these intermediaries are not reliable. That is, DNSSEC is based on data security rather than channel security.

I like to use email to explain this distinction. When I connect to my IMAP server over TLS, I am using a secure channel: I know I am talking to my server and that no-one can falsify the data I receive. But the email I download over this channel could have reached the server from anywhere, and it can contain all sorts of fraudulent messages. But if I get a message signed with PGP or S/MIME I can be sure that data is securely authentic.

DJB uses the bad analogy with HTTP to mock DNSSEC, describing the rational consequences of its threat model as "not a good approach" and "stupid". I would prefer to see an argument that tackles the reasons for DNSSEC's apparently distasteful design. For example, DJB prefers an architecture where authoritative servers have private keys used for online crypto. So if you want outsourced secondary service you have to make a rather difficult trust trade-off. It becomes even harder when you consider the distributed anycast servers used by the root and TLDs: a lot of the current installations cannot be upgraded to the level of physical security that would be required for such highly trusted private keys. And there is the very delicate political relationship between ICANN and the root server operators.

So the design of DNSSEC is based on an assessment that the current DNS has a lot of outsourcing to third parties that we would prefer not to have to trust, but at the same time we do not want to fix this trust problem by changing the commercial and political framework around the protocol. You might legitimately argue that this assessment is wrong, but DJB does not do so.

What follows is a summary of DJB's arguments, translated back to DNSSEC as best I can, with my commentary. The PDF has 180 pages because of the way the content is gradually revealed, but there are less than 40 substantive slides.

Paragraphs starting with a number are my summary of the key points from that page of the PDF. Paragraphs marked with a circle are me expanding or clarifying DJB's points. Paragraphs marked with a square are my corrections and counter-arguments.

  1. HTTPSEC motivation
  2. Standard security goals
  3. HTTPSEC "HTTP Security"
    responses signed with PGP
    • DNSSEC uses its own signature format: RRSIG records.

  4. Signature verification
    chain of trust
    • Internet Central Headquarters -> ICANN.

  5. Root key management
    • The description on this slide is enormously simplified, which is fair because DNSSEC root key management involves a lot of paranoid bureaucracy. But it gets some of these tedious details slightly wrong; fortunately this has no bearing on the argument.

    • Access to the DNSSEC root private key HSM requires three out of seven Crypto Officers. There are also seven Recovery Key Share officers, five of whom can reconstruct the root private key. And there are separate people who control physical access to the HSM, and people who are there to watch everything going according to plan.

    • Root zone signatures last a week, but the root zone is modified several times a day and each modification requires signing, using a private key (ZSK) that is more easily accessed than the root key (KSK) which only comes out every three months when the public keys are signed.

  6. Verifying the chain of trust
  7. HTTPSEC performance
    Precomputed signatures and no per-query crypto.
    • This design decision is not just about performance. It is also driven by the threat model.

  8. Clients don't share the work of verifying signatures
    Crypto primitives chosen for performance
    Many options
    • Another consideration is the size of the signatures: smaller is better when it needs to fit into a UDP packet, and when the signatures are so much bigger than the data they cover.

    • Elliptic curve signatures are now an excellent choice for their small size and good performance, but they are relatively new and have been under a patent cloud until fairly recently. So DNSSEC mostly uses good old RSA which has been free since the patent expired in 2000. If crypto algorithm agility works, DNSSEC will be able to move to something better than RSA, though it will probably take a long time.

    • Compare TLS crypto algorithm upgrades.

  9. Breakable choices such as 640-bit RSA
    Staggering complexity
    • Any fixed choice of crypto primitive is going to be broken at some point, so there must be some way to upgrade the crypto over time. DNSSEC relies on signatures which in turn rely on hashes, and hash algorithms have generally had fairly short lifetimes.

    • Compare SSL/TLS's history of weak crypto. Both protocols date back to the mid 1990s.

    • The complexity of DNSSEC is more to do with the awkward parts of the DNS, such as wildcards and aliases, and not so much the fact of crypto algorithm agility.

  10. HTTPSEC does not provide confidentiality
    • Yes this is a bit of a sore point with DNSSEC. But observe that in pretty much all cases, immediately after you make a DNS query you use the result to connect to a server, and this reveals the contents of the DNS reply to a snooper. On the other hand there are going to be uses of the DNS which are not so straight-forward, and this will become more common as more people use the security properties of DNSSEC to put more interesting data in the DNS which isn't just related to IP connectivity.

  11. HTTPSEC data model
    signatures alongside data
    • This slide makes DNSSEC signing sound a lot more fiddly than it actually is. HTTP deals with sprawling directory trees whereas most DNS master files are quite small, e.g. 33Mbytes for a large-ish signed zone with 51k records. DNSSEC tools deal with zones as a whole and don't make the admin worry about individual signatures.

  12. HTTPSEC purists say "answers should always be static"
    • In practice DNSSEC tools fully support dynamic modification of zones, e.g. the .com zone is updated several times a minute. In many cases it is not particularly hard to add DNSSEC signing as a publication stage between an existing DNS management system and the public authoritative servers, and it often doesn't require any big changes to that system.

    • Static data is supported so that it is possible to have offline keys like the root KSK, but that does not prevent dynamic data. (For a funny example, see the DNSSEC reverse polish calculator.) In a system that requires dynamic signatures static data and offline keys are not possible.

  13. Expiry times stop attackers replaying old signatures
    Frequent resigning is an administrative disaster
    • This is true for some of the early unforgiving DNSSEC tools, but there has been a lot of improvement in usability and reliability in the last couple of years.

  14. HTTPSEC suicide examples
    • These look like they are based on real DNSSEC cockups.

    • Many problems have come from the US government DNSSEC deployment requirement of 2008, so many .gov sites set it up using the early tools with insufficient expertise. It has not been a very good advertisement for the technology.

  15. Nonexistent data - how not to do it
  16. NSEC records for authenticated denial of existence
  17. NSEC allows zone enumeration
  18. DNS data is public
    an extreme notion and a public-relations problem
    • The other problem with NSEC is that it imposes a large overhead during the early years of DNSSEC deployment where TLDs mostly consist of insecure delegations. Every update requires fiddling with NSEC records and signatures even though this provides no security benefits. NSEC3 opt-out greatly reduces this problem.

  19. NSEC3 hashed denial of existence
  20. NSEC3 does not completely prevent attackers from enumerating a zone
    • This is true but in practice most sites that want to keep DNS data private use hidden zones that are only accessible on their internal networks.

    • Alternatively, if your name server does on-demand signing rather using pre-generated signatures, you can use dynamic minimally covering NSEC records or empty NSEC3 records.

    • So there are ways to deal with the zone privacy problem if static NSEC3 isn't strong enough for you.

  21. DNS uses UDP
  22. DNS can be used for amplification attacks
    • A lot of other protocols have this problem too. Yes, DNSSEC makes it particularly bad. DNS software vendors are implementing response rate limiting which eliminates the amplification effect of most attacks. Dealing with spam and criminality is all rather ugly.

    • A better fix would be for network providers to implement ingress filtering (RFC 2827, BCP 38), but sadly this seems to be an impossible task, so higher-level protocols have to mitigate vulnerabilities in the network.

  23. DNSSEC provides no protection against denial of service attacks
    • This quote comes from RFC 4033 section 4. I think (though it isn't entirely clear) that what the authors had in mind was the fact that attackers can corrupt network traffic or stop it, and DNSSEC can do nothing to prevent this. See for example section 4.7 of RFC 4035 which discusses how resolvers might mitigate this kind of DoS attack. So the quote isn't really about reflection attacks.

    • (Other protocols have similar problems; for instance, TLS kills the whole connection when it encounters corruption, so it relies on the difficulty of breaking TCP connections which are not normally hardened with crypto - though see the TCP MD5 signature option in RFC 2385.)

  24. The worst part of HTTPSEC

    The data signed by HTTPSEC doesn’t actually include the web pages that the browser shows to the user.

    HTTPSEC signs only routing information: specifically, 30x HTTP redirects.
    • I can't easily understand this slide because the analogy between DNS and HTTP breaks down. According to the analogy, HTTP redirects are DNS referrals, and web pages are leaf RRsets. But DNSSEC does sign leaf RRsets, so the slide can't be talking about that.

    • Perhaps it is being more literal and it is talking about actual web pages. The DNSSEC answer is that you use records such as SSHFP or TLSA to link the DNSSEC authentication chain to your application protocol.

    • I asked DJB about this on Twitter, and he confirmed the latter interpretation. But he complained that the transition from signed DNSSEC data to an encrypted application channel destroys the advantages of signed DNSSEC data, because of the different security models behind signed data and encrypted channels.

    • But in this situation we are using DNSSEC as a PKI. The X.509 PKI is also based on statically signed data (the server certificate) which is used to authenticate a secure channel.

  25. redirect example
  26. redirect example
  27. redirect example
  28. If final web page is signed, what is the security benefit of signing the redirects? Attacker can’t forge the page.
    • The answer to this question is how you trust the signature on the web page. At the moment we rely on X.509 which is inadequate in a lot of ways. DNSSEC is a new PKI which avoids some of the structural problems in the X.509 PKI. This is the reason I think it is so important.

    • The X.509 PKI was designed to follow the structure of the X.500 directory. When it got re-used for SSL and S/MIME it became decoupled from its original name space. Because of this, any CA can authenticate any name, so every name is only as strong as the weakest CA.

    • DNSSEC follows the structure of the Internet's name space. Its signing authorities are the same as DNS zone authorities, and they can only affect their subdomains. A British .uk name cannot be harmed by the actions of the Libyan .ly authorities.

    • What other global PKIs are there? PGP? Bueller?

  29. Deployment is hard
    • If you look at countries like Sweden, Czech, Netherlands, Brazil, there is a lot more DNSSEC than elsewhere. They have used financial incentives (domain registration discounts for signed domains) to make it more popular. Is DNSSEC worth this effort? See above.

    • It's amusing to consider the relative popularity of DNSSEC and PGP and compare their usage models.

  30. HTTPS is good
    • But it relies on the X.509 PKIX which is a disaster. Peter Gutmann wrote a series of articles with many examples of why: I, II, III; and the subsequent mailing list discussion is also worth reading.

  31. The following quotes are straw-man arguments for why an https-style security model isn't appropriate for DNSSEC.
  32. “HTTPS requires keys to be constantly online.”
    • So does DNSSEC in most setups. You can fairly easily make DNSSEC keys less exposed than a TLS private key, using a hidden master. This is a fairly normal non-sec DNS setup so it's nice that DNSSEC can continue to use this structure to get better security.

  33. “HTTPS requires servers to use per-query crypto.”
    • So does NSEC3.

  34. “HTTPS protects only the channel, not the data. It doesn’t provide end-to-end security.”
    Huh? What does this mean?
    • See the discussion in the introduction about the DNSSEC threat model and the next few notes.

  35. Why is the site owner putting his data on an untrusted server?
    • Redundancy, availability, diversity, scale. The DNS has always had third-party secondary authoritative name servers. HTTP also does so: content delivery networks. The difference is that with DNSSEC your outsourced authoritative servers can only harm you by ceasing to provide service: they cannot provide false service; HTTP content delivery networks can mess up your data as much as they like, before serving it "securely" to your users with a certificate bearing your name.

  36. “HTTPS destroys the caching layer. This Matters.”
    Yeah, sure it does. Film at 11: Internet Destroyed By HTTPS.
    • Isn't it nicer to get an answer in 3ms instead of 103ms?

    • Many networks do not provide direct access to DNS authoritative servers: you have to use their caches, and their caches do not provide anything like the web proxy HTTP CONNECT method - or at least they are not designed to provide anything like that. A similar facility in the DNS would have to be an underhanded crypto-blob-over-DNS tunnel hack: a sort of anarchist squatter's approach to protocol architecture.

    • To be fair, a lot of DNS middleboxes have crappy DNSSEC-oblivious implementations, and DNSSEC does not cope with them at all well. Any security upgrade to the DNS probably can't be done without upgrading everything.

  37. The DNS security mess
  38. referral example
  39. HTTPSEC was all a horrible dream analogy

So to conclude, what DJB calls the worst part of DNSSEC - that it secures the DNS and has flexible cryptographic coupling to other protocols - is actually its best part. It is a new global PKI, and with sufficient popularity it will be better than the old one.

I think it is sad that someone so prominent and widely respected is discouraging people from deploying and improving DNSSEC. It would be more constructive to (say) add his rather nice Curve25519 algorithm to DNSSEC.

If you enjoyed reading this article, you might also like to read Dan Kaminsky's review of DJB's talk at 27C3 just over two years ago.

(6 comments | Leave a comment)

Wednesday 30th January 2013

More Oxbridge college name comparisons

So I said you'd have to do this comparison yourselves, but I can't resist making another list, to go with Janet's list of Oxford college domain names, my list of Cambridge college domain names, and my list of colleges with similar names in the two universities, here is a list of colleges with similar domain names:

Corpus Christi College
Jesus College
Magdalen(e) College
Pembroke College
(The) Queen's's's College
St Cath(a/e)rine's College
St John's College
Trinity College
Wolfson College

A few other comparisons occur to me.

Cambridge seems to have more colleges whose names are commonly abbreviated relative to the college's usual name (ignoring their sometimes lengthy full formal names): Caius, Catz, Corpus, Emma, Fitz, Lucy Cav, Sidney / Catz, Corpus, the House, LMH, Univ. (Does "New" count when divorced from "College"?)

Cambridge is more godly than Oxford: Christ's, Corpus, Emmanuel, Jesus, Trinity College and Hall / Christ Church, Corpus, Jesus, Trinity. (Does St Cross count?)

Oxford is more saintly: Magdalene, Peterhouse, St Catharine's, St Edmund's, St John's, (plus Corpus, Jesus, King's, Queens', according to their full names) / Magdalen, St Anne’s, St Antony’s, St Benet’s, St Catherine’s, St Cross, St Edmund, St Hilda’s, St Hugh’s, St John’s, St Peter’s, St Stephen’s, (plus Lincoln, New, according to their full names).

(7 comments | Leave a comment)

Tuesday 29th January 2013

Cambridge college domain names

My friend Janet recently posted an article about the peculiarities of Oxford college domain names. Cambridge has a similar amount of semi-random hysterical raisin difference between its college domain names, so I thought I would enumerate them in a similar way to Janet.

Like Oxford, we have two kinds of colleges, but our kinds are different. We have colleges of the University and colleges in the Cambridge theological federation. The latter are not formally part of the University in the way that the other colleges are, but they are closely affiliated with the University Faculty of Divinity, and the University Computing Service is their ISP. There are also a number of non-collegiate institutions in the theological federation. I've included MBIT in the stats because it's interesting even though it's very borderline - a semi-college, let's say.

The Computing Service's governing syndicate allows colleges (and University departments) to have long domain names if their abbreviated version is considered by them to be too ugly. It's surprising how few of them have taken this up.

The following summary numbers count a few colleges more than once, if they have more than one domain name, or if they are a hall, etc. University colleges: 31; Theological colleges: 4.5; Straightforward (by Janet's criteria): 15.5; Initials: 0.5; First three letters: 7; First four letters: 5; Middle three letters: 2; Some other abbreviation: 5; Long + short: 3: Halls with hall: 3; Halls without hall: 2.

It is perhaps amusing to observe where the Oxbridge colleges with similar names differ in their domain labels in either place - see my next article. I have previously written about colleges with more-or-less similar names in Oxford and Cambridge, in which I list many of the less obvious formal expansions of the colleges' common names.

So, the Colleges of the University of Cambridge (and their domain names):

Christ's College
Churchill College
Clare College
Clare Hall
Corpus Christi College
Darwin College
Downing College
Emmanuel College
Fitzwilliam College
Girton College
Gonville & Caius College
Homerton College
Hughes Hall
Jesus College
King's College
Lucy Cavendish College
Magdalene College
Murray Edwards College (formerly known as New Hall)
Newnham College
Pembroke College
Queens' College
Robinson College
Selwyn College
Sidney Sussex College
St Catharine's College
St Edmund's College
St John's College
Trinity College
Trinity Hall
Wolfson College

And the theological colleges:

Margaret Beaufort Institute for Theology
Ridley Hall
Wesley House
Westcott House
Westminster College
(7 comments | Leave a comment)

Tuesday 4th December 2012

Distributed (micro-) blogging / how many markets does your protocol support?

I have been idly thinking about a distributed Twitter. The blogging technology we already have does a lot of what we want: your stream of updates ought to be just an Atom feed, and you can subscribe to and aggregate the atom feeds of the people you follow. What does this lack compared to Twitter?

  • A nice user interface. Surely just a small matter of programming.
  • Quick updates. For this use pubsubhubbub.
  • Protected feeds. I'm going to ignore this problem and hope the crypto fairy comes and waves her magic wand to make it go away.
  • Notifications that someone you don't know has mentioned you or replied to you.

The last one is crucial, because open communication between strangers is a key feature of Twitter. But if you allow strangers to contact you then you are vulnerable to spam. A centralized system has the advantage of concentrating both information about how spammers behave and the expertise to analyse and counteract them. In a distributed system spam becomes everyone's problem, and gives everyone an awkward dilemma between preserving privacy and collecting data for spam fighting.

An alternative approach, since feeds are generally public, is to view this as a search problem. That is, you rely on a third party to collect together all the feeds it can, cull the spammers, and inform you of items of interest to you - mentions, replies, tags, etc. This is a slightly centralized system, but you a search provider is not involved in communication between people who know each other, and search is open to competition in a way that most social networking services are not.

The system as a whole then has (roughly) three layers: clients that can update, collect, and display Atom feeds; servers that host Atom feeds; and search services that index the servers. All this tied together with pubsubhubbub and HTTP. In a successful system each of these layers should be a competitive market with multiple implementations and service providers.

This three tier structure is reminiscent of the web. But a lot of Internet applications have only a two tier structure. This led me to think about what kinds of systems have different numbers of markets.

Zero markets

These are proprietary systems entirely controlled by a single vendor. For instance, Skype, AOL (in the mid-1990s), many end-user applications. A lot of web applications fall into this category, where the client software is downloaded on demand to run in the browser - treating the web (servers and browsers) as a substrate for the application rather than a part of it.

One market

A proprietary system with a published API. Operating systems typically fall into this category, and programmable application software. A characteristic of web 2.0 is to provide APIs for web services.

Social networking typically falls somewhere between zero and one on this scale, depending on how complete and well supported their API is. Twitter was a one market system for years, but is now restricting its developers to less than that. Google's services are usually somewhere between zero and one, often closer to zero.

Two markets

An open system, with multiple implementations of the provider side of the interface as well as the consumer side. Many Internet applications are in this category: telnet, mail, usenet, IRC, the web, etc. etc.

Unix / POSIX is another good example. Perhaps some operating systems are slightly more open than a pure one-market system: NextSTEP has a clone, OpenSTEP, but it was never popular enough to become an open system and then Mac OS X raced away leaving it behind. Wine is playing permanent catch-up with Microsoft Windows.

Many programming languages are two-market systems: Ada, C, Fortran, Haskell, JavaScript, Pascal. Lua, Python, Ruby, and Tcl to some extent - they have reference implementations and clones rather than an independent open specification. Java has multiple implementations, even though the spec is proprietary. Some successful languages are still only one market systems, such as Perl and Visual Basic.

Three markets

The key feature here seems to be that a two market system doesn't provide enough connectivity; the third tier links the second tier providers together. This is not layering in the sense of the OSI model: for example, many two-tier Internet applications rely on the DNS for server-to-server connections, but this is a sublayer, not implemented as part of the application itself. In the web and in my distributed (micro-) blogging examples, the search tier operates as both client and server of the application protocol.

Linux can be viewed as a three-tier POSIX implementation. Linux's second tier is much more fragmented than traditional Unix; so Linux distributions provide a third tier that ties it together into a coherent system.

Perhaps the Internet itself is in this category. The IP stack in user PCs and application servers are consumers of Internet connectivity; Internet access providers and server hosting providers are the second tier; and the third tier is the backbone network providers. This is obviously a massive oversimplification - many ISPs can't be easily classified in this manner. (But the same is also true of two-market systems: for example, webmail providers act in both the client and server tiers of Internet mail.)


The DNS has an elaborate multi-tiered structure: stub resolvers, recursive resolvers, authoritative servers, registrars, registries, and the root. This is partly due to its hierarchial stricture, but the registrar / registry split is somewhat artificial (though it is similar to manufacturer / dealer and franchising arrangements). Though perhaps it can also be viewed as a three tier system: The user tier includes resolvers, DNS update clients, zone master file editors, and whois clients for querying registries. (Grumpy aside: Unfortunately editing the DNS is usually done with proprietary web user interfaces and non-standard web APIs.) The middle tier comprises authoritative servers and registrars, not just because these services are often bundled, but also because you can't publish a zone without getting it registered. The third tier comprises the registries and the root zone, which provide an index of the authoritative servers. (The interface between the second and third tiers is EPP rather than DNS.)


I think I have found this classification an interesting exercise because a lot of discussion about protocols that I have seen has been about client / server versus peer-to-peer, so I was interested to spot a three-tier pattern that seems to be quite successful. (I haven't actually mentioned peer-to-peer systems much in this article; they seem to have a similar classification except with one less tier.) This is almost in the opposite direction to the flattened anarchism of cypherpunk designs; even so it seems three-tier systems often give users a lot of control over how much they have to trust powerful third parties. Unfortunately the top tier is often a tough market to crack…

(3 comments | Leave a comment)

Monday 3rd December 2012


Last week I was inspired by a tweet about a chilli quesadilla and since then I have cooked the following about three times for Rachel and I.

Filling (enough for four quesadillas):

  • An onion, cut in half and sliced to make strips;
  • Two sweet peppers, cut into strips
  • A couple of cloves of garlic
  • Two fresh chillis, deseeded and sliced
  • Diced chorizo (Aldi sell little packs, two of which is about right)
  • Olive oil
Heat the oil in a pan then bung the lot in and fry until the onion and pepper are cooked - stop before the onion and peppers get too soft. Since Rachel isn't such a fan of spicy food, I keep the chillis on one side and add them when I assemble my quesadillas.

To make a quesadilla, take a plain wheat flour tortilla wrap and fold in half and unfold to make a crease. Cover each half with a single layer of cheese slices (I like Emmental). Put the tortilla in a dry frying pan and put on a moderate heat. Put some filling on one half, and when the cheese has started to melt, fold over the other half and press so the cheese glues in the folling. Cook for a while until the underside is gently browned (less than a minute, I think) then flip over to brown the other side.


(3 comments | Leave a comment)

Friday 30th November 2012

Can't send mail from an Apple Mac via a BT Internet connection.

This morning we got a problem report from a user who couldn't send mail, because our message submission servers were rejecting their MUA's EHLO command because the host name was syntactically invalid. The symptoms were a lot like this complaint on the BT forums except our user was using Thunderbird. In particular the hostname had the same form, unknown-11:22:33:44:55:66.home, where the part in the middle looked like an embedded ethernet address including the colons that triggered the syntax error.

I wondered where this bogus hostname was coming from so I asked the user to look at their Mac's various hostname settings. I had a look in our logs to see if other people were having the same problem. Yes, a dozen or two, and all of them were using BT Internet, and all of the ethernet addresses in their hostnames were allocated to Apple.

A lot of our BT users have hostnames like unknown-11-22-33-44-55-66.home where the colons in the ethernet address have been replaced by hyphens. But it seems that some versions of the BT hub have broken hostname allocation that fails to use the correct syntax. You can change the settings on the hub to allocate a hostname with valid syntax but you have to do this separately for each computer that connects to your LAN. I believe (but I haven't checked) that the hub implements a fake .home TLD so that DNS "works" for computers on the LAN.

When a Mac OS computer connects to a network it likes to automatically adjust its hostname to match the network's idea of its name, so on a LAN managed by a broken BT hub it ends up with a bogus hostname containing colons. You can force the Mac to use a sensible hostname either globally or (as in those instructions) just in a particular location.

Most MUAs construct their EHLO commands using the computer's hostname. Most mail servers reject EHLO commands with invalid syntax (though underscores are often given a pass). So, if you try to send mail from a Mac on a BT connection then it is likely to fail in this way.

This is a stupid and depressing bug, and the workarounds available to the user are all rather horrible. It would be a waste of time to ask the affected users to do their own workarounds and complaints to BT - and it seems most of them are oblivious to the failure.

So I decided to tweak the configuration on our message submission servers to ignore the syntax of the client hostname. In Exim you can do this using helo_accept_junk_hosts = * preceded by a comment explaining why BT Hubs suck.

(6 comments | Leave a comment)

Tuesday 25th September 2012

Large-scale IP-based virtual hosting

Yesterday there was a thread on the NANOG list about IPv6 addressing for web sites. There is a great opportunity here to switch back to having an IP address per site, like we did with IPv4 in the mid-1990s :-) Of course the web is a bit bigger now, and there is more large-scale hosting going on, but the idea of configuring a large CIDR block of addresses on a server is not unprecedented - except back then we were dealing with /18s rather than /64s.

Demon's Homepages service predated widespread browser support for name-based virtual hosting. In fact it provided one of the download sites for IE3, which was the first version that sent Host: headers. So Homepages was based on IPv4 virtual hosting and had IIRC 98,304 IP addresses allocated to it in three CIDR blocks. It was a single web server, in front of which were a few reverse proxy caches that took most of the load, and that also had all the IP addresses. Every cache would accept connections on all the IP addresses, and the load was spread between them by configuring which address ranges were routed to which cache.

The original version of Homepages ran on Irix, and used a cunning firewall configuration to accept connections to all the addresses without stupid numbers of virtual interfaces. Back then there were not many firewall packages that could do this, so when it moved to BSD (first NetBSD then FreeBSD) we used a "NETALIAS" kernel hack which allowed us to use ifconfig to bring up a CIDR block in one go.

Sadly I have never updated the NETALIAS code to support IPv6. But I wondered if any firewalls had caught up with Irix in the last 15 years. It turns out the answer is yes, and the key feature to look for is support for transparent proxying. On FreeBSD you want the ipfw fwd rule. On Linux you want the TPROXY feature. You can do a similar thing with OpenBSD pf, though it requires the server to use a special API to query the firewall, rather than just using getsockname().

On Demon's web servers we stuffed the IP address into the filesystem path name to find the document root, or used various evil hacks to map the IP address to a canonical virtual server host name before stuffing the latter in the path. mod_vhost_alias is very oriented around IPv4 addresses and host names, so probably not great for IPv6, so mod_rewrite is a better choice if you want to break addresses up into a hierarchial layout. But perhaps it is not that ugly to run a name server which is authoritative for the reverse ip6.arpa range used by the web server, and map the address to the hostname with UseCanonicalName DNS.

(Leave a comment)
Previous 50
Powered by LiveJournal.com