Tony Finch (fanf) wrote,
Tony Finch

A weird BIND DNSSEC resolution bug, with a fix.

The central recursive DNS servers in Cambridge act as stealth slaves for most of our local zones, and we recommend this configuration for other local DNS resolvers. This has the slightly odd effect that the status bits in answers have AD (authenticated data) set for most DNSSEC signed zones, except for our local ones which have AA (authoritative answer) set. This is not a very big deal since client hosts should do their own DNSSEC validation and ignore any AD bits they get over the wire.

It is a bit more of a problem for the toy nameserver I run on my workstation. As well as being my validating resolver, it is also the master for my personal zones, and it slaves some of the Cambridge zones. This mixed recursive / authoritative setup is not really following modern best practices, but it's OK when I am the only user, and it makes experimental playing around easier. Still, I wanted it to validate answers from its authoritative zones, especially because there's no security on the slave zone transfers.

I had been procrastinating this change because I thought the result would be complicated and ugly. But last week one of the BIND developers, Mark Andrews, posted a description of how to validate slaved zones to the dns-operations list, and it turned out to be reasonably OK - no need to mess around with special TSIG keys to get queries from one view to another.

The basic idea is to have one view that handles recursive queries and which validates all its answers, and another view that holds the authoritative zones and which only answers non-recursive queries. The recursive view has "static-stub" zone configurations mirroring all of the zones in the authoritative view, to redirect queries to the local copies.

Here's a simplified version of the configuration I tried out. To make it less annoying to maintain, I wrote a script to automatically generate the static-stub configurations from the authoritative zones.

  view rec {
    match-recursive-only yes;
    zone         { type static-stub; server-addresses { ::1; }; };
    zone { type static-stub; server-addresses { ::1; }; };

  view auth {
    recursion no;
    allow-recursion { none; };
    zone         { type slave; file "cam";  masters { ucam; }; };
    zone { type slave; file "priv"; masters { ucam; }; };

This seemed to work fine, until I tried to resolve names in - then I got a server failure. In my logs was the following (which I have slightly abbreviated):

  client ::1#55687 view rec: query: IN A +E (::1)
  client ::1#60344 view auth: query: IN A -ED (::1)
  client ::1#54319 view auth: query: IN DS -ED (::1)
  resolver: DNS format error from ::1#53 resolving
    Name (SOA) not subdomain of zone -- invalid response
  lame-servers: error (FORMERR) resolving '': ::1#53
  lame-servers: error (no valid DS) resolving '': ::1#53
  query-errors: client ::1#55687 view rec:
    query failed (SERVFAIL) for at query.c:7435

You can see the original recursive query that I made, then the resolver querying the authoritative view to get the answer and validate it. The situation here is that is an unsigned zone, so a DNSSEC validator has to check its delegation in the parent zone and get a proof that there is no DS record, to confirm that it is OK for to be unsigned. Something is going wrong with BIND's attempt to get this proof of nonexistence.

When BIND gets a non-answer it has to classify it as a referral to another zone or an authoritative negative answer, as described in RFC 2308 section 2.2. It is quite strict in its sanity checks, in particular it checks that the SOA record refers to the expected zone. This check often discovers problems with misconfigured DNS load balancers which are given a delegation for but which think their zone is, leading them to hand out malformed negative responses to AAAA queries.

This negative answer SOA sanity check is what failed in the above log extract. Very strange - the resolver seems to be looking for the DS record in the zone, not the zone, so when it gets an answer from the zone it all goes wrong. Why is it looking in the wrong place?

In fact the same problem occurs for the zone itself, but in this case the bug turns out to be benign:

  client ::1#16276 view rec: query: IN A +E (::1)
  client ::1#65502 view auth: query: IN A -ED (::1)
  client ::1#61409 view auth: query: IN DNSKEY -ED (::1)
  client ::1#51380 view auth: query: IN DS -ED (::1)
  security: client ::1#51380 view auth: query (cache) '' denied
  lame-servers: error (chase DS servers) resolving '': ::1#53

You can see my original recursive query, and the resolver querying the authoritative view to get the answer and validate it. But it sends the DS query to itself, not to the name servers for the zone. When this query fails, BIND re-tries by working down the delegation chain from the root, and this succeeds so the overall query and validation works despite tripping up.

This bug is not specific to the weird two-view setup. If I revert to my old configuration, without views, and just slaving and, I can trigger the benign version of the bug by directly querying for the DS record:

  client ::1#30447 ( query: IN DS +E (::1)
  lame-servers: error (chase DS servers) resolving '':

In this case the resolver sent the upstream DS query to one of the authoritative servers for, and got a negative response from the zone apex per RFC 4035 section This did not fail the SOA sanity check but it did trigger the fall-back walk down the delegation chain.

In the simple slave setup, queries for do not fail because they are answered from authoritative data without going through the resolver. If you change the zone configurations from slave to stub or static-stub then the resolver is used to answer queries for names in those zones, and so queries for explode messily as BIND tries really hard (128 times!) to get a DS record from all the available name servers but keeps checking the wrong zone.

I spent some time debugging this on Friday evening, which mainly involved adding lots of logging statements to BIND's resolver to work out what it thought it was doing. Much confusion and headscratching and eventually understanding.

BIND has some functions called findzonecut() which take an option to determine whether it wants the child zone or the parent zone. This works OK for dns_db_findzonecut() which looks in the cache, but dns_view_findzonecut() gets it wrong. This function works out whether to look for the name in a locally-configured zone, and if so which one, or otherwise in the cache, or otherwise work down from the root hints. In the case of a locally-configured zone it ignores the option and always returns the child side of the zone cut. This causes the resolver to look for DS records in the wrong place, hence all the breakage described above.

I worked out a patch to fix this DS record resolution problem, and I have sent details of the bug and my fix to And I now have a name server that correctly validates its authoritative zones :-)

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment