Tony Finch (fanf) wrote,
Tony Finch
fanf

Obscure problem caused by bad DNS load balancer

Our colleagues have recently been having problems talking to the Universities and Colleges Admissions Service ODBC server. This was caused by DNS resolution failures and lengthy time-outs when trying to look up the IPv6 address for odbc.ucas.com. BIND complains in its log:

22-Jul-2010 19:29:17.827 resolver: notice:
  DNS format error from 62.189.0.250#53 resolving odbc.ucas.com/AAAA
  for client 127.0.0.1#52970: invalid response
22-Jul-2010 19:29:17.827 lame-servers: info:
  error (FORMERR) resolving 'odbc.ucas.com/AAAA/IN': 62.189.0.250#53

This is puzzling, because dig shows that the server is sending back what looks like a perfectly well-formatted NODATA response:

; <<>> DiG 9.7.1-P2 <<>> +norec +multi aaaa odbc.ucas.com @62.189.0.250
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59781
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;odbc.ucas.com.         IN AAAA

;; AUTHORITY SECTION:
ucas.com.               86400 IN SOA ucas.com. administrator.ucas.com. (
                                998545544  ; serial
                                28800      ; refresh (8 hours)
                                7200       ; retry (2 hours)
                                604800     ; expire (1 week)
                                86400      ; minimum (1 day)
                                )

;; Query time: 8 msec
;; SERVER: 62.189.0.250#53(62.189.0.250)
;; WHEN: Thu Jul 22 19:36:26 2010
;; MSG SIZE  rcvd: 105

However if you try tracing the resolution of odbc.ucas.com down from the root, you will see that it is delegated from its parent domain to a set of load-balancing DNS servers:

; <<>> DiG 9.7.1-P2 <<>> +norec aaaa odbc.ucas.com @ns0.netcentral.co.uk.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56307
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 4

;; QUESTION SECTION:
;odbc.ucas.com.                 IN      AAAA

;; AUTHORITY SECTION:
odbc.ucas.com.          3600    IN      NS      ns-lp.ucas.com.

;; ADDITIONAL SECTION:
ns-lp.ucas.com.         3600    IN      A       62.189.0.250
ns-lp.ucas.com.         3600    IN      A       81.171.139.250
ns-lp.ucas.com.         3600    IN      A       194.80.160.250
ns-lp.ucas.com.         3600    IN      A       195.188.99.250

;; Query time: 13 msec
;; SERVER: 212.57.232.5#53(212.57.232.5)
;; WHEN: Thu Jul 22 19:39:56 2010
;; MSG SIZE  rcvd: 115

This means that the start of authority (the zone cut) for odbc.ucas.com is odbc.ucas.com - but the NODATA response claimed it was ucas.com. BIND keeps track of where the zone cuts are, and requires that resource records in the authority section are subdomains of the zone cut. When this is not the case, BIND's NODATA parsing logic ends up at these lines of code:

	/*
	 * The responder is insane.
	 */
	log_formerr(fctx, "invalid response");
 

These load balancers are severely broken in other ways. If you ask them for any RR type other than A or AAAA they do not send any reply at all, leaving you to hang. Exceedingly rude! They also do not listen on TCP as they should.

Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 10 comments