?

Log in

No account? Create an account

fanf

Obscure problem caused by bad DNS load balancer

« previous entry | next entry »
22nd Jul 2010 | 18:53

Our colleagues have recently been having problems talking to the Universities and Colleges Admissions Service ODBC server. This was caused by DNS resolution failures and lengthy time-outs when trying to look up the IPv6 address for odbc.ucas.com. BIND complains in its log:

22-Jul-2010 19:29:17.827 resolver: notice:
  DNS format error from 62.189.0.250#53 resolving odbc.ucas.com/AAAA
  for client 127.0.0.1#52970: invalid response
22-Jul-2010 19:29:17.827 lame-servers: info:
  error (FORMERR) resolving 'odbc.ucas.com/AAAA/IN': 62.189.0.250#53

This is puzzling, because dig shows that the server is sending back what looks like a perfectly well-formatted NODATA response:

; <<>> DiG 9.7.1-P2 <<>> +norec +multi aaaa odbc.ucas.com @62.189.0.250
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59781
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;odbc.ucas.com.         IN AAAA

;; AUTHORITY SECTION:
ucas.com.               86400 IN SOA ucas.com. administrator.ucas.com. (
                                998545544  ; serial
                                28800      ; refresh (8 hours)
                                7200       ; retry (2 hours)
                                604800     ; expire (1 week)
                                86400      ; minimum (1 day)
                                )

;; Query time: 8 msec
;; SERVER: 62.189.0.250#53(62.189.0.250)
;; WHEN: Thu Jul 22 19:36:26 2010
;; MSG SIZE  rcvd: 105

However if you try tracing the resolution of odbc.ucas.com down from the root, you will see that it is delegated from its parent domain to a set of load-balancing DNS servers:

; <<>> DiG 9.7.1-P2 <<>> +norec aaaa odbc.ucas.com @ns0.netcentral.co.uk.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56307
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 4

;; QUESTION SECTION:
;odbc.ucas.com.                 IN      AAAA

;; AUTHORITY SECTION:
odbc.ucas.com.          3600    IN      NS      ns-lp.ucas.com.

;; ADDITIONAL SECTION:
ns-lp.ucas.com.         3600    IN      A       62.189.0.250
ns-lp.ucas.com.         3600    IN      A       81.171.139.250
ns-lp.ucas.com.         3600    IN      A       194.80.160.250
ns-lp.ucas.com.         3600    IN      A       195.188.99.250

;; Query time: 13 msec
;; SERVER: 212.57.232.5#53(212.57.232.5)
;; WHEN: Thu Jul 22 19:39:56 2010
;; MSG SIZE  rcvd: 115

This means that the start of authority (the zone cut) for odbc.ucas.com is odbc.ucas.com - but the NODATA response claimed it was ucas.com. BIND keeps track of where the zone cuts are, and requires that resource records in the authority section are subdomains of the zone cut. When this is not the case, BIND's NODATA parsing logic ends up at these lines of code:

	/*
	 * The responder is insane.
	 */
	log_formerr(fctx, "invalid response");
 

These load balancers are severely broken in other ways. If you ask them for any RR type other than A or AAAA they do not send any reply at all, leaving you to hang. Exceedingly rude! They also do not listen on TCP as they should.

| Leave a comment | Share

Comments {8}

from: dwmalone
date: 23rd Jul 2010 05:39 (UTC)

The people who write DNS load balancers seem to have an amazing nack of getting things wrong... I'd written a little tool (at http://www.cnri.dit.ie/cgi-bin/check_aaaa.pl ) to try and spot these sort of AAAA problems, but it won't spot this one.

Reply | Thread

Yeah, tell the UCAS guys....

from: anonymous
date: 26th Jul 2010 18:03 (UTC)

The worst part of all this is the arrogant so and so at UCAS who is simply replying with "We're following the standard, there's nothing wrong with _our_ set up". I suggested the person dealing with this should forward that on to the UCS hostmasters in a "light blue touch paper and retire quickly" kind of way. I don't mind arrogant so much as long as the person in question is technically competent, which is not so in this case.
What's more, I've actually worked out the product they're using. I mentioned the issue to a friend who simply said "Yep, they'll be using product-foo. We used to have some of those. They were crap and we threw them away, taking the financial hit. It's better to have something that works. The only thing this product is good for is impressing clueless project managers. Once techies get their hands on them, their shortcomings become more than apparent."
Oh well, if it wasn't for things like this, life would be quiet, boring and I could get some real work done....

Reply | Thread

Tony Finch

Re: Yeah, tell the UCAS guys....

from: fanf
date: 26th Jul 2010 18:16 (UTC)

Hey, leaving out the juicy details is not fair! We name and shame the culprits on this blog :-)

Reply | Parent | Thread

Re: Yeah, tell the UCAS guys....

from: anonymous
date: 27th Jul 2010 12:25 (UTC)

The product in question ? That'll be a Radware Linkproof system. Judicious use of your favourite search engine should give you the marketing guff for them.

Reply | Parent | Thread

Roy

from: owdbetts
date: 26th Jul 2010 18:34 (UTC)

Looks to be very similar to the problem I see with the load balancer at www.kitco.com. Though for some reason with that domain BIND comes back immediately with a SERVFAIL, rather than hanging.

SERVFAIL is still bad if you're using forwarders, because if BIND gets SERVFAIL from a forwarder, it will go on to try the the remaining forwarders in the list. And of course, if one of them is down, you get a long delay (which is how I discovered the problem).

-roy

Reply | Thread

Tony Finch

from: fanf
date: 26th Jul 2010 18:49 (UTC)

The difference is that all three nameservers for www.kitco.com reply, whereas at least one of the four name servers for odbc.ucas.com is too broken to make any reply to AAAA queries. This breakage leads to a mixture of SERVFAIL and timeout from the local recursive/cacheing BIND.

In fact 81.171.139.250 doesn't reply to A queries either at the moment.

62.189.0.250 and 195.188.99.250 seem to be very confused about their identities. Sometimes they reply as expected, sometimes they do this:

$ dig aaaa odbc.ucas.com @195.188.99.250
;; reply from unexpected source: 62.189.0.250#53, expected 195.188.99.250#53
;; reply from unexpected source: 62.189.0.250#53, expected 195.188.99.250#53
;; reply from unexpected source: 62.189.0.250#53, expected 195.188.99.250#53

; <<>> DiG 9.7.1-P2 <<>> aaaa odbc.ucas.com @195.188.99.250
;; global options: +cmd
;; connection timed out; no servers could be reached

$ # later...

$ dig aaaa odbc.ucas.com @62.189.0.250
;; reply from unexpected source: 195.188.99.250#53, expected 62.189.0.250#53
;; reply from unexpected source: 195.188.99.250#53, expected 62.189.0.250#53
;; reply from unexpected source: 195.188.99.250#53, expected 62.189.0.250#53

; <<>> DiG 9.7.1-P2 <<>> aaaa odbc.ucas.com @62.189.0.250
;; global options: +cmd
;; connection timed out; no servers could be reached

Reply | Parent | Thread

Unbound accepts it

from: bortzmeyer.org
date: 18th Jul 2012 07:51 (UTC)

I just tested this problem ("dig CNAME ns1.webhosting24.com" triggers a SERVFAIL in BIND because ns1.webhosting24.com is delegated but the name servers reply with a zone cut to webhosting24.com) and I noticed that Unbound accepts this reply and yields a result. Here on OARC' ODVR service:

# BIND
% dig @2001:4f8:3:2bc:1::64:20 CNAME ns1.webhosting24.com

; <<>> DiG 9.8.1-P1 <<>> @2001:4f8:3:2bc:1::64:20 CNAME ns1.webhosting24.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 35315
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;ns1.webhosting24.com. IN CNAME

;; Query time: 656 msec
;; SERVER: 2001:4f8:3:2bc:1:0:64:20#53(2001:4f8:3:2bc:1:0:64:20)
;; WHEN: Wed Jul 18 09:21:27 2012
;; MSG SIZE rcvd: 49

# Unbound
% dig @2001:4f8:3:2bc:1::64:21 CNAME ns1.webhosting24.com

; <<>> DiG 9.8.1-P1 <<>> @2001:4f8:3:2bc:1::64:21 CNAME ns1.webhosting24.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43630
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;ns1.webhosting24.com. IN CNAME

;; Query time: 492 msec
;; SERVER: 2001:4f8:3:2bc:1:0:64:21#53(2001:4f8:3:2bc:1:0:64:21)
;; WHEN: Wed Jul 18 09:21:31 2012
;; MSG SIZE rcvd: 49

Does it mean Unbound is wrong and I should report a bug?

Reply | Thread

Tony Finch

Re: Unbound accepts it

from: fanf
date: 18th Jul 2012 09:21 (UTC)

I think it can be argued either way! A popular strict implementation ought to encourage other vendors to get their implementations right - but that clearly hasn't worked for these crappy load balancers which are far too common. A more lenient resolver has the obvious (inter)operational advantages. If I remember discussing this with Mark Andrews correctly, this particular check corresponds to text in RFC 2308 (I think section 3) but that is also silent on the subject of incorrect SOA records.

By the way, thanks for reminding me about this post - I wish I had remembered it when discussing Joe Abley's proposed AS112 improvements...

Reply | Parent | Thread