?

Log in

No account? Create an account

fanf

Weird network bug

« previous entry | next entry »
1st Dec 2008 | 20:13

I just spent a couple of hours debugging a strange network problem, which I haven't seen before.

The initial description was that email relayed from one departmental email server to another via my email server sometimes failed with a timeout error. This only occurred with messages with multiple recipients.

My initial suspicion was that it might be related to an old (now fixed) Exim callout verification bug, which had caused us some problems in the past. If a callout's target server was slow, and there were many recipients needing callout verification, and the client used pipelining, the total time for the callout delays could add up to more than the client's timeout, so it wouldn't see Exim's pipelined replies. Exim now flushes its replies before doing a callout to avoid this problem.

However, some testing showed that the relevant servers were quite swift, which cast some doubt on this diagnosis.

The problem report included some good details which were unfortunately too old for me to be able to correlate with our logs. I had a look to see if there was anything more recent that I could examine. It turned out that there were a couple of messages on our queues that had been delayed by this problem. I ran a test delivery of one of them and it did in fact time out between us sending the message envelope and them sending the reply. Consistent with my guess, but not consistent with other testing.

I tried a delivery with full debugging from Exim, but there was no enlightenment. I tried to send a copy of the problem message's envelope manually. This worked. Um, WTF? What was the difference between what Exim did and what I did?

I ran tcpdump to analyze the two connections. It turned out that Exim was sending the whole envelope (MAIL, RCPT, RCPT, ..., DATA) in one packet, whereas my cut-and-paste into telnet split the envelope into a packet for the MAIL command and one for the rest. So I created a file containing the envelope and used a little shell script to drive telnet. Bingo. I had reproduced the problem.

At this point I thought it might be an MTU-related problem, e.g. blocking ICMP "Fragmentation Needed" messages. I tried pinning down the packet size at which the problem started occurring, with a guess of something related to 512 bytes. After a couple of goes I noticed that the problem message had an envelope of 513 bytes, and an envelope of 512 bytes worked OK.

Then I tried a larger envelope, and - boggle - it worked. 512 or less worked, 513 failed, 514 or greater worked. This also explained why the problem affected message envelopes but not message data. The usual symptom of MTU firewall problems is timeouts at the message data stage, not during the envelope, and in this case message data was getting through fine. (The overhead of message headers makes it very unlikely that a message will have as little as 513 bytes of data.)

A couple of friends suggested testing 1025 as well, and it demonstrated similar but weirder problems. Again 1024 worked, 1025 failed, and 1026 worked. However in the failure case, the timeout occurred after I received the destination's replies to the first half of the envelope - enough commands to fit in 512 bytes. A closer look at both the 513 and 1025 cases revealed that I was getting TCP ACKs for all the data I sent, but something was going wrong after that.

I guess the problem is a firewall that's doing some kind of TCP interception and re-segmentation, and getting it wrong. The ACKs would therefore have been generated by the firewall and not by the actual destination. Someone needs to be given a good whipping with a copy of end-to-end arguments in system design before having their firewall programming licence revoked.

| Leave a comment | Share

Comments {10}

Gerg

from: zkzkz
date: 1st Dec 2008 21:10 (UTC)

Someone needs to be given a good whipping with a copy of end-to-end arguments in system design before having their firewall programming licence revoked.


Ow! The mental discordance is hurting my brain. Surely *everyone* with a firewall programming licence needs a whipping with a copy of the end-to-end arguments.

Reply | Thread

Tony Finch

from: fanf
date: 1st Dec 2008 21:20 (UTC)

Not everyone needs repeated whippings :-)

Reply | Parent | Thread

Gerg

from: zkzkz
date: 2nd Dec 2008 09:37 (UTC)

actually the person i most want to whip with a copy of the end-to-end arguments is whoever is handing out these firewall licenses!

Reply | Parent | Thread

gareth_rees

from: gareth_rees
date: 1st Dec 2008 21:38 (UTC)

So what are you going to do? I suppose you could notice when an envelope has a bad size and if so, pad it in some harmless way. But maybe that way lies madness...

Reply | Thread

Tony Finch

from: fanf
date: 1st Dec 2008 21:47 (UTC)

I'm going to wait for the firewall admin to fix it.

Reply | Parent | Thread

Gerg

from: zkzkz
date: 2nd Dec 2008 09:40 (UTC)

one option would be to bounce all mail to the problematicdomain with a message explaining the problem and suggesting the mailer contact the admin of that domain

Reply | Parent | Thread

Tony Finch

from: fanf
date: 2nd Dec 2008 11:32 (UTC)

It's all internal email, and the two departments that are involved were talking to each other about it before getting us to help. I hope it won't take them too long to work out what's causing such peculiar breakage. At least it isn't affecting very much email.

Edited at 2008-12-02 11:35 am (UTC)

Reply | Parent | Thread

What firewall software is this?

from: apostmaster
date: 2nd Dec 2008 12:03 (UTC)

So, what's the firewall software? Is it something developed in house, or something that other sites might need to worry about, too?

Reply | Parent | Thread

Tony Finch

Re: What firewall software is this?

from: fanf
date: 2nd Dec 2008 12:15 (UTC)

That's what I'd like to know.

Reply | Parent | Thread

Malc

from: mas90
date: 6th Dec 2008 17:53 (UTC)

I had a similarly obtuse bug a while back. Mine affected SYN packets sent by most Linux apps, but not Windows as the SYNs happen to be a different length there... :-)

Reply | Thread