I've spent most of this afternoon staring at tcpdump traces trying to work out what is causing connections from an Exchange server to my Exim servers to be lost.
The original symptom was that our sending rate measurements for the server were going crazy - increasing massively seemingly without a corresponding increase in traffic. This happens because (in our configuration) if a connection is going OK (valid sender and recipient addresses, etc.) Exim doesn't log anything until a message is accepted. Our rate measurement calculations are performed when processing RCPT commands, before the connections were lost, and also before they got logged. (This is, obviously, less than ideal.) So the Exchange server's sending rate seemed to go up for no reason.
I temporarily increased our logging which verified that this was the correct explanation. But why were the connections being lost? I ran tcpdump for a bit and looked at the contents of the connections from the problem host. This showed that a couple of messages with lots of recipients were being repeatedly retried, only to fail because the connection was lost mid-transaction. Most messages were getting through OK.
I suspected a broken firewall, so I sent a message to the relevant staff giving them some details of the problem and asking if they have a firewall that might be causing it. I turned off tcpdump to stop it filling the disk :-) and kept half an eye on the logs. Soon a new and useful bit of information turned up: (I've edited this log extract a bit.)
2009-09-01 18:00:20 +0100 SMTP protocol synchronization error (next input sent too soon: pipelining was advertised): rejected "XXXX TO:<email@example.com>" H=mailserver.dept.cam.ac.uk [131.111.999.999]:50555 I=[18.104.22.168]:25 next input="RCPT TO:<firstname.lastname@example.org>\r\nRCPT TO:<email@example.com>\r\n"
Exim is complaining about the XXXX command because only certain specific commands may be pipelined and XXXX isn't one of them. So why is the Exchange server sending gibberish XXXX (sic!) instead of RCPT? In fact it isn't. I took another tcpdump and found that the problem command fell across a packet boundary: one packet ended with @cam.ac.uk>\r\nXX and the next one started with XX TO:<.
This is an absolutely classic signature of the utterly braindamaged Cisco PIX firewall's SMTP fuxup mode. Oh it aggravates me so much. It replaces SMTP commands that it doesn't know with XXXXes, and it's too stupid to handle packet boundaries correctly.
I sent another message to the department's IT staff telling them to fix their firewall ASAP. I shall probably send similar instructions to another department with a misconfigured PIX, since they brushed off my previous request to fix it. Both departments are also having problems with email from us to them, about which our servers can log more details, so I hope I have enough evidence to persuade them to heed my advice.