Log in

No account? Create an account


More log-structured MTA queues.

« previous entry | next entry »
1st Nov 2006 | 18:34

At the end of http://fanf.livejournal.com/65911.html I mentioned that it might be beneficial to have multiple queues, in order to reduce the density of garbage in the older parts of the queue. There are at least a couple of other reasons why one might want multiple queues.

Even faster

In the absence of any other bottlenecks, the MTA is going to be limited by the rate that it can fsync the main queue. You can raise this limit if you have two parallel queues on different disks, and spread the load of incoming messages between them. I got this idea from the Sendmail X design document which describes a similar (but not quite so neat) queue structure to the one I have been describing.


SMTP's TURN feature allows a client to ask the server to deliver any email queued on the server for a particular domain. This might be used by business dial-up customers who call in to their ISP every so often to collect email.

There are two variants of TURN in the current specifications, ETRN and ATRN, because the original form was insecure. With ETRN the server delivers the queued messages as if from a normal queue run; the only security concern is that the server must have some throttling to prevent clients from starting unlimited numbers of queue runners. With ATRN the existing connection is used to deliver the messages from the server to the client, which switch roles after the ATRN command. The client must be authenticated so that the server knows the client is permitted to receive the email it is asking for. ATRN is used on the "on-demand mail relay" port 366 instead of the usual SMTP port.

The basic implementation of ETRN (common to sendmail, Exim, and older Postfix) is for the server to fire off sendmail -qR, which scans the entire queue for messages with recipient addresses containing the domain given by the client. This is horribly inefficient if you have lots of messages on the queue, and the more clients you have using ETRN the more your queues get clogged with undeliverable messages.

The solution to this problem is to get messages for your ETRN domains off the queue; with Exim this is typically done by delivering them to a batch-SMTP file per domain, which can then be re-injected for delivery fairly efficiently when the client says ETRN. This kind of setup is a must for ATRN: whereas ETRN uses normal SMTP routing and delivery (which works if the clients have static IP addresses), ATRN does not, so there is generally no way to deliver the messages except by ATRN. ETRN is an optimisation to allow clients to tell the server not to wait for a retry timeout, whereas ATRN is purely on-demand so it is actually wrong to leave the messages on the retry queue.

With my log-structured queues we can use this idea but do it more efficiently. When a message is addressed to an ATRN domain, or cannot be delivered to an ETRN domain, its envelope is written to that domain's queue instead of appended to the main queue. One thing that makes this slightly more interesting is that the envelope may have to be split if it has recipients at multiple domains. This introduces the requirement for some kind of reference counting of spool files. The per-domain queue files are not routinely scanned, and when a client requests a delivery the server can simply and efficiently work through its queue.

Postfix leaves ETRN messages on its "deferred" queue, but optimises ETRN by keeping per-domain indexes of messages. This has the advantage of avoiding the need to split messages per domain, and means that ETRN domains are still retried even if the client doesn't ask. However it means that the deferred queue can get large and normal queue runs can get expensive. We should also periodically retry ETRN domains, but this won't happen unless they have messages on the main queue. To deal with that we should periodically probe these domains, which does not need any disk activity if the domain remains unreachable.

| Leave a comment |

Comments {7}


from: pjc50
date: 1st Nov 2006 18:51 (UTC)

"In the absence of any other bottlenecks, the MTA is going to be limited by the rate that it can fsync the main queue."

Not for the first time I wonder if having some other interface for "commit transaction" to special hardware might be a better way of dealing with the demand for persistent transactions that drives so much software design.

Reply | Thread


from: zkzkz
date: 2nd Nov 2006 08:35 (UTC)

wow there are actually people doing log structured mtas now? I started to look at this two jobs ago when large volume mailings were a big part of the job. But it didn't seem like anyone else was interested. My plan was to write out a BSMTP file of received mails and log to a separate file which mails were succcessfully delivered. Recovery would mean reprocessing the entire BSMTP file and ignoring mails that were logged as delivered.

And yes, in Postgres we've speculated about being able to pre-allocate a log file large enough to span an entire track and then write transactions to whatever unused block will come under the head soonest. The problem is a) there's no way to find out where the head is currently and b) the mapping of logical blocks to physical locations is nontrivial these days.

I think to do that you would need an ATA command that lets you give the drive a list of blocks to pick from at its whim and then get back which block it picked.

Reply | Parent | Thread


from: pjc50
date: 2nd Nov 2006 10:14 (UTC)

What I was thinking is "is a spinning disk the most efficient way of recording these transactions?" to which the answer is definitely "no". Flash is faster so long as you're only clearing bits not setting them, which is fine for the application of recording transactions serially, and the small size doesn't matter too much because you only need one bit per transaction with all the real data in the log-structured file.

Higher speeds would be achievable by storing the bits in RAM with flash behind it and a battery to commit the data to flash in the event of main power loss.

Reply | Parent | Thread

from: chrislightfoot
date: 2nd Nov 2006 17:24 (UTC)

In principle you could have a go at running a PLL which locked to the rotation of the disk, and use this to plan which blocks to write to. With a bit of work (and a large enough contiguously-allocated journal) it should be possible to make a pretty good estimate of which is the closest free extent to the disk head. On the other hand when I played with this I didn't get very far (but I didn't try for very long either). And of course it only helps significantly in installations where you can dedicate a whole disk to the log (realistically, two disks, since nobody likes to lose their database when a disk fails -- and note that that means you need two PLLs and no simple relation between the locations of each extent written on the two devices, since there's no way to synchronise the disks to one another).

Practically speaking I think NVRAM on the disk controller is the right solution, though it'd be interesting properly to research the PLL approach.

My solution to the MTA performance question, in my day job, was to argue that, well, machine failures are pretty rare and the occasional duplicated email isn't the end of the world. But this was for a specific application (confirmation link emails for a high-traffic website) and wouldn't be appropriate in general.

Reply | Parent | Thread


from: drpryor
date: 1st Nov 2006 19:48 (UTC)

business dial-up customers - it is a joke, right?
Or this document from 1995, maybe?

Reply | Thread

Tony Finch

from: fanf
date: 1st Nov 2006 23:44 (UTC)

In May I went to Nairobi to teach African techies from ISPs and Universities about email, specifically the basics of Exim. Broadband is still rare there. Even in the first world, ATRN would be appropriate for a small office / home office broadband service with dynamic IP addresses.

Reply | Parent | Thread

from: Dave Holland [org.uk]
date: 2nd Nov 2006 12:34 (UTC)

s/would be/is/

I use Exim+odmrd on a co-lo machine and Exim+fetchmail on my PC at home on dynamic-IP broadband. It all just works.

Reply | Parent | Thread