Log in

No account? Create an account


Scalable log processing

« previous entry | next entry »
21st Sep 2006 | 18:32

Here's an idea I came up with in 1999, when I was working at Demon. It's a design for a scalable log processing infrastructure, which allows you to plug in simple processing scripts which have very easy requirements, but still attain good scalability and reliability.

It's all based on dividing logs into relatively small batches: too large and the log processing latency gets too big; too small and the batch handling overhead gets too big. I decided that one-minute batches were a happy medium. Batches are always aligned to minute boundaries NN:NN:00, so that (assuming your computers' clocks are synchronized) batches generated on different computers start and end at the same time.

Log processing scripts can be simple sequential code. The requirements are that they are idempotent, so that they can be re-run if anything goes wrong; they must depend only on the time stamps in the logs, not the wall clock time; and they must be able to process batches independently.

The log transport protocol must also satisfy these requirements. I decided to use HTTP, because it's already ubiquitous. The PUT method is the one we want, because it's guaranteed to be idempotent (unlike POST), and it pushes data which is more efficient for the log server than pulling (which usually implies polling). HTTP already has lots of useful features (such as security) that we don't have to re-invent. If you use the "chunked" Transfer-Encoding, then you can start sending a batch as it starts which minimizes latency.

This scheme gets its scalability by exploiting parallelism.

A batch of logs may take longer to process than it does to generate if you have lots of log generating machines, or if a naive log processing script is slow - the real-world example from Demon was doing reverse DNS lookups on web logs. In this situation you can start processing the next batch as soon as it becomes available, even if the previous batch is still being processed. This means you can make use of idle CPU (waiting for DNS lookups) or multiple CPUs.

If one machine isn't enough to handle your log load, you can send alternate batches to different machines. For example one can process all the even minutes and one all the odd minutes. You can also have a hierarchy of log processors: perhaps a log collector for each cluster which in turn passes its collected logs to the central processors.

The reliability comes from a couple of simple things.

Logs are transmitted using TCP, which isn't lossy like traditional UDP-based syslog. Logs are written to disk before being sent to the server, so that if there is a network problem they can be transmitted later. Similarly, the log server can delay processing a batch until all the clients have sent their chunks. The log processor can catch up after an outage by processing the backlog of batches in parallel.

This implies that the log server has to have a list of clients so that it knows how many batches to expect each minute. Alternatively, if a machine comes back after an outage and sends a load of old batches to the log processor, idempotence means that it can re-process those old batches safely. I'm not really sure what is the best approach here - it requires some operational experience.

| Leave a comment |

Comments {4}

Brad Fitzpatrick

from: brad
date: 21st Sep 2006 19:09 (UTC)

LiveJournal does a bunch of stuff like this.

Reply | Thread

from: anonymous
date: 21st Sep 2006 23:41 (UTC)

Cool. I found http://danga.com/words/2005_oscon/oscon-2005.pdf which says you log to a database, as I thought. You seem to prefer using MySQL except when forced to do app-specific stuff; for example, i understand that you do email notification deliveries and retries straight from teh database, rather than using a traditional MTA. Is there a more recent description of your architecture, especially the weird bits?

Reply | Parent | Thread

Brad Fitzpatrick

from: brad
date: 22nd Sep 2006 01:40 (UTC)

We also use syslog-ng to spray logs about (also TCP). But yeah, we can spray them into lots of bite-sized pieces and do analysis/summaries on them in parallel/etc.

And yeah, we have a job queue system and an outgoing email is just a job to be done. The job queue API supports enough search primitives to let us do outgoing SMTP nicely, coalescing mails to the same host/etc. But we pretty much do all the SMTP ourselves ... much less pain than other MTAs we've used. The job queue system does all the hard work, then SMTP was like a page of code.

No recent slides, sorry.

Reply | Parent | Thread


from: anonymous
date: 20th Apr 2007 06:48 (UTC)

Hey Tony,

Your Blog rocks!! Just wanted to share something with ya… one blogger to another…
There is this amazing site that I came across where u can make money by sharing information…check it out here’s the link http://www.myndnet.com/login.jsp?referral=alpa83&channel=al399

The coolest part is…every time ur information gets sold u get paid for it!!
I signed it for it.. very cool stuff… u can also mail me at barot.alpa@gmail.com


Reply | Thread