Log in

No account? Create an account


Spam bot signatures

« previous entry | next entry »
5th Nov 2009 | 15:04

Recently I have been investigating spam bot signatures, specifically the characteristic domain names they choose to put in their SMTP HELO commands. A lot of spam bots use the same HELO domains from lots of different compromised PCs, which makes them quite easy to spot and block without any risk of blocking legitimate email. This kind of block can take care of about 15%-20% of spam without relying on 3rd party services like DNSBLs. Of course, this is one of the techniques used to populate the Spamhaus XBL so the only advantage of doing it yourself is if you want to spot spam bots that have not yet been spotted by the Spamhaus guys.

Steve Champeon's Enemieslist service is the highly developed commercial implementation of this idea. His patterns are much more comprehensive, covering spam bot signatures, domestic IP connectivity (like the Spamhaus PBL), and spam-infested netblocks.

Yesterday evening I was thinking about how to automatically identify spam bot signatures, when I realised that I had already written the code to do the job! I wanted to count how many different IP addresses were using the same HELO domain, and block connections that used excessively popular domains. All I needed was a few lines of Exim configuration:

    message   = Probable spam bot HELO seen from $sender_rate networks
    condition = ${if !eqi{localhost.localdomain}{$sender_helo_name} }
  ! verify    = helo
    ratelimit = 4 / 1w / per_conn / strict \
      / unique=${mask:$sender_host_address/24} / ${lc:$sender_helo_name}

Let's unpack this in reverse order.

We're measuring the rate of use of HELO domains, so the ratelimit key is ${lc:$sender_helo_name}. It's forced to lower case so that SERVER and server are treated as the same thing.

But we don't care about the total usage rate, only the rate of uses from different unique IP addresses. The unique= option invokes the Bloom filter code to avoid counting each spam bot more than once. In fact we only count different unique /24 network blocks, in order to avoid false positives from mail clusters in which all servers use the same name. For example, Facebook's MTAs all say HELO mx-out.facebook.com though they are spread across about 100 IP addresses on a couple of /24 networks.

The strict option means keep counting even when the measured rate has passed the limit. The per_conn option means only count once for each connection (which mainly helps with efficiency).

The smoothing period is set to one week, which should mean that Exim doesn't easily forget which HELO domains have been abused.

I've currently got the limit set to 4 blocks of /24. It might even be reasonable to reduce this to 3. I'll need to run it a bit longer to see if any more odd false positives sneak out of the woodwork.

We do not apply this check if the DNS agrees with the HELO domain. There are some legitimate host names which are being heavily abused by spam bots, such as mail.aol.com and mx54.mail.com, so we want to block them if the connection comes from anywhere other than the host itself. (Sadly Facebook's MTAs are misconfigured so they don't pass this check.)

The only exception to this (so far) is localhost.localdomain which is the result of a popular misconfiguration (or lack of configuration) on legitimate Unix MTAs. If I find any other false positives they'll get checked in a similar way.

ETA rediffmail.com also needs whitelisting - it's the mail service of rediff.com which is a portal for Indian expats. Also easyjet.com.

This heuristic seems to catch about three different kinds of spam bot behaviour.

  • The HELO domain is the same as the domain in the MAIL FROM address. Spammers like to forge email "from" surprisingly few popular sites.
  • The HELO domain is one of a few bare hostnames, such as pc or computer - I guess they use the name of the compromised host.
  • The HELO domain is a parent domain of an ISP's edge network, e.g. telesp.net.br.

I'm really pleased by how easy and effective this has turned out to be. The only annoyance is that it took me 20 months to realise that my Bloom filter ratelimit code could do this! Also I hope there aren't too many lurking gotchas that I haven't spotted yet.

This check really shows up a long-standing weakness in Exim's hints database implementation. It just uses a local DBM file to store ratelimit data, so each individual server in my SMTP cluster has to accumulate data on spam bot HELO domains without being able to benefit from the experience of the rest of the cluster. I suppose I should spend some quality time with Tokyo Tyrant...

| Leave a comment | Share

Comments {6}


from: deborah_c
date: 5th Nov 2009 17:29 (UTC)

As it happens, I was listening to a BBC radio play about spam as I read this. I think you should be glad you're not providing support to the protagonist...

Reply | Thread


from: nonameyet
date: 6th Nov 2009 01:50 (UTC)

My logs wont be typical (since I'm not looking at the machine with the MX record for my domain) but many of the connections which fail the helo verify are machines which give "HELO [192.168.1.x]" and then go on to make a secure authentication. I wonder whether these users would clash enough to get caught by the the 4 blocks of /24.

In the past I have known road warriors to use three or four ISPs in a weekend. If their laptop gives a bare hostname or other fixed HELO they could clock up several /24s in a week.

Reply | Thread

Tony Finch

from: fanf
date: 6th Nov 2009 11:51 (UTC)

This doesn't affect message submission because that happens on a different server IP address to the MX service.

For a number of years we have rejected any messages to our MX service from machines that use an IP address literal as their HELO domain.

Reply | Parent | Thread


from: mas90
date: 9th Nov 2009 10:35 (UTC)

Looking at my logs (for a considerably smaller server than yours!) it seems that whilst there are a few spams that would get caught by this technique (typically dsldevice.lan or similar — compromised DSL modems perhaps? — and also common domains such as gmail.com and mail.aol.com) a lot of the benefit would be lost as I already reject non-FQDN HELOs (thus already blocking botnet nodes using HELO names of such things as computer, home, localhost etc.).

However, I've noticed a slightly different pattern which accounts for quite a large quantity of attempted spam, and which leaves me somewhat baffled: "HELO do.not.use.this.dns.server.anymore.XXX.in-addr.arpa" where XXX is the first byte of the client's IP address, which (yesterday) came from no fewer than 27 different /8s spread across the world. CBL lists the clients as spambot-infected, but claims they're not all running the same spambot. Bizaare.

Reply | Thread

Tony Finch

from: fanf
date: 9th Nov 2009 11:50 (UTC)

I have a few HELO blocks that predate my deployment of the dynamic technique, though they're probably redundant now. The .arpa one you mention is indeed very strange.

There's a small amount of legit mail sent from hosts that don't know their FQDN, so (like underscores in HELO domains) it isn't a safe reason to block.


Reply | Parent | Thread


from: nonameyet
date: 12th Nov 2009 09:13 (UTC)

I imagined that some ISPs are redirecting requests aimed at spammer-controlled DNS servers and replying with do.not.use.this.dns.server.anymore.XXX.in-addr.arpa

Reply | Parent | Thread