?

Log in

No account? Create an account

fanf

more HELO statistics

« previous entry | next entry »
2nd Dec 2004 | 14:29

Counting all offered messages (rejected or not), we saw 1 447 252 different HELO names in the last month. If I count the number of dots in each name, the resulting histogram is as follows. The small end (0-2 dots) is inflated by incompetence and forgery. The big end (>10 dots) is 99.99% abuse.

 25765
450511 .
218188 ..
432343 ...
197647 ....
 33647 ..... 5
 28485 ......
 19790 .......
  4582 ........
  2040 .........
  3069 .......... 10
  7005 ...........
  9483 ............
  7722 .............
  4390 ..............
  1840 ............... 15
   568 ................
   150 .................
    23 ..................
     3 ...................
     1 .................... 20


Of the messages we accept, 274 902 different HELO names were used (19% of the total). If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84906 ..
75131 ...
26182 ....
 4723 ..... 5
 4436 ......
 2686 .......
  279 ........
  123 .........
  123 .......... 10
  317 ...........
  447 ............
  320 .............
  211 ..............
   87 ............... 15
   21 ................
    4 .................
    1 ..................


A lot of these are clearly bogus, for example 80 characters of random
words concatenated with an IP address, like

Antigone.meter.ernet.ne.jpsouthparkmail.comnetlane.comlouiskoo.comjpopmail.comtw60.186.213.104

or a random collection of concatenated domain names, like

cave.ngs.ouse.hello.nlsammail.compcmail.com.twsouthparkmail.com

(These should obviously be added to my HELO heuristics!) After removing them, there are 272 890 HELO names. If I count the number of dots in each name, the resulting histogram looks like this:

 5723
69182 .
84905 ..
75130 ...
26176 ....
 4688 ..... 5
 4334 ......
 2521 .......
  179 ........
   47 .........
    0 .......... 10
    2 ...........
    3 ............


This still includes various stupidities. 26631 of the 37272 single dot names ending in com|net|org have no name servers so are invalid. Of the unfiltered list, 208323 of the 288884 com|net|org names are invalid.

Edit: Actually, if you use less-strict DNS validity checking those numbers are 22015 (instead of 26631) and 206556 (instead of 208323).

| Leave a comment | Share

Comments {2}

from: kaet
date: 2nd Dec 2004 19:20 (UTC)

Looking at these in gnuplot, my initial hunch would be that there are four naming policies at work here (this is my hypothesis):

policy mean-len sd-len prop-of-nms
  A       1      0.5   0.346
  B       3      0.7   0.468
  C       4      2.4   0.162
  D      12      1.4   0.023


I'd guess that A policy is incompetent naming, B is "showy" names, mail.foo.com, etc, C is "topographical" names, mail.border.london.router.dodgy.net, etc, and D is spammy long naming.

Looking at the accepted names, my best fit for these classes is (prob of name being in class)

A 0.33
B 0.53
C 0.14
D 0


Applying Bayes, that means the following probability of goodness (acceptance) given a message is in each class

A 0.181
B 0.215
C 0.164
D 0


Dividing by overall probability of goodness (0.19) gives how much this model predicts you should multiply your estimate of goodness by given it was allocated according to a particular class.

A 0.95
B 1.13
C 0.864
D 0


If my intuition of the nature of the classes are correct then it should be possible to detect each class (approximately). The most valuable, I think would be to distinguish between B and C mail. C might be, for example, "A-record contains numbers". But it's not something I can really hypothesise on, not having the full data, :).

As spam-assassin has linear scoring, presumably intended to be a sum-of-likelihood-logs model, I'd give the following penalties (modulo arbitrary scaling multiplier).

A +0.05
B -0.12
C +0.15
D +infinity


Just a quick hour or so modelling: nothing too serious.

Reply | Thread

Tony Finch

from: fanf
date: 3rd Dec 2004 03:00 (UTC)

Cool analysis, thanks!

As a result of my recent data mining :-) I've added three new HELO blocking rules:

block if the HELO name contains a double dot

block if the HELO name is 55 characters or more
(this deals with a large proportion of class D)

The final one is based on DNS lookups. Take the final two components of the name. If the top level component has name servers but the second level component does not, it's a bogus name.If the top level component has no name servers it's incompetence rather than malice so we give them the benefit of the doubt. This should deal with a large proportion of the abusive names.

The other large component of class D is consumer Internet access addresses, which I prefer to leave to blacklist maintainers to deal with.

Reply | Parent | Thread