[PLUG] Training corpus for bayesian spamassassin

Paul Heinlein heinlein at madboa.com
Fri Apr 23 10:42:19 UTC 2004


On Fri, 23 Apr 2004, Keith Lofstrom wrote:

> I'm looking for opinions about "selected training" for bayesian
> spamassassin.

I think it's sufficient start with a base of 200 or so of spam
messages and a similar number of ham. The spam messages ought to be
recent, because their content has changed considerably even since last
summer.

It's always good to feed the false negatives and, especially, the
false positives to the bayesian db -- but I'd also make sure it got a
reasonable diet of standard spam and ham as well.

A threshold setting of 3.0 seems a bit low to me, but I haven't done
much testing with a setting that low.

After that, I'd feed any spam that didn't get BAYES_99 tag back to the
bayesian database. It doesn't really matter whether or not
SpamAssassin caught it correctly the first time.

Why?

My goal is that all spam should get BAYES_99 so that only content --
not mail headers, sender address, or whatever -- is used to identify
spam. That's not to say the other indicators aren't helpful, but to my
mind the goal is to classify messages based on content as much as
possible.

-- Paul Heinlein <heinlein at madboa.com>




More information about the PLUG mailing list