[PLUG] Training corpus for bayesian spamassassin

Fri Apr 23 10:07:02 UTC 2004

I'm looking for opinions about "selected training" for bayesian
spamassassin.

I am transitioning from bogofilter to spamassassin;  poor old bogo 
is having trouble with the spammers addition of random filler words,
and about 50% of spam was making it through (with admittedly high
thresholds to minimize false positives).  

I have a corpus of about 22K saved spams (since 2003 Aug), and about
110K saved hams (good GAWD I've read a lot of email since 1975).  
Estimated accuracy after some careful cleaning is 0.02% misclassification.
Training with all those through a spam filter creates an enormous
database, which makes mail processing and retraining rather slow.
A recent subset, or a selected subset, makes a lot more sense.

It is superficially plausible to train the bayesian filter on
spamassassin with just the misclassified false positives and
false negatives.  For the last week or so, I have been running 
spamassassin with the bayesian filter and training turned off,
to see what kinds of mistakes it makes with the heuristic rules.
It seems to pass about 30% of the spam (false negative) and trap
about 5% (false positive!!) of the ham, with the threshold set
to 3.0 and a whole bunch of addresses whitelisted.   I will have
quite a few false-negative spams before long (sigh), and I can
build a modest ham training list by including some of the recent
"squeakers".  The result should be a tolerably sized bayes_toks file.

Does this "selected subset" training approach make sense, or have I
got my head so thoroughly wedged up my behind that oxygen is not
getting through?

Keith

-- 
Keith Lofstrom           keithl at ieee.org         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs