[PLUG] Training corpus for bayesian spamassassin
Keith Lofstrom
keithl at kl-ic.com
Fri Apr 23 10:07:02 UTC 2004
I'm looking for opinions about "selected training" for bayesian
spamassassin.
I am transitioning from bogofilter to spamassassin; poor old bogo
is having trouble with the spammers addition of random filler words,
and about 50% of spam was making it through (with admittedly high
thresholds to minimize false positives).
I have a corpus of about 22K saved spams (since 2003 Aug), and about
110K saved hams (good GAWD I've read a lot of email since 1975).
Estimated accuracy after some careful cleaning is 0.02% misclassification.
Training with all those through a spam filter creates an enormous
database, which makes mail processing and retraining rather slow.
A recent subset, or a selected subset, makes a lot more sense.
It is superficially plausible to train the bayesian filter on
spamassassin with just the misclassified false positives and
false negatives. For the last week or so, I have been running
spamassassin with the bayesian filter and training turned off,
to see what kinds of mistakes it makes with the heuristic rules.
It seems to pass about 30% of the spam (false negative) and trap
about 5% (false positive!!) of the ham, with the threshold set
to 3.0 and a whole bunch of addresses whitelisted. I will have
quite a few false-negative spams before long (sigh), and I can
build a modest ham training list by including some of the recent
"squeakers". The result should be a tolerably sized bayes_toks file.
Does this "selected subset" training approach make sense, or have I
got my head so thoroughly wedged up my behind that oxygen is not
getting through?
Keith
--
Keith Lofstrom keithl at ieee.org Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
More information about the PLUG
mailing list