[PLUG] spam filtering: server vs. client

Carlos Konstanski ckonstanski at pippiandcarlos.com
Mon Jun 18 13:47:50 UTC 2007


My mailserver setup has spam filtering in place, but it has proven to
be inadequate.  Many messages are getting through the filter without
receiving any spam score whatsoever.  It is time to add bayes
filtering to the mix.

One can implement bayes filtering on the server, or on each client.
Server-side filtering is what the users want: a magical filter that
just works.  Client-side filtering puts the burden of filter training
on the users, but their filter is more acutely tailored to their
emails.

I envision a server-side solution with a small client-side twist.
Each user would have a folder called "spam" where they would put any
spam that makes it into their inbox.  The sa-learn job running on the
server would then look in each user's spam folder and learn those
messages as spam, and delete them when done.  In time, they would find
decreasing amounts of spam in their inbox, and their involvement in
the filter training would decrease as well.

The biggest potential problem with this method (or any server-side
method, for that matter) is that the filter is being trained on all
the users' spam.  It is not tailored for each individual.  There is
plenty of discussion on the subject of "one man's spam is another
man's ham" on the web.  My question is: is all this discussion merely
theoretical, or does it carry weight in the practical world?  Will my
users miss emails they really wanted to see if I train the filter on
all their spam as a whole?

There's also the question of ham.  Without ham, one cannot train a
spam filter.  Will I confuse the filter by using the ham from several
different users?  Or will the relevant tokens be discovered
regardless?

In short, I am tying to find a balance between theoretically correct
spam filtering, and practical filtering that does what it is supposed
to do: keep users from seeing spam in their inboxes.  This is the line
which I'm sure many of you have found.  Please share your thoughts.

Carlos Konstanski



More information about the PLUG mailing list