Sunday, July 25, 2004

Bayesian e-mail Filters

A couple of entries ago, I mentioned Bayesian e-mail filters. I guess I ought to explain that.

I get spam (who doesn't?). I use every trick I can to minimize it. I first send my e-mail through who run SpamAssassin. From there, I forward it to who have their own spam filter. Unfortunately, even two spam filters don't eliminate all of it.

I wrote a bunch of rules in Outlook Express but I was always having to tweak them and watch carefully for false positives (good e-mail labeled as spam by my rules).

One day as I was surfing, I came across this concept of Bayesian filtering. There are a couple of good articles here and here.

Bayesian spam filters calculate the probability of a message being spam based on its contents. Unlike simple content-based filters, Bayesian spam filtering learns from spam and from good mail, resulting in a very robust, adapting and efficient anti-spam approach that, best of all, returns hardly any false positives.

The net of this is that Bayesian filtering programs parse each e-mail and score each string (word) as to whether it has appeared in spam before. Before is key as you have to "train" your filter initially by manually identifying e-mail as spam.

I'm running K9 as a e-mail proxy. It is small, only 77K to download. To install it, you have to change your e-mail account properties to point your incoming mail (POP3) server to and make a slight adjustment of your account name. The result looks like this:

I won't cover all the steps to implement K9 but it isn't tricky. Obviously, it only works on local e-mail programs. It won't work on yahoo, hotmail, or juno for example.

When you're done, each time you check your e-mail, it will be pre-processed by K9 and flagged as spam or not. A simple rule will throw spam into a folder to be spot checked periodically and deleted.

Here's what the results are:

Even after 2 spam filters, 66% of what gets to me is still spam! The good news is that of that, almost 97% is caught by K9. Since these statistics have been running, .54% have been misidentified as spam. Don't depend on that. Always make a quick look-see through the spam folder before you delete it.

No comments: