In
this article we discuss the inner
workings of a bayesian spam filter.
Which spam filter is best for you?
There are a number of methods out
there for identifying spam. Keyword
search, message hash, RBL lists,
forged header detection and Bayesian
filters are just a few. Most anti-spam
products today employee a number
of these techniques with varying
degrees of effectiveness. Today
we will be discussing “Bayesian
filters”, how they
work, why they work, and when they
don’t work.
A Bayesian filter is a filter that
learns from experience.
After an email is classified as
spam or solicited, the filter adds
the statistics about the email into
its recognition table for future
reference.
A grossly oversimplified example:
2 messages in your inbox, one is
spam and one is not. Message 1 contains
the one sentence: Hi cat. Message
2 contains one sentence: Hi dog.
The filter would look at the total
number of words in a message, compare
it with the distinct words that
are in spam and not spam and come
up with: Hi (0), Cat (.5)spam, Dog
(.5)not spam; This is because the
word “Hi” happens equally
and the word Cat being 50% of the
message occurs only in spam messages
and the word Dog being 50% of the
message only occurs in legitimate
emails.
Now lets take a third message it
needs to learn from, Message 3:
Hi dog how are you? This message
we have also identified as legitimate.
In this case dog is .20 as well
as how, are, and you. So now our
filter database is hi(0) Cat(.5)spam
Dog(.35)not spam how/are/you (.20)not
spam.
This adaptive nature is exactly
why Bayesian filters are so attractive
to some. Even after the spammers
have figured out how to get past
the keyword searches and most other
methods, the new “words”
they create by misspelling and scrambling
never appear in legitimate emails
and therefore are always seen as
spam first.
One of the aspects of a Bayesian
filter that separates it from other
anti-spam methods is that it is
adaptive and based on the user.
Depending on the mail YOU get the
filter learns. This is extremely
helpful in situations where typical
spam words are used in your workplace
everyday. Take a financial institution
for example: The words “mortgage”
and “cash” will probably
show up in excess, so much so that
regular spam blocking techniques
would prevent legitimate emails
from being delivered. With a Bayesian
filter in place however the rules
are totally different. Because the
emails you “Taught it with”
included those terms it does not
see them as spam indicators. What
is more it will be able to identify
discrepancies between spam email
coming in with the word “mortgage”
in it, and internal messages with
the word “mortgage”
in it. The differences will obviously
be word location, word count and
supporting words. All things the
Bayesian Filter takes into account
before it makes its final judgment
about the fate of this email.
To truly get good results out of
a Bayesian filter, you need to “feed”
it lots of good input. In my experience
several hundred spam and non spam
messages make it extremely effective.
The actual math behind the filter
is unique to the product because
although the basic concept is well
known, the application requires
a liberal dose of creativity. Some
of the more advanced conditions
that the filter can include in its
“dictionary” of values
are: word location, word presence
to email size and subject line existence
in message body. There are of course
many more and the number will probably
increase as long as email spam continues
to increase.
With this kind of processing going
against each email, it is easy to
see that spammers have a very difficult
time in tricking Bayesian Filters.
As a matter of fact, the only thing
they can do is to try and keep their
mail as short and neutral as possible.
This, to the spammer’s dismay,
soon becomes another signature of
spam emails because of the adaptive
nature of these filters. In brief,
Bayesian Filters hold a lot of promise
when “taught right”
(much like our kids) and they have
to potential to virtually eliminate
spam from your inbox.
I hope this has been instructional
for you, remember to take care and
have fun.
Spam
Blocking and Bayesian Spam Filter
Software - Click Here
Back
to Articles Home