Bayesian Technical Information

MailWasher uses bayesian statistics to determine the probability that an email is spam or good based on your preferences from the training you provide.

The difficulty with any Bayesian is that it requires healthy volumes of email to be classified as good or bad before it becomes effective, there are some tweaks we've added to the current UI under Settings >> Spam Tools >> Learning.

Firstly, the Bayesian engine is designed to return a statistical analysis of an email ranging between 0 and 1

0 is very good, and 1 is very spammy. A result of .5 is neutral, often referred to as the midpoint.

An email is broken up into a series of tokens, these are words or other recognisable text of the email. As you train emails these are added to a corpus, the tokens within the email are added to a hash file as well as the number of times these tokens appeared. There is one hash file for email you classify as good and one for email classified as spam.

In a third hash file, tokens have a probability score mapped to them determined by whether the token came from a good or bad email. If a token is found in the spam significantly more, it's probability will be very near the spammy end

e.g.

viagra 0.821818

If a token is found in both spam and good corpuses at a roughly equal amount, it's probability would be back near the mid-point

e.g.

html 0.573604

'most interesting' tokens are determined by how far their probability is from the neutral .5 whereas a token whose probability is closer to the mid-point is not considered very interesting.

When an email is checked against the Bayesian it is broken up into these tokens, the 20 'most interesting' tokens are then taken and used to calculate the probability that the email is either spam or good.

MailWasher does at present a fairly standard Bayesian analysis, and converts the Bayesian result into a spam weighting score, which can be coupled with other Spam Tools to come to a final analysis of the email.

On a low Bayesian sensitivity setting :

MailWasher will limit the authority the Bayesian can have, so the highest 'weighting' it will apply will be -/+99 and 'Infinite' for a message that has been trained.

If the Bayesian evaluates the message between the range of -50 through to +50, and other Spam Tools bring the Total Rating to outside of that range, then MailWasher will automatically train that email as either good or bad.

On a high Bayesian sensitivity setting :

MailWasher will not limit the authority the Bayesian can have, so the highest 'weighting' it will apply will be -/+149 and 'Infinite' for a message that has been trained.

If the Bayesian evaluates the message between the range of -75 through to +75, and other Spam Tools bring the Total Rating to outside of that range, then MailWasher will automatically train that email as either good or bad.

Minimum word length - Sets the minimum numbers of characters required for the word (token) to be evaluated by the Bayesian. So words like 'of' would be completely ignored.

Default is 4

Maximum word length - Sets the maximum numbers of characters required for the word to be evaluated by the Bayesian.

Default is 30

Use lower case - By default MailWasher ignores capitalisation when evaluating words, so for example 'monkey' is considered the same as 'Monkey'. Unchecking this option will treat words with different capitalisation separately.

Default is checked.

Good token weight - This setting gives more authority to words considered to be good, which in cases where emails have both good and spam words generally means the email will come out as good. Currently the setting is 2.0 doubles their weight.

Default is 2.0

Minimum count for inclusion - A word must occur this number of times before it will have its' probability mapped.

Default is 5

Certain spam count - If an email has 0 good tokens, and more than the number specified of bad tokens then MailWasher will class as definitely spam and return the Certain Spam Score regardless of the Bayesian evaluation. If this is a negative number this feature is disabled. A typical number to set would be around 10, though if very few emails have been trained as good it may cause false positives and mark good email as spam.

Default is -1

Interesting word count - The number of tokens both good and bad that will be considered when performing the final evaluating of the email. A larger number early can cause the Bayesian evaluation to be too authoritative, so currently set to 10. Though 15-25 is a better choice as more emails are trained.

Default is 10

Whole words can also be excluded currently by manually editing the 'mwp_exw.dat' file in the Application Data\Firetrust\MailWasher\Cache\ directory. You can open this file in a standard text editor, one word per line.

Whole words can also be included currently by manually editing the 'mwp_inw.dat' file in the same directory. You can open this file in a standard text editor, one word per line. For example as above a word must have 4 characters to be considered, but adding the word 'sex' makes MailWasher consider this word.

Words can also be converted in the 'mwp_conv.dat' file in the same directory, so for example the following would convert all the different variations to a plain 'viagra'

v1agra viagra

\/iagra viagra

/iagra viagra

vi@gra viagra

/i@gra viagra

Other files are

mwp_nswl.dat - This is the non-spam corpus, it is rebuilt every time mail is washed.

mwp_swl.dat - This is the spam corpus, it is rebuilt every time mail is washed.

mwp_pmap.dat - This file stores the probability mappings for tokens,

MWP.db3 - This is the main MailWasher database file, it stores friends list, blacklist, email deleted, the corpus files etc...