Closed Bug 263397 Opened 20 years ago Closed 19 years ago

Use number of misspelled words as a criteria in Bayesian filtering.

Categories

(Thunderbird :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 294077

People

(Reporter: dsimcha, Assigned: mscott)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20041007 Firefox/0.10.1
Build Identifier: 

I have devised an idea to make Bayesian filtering of junk mail work better:  in
addition to the existing criteria used to determine the probability of a message
being spam, every incoming message should be spell-checked using the existing
spell-checker.  As messages are marked as junk, the percentage of misspelled
words should be recorded and this should be used as part of the criteria for
determining the probability of a message being junk.  This would complement the
current Bayesian filtering method very well, as one of the most common methods
of getting around this filtering is to purposely misspell words, for example
prof1t instead of profit or remmm oveeee instead of remove.

Reproducible: Always
Steps to Reproduce:
Whilst this is arguably a dup of one or more of the spam filtering bugs
that are open, it is more relevant to note that the Bayes in Bayesian filtering
already does this.

In your example 'prof1t' identifies spam far more damningly than even profit
which I doubt occurs in many of your valid e-mails.
However, in the remove vs. ree mooove example, there are so many ways to
misspell the same word that the fact that spammers use multiple misspellings
will throw off the Bayesian filters until they have seen almost every feasible
combination.
I agree that it looks as though what you propose should work
and should be effective, but in practice, a totally statistical
approach is provably superior.

In your example, the tokens 'ree' and 'mooove' which I agree are
likely to be strong markers of spam are only going to be evaluated 
if they are within the 15 most interesting tokens in a message, in
other words after spammers have started to use just these mis-spellings
in preference to others.

The Bayes (training) approach is like using a snow plough to keep
a path clear in inclement weather, anticipating mis-spellings
is like trying to catch the snow on the way dowm.

There are links to the Plan for Spam, and much other information
from the Mozilla Bayesian spam page 
http://www.mozilla.org/mailnews/spam.html 
This is an automated message, with ID "auto-resolve01".

This bug has had no comments for a long time. Statistically, we have found that
bug reports that have not been confirmed by a second user after three months are
highly unlikely to be the source of a fix to the code.

While your input is very important to us, our resources are limited and so we
are asking for your help in focussing our efforts. If you can still reproduce
this problem in the latest version of the product (see below for how to obtain a
copy) or, for feature requests, if it's not present in the latest version and
you still believe we should implement it, please visit the URL of this bug
(given at the top of this mail) and add a comment to that effect, giving more
reproduction information if you have it.

If it is not a problem any longer, you need take no action. If this bug is not
changed in any way in the next two weeks, it will be automatically resolved.
Thank you for your help in this matter.

The latest beta releases can be obtained from:
Firefox:     http://www.mozilla.org/projects/firefox/
Thunderbird: http://www.mozilla.org/products/thunderbird/releases/1.5beta1.html
Seamonkey:   http://www.mozilla.org/projects/seamonkey/
This bug has been automatically resolved after a period of inactivity (see above
comment). If anyone thinks this is incorrect, they should feel free to reopen it.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → EXPIRED
Resolution: EXPIRED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.