Closed Bug 210215 Opened 21 years ago Closed 19 years ago

Junk mail probability algorithm ignores over-abundant tokens

Categories

(MailNews Core :: Filters, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sparr0, Assigned: sspitzer)

References

()

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030529

The algorithm for calculating token probabilities, as taken from Paul Graham, is
flawed in that it severely distorts the values for tokens that appear more than
once per email on average.  For example, say I have 500 emails, 300 junk and 200
nonjunk.  In my 300 junk emails you find the token nbsp (thats a html
nonbreaking space) a total of 1100 times.  In my 200 nonjunk emails you find it
100 times.  The correct, imho, spam probability of nbsp is
( 1100 / 300 ) / ( ( 250 / 200 ) + ( 1100 / 300 ) ) = 74.21%.  However, the
algorithm we use now throws in a couple of extra dmin() calls which would cut
the 800/300 factors down to 300/300 and the 250/200 down to 200/200, making
Mozilla see the probability as a flat 50%.  If the 1100 and 250 were 800 and 100
then the proper probability of about 85% would get cut down to about 67%. 
Actually these numbers are fudged a bit since Mozilla gives good emails double
weight, but the basic problem still stands.

The fix for this would be to simply replace these three lines in
nsBayesianFilter.cpp:
(dmin(1.0, (b / nbad)) /
     (dmin(1.0, (g / ngood)) +
      dmin(1.0, (b / nbad))))));
with these:
((b / nbad) /
 ((g / ngood) +
  (b / nbad)))));

The first dmin() call is completely redundant, since it is directly surrounded
by a 'dmin(.99, ... )'.  The other two are what causes the problem described in
this bug.

Reproducible: Always

Steps to Reproduce:
Confirming.  dmose, who should own this?
Status: UNCONFIRMED → NEW
Ever confirmed: true
bz: sspitzer is a good owner for this.  I'd be interested in beard's take on
this proposed patch.
*** Bug 209715 has been marked as a duplicate of this bug. ***
Actual numbers from my newly-trained (and bugged, thanks to mozilla ignoring
half my training, but thats the subject of another bug report) training.dat. 
The token 'p' (usually representing a <p> in a html email) occurs 330 times in
my 129 good emails and 463 times in my 24 junk emails.  Unfortunately this bug
results in a calculated spam probability of 50%, instead of the more correct
probability of (463/24)/((463/24)+(330/129)*2)=79% (88% if you dont double the
good count).  The opposite happens with 'received' (indicative of forwarded
email I think), where I see 708 occurences in 129 good mails and only 49 in 24
junk mails.  again, the bug makes it 50%, but it should be 15%.
the algorithm we use is based on http://www.paulgraham.com/spam.html, and was
implemented by beard.

improvements are possible, paul graham has some listed here:
http://www.paulgraham.com/better.html

I'd be nervous about tinkering with the algorithm before we do some serious
testing to see what the affects would be.

made my proposed change into an attachment, hope someone else can try it out
too.  should improve accuracy quite a bit, especially in early training.  now
im off to fix the training-is-broken bug  :)
Comment on attachment 130892 [details] [diff] [review]
fixes dmin() skewing ratios for overabundant tokens

Index: nsBayesianFilter.cpp
===================================================================
RCS file:
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.
cpp,v
retrieving revision 1.32
diff -u -w -r1.32 nsBayesianFilter.cpp
--- nsBayesianFilter.cpp	5 Aug 2003 20:09:03 -0000	1.32
+++ nsBayesianFilter.cpp	4 Sep 2003 07:49:00 -0000
@@ -645,11 +645,12 @@
	     //      (min .99 (float (/ (min 1 (/ b nbad))
	     // 			(+ (min 1 (/ g ngood))
	     // 			   (min 1 (/ b nbad)))))))
+	      // UPDATE: removed two min calls to fix a bug, one because its
redundant
	     token.mProbability = dmax(.01,
				      dmin(.99,
-					  (dmin(1.0, (b / nbad)) /
-					       (dmin(1.0, (g / ngood)) +
-						dmin(1.0, (b / nbad))))));
+					  ((b / nbad) /
+					       ((g / ngood) +
+						(b / nbad)))));
	     PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS,
("token.mProbability (%s) is %f", word, token.mProbability));
	 } else {
	     token.mProbability = 0.4;
Comment on attachment 130892 [details] [diff] [review]
fixes dmin() skewing ratios for overabundant tokens

Index: nsBayesianFilter.cpp
===================================================================
RCS file:
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.
cpp,v
retrieving revision 1.32
diff -u -w -r1.32 nsBayesianFilter.cpp
--- nsBayesianFilter.cpp	5 Aug 2003 20:09:03 -0000	1.32
+++ nsBayesianFilter.cpp	4 Sep 2003 07:49:00 -0000
@@ -645,11 +645,12 @@
	     //      (min .99 (float (/ (min 1 (/ b nbad))
	     // 			(+ (min 1 (/ g ngood))
	     // 			   (min 1 (/ b nbad)))))))
+	      // removed two min calls to fix a bug (
http://bugzilla.mozilla.org/show_bug.cgi?id=210215 ) and one because its
redundant
	     token.mProbability = dmax(.01,
				      dmin(.99,
-					  (dmin(1.0, (b / nbad)) /
-					       (dmin(1.0, (g / ngood)) +
-						dmin(1.0, (b / nbad))))));
+					  ((b / nbad) /
+					       ((g / ngood) +
+						(b / nbad)))));
	     PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS,
("token.mProbability (%s) is %f", word, token.mProbability));
	 } else {
	     token.mProbability = 0.4;
Attachment #131044 - Flags: superreview?(sspitzer)
Attachment #131044 - Flags: review?(sspitzer)
Ive been running my patch for 3 weeks now with no ill effects.  Monitoring my
token list with mnenhy (and my own training.csv patch) indicates much more
useful percentages for many tokens.
Okay, I'm going to test with attachment 131044 [details] [diff] [review] for a while, to see if it fares
any better than the default algorithm in the face of w32.swen (which decimates
the junk filter's general effectiveness, it seems).
Here are some reformatted results for the top 40ish tokens in my training.dat
from a modified version of Mnenhy.  Token is the word in question, Good is how
many times I have seen it in a nonspam email, Evil is how many times in spam. 
Patch% is the "correct" spam probability according to the algorithm as I have
patched it.  Old% is the "bad" probability as calculated by the current
straight-from-paul-graham algorithm.  

The junk mail filter has been trained by 500 messages, whereof 354 (71%) have
been rated as solicited and 146 (29%) as junk.
This has lead to a total of 19964 tokens, with 11279 (56%) rated as good and
8685 (44%) as evil.

In the following table, 19926 tokens below the threshold of 1000 appearances
have been ignored. (and a few of my personal very low % tokens have been censored)

Token   Good    Evil    Patch%   Old% 
span       0    1484    99.00  99.00
width      0    1533    99.00  99.00
img        0    1001    99.00  99.00
html       0    1041    99.00  99.00
color      0    1225    99.00  99.00
xxxxxxxx   0    1470    99.00  99.00
td         0    2631    99.00  99.00
tr         0    1548    99.00  99.00
www     1062    2124    70.80  50.00
http    1765    3090    67.97  50.00
font    1876    2958    65.65  50.00
href    1274    1627    60.76  50.00
p       1172    1105    53.34  50.00
and     2211    1542    45.81  50.00
a       5703    3973    45.79  50.00
com     5400    3716    45.48  50.00
of      1655    1037    43.17  50.00
net     2248    1357    42.26  50.00
in      1234     693    40.51  50.00
you     1511     655    34.45  50.00
with    1614     625    31.95  50.00
to      4754    1637    29.45  50.00
for     2214     702    27.77  50.00
from    2333     731    27.53  50.00
the     4328    1346    27.38  50.00
b       4756    1472    27.28  50.00
br      9412    2705    25.84  50.00
is      1111     317    25.70  50.00
this    1118     315    25.46  50.00
by      1629     442    24.75  50.00
on      1285     316    22.97  50.00
be      1098     210    18.82  50.00
received1421     262    18.27  50.00
aug     1169     205    17.53  50.00
i       1405     235    16.86  50.00
nbsp    4834     357    8.22   50.00
yahoo   1579      61    4.47   29.47
xxxxxxxx1196      40    3.90   21.51

It is plainly obvious that many tokens with useful percentages are being mashed
into the 50% bracket, and many more are having their percentages skewed.
Attachment #131044 - Flags: superreview?(sspitzer)
Attachment #131044 - Flags: superreview?(dmose)
Attachment #131044 - Flags: review?(sspitzer)
Attachment #131044 - Flags: review?(dmose)
The changes here are becoming outdated by the progress in bug 181534. You might
want to have a look there Clarence (I didn't see your address in the CC list).
Product: MailNews → Core
Comment on attachment 131044 [details] [diff] [review]
like attachment 130892 [details] [diff] [review], but remove dmin/dmax, replace values w/ constants, use an if so that veryjunky isn't logically penalized

As far as I can see, because 181534 has landed, this patch is no longer needed.
Attachment #131044 - Flags: superreview?(dmose)
Attachment #131044 - Flags: review?(dmose)
Comment on attachment 131044 [details] [diff] [review]
like attachment 130892 [details] [diff] [review], but remove dmin/dmax, replace values w/ constants, use an if so that veryjunky isn't logically penalized

As far as I can see, because 181534 has landed, this patch is no longer
necessary.
Resolving; feel free to re-open if I'm wrong and this algorithm issue still
applies in the new chi-separating world.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: