Closed Bug 245176 Opened 20 years ago Closed 20 years ago

Sign error in Bayesian spam filter causes spam to get through

Categories

(MailNews Core :: Filters, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.8alpha2

People

(Reporter: lorenzo, Assigned: lorenzo)

References

Details

(Whiteboard: fixed-aviary1.0)

Attachments

(3 files)

http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#992

Hexp and Sexp are declared as PRUint32. However, they must be negative. Consider
the following code (line 1004):

        if ( S < 1e-200 ) 
        {
          S = frexp(S, &e);
          Sexp += e;
        }

It is obvious that e is negative, since S < 1e-200. Since we are adding e to
Sexp, and Sexp starts out as zero, obviously either Sexp = 0 or Sexp < 0. So
when we do (line 1017):

S = log(S) + Sexp * M_LN2;

the compiler thinks Sexp is unsigned and completely messes up the conversion
from to floating point (required to multiply by M_LN2). This causes S to assume
silly values like 2e-39 with the result that when the chi-squared function is
called we get S = 0. Thus emails with many bad tokens get through the spam filter.

Changing Sexp from PRUint32 to PRInt32 fixes the problem.
Patch to see what is going wrong
Assignee: mscott → lorenzo
Status: NEW → ASSIGNED
Attached file candidate spam email
Spam email I used for testing
To see this bug, do the following:

1. Apply the patch in attachment 149697 [details] [diff] [review]
2. Drop the .eml file in a mailbox somewhere
3. Enable junk filter NSPR logging (set NSPR_LOG_MODULES=BayesianFilter:5, etc.)
4. Start mail and run the junk filter

This is what I get in the log:

0[3f40a0]: analyze the tokenized message
[...]
0[3f40a0]: token.mProbability (undervalued) is 0.992846
0[3f40a0]: token.mProbability (quick) is 0.998941
0[3f40a0]: S = 1.47474e-164, Sexp = -670
0[3f40a0]: S = 2.97704e+009
0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14614 is junk probability =
(0.500000)  HAM SCORE:0.000000 SPAM SCORE:0.000000

If S = 1.47474e-164 and Sexp = -670, doing S = log(S) + Sexp * M_LN2 should give
a result of -841.6, not 2.9e9!

After changing the PRUint32 to PRInt32, this is what I get:

0[3f40a0]: analyze the tokenized message
[...]
0[3f40a0]: token.mProbability (undervalued) is 0.992846
0[3f40a0]: token.mProbability (quick) is 0.998941
0[3f40a0]: S = 1.47474e-164, Sexp = -670
0[3f40a0]: S = -841.644
0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14615 is junk probability =
(1.000000)  HAM SCORE:0.000000 SPAM SCORE:1.000000

which is correct and the message is correctly tagged as spam.
Attached patch fix the problemSplinter Review
Patch to fix the problem
Comment on attachment 149699 [details] [diff] [review]
fix the problem

Scott, could you take a look at this?

I'm hoping this is the reason why TB >= 0.5 is so much worse at spotting junk
than 0.4 (at least for me).
Attachment #149699 - Flags: review?(mscott)
hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the
same code. 

Also See Bug #244357
(In reply to comment #6)
> hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the
> same code.

Yes, sorry. This did not affect 0.5. It started when I started using the
"experimental new junk-mail algorithm" build I saw on the mozillazine forums.
Then I reverted to 0.4 and now I'm on 0.6+

> Also See Bug #244357

Hmm, I didn't notice that. I just assumed the chi-square function was correct.
So maybe bug 244357 was the reason why it was returning exactly zero and not
some other value.

But that's not so important: even if there is a bug in there, this one is far
more serious, since this is the code that determines the input of that function,
and if the input is nonsensical (like in this case) there is no way that that
that function can return a useful value. So this patch needs to be applied.

It's ironic that people are discussing the nuances of implementation of
chi-squared in bug 244357 but missed this problem, which is much more serious... ;-)
I re-ran my benchmark tests with this patch and it checks out ok.

false positives remained the same before and after (5 and 5).

false negatives were reduced from 52 to 41

Of course this was from over 3,000 messages. So percentage wise it's just small
improvement.
checked in. Thanks for the patch!
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Whiteboard: fixed-aviary1.0
Target Milestone: --- → mozilla1.8alpha2
Attachment #149699 - Flags: review?(mscott) → review+
The effect on my inbox was much worse than that. It probably depends on the
training set: the more tokens matched, the more likely that S becomes less than
1e-200 and the message slips through.

In any case, thanks for the quick review and checkin! Thanks to this and bug
244722, only two or three spams have found their way into my inbox since
yesterday. I am a happy camper again. :)
I don't suppose anyone can tell me which release(s) of Mozilla were affected?
1.8 alpha 1
*** Bug 248337 has been marked as a duplicate of this bug. ***
do we have a build that we can test?
(In reply to comment #14)
> do we have a build that we can test?

This should be fixed in Thunderbird 0.7 and not in 0.6.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: