Closed Bug 245176 Opened 20 years ago Closed 20 years ago

Sign error in Bayesian spam filter causes spam to get through

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla1.8alpha2

People

(Reporter: lorenzo, Assigned: lorenzo)

References

Details

(Whiteboard: fixed-aviary1.0)

Attachments

(3 files)

log statements for testing bug 20 years ago Lorenzo Colitti 802 bytes, patch		Details \| Diff \| Splinter Review
candidate spam email 20 years ago Lorenzo Colitti 4.33 KB, text/plain		Details
fix the problem 20 years ago Lorenzo Colitti 721 bytes, patch	mscott : review+	Details \| Diff \| Splinter Review

Lorenzo Colitti

Assignee

Description

•

20 years ago

http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#992

Hexp and Sexp are declared as PRUint32. However, they must be negative. Consider
the following code (line 1004):

        if ( S < 1e-200 ) 
        {
          S = frexp(S, &e);
          Sexp += e;
        }

It is obvious that e is negative, since S < 1e-200. Since we are adding e to
Sexp, and Sexp starts out as zero, obviously either Sexp = 0 or Sexp < 0. So
when we do (line 1017):

S = log(S) + Sexp * M_LN2;

the compiler thinks Sexp is unsigned and completely messes up the conversion
from to floating point (required to multiply by M_LN2). This causes S to assume
silly values like 2e-39 with the result that when the chi-squared function is
called we get S = 0. Thus emails with many bad tokens get through the spam filter.

Changing Sexp from PRUint32 to PRInt32 fixes the problem.

Lorenzo Colitti

Assignee

Comment 1

•

20 years ago

Attached patch log statements for testing bug — Details — Splinter Review

Patch to see what is going wrong

Assignee: mscott → lorenzo

Status: NEW → ASSIGNED

Lorenzo Colitti

Assignee

Comment 2

•

20 years ago

Attached file candidate spam email — Details

Spam email I used for testing

Lorenzo Colitti

Assignee

Comment 3

•

20 years ago

To see this bug, do the following:

1. Apply the patch in attachment 149697 [details] [diff] [review]
2. Drop the .eml file in a mailbox somewhere
3. Enable junk filter NSPR logging (set NSPR_LOG_MODULES=BayesianFilter:5, etc.)
4. Start mail and run the junk filter

This is what I get in the log:

0[3f40a0]: analyze the tokenized message
[...]
0[3f40a0]: token.mProbability (undervalued) is 0.992846
0[3f40a0]: token.mProbability (quick) is 0.998941
0[3f40a0]: S = 1.47474e-164, Sexp = -670
0[3f40a0]: S = 2.97704e+009
0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14614 is junk probability =
(0.500000)  HAM SCORE:0.000000 SPAM SCORE:0.000000

If S = 1.47474e-164 and Sexp = -670, doing S = log(S) + Sexp * M_LN2 should give
a result of -841.6, not 2.9e9!

After changing the PRUint32 to PRInt32, this is what I get:

0[3f40a0]: analyze the tokenized message
[...]
0[3f40a0]: token.mProbability (undervalued) is 0.992846
0[3f40a0]: token.mProbability (quick) is 0.998941
0[3f40a0]: S = 1.47474e-164, Sexp = -670
0[3f40a0]: S = -841.644
0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14615 is junk probability =
(1.000000)  HAM SCORE:0.000000 SPAM SCORE:1.000000

which is correct and the message is correctly tagged as spam.

Lorenzo Colitti

Assignee

Comment 4

•

20 years ago

Attached patch fix the problem — Details — Splinter Review

Patch to fix the problem

Lorenzo Colitti

Assignee

Comment 5

•

20 years ago

Comment on attachment 149699 [details] [diff] [review]
fix the problem

Scott, could you take a look at this?

I'm hoping this is the reason why TB >= 0.5 is so much worse at spotting junk
than 0.4 (at least for me).

Attachment #149699 - Flags: review?(mscott)

Scott MacGregor

Comment 6

•

20 years ago

hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the
same code. 

Also See Bug #244357

Lorenzo Colitti

Assignee

Comment 7

•

20 years ago

(In reply to comment #6)
> hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the
> same code.

Yes, sorry. This did not affect 0.5. It started when I started using the
"experimental new junk-mail algorithm" build I saw on the mozillazine forums.
Then I reverted to 0.4 and now I'm on 0.6+

> Also See Bug #244357

Hmm, I didn't notice that. I just assumed the chi-square function was correct.
So maybe bug 244357 was the reason why it was returning exactly zero and not
some other value.

But that's not so important: even if there is a bug in there, this one is far
more serious, since this is the code that determines the input of that function,
and if the input is nonsensical (like in this case) there is no way that that
that function can return a useful value. So this patch needs to be applied.

It's ironic that people are discussing the nuances of implementation of
chi-squared in bug 244357 but missed this problem, which is much more serious... ;-)

Scott MacGregor

Comment 8

•

20 years ago

I re-ran my benchmark tests with this patch and it checks out ok.

false positives remained the same before and after (5 and 5).

false negatives were reduced from 52 to 41

Of course this was from over 3,000 messages. So percentage wise it's just small
improvement.

Scott MacGregor

Comment 9

•

20 years ago

checked in. Thanks for the patch!

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → FIXED

Whiteboard: fixed-aviary1.0

Target Milestone: --- → mozilla1.8alpha2

Scott MacGregor

Updated

•

20 years ago

Attachment #149699 - Flags: review?(mscott) → review+

Lorenzo Colitti

Assignee

Comment 10

•

20 years ago

The effect on my inbox was much worse than that. It probably depends on the
training set: the more tokens matched, the more likely that S becomes less than
1e-200 and the message slips through.

In any case, thanks for the quick review and checkin! Thanks to this and bug
244722, only two or three spams have found their way into my inbox since
yesterday. I am a happy camper again. :)

neil@parkwaycc.co.uk

Comment 11

•

20 years ago

I don't suppose anyone can tell me which release(s) of Mozilla were affected?

Scott MacGregor

Comment 12

•

20 years ago

1.8 alpha 1

Benedikt Kantus

Comment 13

•

20 years ago

*** Bug 248337 has been marked as a duplicate of this bug. ***

brian lehrhoff

Comment 14

•

20 years ago

do we have a build that we can test?

Lorenzo Colitti

Assignee

Comment 15

•

20 years ago

(In reply to comment #14)
> do we have a build that we can test?

This should be fixed in Thunderbird 0.7 and not in 0.6.

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.