Closed
Bug 245176
Opened 20 years ago
Closed 20 years ago
Sign error in Bayesian spam filter causes spam to get through
Categories
(MailNews Core :: Filters, defect)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla1.8alpha2
People
(Reporter: lorenzo, Assigned: lorenzo)
References
Details
(Whiteboard: fixed-aviary1.0)
Attachments
(3 files)
802 bytes,
patch
|
Details | Diff | Splinter Review | |
4.33 KB,
text/plain
|
Details | |
721 bytes,
patch
|
mscott
:
review+
|
Details | Diff | Splinter Review |
http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#992 Hexp and Sexp are declared as PRUint32. However, they must be negative. Consider the following code (line 1004): if ( S < 1e-200 ) { S = frexp(S, &e); Sexp += e; } It is obvious that e is negative, since S < 1e-200. Since we are adding e to Sexp, and Sexp starts out as zero, obviously either Sexp = 0 or Sexp < 0. So when we do (line 1017): S = log(S) + Sexp * M_LN2; the compiler thinks Sexp is unsigned and completely messes up the conversion from to floating point (required to multiply by M_LN2). This causes S to assume silly values like 2e-39 with the result that when the chi-squared function is called we get S = 0. Thus emails with many bad tokens get through the spam filter. Changing Sexp from PRUint32 to PRInt32 fixes the problem.
Assignee | ||
Comment 1•20 years ago
|
||
Patch to see what is going wrong
Assignee: mscott → lorenzo
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•20 years ago
|
||
Spam email I used for testing
Assignee | ||
Comment 3•20 years ago
|
||
To see this bug, do the following: 1. Apply the patch in attachment 149697 [details] [diff] [review] 2. Drop the .eml file in a mailbox somewhere 3. Enable junk filter NSPR logging (set NSPR_LOG_MODULES=BayesianFilter:5, etc.) 4. Start mail and run the junk filter This is what I get in the log: 0[3f40a0]: analyze the tokenized message [...] 0[3f40a0]: token.mProbability (undervalued) is 0.992846 0[3f40a0]: token.mProbability (quick) is 0.998941 0[3f40a0]: S = 1.47474e-164, Sexp = -670 0[3f40a0]: S = 2.97704e+009 0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14614 is junk probability = (0.500000) HAM SCORE:0.000000 SPAM SCORE:0.000000 If S = 1.47474e-164 and Sexp = -670, doing S = log(S) + Sexp * M_LN2 should give a result of -841.6, not 2.9e9! After changing the PRUint32 to PRInt32, this is what I get: 0[3f40a0]: analyze the tokenized message [...] 0[3f40a0]: token.mProbability (undervalued) is 0.992846 0[3f40a0]: token.mProbability (quick) is 0.998941 0[3f40a0]: S = 1.47474e-164, Sexp = -670 0[3f40a0]: S = -841.644 0[3f40a0]: imap-message://REMOVED@REMOVED/INBOX#14615 is junk probability = (1.000000) HAM SCORE:0.000000 SPAM SCORE:1.000000 which is correct and the message is correctly tagged as spam.
Assignee | ||
Comment 4•20 years ago
|
||
Patch to fix the problem
Assignee | ||
Comment 5•20 years ago
|
||
Comment on attachment 149699 [details] [diff] [review] fix the problem Scott, could you take a look at this? I'm hoping this is the reason why TB >= 0.5 is so much worse at spotting junk than 0.4 (at least for me).
Attachment #149699 -
Flags: review?(mscott)
Comment 6•20 years ago
|
||
hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the same code. Also See Bug #244357
Assignee | ||
Comment 7•20 years ago
|
||
(In reply to comment #6) > hmm but the new junk mail algorithm is in 0.6 not in 0.5. 0.4 and 0.5 use the > same code. Yes, sorry. This did not affect 0.5. It started when I started using the "experimental new junk-mail algorithm" build I saw on the mozillazine forums. Then I reverted to 0.4 and now I'm on 0.6+ > Also See Bug #244357 Hmm, I didn't notice that. I just assumed the chi-square function was correct. So maybe bug 244357 was the reason why it was returning exactly zero and not some other value. But that's not so important: even if there is a bug in there, this one is far more serious, since this is the code that determines the input of that function, and if the input is nonsensical (like in this case) there is no way that that that function can return a useful value. So this patch needs to be applied. It's ironic that people are discussing the nuances of implementation of chi-squared in bug 244357 but missed this problem, which is much more serious... ;-)
Comment 8•20 years ago
|
||
I re-ran my benchmark tests with this patch and it checks out ok. false positives remained the same before and after (5 and 5). false negatives were reduced from 52 to 41 Of course this was from over 3,000 messages. So percentage wise it's just small improvement.
Comment 9•20 years ago
|
||
checked in. Thanks for the patch!
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Whiteboard: fixed-aviary1.0
Target Milestone: --- → mozilla1.8alpha2
Updated•20 years ago
|
Attachment #149699 -
Flags: review?(mscott) → review+
Assignee | ||
Comment 10•20 years ago
|
||
The effect on my inbox was much worse than that. It probably depends on the training set: the more tokens matched, the more likely that S becomes less than 1e-200 and the message slips through. In any case, thanks for the quick review and checkin! Thanks to this and bug 244722, only two or three spams have found their way into my inbox since yesterday. I am a happy camper again. :)
Comment 11•20 years ago
|
||
I don't suppose anyone can tell me which release(s) of Mozilla were affected?
Comment 12•20 years ago
|
||
1.8 alpha 1
Comment 13•20 years ago
|
||
*** Bug 248337 has been marked as a duplicate of this bug. ***
Comment 14•20 years ago
|
||
do we have a build that we can test?
Assignee | ||
Comment 15•20 years ago
|
||
(In reply to comment #14) > do we have a build that we can test? This should be fixed in Thunderbird 0.7 and not in 0.6.
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•