Closed
Bug 210215
Opened 21 years ago
Closed 19 years ago
Junk mail probability algorithm ignores over-abundant tokens
Categories
(MailNews Core :: Filters, defect)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: sparr0, Assigned: sspitzer)
References
()
Details
Attachments
(2 files)
675 bytes,
patch
|
Details | Diff | Splinter Review | |
3.84 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030529 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030529 The algorithm for calculating token probabilities, as taken from Paul Graham, is flawed in that it severely distorts the values for tokens that appear more than once per email on average. For example, say I have 500 emails, 300 junk and 200 nonjunk. In my 300 junk emails you find the token nbsp (thats a html nonbreaking space) a total of 1100 times. In my 200 nonjunk emails you find it 100 times. The correct, imho, spam probability of nbsp is ( 1100 / 300 ) / ( ( 250 / 200 ) + ( 1100 / 300 ) ) = 74.21%. However, the algorithm we use now throws in a couple of extra dmin() calls which would cut the 800/300 factors down to 300/300 and the 250/200 down to 200/200, making Mozilla see the probability as a flat 50%. If the 1100 and 250 were 800 and 100 then the proper probability of about 85% would get cut down to about 67%. Actually these numbers are fudged a bit since Mozilla gives good emails double weight, but the basic problem still stands. The fix for this would be to simply replace these three lines in nsBayesianFilter.cpp: (dmin(1.0, (b / nbad)) / (dmin(1.0, (g / ngood)) + dmin(1.0, (b / nbad)))))); with these: ((b / nbad) / ((g / ngood) + (b / nbad))))); The first dmin() call is completely redundant, since it is directly surrounded by a 'dmin(.99, ... )'. The other two are what causes the problem described in this bug. Reproducible: Always Steps to Reproduce:
Comment 1•21 years ago
|
||
Confirming. dmose, who should own this?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 2•21 years ago
|
||
bz: sspitzer is a good owner for this. I'd be interested in beard's take on this proposed patch.
Reporter | ||
Comment 3•21 years ago
|
||
*** Bug 209715 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 4•21 years ago
|
||
Actual numbers from my newly-trained (and bugged, thanks to mozilla ignoring half my training, but thats the subject of another bug report) training.dat. The token 'p' (usually representing a <p> in a html email) occurs 330 times in my 129 good emails and 463 times in my 24 junk emails. Unfortunately this bug results in a calculated spam probability of 50%, instead of the more correct probability of (463/24)/((463/24)+(330/129)*2)=79% (88% if you dont double the good count). The opposite happens with 'received' (indicative of forwarded email I think), where I see 708 occurences in 129 good mails and only 49 in 24 junk mails. again, the bug makes it 50%, but it should be 15%.
Assignee | ||
Comment 5•21 years ago
|
||
the algorithm we use is based on http://www.paulgraham.com/spam.html, and was implemented by beard. improvements are possible, paul graham has some listed here: http://www.paulgraham.com/better.html I'd be nervous about tinkering with the algorithm before we do some serious testing to see what the affects would be.
Reporter | ||
Comment 6•21 years ago
|
||
made my proposed change into an attachment, hope someone else can try it out too. should improve accuracy quite a bit, especially in early training. now im off to fix the training-is-broken bug :)
Reporter | ||
Comment 7•21 years ago
|
||
Comment on attachment 130892 [details] [diff] [review] fixes dmin() skewing ratios for overabundant tokens Index: nsBayesianFilter.cpp =================================================================== RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter. cpp,v retrieving revision 1.32 diff -u -w -r1.32 nsBayesianFilter.cpp --- nsBayesianFilter.cpp 5 Aug 2003 20:09:03 -0000 1.32 +++ nsBayesianFilter.cpp 4 Sep 2003 07:49:00 -0000 @@ -645,11 +645,12 @@ // (min .99 (float (/ (min 1 (/ b nbad)) // (+ (min 1 (/ g ngood)) // (min 1 (/ b nbad))))))) + // UPDATE: removed two min calls to fix a bug, one because its redundant token.mProbability = dmax(.01, dmin(.99, - (dmin(1.0, (b / nbad)) / - (dmin(1.0, (g / ngood)) + - dmin(1.0, (b / nbad)))))); + ((b / nbad) / + ((g / ngood) + + (b / nbad))))); PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("token.mProbability (%s) is %f", word, token.mProbability)); } else { token.mProbability = 0.4;
Reporter | ||
Comment 8•21 years ago
|
||
Comment on attachment 130892 [details] [diff] [review] fixes dmin() skewing ratios for overabundant tokens Index: nsBayesianFilter.cpp =================================================================== RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter. cpp,v retrieving revision 1.32 diff -u -w -r1.32 nsBayesianFilter.cpp --- nsBayesianFilter.cpp 5 Aug 2003 20:09:03 -0000 1.32 +++ nsBayesianFilter.cpp 4 Sep 2003 07:49:00 -0000 @@ -645,11 +645,12 @@ // (min .99 (float (/ (min 1 (/ b nbad)) // (+ (min 1 (/ g ngood)) // (min 1 (/ b nbad))))))) + // removed two min calls to fix a bug ( http://bugzilla.mozilla.org/show_bug.cgi?id=210215 ) and one because its redundant token.mProbability = dmax(.01, dmin(.99, - (dmin(1.0, (b / nbad)) / - (dmin(1.0, (g / ngood)) + - dmin(1.0, (b / nbad)))))); + ((b / nbad) / + ((g / ngood) + + (b / nbad))))); PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("token.mProbability (%s) is %f", word, token.mProbability)); } else { token.mProbability = 0.4;
Attachment #131044 -
Flags: superreview?(sspitzer)
Attachment #131044 -
Flags: review?(sspitzer)
Reporter | ||
Comment 10•21 years ago
|
||
Ive been running my patch for 3 weeks now with no ill effects. Monitoring my token list with mnenhy (and my own training.csv patch) indicates much more useful percentages for many tokens.
Comment 11•21 years ago
|
||
Okay, I'm going to test with attachment 131044 [details] [diff] [review] for a while, to see if it fares any better than the default algorithm in the face of w32.swen (which decimates the junk filter's general effectiveness, it seems).
Reporter | ||
Comment 12•21 years ago
|
||
Here are some reformatted results for the top 40ish tokens in my training.dat from a modified version of Mnenhy. Token is the word in question, Good is how many times I have seen it in a nonspam email, Evil is how many times in spam. Patch% is the "correct" spam probability according to the algorithm as I have patched it. Old% is the "bad" probability as calculated by the current straight-from-paul-graham algorithm. The junk mail filter has been trained by 500 messages, whereof 354 (71%) have been rated as solicited and 146 (29%) as junk. This has lead to a total of 19964 tokens, with 11279 (56%) rated as good and 8685 (44%) as evil. In the following table, 19926 tokens below the threshold of 1000 appearances have been ignored. (and a few of my personal very low % tokens have been censored) Token Good Evil Patch% Old% span 0 1484 99.00 99.00 width 0 1533 99.00 99.00 img 0 1001 99.00 99.00 html 0 1041 99.00 99.00 color 0 1225 99.00 99.00 xxxxxxxx 0 1470 99.00 99.00 td 0 2631 99.00 99.00 tr 0 1548 99.00 99.00 www 1062 2124 70.80 50.00 http 1765 3090 67.97 50.00 font 1876 2958 65.65 50.00 href 1274 1627 60.76 50.00 p 1172 1105 53.34 50.00 and 2211 1542 45.81 50.00 a 5703 3973 45.79 50.00 com 5400 3716 45.48 50.00 of 1655 1037 43.17 50.00 net 2248 1357 42.26 50.00 in 1234 693 40.51 50.00 you 1511 655 34.45 50.00 with 1614 625 31.95 50.00 to 4754 1637 29.45 50.00 for 2214 702 27.77 50.00 from 2333 731 27.53 50.00 the 4328 1346 27.38 50.00 b 4756 1472 27.28 50.00 br 9412 2705 25.84 50.00 is 1111 317 25.70 50.00 this 1118 315 25.46 50.00 by 1629 442 24.75 50.00 on 1285 316 22.97 50.00 be 1098 210 18.82 50.00 received1421 262 18.27 50.00 aug 1169 205 17.53 50.00 i 1405 235 16.86 50.00 nbsp 4834 357 8.22 50.00 yahoo 1579 61 4.47 29.47 xxxxxxxx1196 40 3.90 21.51 It is plainly obvious that many tokens with useful percentages are being mashed into the 50% bracket, and many more are having their percentages skewed.
Attachment #131044 -
Flags: superreview?(sspitzer)
Attachment #131044 -
Flags: superreview?(dmose)
Attachment #131044 -
Flags: review?(sspitzer)
Attachment #131044 -
Flags: review?(dmose)
Comment 13•20 years ago
|
||
The changes here are becoming outdated by the progress in bug 181534. You might want to have a look there Clarence (I didn't see your address in the CC list).
Updated•20 years ago
|
Product: MailNews → Core
Comment 14•19 years ago
|
||
Comment on attachment 131044 [details] [diff] [review] like attachment 130892 [details] [diff] [review], but remove dmin/dmax, replace values w/ constants, use an if so that veryjunky isn't logically penalized As far as I can see, because 181534 has landed, this patch is no longer needed.
Attachment #131044 -
Flags: superreview?(dmose)
Attachment #131044 -
Flags: review?(dmose)
Comment 15•19 years ago
|
||
Comment on attachment 131044 [details] [diff] [review] like attachment 130892 [details] [diff] [review], but remove dmin/dmax, replace values w/ constants, use an if so that veryjunky isn't logically penalized As far as I can see, because 181534 has landed, this patch is no longer necessary.
Comment 16•19 years ago
|
||
Resolving; feel free to re-open if I'm wrong and this algorithm issue still applies in the new chi-separating world.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•