Closed
Bug 209715
Opened 21 years ago
Closed 21 years ago
Common junk tokens ignored, low-training accuracy skewed, due to mathematical quirk
Categories
(MailNews Core :: Filters, defect)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 210215
People
(Reporter: sparr0, Assigned: sspitzer)
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529 b is the junk count for the token, g is the nonjunk count, nbad is the total marked junk, ngood is the total marked nonjunk: token.mProbability = dmax(.01, dmin(.99, (dmin(1.0, (b / nbad)) / (dmin(1.0, (g / ngood)) + dmin(1.0, (b / nbad)))))); by doing dmin(1.0, ... ) around the last three divisions you break cases where a token occurs more than once per email on average. this leads to ignoring really common tokens, even though they could be extremely good spam indicators, because 1000/100 becomes the same as 200/100 after passing it through the dmin() despite it being something like an 85% spam indicator. this also skews the accuracy a bit when you first start training because initially there are quite a few tokens that will average more than one per mail. Reproducible: Always Steps to Reproduce:
Reporter | ||
Comment 1•21 years ago
|
||
*** This bug has been marked as a duplicate of 210215 ***
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → DUPLICATE
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•