Closed Bug 209715 Opened 21 years ago Closed 21 years ago

Common junk tokens ignored, low-training accuracy skewed, due to mathematical quirk

Categories

(MailNews Core :: Filters, defect)

defect
Not set
minor

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 210215

People

(Reporter: sparr0, Assigned: sspitzer)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529

b is the junk count for the token, g is the nonjunk count, nbad is the total
marked junk, ngood is the total marked nonjunk:

            token.mProbability = dmax(.01,
                                     dmin(.99,
                                         (dmin(1.0, (b / nbad)) /
                                              (dmin(1.0, (g / ngood)) +
                                               dmin(1.0, (b / nbad))))));

by doing dmin(1.0, ... ) around the last three divisions you break cases where a
token occurs more than once per email on average.  this leads to ignoring really
common tokens, even though they could be extremely good spam indicators, because
1000/100 becomes the same as 200/100 after passing it through the dmin() despite
it being something like an 85% spam indicator.  this also skews the accuracy a
bit when you first start training because initially there are quite a few tokens
that will average more than one per mail.

Reproducible: Always

Steps to Reproduce:

*** This bug has been marked as a duplicate of 210215 ***
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → DUPLICATE
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.