Closed Bug 209715 Opened 21 years ago Closed 21 years ago

Common junk tokens ignored, low-training accuracy skewed, due to mathematical quirk

Tracking

(Not tracked)

Status:

RESOLVED DUPLICATE of bug 210215

People

(Reporter: sparr0, Assigned: sspitzer)

Details

Clarence Risher

Reporter

Description

•

21 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529

b is the junk count for the token, g is the nonjunk count, nbad is the total
marked junk, ngood is the total marked nonjunk:

            token.mProbability = dmax(.01,
                                     dmin(.99,
                                         (dmin(1.0, (b / nbad)) /
                                              (dmin(1.0, (g / ngood)) +
                                               dmin(1.0, (b / nbad))))));

by doing dmin(1.0, ... ) around the last three divisions you break cases where a
token occurs more than once per email on average.  this leads to ignoring really
common tokens, even though they could be extremely good spam indicators, because
1000/100 becomes the same as 200/100 after passing it through the dmin() despite
it being something like an 85% spam indicator.  this also skews the accuracy a
bit when you first start training because initially there are quite a few tokens
that will average more than one per mail.

Reproducible: Always

Steps to Reproduce:

Clarence Risher

Reporter

Comment 1

•

21 years ago


*** This bug has been marked as a duplicate of 210215 ***

Status: UNCONFIRMED → RESOLVED

Closed: 21 years ago

Resolution: --- → DUPLICATE

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Common junk tokens ignored, low-training accuracy skewed, due to mathematical quirk

Categories

(MailNews Core :: Filters, defect)

Tracking

(Not tracked)

People

(Reporter: sparr0, Assigned: sspitzer)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated