Auto-classified (non) junk mail should be fed back into the training system

RESOLVED DUPLICATE of bug 197430

Status

MailNews Core
Filters
--
enhancement
RESOLVED DUPLICATE of bug 197430
15 years ago
10 years ago

People

(Reporter: Clarence Risher, Assigned: (not reading, please use seth@sspitzer.org instead))

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

15 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529

I will begin by saying that to counteract the effect of unbalanced corpus sizes
it would be appropriate to modify good/bad token counts in relation to the
proportion of good/bad emails when calculating probabilities.  If the token
"foo" has 50 good appearances and 70 bad appearances then the spam probability
would be 70/(50+70)=58% normally, but if you adjust for the number of classified
emails and there have been 1000 good and 1500 bad emails then you would adjust
it to 70/((50*(1500/1000)+70)=48% which is what the probability would be if you
got 500 more good emails that followed the pattern of your first 1000.

And, on to the topic of this bug/rfe...  When the JMC classifies a mail as junk
or nonjunk, its contents should be added to the training of the JMC as either
junk or nonjunk.  This does NOT produce a feedback effect, it is how bayesian
filtering was designed to work.  In the words of the spambayes project
(http://spambayes.sourceforge.net):
"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.

You would want to make the three states of junk, notjunk, and unclassified VERY
apparent to the user so that they know they need to classify their unclassified
mail.

Reproducible: Always

Steps to Reproduce:
(Reporter)

Updated

15 years ago
Depends on: 209483

Comment 1

15 years ago
This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.
(Reporter)

Comment 2

15 years ago
This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.

As to the point about weighting the values, I think you are right.  I was
mis-reading the code my first time through.

How many states we have right now depends on how you look at it.  Since we arent
doing automatic training we really have 4 states.  Manually tagged (and trained)
junk, Manually tagged nonjunk, Automatically tagged (and not trained) junk, and
Automatically tagged nonjunk (and the never-tagged nonjunk you start with, which
is kinda the same as #4).  I want to effectively do away with the tagged but not
trained states and replace them with a state for email that has never been
classified.  The fact that we arent doing automatic training is a bug in that
the way our bayesian filter is working is not how a bayesian filter is designed
to work.  The change to a 3 state system would just be a side effect of
implementing the filter correctly.

Also, we would need to mark all emails as unclassified in the event that
training.dat got deleted or corrupted.  We wouldnt be able to cover the odd case
of it being replaced, but that rarely happens.
Severity: normal → enhancement
(Reporter)

Comment 3

15 years ago
One very relevant comment on this issue from the SpamBayes project (
http://spambayes.sourceforge.net/background.html ):

"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.
(Reporter)

Comment 4

15 years ago
OK, now I feel stupid, repeating myself.  One too many bugzilla posts before
going to sleep.  I apologize for that (and this) bug spam to anyone CCing.
(Reporter)

Comment 5

15 years ago
The quote I meant to paste is from SpamAssassin:

Mistake-based training
This means training on a small number of mails, then only training on messages
that SpamAssassin classifies incorrectly. This works, but it takes longer to get
it right than a full training session would.

Comment 6

15 years ago
Is this a dup of bug 197430?
(Reporter)

Comment 7

15 years ago
yes, it is.

*** This bug has been marked as a duplicate of 197430 ***
Status: UNCONFIRMED → RESOLVED
Last Resolved: 15 years ago
Resolution: --- → DUPLICATE
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.