Closed Bug 209510 Opened 21 years ago Closed 21 years ago

Auto-classified (non) junk mail should be fed back into the training system

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 197430

People

(Reporter: sparr0, Assigned: sspitzer)

References

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529

I will begin by saying that to counteract the effect of unbalanced corpus sizes
it would be appropriate to modify good/bad token counts in relation to the
proportion of good/bad emails when calculating probabilities.  If the token
"foo" has 50 good appearances and 70 bad appearances then the spam probability
would be 70/(50+70)=58% normally, but if you adjust for the number of classified
emails and there have been 1000 good and 1500 bad emails then you would adjust
it to 70/((50*(1500/1000)+70)=48% which is what the probability would be if you
got 500 more good emails that followed the pattern of your first 1000.

And, on to the topic of this bug/rfe...  When the JMC classifies a mail as junk
or nonjunk, its contents should be added to the training of the JMC as either
junk or nonjunk.  This does NOT produce a feedback effect, it is how bayesian
filtering was designed to work.  In the words of the spambayes project
(http://spambayes.sourceforge.net):
"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.

You would want to make the three states of junk, notjunk, and unclassified VERY
apparent to the user so that they know they need to classify their unclassified
mail.

Reproducible: Always

Steps to Reproduce:
Depends on: 209483
This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.
This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.

As to the point about weighting the values, I think you are right.  I was
mis-reading the code my first time through.

How many states we have right now depends on how you look at it.  Since we arent
doing automatic training we really have 4 states.  Manually tagged (and trained)
junk, Manually tagged nonjunk, Automatically tagged (and not trained) junk, and
Automatically tagged nonjunk (and the never-tagged nonjunk you start with, which
is kinda the same as #4).  I want to effectively do away with the tagged but not
trained states and replace them with a state for email that has never been
classified.  The fact that we arent doing automatic training is a bug in that
the way our bayesian filter is working is not how a bayesian filter is designed
to work.  The change to a 3 state system would just be a side effect of
implementing the filter correctly.

Also, we would need to mark all emails as unclassified in the event that
training.dat got deleted or corrupted.  We wouldnt be able to cover the odd case
of it being replaced, but that rarely happens.
Severity: normal → enhancement
One very relevant comment on this issue from the SpamBayes project (
http://spambayes.sourceforge.net/background.html ):

"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.
OK, now I feel stupid, repeating myself.  One too many bugzilla posts before
going to sleep.  I apologize for that (and this) bug spam to anyone CCing.
The quote I meant to paste is from SpamAssassin:

Mistake-based training
This means training on a small number of mails, then only training on messages
that SpamAssassin classifies incorrectly. This works, but it takes longer to get
it right than a full training session would.
Is this a dup of bug 197430?
yes, it is.

*** This bug has been marked as a duplicate of 197430 ***
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → DUPLICATE
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.