209510 - Auto-classified (non) junk mail should be fed back into the training system

Reporter

Description

•

21 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529

I will begin by saying that to counteract the effect of unbalanced corpus sizes
it would be appropriate to modify good/bad token counts in relation to the
proportion of good/bad emails when calculating probabilities.  If the token
"foo" has 50 good appearances and 70 bad appearances then the spam probability
would be 70/(50+70)=58% normally, but if you adjust for the number of classified
emails and there have been 1000 good and 1500 bad emails then you would adjust
it to 70/((50*(1500/1000)+70)=48% which is what the probability would be if you
got 500 more good emails that followed the pattern of your first 1000.

And, on to the topic of this bug/rfe...  When the JMC classifies a mail as junk
or nonjunk, its contents should be added to the training of the JMC as either
junk or nonjunk.  This does NOT produce a feedback effect, it is how bayesian
filtering was designed to work.  In the words of the spambayes project
(http://spambayes.sourceforge.net):
"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.

You would want to make the three states of junk, notjunk, and unclassified VERY
apparent to the user so that they know they need to classify their unclassified
mail.

Reproducible: Always

Steps to Reproduce:

Clarence Risher

Reporter

Updated

•

21 years ago

Depends on: 209483

Jason M. Abels

Comment 1

•

21 years ago

This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.

Clarence Risher

Reporter

Comment 2

•

21 years ago

This really looks to me like two bugs. The first is a request that the words be
weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but
I could be wrong.

The other is to change from the two-state system to a three-state system. I
actually think this would be a very good idea, but again, I don't know for sure.

Still, this really appears to be two different bugs, and they both should have
the severity of enhancement.

As to the point about weighting the values, I think you are right.  I was
mis-reading the code my first time through.

How many states we have right now depends on how you look at it.  Since we arent
doing automatic training we really have 4 states.  Manually tagged (and trained)
junk, Manually tagged nonjunk, Automatically tagged (and not trained) junk, and
Automatically tagged nonjunk (and the never-tagged nonjunk you start with, which
is kinda the same as #4).  I want to effectively do away with the tagged but not
trained states and replace them with a state for email that has never been
classified.  The fact that we arent doing automatic training is a bug in that
the way our bayesian filter is working is not how a bayesian filter is designed
to work.  The change to a 3 state system would just be a side effect of
implementing the filter correctly.

Also, we would need to mark all emails as unclassified in the event that
training.dat got deleted or corrupted.  We wouldnt be able to cover the odd case
of it being replaced, but that rarely happens.

Severity: normal → enhancement

Clarence Risher

Reporter

Comment 3

•

21 years ago

One very relevant comment on this issue from the SpamBayes project (
http://spambayes.sourceforge.net/background.html ):

"mistake-based training" - where you only train on messages that the system got
wrong - results in the system taking a lot longer to get really good at what it
does. You're better off feeding it the things it gets right, as well as the
things it gets wrong.

Clarence Risher

Reporter

Comment 4

•

21 years ago

OK, now I feel stupid, repeating myself.  One too many bugzilla posts before
going to sleep.  I apologize for that (and this) bug spam to anyone CCing.

Clarence Risher

Reporter

Comment 5

•

21 years ago

The quote I meant to paste is from SpamAssassin:

Mistake-based training
This means training on a small number of mails, then only training on messages
that SpamAssassin classifies incorrectly. This works, but it takes longer to get
it right than a full training session would.

Daniel de Wildt

Comment 6

•

21 years ago

Is this a dup of bug 197430?

Clarence Risher

Reporter

Comment 7

•

21 years ago

yes, it is.

*** This bug has been marked as a duplicate of 197430 ***

Status: UNCONFIRMED → RESOLVED

Closed: 21 years ago

Resolution: --- → DUPLICATE

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

Bugzilla

Quick Search

Auto-classified (non) junk mail should be fed back into the training system

Categories

(MailNews Core :: Filters, enhancement)

Tracking

(Not tracked)

People

(Reporter: sparr0, Assigned: sspitzer)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated