Closed
Bug 209510
Opened 21 years ago
Closed 21 years ago
Auto-classified (non) junk mail should be fed back into the training system
Categories
(MailNews Core :: Filters, enhancement)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 197430
People
(Reporter: sparr0, Assigned: sspitzer)
References
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4; MultiZilla v1.4.0.4A) Gecko/20030529 I will begin by saying that to counteract the effect of unbalanced corpus sizes it would be appropriate to modify good/bad token counts in relation to the proportion of good/bad emails when calculating probabilities. If the token "foo" has 50 good appearances and 70 bad appearances then the spam probability would be 70/(50+70)=58% normally, but if you adjust for the number of classified emails and there have been 1000 good and 1500 bad emails then you would adjust it to 70/((50*(1500/1000)+70)=48% which is what the probability would be if you got 500 more good emails that followed the pattern of your first 1000. And, on to the topic of this bug/rfe... When the JMC classifies a mail as junk or nonjunk, its contents should be added to the training of the JMC as either junk or nonjunk. This does NOT produce a feedback effect, it is how bayesian filtering was designed to work. In the words of the spambayes project (http://spambayes.sourceforge.net): "mistake-based training" - where you only train on messages that the system got wrong - results in the system taking a lot longer to get really good at what it does. You're better off feeding it the things it gets right, as well as the things it gets wrong. You would want to make the three states of junk, notjunk, and unclassified VERY apparent to the user so that they know they need to classify their unclassified mail. Reproducible: Always Steps to Reproduce:
Comment 1•21 years ago
|
||
This really looks to me like two bugs. The first is a request that the words be weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but I could be wrong. The other is to change from the two-state system to a three-state system. I actually think this would be a very good idea, but again, I don't know for sure. Still, this really appears to be two different bugs, and they both should have the severity of enhancement.
Reporter | ||
Comment 2•21 years ago
|
||
This really looks to me like two bugs. The first is a request that the words be weighed by the ratio of junk/non-junk. I'm pretty sure this is already done, but I could be wrong. The other is to change from the two-state system to a three-state system. I actually think this would be a very good idea, but again, I don't know for sure. Still, this really appears to be two different bugs, and they both should have the severity of enhancement. As to the point about weighting the values, I think you are right. I was mis-reading the code my first time through. How many states we have right now depends on how you look at it. Since we arent doing automatic training we really have 4 states. Manually tagged (and trained) junk, Manually tagged nonjunk, Automatically tagged (and not trained) junk, and Automatically tagged nonjunk (and the never-tagged nonjunk you start with, which is kinda the same as #4). I want to effectively do away with the tagged but not trained states and replace them with a state for email that has never been classified. The fact that we arent doing automatic training is a bug in that the way our bayesian filter is working is not how a bayesian filter is designed to work. The change to a 3 state system would just be a side effect of implementing the filter correctly. Also, we would need to mark all emails as unclassified in the event that training.dat got deleted or corrupted. We wouldnt be able to cover the odd case of it being replaced, but that rarely happens.
Severity: normal → enhancement
Reporter | ||
Comment 3•21 years ago
|
||
One very relevant comment on this issue from the SpamBayes project ( http://spambayes.sourceforge.net/background.html ): "mistake-based training" - where you only train on messages that the system got wrong - results in the system taking a lot longer to get really good at what it does. You're better off feeding it the things it gets right, as well as the things it gets wrong.
Reporter | ||
Comment 4•21 years ago
|
||
OK, now I feel stupid, repeating myself. One too many bugzilla posts before going to sleep. I apologize for that (and this) bug spam to anyone CCing.
Reporter | ||
Comment 5•21 years ago
|
||
The quote I meant to paste is from SpamAssassin: Mistake-based training This means training on a small number of mails, then only training on messages that SpamAssassin classifies incorrectly. This works, but it takes longer to get it right than a full training session would.
Comment 6•21 years ago
|
||
Is this a dup of bug 197430?
Reporter | ||
Comment 7•21 years ago
|
||
yes, it is. *** This bug has been marked as a duplicate of 197430 ***
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → DUPLICATE
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•