Closed Bug 237095 Opened 20 years ago Closed 20 years ago

Wrong token counts when re-training against msgs already marked as junk / not junk

Categories

(MailNews Core :: Filters, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.8alpha1

People

(Reporter: mscott, Assigned: mscott)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

While doing some tests on the new junk mail controls, I discovered a problem
with our token accounting which comes up when a user blows away training.dat and
re-trains against good/bad messages that have already been classified. 

Steps:
1) Remove your training.dat and decide you want to retrain. This will become
very common with the new junk controls which perform best if you retrain.

2) Try to retrain your filters against ham and spam that have already been
classified.

We end up calling nsBayesianFilter::observeMessage
(http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#787)
which gets the current classification and the new classification provided by the
user when they choose Mark as Junk / Mark as Not Junk for re-training.

This method examines the current classification and removes the tokens for this
message from the appropriate training set. We then add the tokens for the
message to the appropriate training set based on the new classification. 

Consider the case where the old classification == the new classification in
conjunction with a user who has just removed his training.dat file for
retraining. This user selects a bunch of messages already marked as junk and
marks them as junk again for re-training purposes. The first message has its
tokens added to the junk training set. Lets say it contained a word like
'Viagra'. The next junk message first removes itself from the training set. Lets
say this 2nd message also contains the word Viagra. We have one token for viagra
with a token count of one from our first message. By removing the tokens for the
2nd message we end up removing the Viagra token. Then the 2nd message tokens get
added back in. Leaving us with a Viagra token with a total message occurrence
count of 1. But it should be 2!!

Extrapolate this scenario out over the course of retraining against already
classified messages and we'll end up with counts of 1 for any tokens shared
amongst the messages! Oh no!
Status: NEW → ASSIGNED
I think this solution addresses the problem. Unfortunately we can't distinguish
between the case where the user is re-training vs. just trying to classify a
classified message with the same classification, 'thinking' he is improving the
filter. The later case is very bad because it leads to data skew in the
tokenizer. That's why it was important that we removed the tokens and added
them back. 

To the good with this patch: I think users that delete training.dat and
re-train are going to get much better results because the token counts will be
right. 

To the bad: this now allows users to classify the same message over and over
again, leading to data skew with the token counts.
If it were made a preference, then a savvy user (or an extension or script)
could turn on the proposed patch when retraining after blowing away
training.dat, but leave the old behavior in place for normal operation. 
(In reply to comment #2)
> If it were made a preference, then a savvy user (or an extension or script)
> could turn on the proposed patch when retraining after blowing away
> training.dat, but leave the old behavior in place for normal operation. 

Or removal of training.dat might be automatically detected by a code (based on
timestamp or something) placed into it on creation. Not sure, if this solves the
problem, though...
>>>>... just trying to classify a
classified message with the same classification, 'thinking' he is improving the
filter. <<<<<

So, should I understand this as meaning one should never re-mark as ham what the
junkmail already thinks is ham? In other places (MozZine Forums) I've heard this
is what one SHOULD do.
I have also encountered this bug, and I think a *LOT* of junk filter
experimenters have been affected by it without their knowledge.
One possible solution is to store for each email the date that it was last
marked, then you compare that to the creation date of the current training.dat
when the message status is changed.  When you change from good to bad, if the
last marked date is before the creation of training.dat then you do NOT
decrement the good token counts before incrementing the bad token counts (and
vice versa (and super- vice versa)).  you would probably want to put the
creation date IN training.dat, recorded by mailnews when the file is created,
rather than relying on the OS to handle it.
Let's take my change when the tree opens for 1.8a.
Target Milestone: --- → mozilla1.8alpha
Attachment #143576 - Flags: superreview?(bienvenu)
Attachment #143576 - Flags: superreview?(bienvenu) → superreview+
this fix has finally been checked in for seamonkey. It was already fixed in
thunderbird 0.6
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
It seems to me that it'd be much more common for people to toggle the marker
either mistakenly (See message from address/title/whatever, say "That's not
spam!" then read it and find the classifier was right) or due to confusion about
what does or does not improve the filter than for people to be retraining with
previously marked mail.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: