Closed Bug 472005 Opened 15 years ago Closed 15 years ago

Bayes trait analysis sees keyword header, which biases the evaluation

Categories

(MailNews Core :: Filters, defect)

defect
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: rkent, Assigned: rkent)

Details

Attachments

(1 file, 2 obsolete files)

The bayesian filter tokenizer currently sees headers that are set by mailnews itself, such as the keywords header. This causes some strange feedback effects, particularly in the case of applying soft tags. Since the existance of the keyword itself is obviously a very strong indicator for soft tagging with that header, then whether the header is present or not has a strong effect on the calculated trait match percent. This causes after-the-fact trait bayesian percentage calculations to differ dramatically in some cases from the cases where the filter is applied during normal message reciept, which is very confusing.

It is actually worse than that. The most common time when you train with a keyword applied, is when you are correcting a false positive. In that case, the presence of the keyword in the message is an anti-indicator of the need for the keyword. This causes an oscillation in values, as the keyword is removed if present, or placed if absent, on subsequent after-the-fact calculations.

I need to decide whether to hard-wire the tokenizer to ignore headers set by mailnews itself, or implement some sort of interface to allow extensions to manipulate the tokenizer.
Attachment #359306 - Flags: superreview?(bienvenu)
Attachment #359306 - Flags: review?(bugzilla)
Status: NEW → ASSIGNED
Target Milestone: --- → Thunderbird 3.0b2
Whiteboard: [waiting review bienvenu, standard8]
Comment on attachment 359306 [details] [diff] [review]
Allow extensions to enable/disable specific headers

+    // arrays of extra headers to tokenize / to not tokenize
+    nsCStringArray mEnabledHeaders;
+    nsCStringArray mDisabledHeaders;

This is the wrong thing to use - mozilla-central has been removing uses of ns(C)StringArray (see bug 466622), so you should now be using nsTArray<nsCString> instead.
Attachment #359306 - Flags: review?(bugzilla) → review-
Attached patch Use nsTArray<CString> (obsolete) — Splinter Review
Sorry, I knew better. Changed to nsTArray.
Attachment #359306 - Attachment is obsolete: true
Attachment #359779 - Flags: superreview?(bienvenu)
Attachment #359779 - Flags: review?(bugzilla)
Attachment #359306 - Flags: superreview?(bienvenu)
Now that I have bug 451405 implemented in my tree, so that I can actually see what the Bayes filter is doing, I can see that this is also an issue with the x-mozilla-status:0000 tag. In fact, for a recent uncertain spam that I had, the x-mozilla tag was the second strongest indicator - and pointing in the wrong direction. That means that in the past I had a significant difference in the read/unread status of emails that I have trained - which does not make any sense as an indicator of the spaminess. At least for one email, this made the difference in whether it was flagged as spam or not.

So although I would like to add the ability to customize which headers can be tokenized, as in the current patch, I think that I shoould disable by default the tokenization of the x-mozilla headers as well.

I'm going to cancel my review request so that I can add that. I might also add an option about whether the header token will be the accepted as a unit, or broken into pieces.
Attachment #359779 - Flags: superreview?(bienvenu)
Attachment #359779 - Flags: review?(bugzilla)
I am now convinced that adding x-mozilla tokens to the spam filter is bad for everyone, not just for my soft tags work. So I'll request this simple patch to ignore them.

The argument for this is that the user should not have to be aware of the local status of an email in selecting emails for training. Without this patch, if a user trains on "unread" junk mail, but "read" good mail, then "unread" becomes a strong false indicator of spamminess of the email. It's a false indicator, because during normal processing of a junk message, the email is always unread.

Unfortunately I cannot test this using a spam corpus, as it is specific to mozilla email.
Attachment #359779 - Attachment is obsolete: true
Attachment #360339 - Flags: superreview?(bugzilla)
Attachment #360339 - Flags: review?(bugzilla)
Whiteboard: [waiting review bienvenu, standard8] → [waiting review standard8]
Comment on attachment 360339 [details] [diff] [review]
[checked in] Don't add tokens for x-mozilla headers

These seems sensible, lets put it in for beta 2 and see how much it improves things for folks.

I've checked this in already to save the round of adding checkin-needed.

http://hg.mozilla.org/comm-central/rev/e7c3f8cf9566
Attachment #360339 - Attachment description: Don't add tokens for x-mozilla headers → [checked in] Don't add tokens for x-mozilla headers
Attachment #360339 - Flags: superreview?(bugzilla)
Attachment #360339 - Flags: superreview+
Attachment #360339 - Flags: review?(bugzilla)
Attachment #360339 - Flags: review+
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Whiteboard: [waiting review standard8]
You need to log in before you can comment on or make changes to this bug.