Closed Bug 472005 Opened 17 years ago Closed 17 years ago

Bayes trait analysis sees keyword header, which biases the evaluation

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 3.0b2

People

(Reporter: rkent, Assigned: rkent)

Details

Attachments

(1 file, 2 obsolete files)

Allow extensions to enable/disable specific headers 17 years ago Kent James (:rkent) 4.44 KB, patch	standard8 : review-	Details \| Diff \| Splinter Review
Use nsTArray<CString> 17 years ago Kent James (:rkent) 4.45 KB, patch		Details \| Diff \| Splinter Review
[checked in] Don't add tokens for x-mozilla headers 17 years ago Kent James (:rkent) 855 bytes, patch	standard8 : review+ standard8 : superreview+	Details \| Diff \| Splinter Review

Kent James (:rkent)

Assignee

Description

•

17 years ago

The bayesian filter tokenizer currently sees headers that are set by mailnews itself, such as the keywords header. This causes some strange feedback effects, particularly in the case of applying soft tags. Since the existance of the keyword itself is obviously a very strong indicator for soft tagging with that header, then whether the header is present or not has a strong effect on the calculated trait match percent. This causes after-the-fact trait bayesian percentage calculations to differ dramatically in some cases from the cases where the filter is applied during normal message reciept, which is very confusing. It is actually worse than that. The most common time when you train with a keyword applied, is when you are correcting a false positive. In that case, the presence of the keyword in the message is an anti-indicator of the need for the keyword. This causes an oscillation in values, as the keyword is removed if present, or placed if absent, on subsequent after-the-fact calculations. I need to decide whether to hard-wire the tokenizer to ignore headers set by mailnews itself, or implement some sort of interface to allow extensions to manipulate the tokenizer.

Kent James (:rkent)

Assignee

Comment 1

•

17 years ago

Attached patch Allow extensions to enable/disable specific headers (obsolete) — Details — Splinter Review

Attachment #359306 - Flags: superreview?(bienvenu)

Attachment #359306 - Flags: review?(bugzilla)

Kent James (:rkent)

Assignee

Updated

•

17 years ago

Status: NEW → ASSIGNED

Target Milestone: --- → Thunderbird 3.0b2

Kent James (:rkent)

Assignee

Updated

•

17 years ago

Whiteboard: [waiting review bienvenu, standard8]

Mark Banner (:standard8)

Comment 2

•

17 years ago

Comment on attachment 359306 [details] [diff] [review] Allow extensions to enable/disable specific headers + // arrays of extra headers to tokenize / to not tokenize + nsCStringArray mEnabledHeaders; + nsCStringArray mDisabledHeaders; This is the wrong thing to use - mozilla-central has been removing uses of ns(C)StringArray (see bug 466622), so you should now be using nsTArray<nsCString> instead.

Attachment #359306 - Flags: review?(bugzilla) → review-

Kent James (:rkent)

Assignee

Comment 3

•

17 years ago

Attached patch Use nsTArray<CString> (obsolete) — Details — Splinter Review

Sorry, I knew better. Changed to nsTArray.

Attachment #359306 - Attachment is obsolete: true

Attachment #359779 - Flags: superreview?(bienvenu)

Attachment #359779 - Flags: review?(bugzilla)

Attachment #359306 - Flags: superreview?(bienvenu)

Kent James (:rkent)

Assignee

Comment 4

•

17 years ago

Now that I have bug 451405 implemented in my tree, so that I can actually see what the Bayes filter is doing, I can see that this is also an issue with the x-mozilla-status:0000 tag. In fact, for a recent uncertain spam that I had, the x-mozilla tag was the second strongest indicator - and pointing in the wrong direction. That means that in the past I had a significant difference in the read/unread status of emails that I have trained - which does not make any sense as an indicator of the spaminess. At least for one email, this made the difference in whether it was flagged as spam or not. So although I would like to add the ability to customize which headers can be tokenized, as in the current patch, I think that I shoould disable by default the tokenization of the x-mozilla headers as well. I'm going to cancel my review request so that I can add that. I might also add an option about whether the header token will be the accepted as a unit, or broken into pieces.

Kent James (:rkent)

Assignee

Updated

•

17 years ago

Attachment #359779 - Flags: superreview?(bienvenu)

Attachment #359779 - Flags: review?(bugzilla)

Kent James (:rkent)

Assignee

Comment 5

•

17 years ago

Attached patch [checked in] Don't add tokens for x-mozilla headers — Details — Splinter Review

I am now convinced that adding x-mozilla tokens to the spam filter is bad for everyone, not just for my soft tags work. So I'll request this simple patch to ignore them. The argument for this is that the user should not have to be aware of the local status of an email in selecting emails for training. Without this patch, if a user trains on "unread" junk mail, but "read" good mail, then "unread" becomes a strong false indicator of spamminess of the email. It's a false indicator, because during normal processing of a junk message, the email is always unread. Unfortunately I cannot test this using a spam corpus, as it is specific to mozilla email.

Attachment #359779 - Attachment is obsolete: true

Attachment #360339 - Flags: superreview?(bugzilla)

Attachment #360339 - Flags: review?(bugzilla)

Kent James (:rkent)

Assignee

Updated

•

17 years ago

Whiteboard: [waiting review bienvenu, standard8] → [waiting review standard8]

Mark Banner (:standard8)

Comment 6

•

17 years ago

Comment on attachment 360339 [details] [diff] [review] [checked in] Don't add tokens for x-mozilla headers These seems sensible, lets put it in for beta 2 and see how much it improves things for folks. I've checked this in already to save the round of adding checkin-needed. http://hg.mozilla.org/comm-central/rev/e7c3f8cf9566

Attachment #360339 - Attachment description: Don't add tokens for x-mozilla headers → [checked in] Don't add tokens for x-mozilla headers

Attachment #360339 - Flags: superreview?(bugzilla)

Attachment #360339 - Flags: superreview+

Attachment #360339 - Flags: review?(bugzilla)

Attachment #360339 - Flags: review+

Mark Banner (:standard8)

Updated

•

17 years ago

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Whiteboard: [waiting review standard8]

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Bayes trait analysis sees keyword header, which biases the evaluation

Categories

(MailNews Core :: Filters, defect)

Tracking

(Not tracked)

People

(Reporter: rkent, Assigned: rkent)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 2 obsolete files)

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Comment 6

Updated

Attachment

General

Description

File Name

Content Type