Open Bug 305764 Opened 20 years ago Updated 3 years ago

The junk filters should be sensitive to missing Message-IDs

Categories

(MailNews Core :: Filters, enhancement)

enhancement

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: usenet, Unassigned)

Details

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 A missing Message-ID in a mail appears to be a strong sign of the mail being spam. The mail filters should use this information as part of their Bayesian filters. Reproducible: Always Steps to Reproduce: 1. Get mail 2. 3. Actual Results: Lots of spams with missing message-IDs are received, and the mail filter does not catch them. Expected Results: Spotted the lack of a Message-ID headers, combined this with other evidence to confirm that the E-mails were likely to be spam, and marked them as junk. After getting a load of duplicate E-mails because of another bug, I had to use a script to remove duplicates from the mboxes, based on using the Message-ID as a key. What I found whilst debugging this was that mails without Message-IDs were all, or almost all, spam.
From looking at training.dat, I see it contains strings like "message-id:<d9c4a72405071603544664a05c@mail.gmail.com>", but never "message-id:" on its own, as far as I can tell. That would seem to indicate that only the presence of a particular message id will affect junk score (not really likely to get the same id twice), but the presence of a message id itself all won't. It looks like headers tokenize to "<headername>:<first-word>"; perhaps they should tokenize to "<headername>:" as well, so the mere presence of a header could be reflected in the training data, too?
Assignee: mscott → nobody
Component: General → MailNews: Filters
Product: Thunderbird → Core
Version: unspecified → Trunk
OS: Linux → All
QA Contact: filters
Hardware: PC → All
(In reply to comment #0) > A missing Message-ID in a mail appears to be a strong sign of the mail being > spam. The mail filters should use this information as part of their Bayesian > filters. Bayesian filters only work on text that's actually there, by definition. Wayne's comment 2 is pertinent, but even if that bug is fixed, it won't make automatic junk detection any more reliable. The sort of testing you're looking for is appropriate for something like SpamAssassin; xref bug 235114. Recommend WONTFIX.
One approach to requests like this would be to add an interface to the Bayesian filter store so that arbitrary tokens could be added or deleted. Then people who wanted to play with different types of tokenization could use an extension to do that.
Kent, there may also be a junk bug filed for messages that are missing addresses. Anyway, there is xref Bug 391717 – filter with from criteria doesn't work if message's From: address is null or missing
Product: Core → MailNews Core
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.