Open Bug 215941 Opened 21 years ago Updated 2 years ago

spam/junk filter: add a "headers-only" mode for bayes filtering

Categories

(MailNews Core :: Filters, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: lisken, Unassigned)

References

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624

I've posted this on netscape.public.mozilla.mail-news and
was asked to file it as a bug.

In all articles/discussions on Bayesian spam filters, the
remark inevitably comes up that perhaps the main disadvantage
of this technique is that emails have to be downloaded before
they can be analyzed. As an IMAP user, I always come up with
an obvious question: why not try out a different mode for the
filter which only analyzes email headers? Running my "human
spam filter" on my IMAP mailbox, I have a near to perfect
recognition rate based on subject and sender alone. This might
be more difficult if I was working in an English environment:
Most but not all "real" emails I receive are German, and most
spam is English or Asian. Also of course I use my "inner
address book". But still I can't believe a Bayesian filter
working on headers only would be useless. Two ways of using it
spring to mind:

* Enable users to complete switch between headers-only and
full-message analysis

* Try a "careful" headers-only analysis first (careful in the
sense that false "spam" judgements should be minimized), and
then download and fully analyse only those emails not yet
classified as spam.

If this works it should be another good selling point for Mozilla.


Reproducible: Always

Steps to Reproduce:
1. Configure to use the Junk filter.
2. Configure to use an IMAP account.
3. open that account's Inbox.

Actual Results:  
Junk filter downloads all mail, then analyzes emails.

Expected Results:  
With this new feature, you could:

a) configure the filter to analyze headers only, mails won't be
   downloaded for analyzing

b) or, you could configure to download and analyze only those
   emails not classified as Junk so far
My English is of the British variety. At least it should be. ;-)

s/analyze/analise/g
Oh dear, it's getting worse!

s/anal[yi][zs]e/analyse/g
This was a reply to my newsgroup post, from Stanimir Stamenkov
<stanio@domain.invalid>:

I've posted similar proposition in this group awhile ago.

I've even thought of some kind of customizable targets where you specify exactly
which headers/parts of the message (where the body is one whole part) should be
included in the Bayesian filtering. So I could make a training database/target
which checks only the subject and the sender ('Subject' and 'From' headers) for
example. Filtering on the 'Received' headers contents should give a pretty good
results, IMHO. 
To keep this feature simple, it should be automatic.

For pop3 -> analyze whole message

For imap -> analyze header


For pop3, there is a definate advantage to checking the entire message (more
spam symptoms are likely in the body, rather than headers.

But for IMAP, that has to be weighted, with the obvious download issue.\

IMHO confirm this as a feature.  But don't make this a UI pref.  On/Off is all
that's needed.  On for spam checking as per mail type.  Off for no spam check.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Suggested UI (dropdown list) in Junk Mail Controls (only for IMAP servers):

Analyze for Junk e-mail by:
    _____________________________________________________________
   [ looking at headers, then bodies (if header is inconclusive) ]
   |-------------------------------------------------------------|
   | looking at headers only (don't download bodies for analysis)|
   | looking at headers, then bodies (if header is inconclusive) | <-- default?
   | always analyzing both headers & bodies                      |
   +-------------------------------------------------------------+
To Comment 4: The purpose of the customizable targets was to implement generic
classification mechanism, not only junk filtering. Here, I've found the original
thread:

http://groups.google.com/groups?threadm=b4gloh%24fge3%40ripley.netscape.com

which may be helpful. If it is implemented that way (maintaining filter targets)
no mater in the beginning there would be no UI, but some predefined
settings/behavior, later when a suitable UI is proposed it would be easily added
feature.
Product: MailNews → Core
sorry for the spam.  making bugzilla reflect reality as I'm not working on these bugs.  filter on FOOBARCHEESE to remove these in bulk.
Assignee: sspitzer → nobody
don't know that he had authority to, but bug 219715 (same issue) was wontfixed.
QA Contact: laurel → filters
From my reading of bug 219715, it was marked WONTFIX because the request seemed to imply that TB could analyze messages for junk without downloading the body. An alternate reading might have been though that the user intended to manually mark messages as junk, in which case bug 219715 is at least possible. There is no hint of the use of headers only, which is the main point of this current bug.

Concerning headers, I think this is generally a good idea. One reason is that I believe it is becoming quite common for email servers to have some sort of anti-spam processing, which then leaves status information in the headers. SpamAssassin, for example, leaves not only its estimated status but also a list of rules that were fired by their filter. These rules should become header tokens in the TB Bayes filter (though they don't at the moment due to the way tokens are formed. I'd like to change this.) The local TB Bayes filter can optimize the weighting of SpamAssassin rules for the individual user. That, combined with other header information such as addresses, might be sufficient data for many users.
(In reply to comment #10)
> From my reading of bug 219715, it was marked WONTFIX because the request seemed
> to imply that TB could analyze messages for junk without downloading the body.

yes

> An alternate reading might have been though that the user intended to manually
> mark messages as junk, in which case bug 219715 is at least possible. 

or because reporter states "... view in junk mail folder ..." he may have meant don't download body of messages which were put in a server side's "junk" folder - which is beyond the scope of TB junk mail processing.

> There is no hint of the use of headers only, which is the main point of this current bug.

you may be correct - but that is in effect what the reporter requested.  In any event, the request is ambiguous as written and for that reason alone the bug probably should not have been WONTFIXED without some exploration with the user.


> Concerning headers, I think this is generally a good idea. One reason is that I
> believe it is becoming quite common for email servers to have some sort of
> anti-spam processing, which then leaves status information in the headers.
> SpamAssassin, for example, leaves not only its estimated status but also a list
> of rules that were fired by their filter. These rules should become header
> tokens in the TB Bayes filter (though they don't at the moment due to the way
> tokens are formed. I'd like to change this.) The local TB Bayes filter can
> optimize the weighting of SpamAssassin rules for the individual user. That,
> combined with other header information such as addresses, might be sufficient
> data for many users.

You have a point. A "header's only" option may be the easiest way to reduce the overhead of bayes spam filtering.  But it's effectiveness for the average user will no doubt be a sticking point.

As for spamassassin and the like ... you add complexity if you start asking what fields and for what packages will bayes pars on.  Besides, if one is able to properly configured SA then one should not need to combine it with bayes, and why invoke the overhead of bayes processing if some other filter has deemed the message to be discardable?  The only big reasons I can see are a) server side SA doesn't interact to one's TB address book and b) some ISPs might not give you access to customize SA to your account.  Unfortunately, both of those are primarily server side issues. 

If TB "trust" spam headers interface were more publicized and usable in more circumstances - i.e. the server side gave enough information in an appropriate form for thunderbird to use, and thunderbird UI is flexible enough to configure it - AND if it could be combined with client side AB whitelist then we might have 99% of what's needed in terms of making server-side filtering more useful.

There is no doubt however that improvements can be made - hopefully some small tweaks can be done it TB 3 to get a big payback.  And there are plenty of suggestions https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&short_desc_type=anywordssubstr&short_desc=junk+spam+bayes+filter&product=Core&product=Firefox&product=NSPR&product=Thunderbird&product=Toolkit&long_desc_type=allwordssubstr&long_desc=bayes&bug_file_loc_type=allwordssubstr&bug_file_loc=&status_whiteboard_type=allwordssubstr&status_whiteboard=&keywords_type=allwords&keywords=&resolution=WONTFIX&resolution=---&bug_severity=enhancement&emailreporter1=1&emailtype1=substring&email1=&emailassigned_to2=1&emailreporter2=1&emailqa_contact2=1&emailtype2=substring&email2=&bugidtype=include&bug_id=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Reuse+same+sort+as+last+time&field0-0-0=noop&type0-0-0=noop&value0-0-0=
Summary: spam/junk filter: add a "headers-only" mode → spam/junk filter: add a "headers-only" mode for bayes filtering
"why invoke the overhead of bayes processing if some other filter has deemed the message to be discardable?"

At least in theory, bayes processing is supposed to work partly because your own pattern of ham is unique to you. That is an advantage for a local client like TB, particularly if marking of ham can be well integrated into the user interface or message processing. But spam signatures are NOT unique to the individual user. The more users reporting that a particular email is junk, the quicker that tokens can be identified that are part of that junk email. Advantage: server. So the best spam filters will need to combine characteristics of both client and server. The problem is not cases where SA has "deemed the message to be discardable" but rather cases where SA is suspicious, but not ready to make the final call. Local bayes could provide a second opinion.

"you add complexity if you start asking what fields and for what packages will bayes pars on"

I just recompiled TB 3.0 replacing the "." token separator with a "," in the bayes tokenizer.  It think that is all that is necessary to start parsing SpamAssassin header keys. I'll be testing that in the next few days. I suspect that other anti-spam solutions have equally simple requirements to parse their headers. Current TB is particularly bad at processing headers. In can case, improvements need to be made there before this current bug makes any sense at all.
Attached file Message Filter File
I have devised the most efficient junk mail filtering system which is possible with ThunderBird.  It's designed to use the "Fetch Headers only" feature of Server Settings in order to not download junk mail, and to get junk mail Headers to your Junk Box.  It will also work fine if you don't use Fetch Headers only.  My complicated setup works with just 13 Filters, and a special "Junk Address Book".  I've Attached a copy of my Message Filter File if you wish to copy it to your computer.  The File would require the following changes which you can do with any Text Editor like Notepad:
"tlmester@mail.niagara.com" needs to be replaced with your E-Mail Address @ Your Server.
"tlmester@niagara.com" needs to be replaced with your E-Mail Address.
"Terry" needs to be replaced with Your First Name.

The "Junk Address Book" would contain only known Junk Addresses and Personal Addresses from people who sometimes send you large junk messages.  This makes managing Junk much easier.  This setup also takes advantage of your first name to get junk mail moved to your Junk Box, but if you made the mistake of using your first name in your E-Mail Address this benefit is defeated.
Make sure that you make a 'backup copy' of your Message Filter File (msgFilterRules.dat) because mine once got inexplicably deleted by ThunderBird.  So, make a backup copy!  You should find this File under the following Directory:
C:\Documents and Settings\YOUR COMPUTER LOGIN\Application Data\Thunderbird\Profiles\gzggsx71.default\Mail\YOUR E-MAIL ACCOUNT SERVER

NOTE:  Before replacing your msgFilterRules.dat File with mine, ThunderBird must not be running in your computer because it regularly re-saves this File.  Exit ThunderBird before replacing this File.

Here is a summary of the 13 Filters:
Lowest Priority Filter -- this gives ALL Messages the Lowest Priority.

Low Priority Filter -- this gives Low Priority to Messages with YOUR First Name in the "TO", or "re:" or "fwd:" in the Subject Line.

Work Filter -- for Messages from your specified WORK Websites, they are given Highest Priority, Not Junk Status, Work Tag and a Star.

Delete (200) Filter -- Messages from addresses in the Junk Address Book that are over 200KB are deleted from your Thunderbird Account, but left on the POP Server for you to manually delete them or view them.  You can access the POP Server via your Internet Server's Webmail.  What I do is include in my Junk Address Book some of the people in my Personal Address Book who sometimes send me large junk e-mails.  This way you won't get stuck downloading junk mail from friendly addresses.

Safe Filter -- this is a list of Websites which you consider to be 'safe'.  Messages from these Websites are given a Star, High Priority and Not Junk Status.

From Filter -- Messages 'from' the listed Address Books (excluding your Junk Book) are given Highest Priority, Not Junk Status and a Star.  (NOTE:  You will need to reconstitute this Filter with the names of your specific Address Books like Personal and Collected.)

Personal Tag Filter -- Messages from addresses in your Personal Address Book are given a Personal Tag.

Important Tag Filter -- Messages with the Highest Priority (not from addresses in your Personal Book) are given an Important Tag.

Download Filter -- Messages not 'from' addresses in any of your Address Books, addressed TO your E-Mail Address, less than 10KB, with a Priority lower than High (meaning they're not in your Safe Filter), and without a Star are downloaded to your Inbox.

Delete (50) Filter -- Messages in your Junk Address Book (but not in your other From Filter Address Books or the Safe Filter) which are larger than 50KB are deleted from both your Thunderbird Account and the POP Server.

Junk Download Filter -- Messages in your Junk Address Book less than 5KB are downloaded, given Junk Status and moved to your Junk Box.  Messages caught by the Safe or From Filters are not covered by this Filter.

Junk Status Filter -- Messages that still have a Lowest Priority are given Junk Status.

Junk Box Filter -- Messages with a Low or Lowest Priority (that don't contain Your First Name in the TO) are moved to your Junk Box.

This setup will provide you excellent junk mail control -- especially if you're using Dialup.
NOTE:  As of the date of this posting, the SIZE feature of Message Filters is not working properly.  It doesn't presently work with Message Headers, and so some of the Filters are disabled until this feature is fixed.
Product: Core → MailNews Core
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: