Closed Bug 224318 Opened 17 years ago Closed 16 years ago

Bayes filtering should learn through use of external/serverside filters

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: raccettura, Assigned: Bienvenu)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.7, late-l10n)

Attachments

(11 files, 5 obsolete files)

7.56 KB, text/plain
Details
Ham
1.59 KB, text/plain
Details
1.22 KB, text/plain
Details
17.23 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
2.10 KB, patch
mscott
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
15.43 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
1.75 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
3.89 KB, patch
mscott
: superreview+
Details | Diff | Splinter Review
3.26 KB, patch
Stefan.Borggraefe
: review+
Bienvenu
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
818 bytes, patch
Bienvenu
: superreview+
chofmann
: approval1.7+
Details | Diff | Splinter Review
752 bytes, patch
mscott
: superreview+
Details | Diff | Splinter Review
Bayes filtering should be aware of the increasingly popular X-Spam headers. 
Products such as SpamAssassin use them to mark suspected spam emails.

Ideally, Bayes should ignore the top message that SpamAssassin attaches to
suspected spam.

Invite discussion on how to deal with Bayes and other spam filtering software.

I'm attaching 2 emails, 1 spam, 1 ham, both filtered through SpamAssassin.  The
spam is very distinctive, as it's always attached, to a spam notice email.  All
have X-Spam.


There is also the possibility to have an option and utalize external spam
filters with/or bayes.  For example, training on SpamAssassin's results of
spam/ham (such as SA's Bayes filtering does).  Or allowing Mozilla to simply
recognize SA's decision as what the email is.  Rather than SA does the scan,
then Mozilla does it again.  This would in essence allow third party filters the
ability to use Mozilla's Spam UI, or work in conjunction with Mozilla.
Attached file Spam
Spam Sample (note attached original email, and headers).
Attached file Ham
Ham.  Note difference from Spam.
Note messages *can* be inline, though no longer the default format in SA later
than 2.50, now attachments are default behavior.
Severity: normal → enhancement
OS: Windows XP → All
Hardware: PC → All
Scott and I had some ideas about what else we could do with this. We're thinking
a new tab on the spam settings window (which we're proposing to have tabs when
we add more options like this) with the following choices about what to do with
the x-spam-status header:

1. Ignore
2. Trust positives
3. Trust negatives (can trust both pos and neg)
4. Give Weight to x-spam-status (somehow combine x-spam-status result with
bayesian score)

If the user choses to trust both positives and negatives, then we don't need to
run the bayesian filter.


I think an option to "use X-Spam" would be good to.  Rather than use Mozilla's
bayes filtering.  Honor the spam filter on the server.  Use Mozilla's UI and
spam handling with the server's decision.  Would cut down on CPU for those users.

I like "give weight"

Would be nice if we can have a checkbox to "feed" the bayes filter and train
with the results from X-Spam.  

Since SpamAssassin as well as other products are relatively accurate, without
user interaction, Bayes could be trained quite quickly in those situations, as
noted here (among other places):
http://www.eweek.com/article2/0,4149,1366242,00.asp

There's a ton of potential.  That little tag can really do a lot to enhance Mozilla.
Attached file Header examples
There is one problem: There are different headers with different products.
I use spampal (Mail-Proxy freeware for win32) and it adds different headers.
Blocks: spam
*Detection Checklist*

SpamAssassin:
Ham    X-Spam-Status: No,
Spam   X-Spam-Status: Yes,
Spam   X-Spam-Flag: YES
(x-Spam-Status can have data after Yes, No)
I believe X-Spam-Flag was added in later versions.

SpamPal:
Ham    X-SpamPal: PASS
Spam   X-SpamPal: SPAM 
(spam can have data after the word spam)

SpamCatcher:
Ham   X-SpamCatcher-Flag: No
Spam  X-SpamCatcher-Flag: Yes
It occurs to me that we need this to be extensible, and that we need pattern
matching of some sort. So, I'm thinking I should add a filter action to set the
junk score. Then, users can write their own filters to handle some of the
server-side spam products.  Then, integrating with these products becomes a
matter of defining some filters. We could do like the MDN code does and define
the filters internally on the fly, invisible to the user. So I think I'll add a
filter action that sets a junk score.

 
I agree, there should be a way through filters.

But I'm thinking in Junk Mail Controls, there should be a new tab, with
checkboxes for:

[ ] Enable Support for External Mail Filters (Select..)
[ ] Enable Habeas Support
[ ] More soon...
and a mention they can define their own filters to customize this further.

In the first one, have a button, that brings a popup asking "Spam Assassin",
"SpamPal", "SpamCatcher".  Able to check multiples.

Turn them on by default (if the user doesn't have spamassassin, it just will
never fire, no real harm done).  If it does, it automatically kicks in.  The
other option is to have Thunderbird detect the first instance.

By having the filter, and the tab in junk mail controls, the user can not only
define their own rules with a filter (power user), but the basics are within
easy reach for the general user.

As more products emerge, we could easily add UI options for a few of the most
popular options.  I think the 3 mentioned are the most popular right now.

Only doing filter rules, would make this feature beyond the casual user, who may
want filtering, but is not geeky enough to make a filter.  Besides.  I think the
above 3 will apply to most of those who want the feature anyway, so it would
work out of the box for most people, and the rest can adapt it to their needs.


As a sidenote bug 11040 is indirectly related to this bug.  I wouldn't say
blocking, but definate influcence.
sorry if I wasn't clear - that's what I meant by "Then, integrating with these
products becomes a matter of defining some filters. We could do like the MDN
code does and define the filters internally on the fly, invisible to the user."
So the implementation of that UI would internally be some filters.

One issue I need to deal with is to propagate the junk status set by filters to
the imap server so that if the message gets moved to another folder, the junk
status is also moved.
David:  Sounds good to me.

Changing summary of this bug a little, since it's more more than X-Spam Headers
now.  
Summary: Bayes filtering should be aware of X-Spam Headers → Bayes filtering should learn through use of external/serverside filters
One drawback of this filter approach, as opposed to putting some code in the
code that parses mail headers, is that filters only run on new mail downloaded
in the inbox. If there are server-side filters that classify messages *and* move
them to other imap folders on the server, the client-side spam header detection
filters won't detect them. Not sure if this is an important issue...if it turns
out to be, we could run the internal filters on folders other than the inbox, I
guess. The advantage of using filters is that they're extensible.
this patch makes it so the user can set a message as junk or not junk through a
mail filter.

I think I'm going to make this three separate bugs.

1. this one - UI and backend for filters to set junk score.
2. adding hidden custom filters for various well-known server-side plugins.
3. Adding ability to train bayesian filter on server-data
Attachment #144072 - Flags: superreview?(mscott)
(In reply to comment #13)
> 2. adding hidden custom filters for various well-known server-side plugins.
> 

I think I might take that one when I get a few free cycles.
Robert, I'll get you started by doing one of them, when I get a few cycles :-)
(In reply to comment #15)
> Robert, I'll get you started by doing one of them, when I get a few cycles :-)
If your doing these in seaparate bugs, CC me on them, so I can keep track.  Thanks.
Comment on attachment 144072 [details] [diff] [review]
support for filters setting junk score

awesome!
Attachment #144072 - Flags: superreview?(mscott) → superreview+
I think Habeas should write their own as an XPI, personally. Lazy so-and-so's ;-)

Gerv
Habeas headers - they suggest filtering on #3...
http://www.habeas.com/configurationPages/headers.htm

I'm thinking the way this will work is that we'll add the ability to load this
kind of spam filter from disk, so we'll store the individual filters on disk.
That way dropping in new kinds of filters won't involve changing the code so much.
turns out custom headers were somewhat broken, in terms of what the UI allowed
you to set.
Attachment #144171 - Flags: superreview?(mscott)
Attachment #144171 - Flags: superreview?(mscott) → superreview+
Attached file filters for spamassassin (obsolete) —
Attached file filters for SpamCatcher (obsolete) —
Attached file filters for Habeas (obsolete) —
Attached file filters for SpamPal (obsolete) —
I'm thinking the way this might work is we add some attributes to
nsISpamSettings for handling server-side spam filters:

1. ServerSpamFilterName
2. ServerSpamAction - trust yes, trust no, trust both

Then, when we're starting up a server, if the spam filter name is set, we load
the correspondingly named filter file, and enable the Yes and/or No filters,
according to what the user has specified.

As far as the UI for picking the server side spam filter to incorporate is
concerned, I imagine it'll just be a drop down where you can pick from the list
of server-side spam filters we know about (maybe with a default choice of None,
or a checkbox to turn off this behaviour). It would be cool to populate this
list from the .dat files on disk, so that dropping in a new one adds it
automatically to the list, but we might not get there...
Comment on attachment 144072 [details] [diff] [review]
support for filters setting junk score

this would involve an exception for the localization freeze (it adds a few
strings) but we'd really like to get this into tbird .6 and Moz 1.7 - the fix
is fairly safe, and allows you to make filters set a junk score.
Attachment #144072 - Flags: approval1.7?
Comment on attachment 144171 [details] [diff] [review]
fix for custom headers

this is needed because the custom headers stuff was always slightly broken...
Attachment #144171 - Flags: approval1.7?
Comment on attachment 144171 [details] [diff] [review]
fix for custom headers

a=chofmann for 1.7
Attachment #144171 - Flags: approval1.7? → approval1.7+
David:

If we are learning from positive marks from external spam filters, isn't it
necessary to learn from negatives as well?  Otherwise we are essentially
tainting the built in bayesian filters with one sided results.

Just thinking outloud really.
Not sure what you mean - I've added settings to trust both positive and negative
results in my patch, and in the filters (except for Habeas). But I haven't done
anything about actually feeding the data into the spam filter to train it...I'm
probably going to leave that to you or someone else.
Hmm.. I retract my last comment.

I apparantly have some networking issues, when I was looking at the filter for
spamAssassin I saw:
>name="SpamAssasinYes"
>enabled="yes"
>type="1"
>action="JunkScore"
>actionValue="100"
>condition="OR (\"X-Spam-Status\",begins with,Yes) OR (\"x-Spam-Flag\",begins
with,YES)"

and that was it... hence my question.

But now I see the rest.  I've also been double posting on at least one forum,
and having connections time out.  So I think I have some networking problem here
at the minute, though my MRTG graph barely shows a change in ping time.

Anyway.  Disregard my last comment.  
This handles automatically creating hidden filters for a given server-side
filter, if the per-server pref serverFilterName and serverFilterTrustFlags are
set appropriately.
Attachment #144590 - Flags: superreview?(mscott)
diff for filter description files (includes a typo fix in SpamAssassin.sfd)
Attachment #144230 - Attachment is obsolete: true
Attachment #144231 - Attachment is obsolete: true
Attachment #144232 - Attachment is obsolete: true
Attachment #144233 - Attachment is obsolete: true
Attachment #144591 - Flags: superreview?(mscott)
Attachment #144592 - Flags: superreview?(mscott)
Attachment #144592 - Flags: superreview?(mscott) → superreview+
Attachment #144591 - Flags: superreview?(mscott) → superreview+
Comment on attachment 144590 [details] [diff] [review]
backend support for automatic server spam filter  filters

looks great.
Attachment #144590 - Flags: superreview?(mscott) → superreview+
Comment on attachment 144072 [details] [diff] [review]
support for filters setting junk score

a=asa (on behalf of drivers) for checkin to 1.7
Attachment #144072 - Flags: approval1.7? → approval1.7+
front and backend support for filters setting junk score checked in.
Keywords: late-l10n
Bug 181631 was already about having Mark as Junk/Not Junk in the message filter 
actions; I've marked it Fixed with an xref to this bug.

I've opened bug 238816 about adding those enhancements for custom-header 
matching to MailViews and Search.
I think there are some small issues with the strings that were checked in:

> +<!ENTITY setJunkScore.label "Set Junk Status">

The other filter actions that use a combobox all end with a colon. Also I think
this filter action should end with a "to" to be consistend with "Change message
priority to:".

I'm not sure whether "Junk Status" should be upper case or not.

> +<!ENTITY notJunk.label "NotJunk">

There should be a blank between Not and Junk.
I agree with Stefan about the language changes he suggested. This patch does
just that. 

1) It adds a space beteen Not and Junk
2) It adds a colon to the phrase: Set Junk Status to to be consisent with
setting the priority
3) I also moved the junk status action in the dialog so it was grouped with the
rest of the combo box driven actions such as setting priority, label the
message, etc. Don't let the wierd way cvs diff generated the patch for that
change fool you. It was just moving a few lines of xul higher up in the file. 

Still have one remaining problem...whenever we read the filter in from disk,
this action always resets to Not Junk even if you had it set to Junk.
(In reply to comment #41)
> Created an attachment (id=145191)
>
> 1) It adds a space beteen Not and Junk

This is not included in the patch. :-(
actually it is. But cvs diff -uw ignores white space and it views that change as
white space so it didn't show up. Weird

:)
Attachment #145191 - Flags: superreview?(bienvenu)
Attachment #145191 - Flags: review?(Stefan.Borggraefe)
Attachment #145191 - Flags: review?(Stefan.Borggraefe) → review+
Attachment #145191 - Flags: superreview?(bienvenu) → superreview+
I'm not able to reproduce the filter returning to non-junk problem, even with a
fresh tree from CVS. Maybe it's a release build only issue...
Comment on attachment 145191 [details] [diff] [review]
following up on Stefan's suggestions to the filter UI

asking for 1.7 status for this polish
Attachment #145191 - Flags: approval1.7?
Comment on attachment 145191 [details] [diff] [review]
following up on Stefan's suggestions to the filter UI

a=chofmann for 1.7
Attachment #145191 - Flags: approval1.7? → approval1.7+
Comment on attachment 145191 [details] [diff] [review]
following up on Stefan's suggestions to the filter UI

this patch has been checked in for 1.7 final
(In reply to comment #41)
> Still have one remaining problem...whenever we read the filter in from disk,
> this action always resets to Not Junk even if you had it set to Junk.

I see this too. The actionValue contains a random number instead 0 or 100 when
the FilterListDialog is opened for the first time after mozilla is started. When
you just open the FilterListDialog and close it immediatly without opening the
FilterEditor this value is written to msgFilterRules.dat.

Also in FilterEditor.js sometimes gJunkScoreCheckbox and sometimes 
gChangeJunkScoreCheckbox is used for something that looks like it should be just
one variable instead. But this is unrelated to the random number problem.
this patch fixes one of the issues with this bug fix:

"whenever we read the filter in from disk,
this action always resets to Not Junk even if you had it set to Junk."

We were never reading in the junk mail action value when reading the filter
from disk. Hence, the action value was garbage, causing it to sometimes get set
to mark as junk and sometimes as not junk

However there is still another really nasty issue out there. Any filter that
fires has the random potential of marking mail as junk. Even if the filter does
not have the junk status action checked. See Bug #239349 for information about
that issue.
Comment on attachment 145363 [details] [diff] [review]
fixes a bug where the junk action value never gets initialized

david see my comment above that explains this patch.
Attachment #145363 - Flags: superreview?(bienvenu)
Comment on attachment 145363 [details] [diff] [review]
fixes a bug where the junk action value never gets initialized

uninitialized variables leading to random behavior == good candidate for 1.7
final :)
Attachment #145363 - Flags: approval1.7?
I just found the cause of Bug #239349 which caused messages to get randomly
marked as junk or not junk if you had a filter rule that set a label action.
That fix should also go into 1.7

Comment on attachment 145363 [details] [diff] [review]
fixes a bug where the junk action value never gets initialized

a=chofmann for 1.7
Attachment #145363 - Flags: approval1.7? → approval1.7+
Comment on attachment 145363 [details] [diff] [review]
fixes a bug where the junk action value never gets initialized

I swear I wrote that code...
Attachment #145363 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 145363 [details] [diff] [review]
fixes a bug where the junk action value never gets initialized

this patch has been checked in for 1.7
Keywords: fixed1.7
backend support is checked in. I still need to write some front end code to
allow the user to set this up (though for now you can just set a hidden pref on
the server, serverFilterName, to the appropriate server side filter name
(Habeas, SpamAssassin, SpamCatcher, or SpamPal).
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
this busted balsa tinderbox (gcc3.4):
/builds/tinderbox/SeaMonkey-gcc3.4/Linux_2.4.7-10_Depend/mozilla/mailnews/base/src/nsSpamSettings.cpp:458:
error: extra `;'
Attached patch Fix Bustage (obsolete) — Splinter Review
Attachment #146072 - Flags: review?(bienvenu)
Comment on attachment 146072 [details] [diff] [review]
Fix Bustage

I actually already checked in the same fix
Attachment #146072 - Attachment is obsolete: true
Attachment #146072 - Flags: review?(bienvenu)
Blocks: 240476
What about forget headers? Is this code immune to that? E.g. takes only the
later added headers? The spammers could insert their own headers saying
spam-status: 0.

And how are new spam filters added to this? My server has some new stuff, it
inserts a score into the header and even the cause for this score - what was
suspicious in the mail. Something like this:
X-Spam-Status: No, hits=0.1 required=5.0
X-Spam-Level: HTML_MAIL, NO_SENDER
(In reply to comment #60)
> What about forget headers? Is this code immune to that? E.g. takes only the
> later added headers? The spammers could insert their own headers saying
> spam-status: 0.

Good point.
Some of my mails gets filtered two or more times and get different X-Spam
headers if not all marks it as spam then it might be a problem. 
I meant forged, sorry for the typo.
Attachment #147413 - Flags: superreview?(mscott)
Attachment #147413 - Flags: superreview?(mscott) → superreview+
Comment on attachment 147413 [details] [diff] [review]
fix for pop3 filter junk score action

very safe fix, only affecting setting junk score with pop3 filters...
Attachment #147413 - Flags: approval1.7?
Comment on attachment 147413 [details] [diff] [review]
fix for pop3 filter junk score action

a=asa (on behalf of drivers) for checkin to 1.7
Attachment #147413 - Flags: approval1.7? → approval1.7+
*** Bug 243049 has been marked as a duplicate of this bug. ***
My mail server sends x-junkmail-status headers, an example value is
"score=150/50, host=mx01.versatel.de". There also is a header X-Junkmail-Whitelist.
Product: MailNews → Core
I noticed that the sfd files were added to packages-os2 (and others), but I
build mailnews, these files never get exported into my dist.

It appears that the Makefile never gets hit?
(In reply to comment #68)
> I noticed that the sfd files were added to packages-os2 (and others), but I
> build mailnews, these files never get exported into my dist.
> 
> It appears that the Makefile never gets hit?

Makefile.in (in mailnews/base/search/src/) has only added SpamAssassin and SpamPal, leaving out the others two, even if the packager scripts try to install them (see attachment 144592 [details] [diff] [review]). Legal issues or simply forgot to add them?
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.