Closed Bug 163188 (bayesian) Opened 22 years ago Closed 22 years ago

Add Bayesian antispam filters per Paul Graham's design

Categories

(MailNews Core :: Filters, enhancement)

x86
All
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericweb, Assigned: beard)

References

(Blocks 1 open bug, )

Details

Attachments

(3 files, 13 obsolete files)

2.88 KB, patch
dmosedale
: review+
sspitzer
: superreview+
Details | Diff | Splinter Review
3.43 KB, patch
Details | Diff | Splinter Review
1.59 KB, patch
peterv
: review+
Details | Diff | Splinter Review
Lisp guru Paul Graham has written a brilliant spam filtering system (at the
above URL) based on Bayesian statistical evaluation of the tokens of incoming
mail. The spam and non-spam probabilities of individual words are derived by
scanning and comparing corpuses of the individual's received spam mail and
non-spam mail. The brilliant things about this approach are that:

1) it is extremely difficult for spammers to defeat it
2) it automatically evolves as spammers evolve their pitches
3) it tailors its behavior to the individual's actual received emails

A Moz Mail client implementation of this might have the following features:
A) "Scan Spam Email" and "Scan Nonspam Email" context menu options in the mail
folders window that would scan a folder's emails and process them as spam or
nonspam (seeding the Bayesian statistical evaluation); you'd use this for
"initializing" the filter when getting started and training it on your
accumulated nonspam and spam email
B) A "Delete As Spam" right-click menu option that would not only delete the
email but would also use it as input to the spam filter
C) An initial database of default token weights to get the filter started. (I'm
sure Paul Graham would be happy to provide his corpuses and derived weightings
to an open source project such as Mozilla. He clearly wants others to make use
of his work.)
D) A checkbox in prefs that turned this filtering off. (It will be a popular and
accurate enough feature that it should be on my default to promote awareness,
adoption, and viral adoption of Moz Mail.)
E) A "Suspected Spam" folder into which suspected spam email is automatically
filtered.

The first email program with this kind of filtering built in will be the first
killer app since the browser itself as the amount of spam is rapidly increasing.
It could be a major driver of Mozilla adoption if Moz Mail had this feature.
(e.g. I'd switch back to POP if I had to to take advantage of this feature.)

Of course, mail server-based implementations are also possible, but mail server
vendors and ISPs may resist implementing and adopting them because of the
processing load it would put on their servers. Client CPU cycles on the other
hand are cheap. It would be great if someone would add this to Moz Mail which is
not supported by commercially available antispam systems I'm aware of.
Bravo!  I was just chatting on #mozillazine that putting this in would be the
thing that would inspire me to learn Mozilla's structure...
Thought implementing it might make a nice easy project that would make my life
sooo much more pleasant. (no more continual .procmailrc updating!)

No knowledge of said structure at the moment, so would just like to register an
interest in helping out, as well as seeking good places to learn about Mail/News
methods.
Oh yeah, this would totally make my day!

Gary Arnold also seems to have written this thing in Perl, which might be easier
to understand for a lot of people: http://www.garyarnold.com/projects.php#bayespam
And more source to analyze:

Written by ESR in C: http://www.tuxedo.org/~esr/bogofilter/ and someone made a
"badwordlist" for that which can be found at
http://www.xtdnet.nl/paul/spam/bogofilter/

Check bug 11035. Someone is working on the foundation work and comment 31 in
that bug by Alec Flett seems like a very good idea. That way I can make spam
filter process after all my other filters to make achieve better results. I
think eventually we would have a number of filter plugins like "Spam Assassin
(bug 11035)", "Bayesian (this bug)", "whitelist (bug 120160)" that user can use
instead of one general solution. Should we add this as a dependancy?
I _DEFINITELY_ would like to see this happen; an tool that
integrates into the mail browser to deal with spam is really needed.
However, I think you can make the user interface even simpler.

Create a nice big button named "SPAM" - press it
whenever you see a SPAM message like you would press "Delete",
but it will then do a number of configurable actions -
and give it a useful default, since many users don't configure their systems.
I suggest as the default that it
(1) forge and send back a "no such user" message;
this will remove you from some lists in a few cases and also
warns others about forging emails.
(2) save the message in a specially-named
"spam" folder, for use in a naive Bayesian
statistical analysis program (as Graham describes).
Other options that could be turned on include forwarding a copy
to a list of email addresses (e.g., your local "abuse" account,
the newsgroup news.admin.net-abuse.sightings, and
email addresses of well-known spam killers), or calling on other
spam killers to check it like SpamAssassin.
In the dialogue for configuring the SPAM button's actions,
perhaps there could be a radio selection beside each action like
"don't do it when you press SPAM", "do it when you press SPAM",
or "confirm before doing it when you press SPAM" - that way, you
can turn on "ask me before sending to abuse" or whatever.

Also, you need to pre-create a "Suspected Spam" folder
if one doesn't already exist.

When the "spam" folder grows to 100 messages, start up the
naive Bayesian analysis, and assume that any email saved in
"spam" is spam, and anything saved in a folder
other than the "Suspected Spam" or "Spam" folders is good.
Automatically re-run the analysis every 100 messages or every
week, whichever comes first.
Note that from a user's point of view, the only thing they
need to know is that they need to press the SPAM button
when they see spam.  Everything else is automatic.
Simple, n'est-ce pas?

There's a great deal of study on the topic of Naive Bayesian
approaches to spam, and it's surprisingly effective.
Here are some references studying it:

http://arxiv.org/abs/cs.CL/0006013
  An evaluation of Naive Bayesian anti-spam filtering
http://arxiv.org/abs/cs/0008019
  An Experimental Comparison of Naive Bayesian and
  Keyword-Based Anti-Spam Filtering with Personal E-mail Messages
http://arxiv.org/abs/cs/0009009
  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian
  and a Memory-Based Approach
http://www.lsi.upc.es/~carreras/pub/boospamev.ps
http://www.monmouth.edu/~drucker/SVM_spam_article_compete.PDF


Others have at least partly implemented it:
http://www.ai.mit.edu/~jrennie/ifile/
  Ifile implemented the idea many years ago.  This is useful to show that
  the idea has been around awhile.
http://crm114.sourceforge.net
  CRM114 can do this, and can extend the probabilities to
  phrases and not just individual words.

Eric Raymond's implementation has already been mentioned; that might
actually be a better starting point for actual code.

Hope all this helps!!

Summary: want Bayesian antispam filters per Paul Graham's design → [RFE] Add Bayesian antispam filters per Paul Graham's design
*** Bug 165725 has been marked as a duplicate of this bug. ***
How about a mail filtering API of some sort that would allow plugable filters. That way I could use a Bayesian filter and someone else could use Vipul's Razor or a whitelist program. That way we might not have to wait as long when Joe Spammer finds a way past the current filter and we have to tighten up the spam filters or combine methods.

Just for the record: another article, by a statistician this time, on a method
to perhaps improve the original "Bayesian" algorithm to one which actually is
Bayesian. :-)

http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

Gerv 
I've started working on this, I have a simple prototype. I'll post again once I
get something decent.
Make this depend on the plugin interface bug.
Depends on: 167561
Depends on: 169557
No longer depends on: 167561
For algorithmic ideas (if not code :-), the Python community has started a
project aimed at developing optimal parsing and classifyer algorithms, varying
within the ideas of Paul Graham.  Using rigorous testing methods, lots of ideas
of the form "wouldn't it work better if you changed this parameter or if you
treated that header specially" are accepted and rejected.

See http://spambayes.sf.net

Developers welcome!
Just thought I'd mention that I'm the author of the essay mention in comment #8
above
(http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)...
we're discussing it on the list Guido mentions (http://spambayes.sf.net). I, for
one, would be very  interested in any discussion of using Graham-derived stuff
in Mozilla.

Gary Robinson
Attached patch bayesian prototype patch, v1 (obsolete) — — Splinter Review
This is the beginnings of a prototype for bayesian spam filtering, courtesy of
peterv.  This code has some issues right now, and is reportedly not catching
spam as well as earlier versions, so I need to track down what's going on
there.	The biggest issue, though, is that right now a JS hash is used for word
counts, and read and written to disk in one shot.  We're gonna want a real
on-disk DB here to avoid tons of bloat, so I'm guessing mork is not gonna be
suitable... but I'm happy to be proven wrong.  Perhaps we need to fall back to
the ancient dbm code that the NSS folks use?
Attached patch patch, v2 (obsolete) — — Splinter Review
This takes the code from the last patch and hooks it up into the Mozilla build,
and partially hooks it up with Seth's UI.  It still needs a bunch of cleanup,
but it's a start.  This patch includes the Seth's "turn on the UI" patch, and
as well as the nsIMsgFilterPlugin interface and the hooks into IMAP to use it.
Attachment #99870 - Attachment is obsolete: true
It might be nice to automatically notify the sender of any email that is
automatically sorted as spam that his email was automatically sorted as spam and
not recieved by the intended recipient.  This would help to catch false
positives that might otherwise go unnoticed.
But if we automatically respond to the spammers, it just confirms that the
address is valid, just as many "opt-out" links just confirm an active email
address.  The best bet is to ignore or send a false undeliverable.
I agree with #16.  If one is concerned about false positives, one should just
scan the spam folder briefly before emptying it.

Replying to spam is the last thing you want to do.  (right up there with opening
them with HTML parsing enabled)
Attached patch patch, v3 (obsolete) — — Splinter Review
Patch v3; includes changes from peterv to the algorithm (starting to work
nicely :-).  Also has various infrastructure changes so that IMAP analyzes
every message as it comes in (though it doesn't yet indicate that in the UI).
Attachment #100102 - Attachment is obsolete: true
over to beard, who now owns this.
Assignee: naving → beard
BTW, we should credit Paul Graham, Gary Robinson and the spambayes list (Tim
Peters et alia) in the filter code. Though I didn't use the lisp or the python
code I stole a lot of the ideas for our implementation from them. ;-)
OK, I've checked in peterv's existing code in
mailnews/extensions/bayesian-spam-filter; it's not yet part of the build.  I
also added a comment crediting various folks as Peter suggested.  The stuff
that's checked in may not quite work, but it's close.  Next step get the rest of
the infrastructure changes in.
Is it too far beyond the scope of this bug to suggest the capability of pooling
the "good" and "bad" corpus derived from many users?  Hypothetically, in this
way a company could designate a central repository for "interesting" tokens so
that the scores from other employees can be evaluated along with personal dbs -
this immediately reduces the false-positive rates for common contacts with
suppliers, sales, etc.  Or would such a repository of necessity need to be a
transactional-based server (sql of some flavor for example) to avoid
file-locking issues, and therefore would require much more development? 

An interesting side-development of this would be to create an infrastructure
(between many sites) that share their corpus results and use this as a baseline
against which an individual's behaviours (prior picking of spam or removing mail
from the spam-holding folder) are given weight.
Some temporary code to get the current bayesian bits checked in hooked up to
seth's UI.  Eventually this will be replaced by code that goes the filtering
plugin interface, but that's not there yet.
Attachment #100530 - Attachment is obsolete: true
Micheal: an interesting idea, but definitely behind the scope of this particular
bug.  Feel free to file another about it, if you wish.
Comment on attachment 100812 [details] [diff] [review]
temporary glue to hook up junk mail button to bayesian code

r=bienvenu
Attachment #100812 - Flags: review+
Comment on attachment 100812 [details] [diff] [review]
temporary glue to hook up junk mail button to bayesian code

I'm not sure this is right.

why are you setting .label on the message header, and not the .score attribute?

instead of function mark() can we call it something like function
setSpamScore()?
Attached patch temp glue patch, v2 (obsolete) — — Splinter Review
Attachment #100812 - Attachment is obsolete: true
Comment on attachment 100858 [details] [diff] [review]
temp glue patch, v2

sr=sspitzer
Attachment #100858 - Flags: superreview+
Re comment 22, comment 24, there are various downsides to a pooled repository:
* Possibility of it being 'polluted' by malicious/incompetent users;
* What different people consider as spam may not be the same;
* Different people get different types of mail, therefore their 'good' lists in
particular are likely to be very different; and
* If everybody uses the same lists, spammers can tune their content based on
these lists and thus avoid the filter - one of the strengths of the system is
that everyone's list is different, so it's impossible to produce mail to get
round every list.
Comment on attachment 100858 [details] [diff] [review]
temp glue patch, v2

r=peterv, note that this will still mark the message as read.
Attachment #100858 - Flags: review+
Is there a target date or milestone on this bug?  There's been a lot of activity
but no milestone has been set.  I see the reviews are done, so I assume once
approval has been obtained the code will be checked into the tree?
This feature is very much a work in progress. The architecture to support filter
plugins is still being worked out.
Status: NEW → ASSIGNED
Attached patch temp glue patch, v3 — — Splinter Review
Fix problems related to the score property; make mark() iterate instead of
using callbacks.
Attachment #100858 - Attachment is obsolete: true
Comment on attachment 100954 [details] [diff] [review]
temp glue patch, v3

sr=sspitzer

sorry about the misleading ".score" thing.

note to the review, we don't need to call setScore() as the doCommand() code
does that.
Attachment #100954 - Flags: superreview+
Comment on attachment 100954 [details] [diff] [review]
temp glue patch, v3

Carrying forward peterv's r=.
Attachment #100954 - Flags: review+
Glue patch checked in.
Attached patch build patch, v1 — — Splinter Review
Patch to turn on the UI that seth put in (duplicated from his patch in the
front-end bug) and cause the bayesian stuff to be installed at build time.
Attached patch Mac build system patch. — — Splinter Review
This implements a new build option, junkmailfilter, and adds appropriate build
steps to the Mac build system.
Can someone please set an eta or milestone?  Also, can we use the temporary
"glue" until the message filtering plugin interface comes through or do we have
to wait on that?
Attached patch Patch for msgcoreidl (obsolete) — — Splinter Review
CW project change to build nsIMsgFilterPlugin.idl on the Mac.
Comment on attachment 100978 [details] [diff] [review]
Mac build system patch.

r=peterv
Attachment #100978 - Flags: review+
Attached patch gcc warning fix, v1 (obsolete) — — Splinter Review
gcc 3.1.1 points out that rv can be used uninitialized if aCount == 0.	Here's
a patch.
Comment on attachment 102510 [details] [diff] [review]
gcc warning fix, v1

r=beard
Attachment #102510 - Flags: review+
This patch uses PLDHashTable instead of nsObjectHashTable. This allows it to
allocate all Token objects in place, and does aggressive key sharing. Still has
some glitches right now, so this is a work in progress.
Attachment #102510 - Attachment is obsolete: true
Attached patch PLDHashTable patch, v2. (obsolete) — — Splinter Review
This simplifies the structure of the Token significantly, and uses the stub
moveEntry and clearEntry PLDHashTable operators. Pooled allocation of token
strings is now unconditionally used.
Attachment #102513 - Attachment is obsolete: true
Comment on attachment 102521 [details] [diff] [review]
PLDHashTable patch, v2.

sr=sspitzer
Attachment #102521 - Flags: superreview+
I've been noticing some weirdness with this patch (things getting slow, or not 
working), taking forever to shutdown.

I got a crash while classifying:

PL_DHashStringKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 79 + 
22 bytes
HashKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 72 + 13 bytes
PL_DHashTableOperate(PLDHashTable * 0x04a72344, const void * 0xdddddddd, int 0) 
line 479 + 16 bytes
Tokenizer::get(const char * 0xdddddddd) line 134 + 15 bytes
Tokenizer::remove(const char * 0xdddddddd, unsigned int 3722304989) line 157 + 
12 bytes
forgetTokens(Tokenizer & {...}, Token * * 0x02de2d68, unsigned int 91) line 593
nsBayesianFilter::observeMessage(Tokenizer & {...}, const char * 0x04a6c3a8, 
unsigned int 1, unsigned int 2, nsIJunkMailClassificationListener * 0x00000000) 
line 625 + 20 bytes
MessageObserver::analyzeTokens(const char * 0x04a6c3a8, Tokenizer & {...}) line 
578
TokenStreamListener::OnStopRequest(TokenStreamListener * const 0x04a6c310, 
nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 375
nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x04a59880, 
nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 1099
nsStreamListenerTee::OnStopRequest(nsStreamListenerTee * const 0x04915f20, 
nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 66
nsOnStopRequestEvent0::HandleEvent(nsOnStopRequestEvent0 * const 0x04ad7d18) 
line 319 + 33 bytes
nsStreamListenerEvent0::HandlePLEvent(PLEvent * 0x04ad7d28) line 113 + 12 bytes
PL_HandleEvent(PLEvent * 0x04ad7d28) line 644 + 10 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x01266f20) line 574 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x00660256, unsigned int 49509, unsigned int 0, 
long 19296032) line 1335 + 9 bytes
USER32! 77e11b60()
USER32! 77e11cca()
USER32! 77e183f1()
nsAppShellService::Run(nsAppShellService * const 0x012db5c8) line 472
main1(int 2, char * * 0x00276ef8, nsISupports * 0x00276f40) line 1522 + 32 bytes
main(int 2, char * * 0x00276ef8) line 1883 + 37 bytes
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 77e8d326()
Comment on attachment 102521 [details] [diff] [review]
PLDHashTable patch, v2.

r=dmose after casts are changed to C++ style
Attachment #102521 - Flags: review+
Comment on attachment 102521 [details] [diff] [review]
PLDHashTable patch, v2.

since I'm crashing with this, and there's some weirdness, marking this needs
work.
Attachment #102521 - Flags: superreview+
Attachment #102521 - Flags: review+
Attachment #102521 - Flags: needs-work+
Attachment #101103 - Attachment is obsolete: true
Attached patch PLDHashTable patch v3 (obsolete) — — Splinter Review
Replaced C style casts with appropriate NS_(STATIC|REINTERPRET)_CAST macros.
Added more error handling and NS_ASSERTIONs.
Attachment #102521 - Attachment is obsolete: true
Attached patch PLDHashTable patch v4 (obsolete) — — Splinter Review
This patch seems to work much better -- no longer getting zero length tokens in
the hash tables, which seemed to stem from using an alignment value of 1 when
calling PL_InitArenaPool() -- now using an alignment of 2, and an arena size of
16K.
Attachment #102527 - Attachment is obsolete: true
Comment on attachment 102658 [details] [diff] [review]
PLDHashTable patch v4

A few error checking nits:

>+static void PR_CALLBACK MoveEntry(PLDHashTable* table,
>+                                  const PLDHashEntryHdr* from,
>+                                  PLDHashEntryHdr* to)
>+{
>+    const Token* fromToken = NS_STATIC_CAST(const Token*, from);
>+    Token* toToken = NS_STATIC_CAST(Token*, to);
>+    if (fromToken->mLength == 0) {
>+        NS_WARNING("zero length token in table!");

Should this really be an assertion rather than just a warning?	IE is a
zero-length token ever valid?

>-Tokenizer::Tokenizer() : mTokens(NULL, NULL, NULL, NULL)
>+Tokenizer::Tokenizer()
> {
>-    PL_InitArenaPool(&mTokenPool, "Tokens Arena", 4096 * sizeof(Token), sizeof(double));
>-    PL_InitArenaPool(&mWordPool, "Words Arena", 32768, sizeof(char));
>+    PRBool ok = PL_DHashTableInit(&mTokenTable, &gTokenTableOps, nsnull, sizeof(Token), 256);
>+    NS_ASSERTION(ok, "mTokenTable failed to initialize");

Since PL_DHashTableInit ultimately ends up allocating memory, and a failure
there could be the cause of this failure, failure should probably be checked in
all builds, not just asserted in debug builds.

> Token* Tokenizer::add(const char* word, PRUint32 count)
> {
>-    nsCStringKey key(word);
>-    Token* token = (Token*) mTokens.Get(&key);
>+    Token* token = get(word);
>     if (!token) {
>-        token = newToken(word, count);
>-        if (token && token->mWord.get()) {
>-            // NOTE:  to save space, sharedKey shares the string pointer with the token itself.
>-            // This is safe, as long as the token's lifetime exceeds the hash table / key itself.
>-            // Since the token string is now arena allocated, this will always be true.
>-            nsCStringKey sharedKey(token->mWord.get(), token->mWord.Length(), nsCStringKey::NEVER_OWN);
>-            mTokens.Put(&sharedKey, token);
>+        PLDHashEntryHdr* newEntry = PL_DHashTableOperate(&mTokenTable, word, PL_DHASH_ADD);
>+        if (newEntry) {
>+            PRUint32 len = strlen(word);
>+            token = NS_STATIC_CAST(Token*, newEntry);
>+            token->mWord = copyWord(word, len);
>+            NS_ASSERTION(token->mWord, "copyWord failed");

Same as previous comment: this should probably be more than just an assertion
for the same reason.

Other than these nits, it looks good.  Fix them and you've got r=dmose.
Attachment #102658 - Flags: review+
Summary: [RFE] Add Bayesian antispam filters per Paul Graham's design → Add Bayesian antispam filters per Paul Graham's design
Attached patch PLDHashTable patch v5 (obsolete) — — Splinter Review
This uses a 2-byte aligned string arena (1-byte doesn't seem to work, 4-byte
seems wasteful), and addresses error checking concerns.
Attachment #102658 - Attachment is obsolete: true
Alias: bayesian
Comment on attachment 103035 [details] [diff] [review]
PLDHashTable patch v5

Patch checked in.
Attachment #103035 - Attachment is obsolete: true
Attached patch TokenEnumeration patch v1 (obsolete) — — Splinter Review
This patch introduces a new helper class, TokenEnumeration, to avoid copying
tokens where possible, and renames getTokens() to copyTokens().
Comment on attachment 103118 [details] [diff] [review]
TokenEnumeration patch v1

Are you sure you don't want the fix for bug 174859?  Then you could avoid the
overhead of copying tokens rather than token pointers in classifyMessage's call
to copyTokens.

Comments apart from this patch:

- You might use a better magic number than 0xFEEDFACE -- see the magic strings
used by the XPCOM typelib file format
(http://www.mozilla.org/scriptable/typelib_file.html) and the XPCOM FastLoad
file format
(http://lxr.mozilla.org/mozilla/source/xpcom/io/nsFastLoadFile.h#139), both
inspired by PNG's magic string header.

- Nit: last_delimiter violates the otherwise-prevailing interCaps style for
local variable names.

Food for future revs, sr=brendan@mozilla.org on this one.

/be
Attachment #103118 - Flags: superreview+
Depends on: 174859
Comment on attachment 103118 [details] [diff] [review]
TokenEnumeration patch v1

Fixed style problem, and checked in.
Attachment #103118 - Attachment is obsolete: true
Please use MLP/GPL/LGPL for the files in
mozilla/mailnews/extensions/bayesian-spam-filter. It looks like
mozilla/mailnews/extensions/bayesian-spam-filter/MANIFEST can be removed. The
project file
(mozilla/mailnews/extensions/bayesian-spam-filter/macbuild/BayesianFilter.xml)
is referring to files for mdn, could you correct it?
Hmmm... Just thought of something.  If the "spam-or-not" rating is stored in the
message header/body, will these algorithms fail when a spammer includes
something like "X-Mozilla-Spam: not" (or whatever the real header might be) in a
spam message?
that's a good point, but I don't think we're planning on storing the spam header
in the message. If we did, we should filter out whatever came from the server,
and put our own in.
Re comment 60 - would this be a problem if you read your mail from two different
places (e.g. with IMAP, from home and from work)? Suppose I read my mail at
work, mark as spam or whatever, then at home I read it again and the header
added by my work installation is stripped as my home installation downloads the
message from the server.

A way round this would be to store something that's unique to this installation
in the header, and ignore the header if it doesn't match our installation, e.g.
X-Mozilla-Spam: 3njh84n5 not
Russell, in IMAP, you can't just add a header to a message. The way to solve
this with IMAP is to add a custom keyword to the message, on the server. This
will only work on some IMAP servers, but I don't think there's a solution that
will work on all IMAP servers.
I just thought I'd mention that the the SpamBayes project
http://spambayes.sf.net/ has now pretty much completed their algorithm research
phase, and now have a "clear winner" algorithm based on chi-square
probabilities, which you may want to consider upgrading to.  In contrast to
Graham's original algorithm and the Robinson variations in use a month ago, the
"chi-combining" algorithm has a usable "unsure" range where it "knows that it
doesn't know" what a particular e-mail is.  The earlier algorithms were either
sure all the time, or unsure in unusable ways (e.g. drifting "unsure" ranges). 
The new algorithm can still be "trained" incrementally using pretty much the
same word-frequency data, and has far fewer biases and tweak factors.  There
have also been many tokenizer improvements.  Last, but not least, the project
has a usable Outlook 2000 add-in that can be looked at for UI ideas and perhaps
used for testing to verify that same-or-better results are produced by the
Mozilla implementation of the classifier.

One of the most difficult things about implementing this kind of filtering is
validating your algorithms; the spambayes people have put an enormous amount of
both theoretical work and large scale testing for statistical validation.   It
would be good to make use of it here.
Does this just count words in the message, or does it consider other clues as
"words"? I mean things like "Message is HTML", "Out of date", etc.

Bug 151622 is asking for score-based filtering, which covers most of these
ideas, but with manual scoring. It seems a good idea to use the Bayesian engine
to make it automatic.
Some thoughts on the UI for this.

Add a special mail folder called Spam. At some point, add the ability to have
messages older than a certain age purged.

Add another menubar button next to Delete called Spam, when viewing all mail
folders but Spam. When the user presses this on a message, it goes to the Spam
folder, it gets processed and added to the filter list and at some point, add
the ability to automatically report the spam to black list/spam maintainer.

When viewing the Spam folder, the button becomes NotSpam (or something that
sounds better). Pressing this moves the mail back into the proper folder (as
determined by other filters the user has) and processes the mail removing it
from the bad mail list and adding it to the good mail list.

First time user hits the Spam button, bring up a brief descriptive dialog
explaining what is going on and allowing the user to set other filter options
(such as the report to blacklist maintainer, for instance).

Any mail that comes in and doesn't get filtered as Spam is processed and added
to the good list.

In practice, the user would start off with empty lists for both. He could go
through existing mail and mark it as Spam, starting off the list, but this isn't
necessary. All they would need to do is mark new mail as Spam. This means that
the filter would not have as high an accuracy at first, but would improve
quickly. It is also very simple for the average user to start with.
Tim: see bug 169638. UI which is almost ready seems pretty similar to what you
suggested.

Adding this feature's UI bug, bug 169638, to dependencies so that people become
aware of its existence.
Adding this bug to dependencies in bug 11035. This is after all a spam-blocking
feature, right?
Blocks: 11035
We'd better not use the term "Spam" in the UI. Spam is a registered trademark
and using it within the product UI could draw a lawsuit. The meat company that
owns the trademark *is* actively trying to defend its trademark and would love a
victory against a high-profile target like Mozilla. Suggest "Junk Mail" instead.
Not true, according to http://www.spam.com/ci/ci_in.htm

> We do not object to use of this slang term to describe UCE, although we do
> object to the use of our product image in association with that term. Also, if
> the term is to be used, it should be used in all lower-case letters to
> distinguish it from our trademark SPAM, which should be used with all uppercase
> letters.
How will the filtering work for IMAP connections?  As I understand it, the
filters will run as the headers are generated for the message list, however a
typical IMAP connection does not retrieve any part of the message until it is
read/moved/copied/etc.  Therefore, in a typical IMAP connection, only the
message headers will be available for running through the filter.  Will the
filter work only on the headers, or will it force a download of the text/html,
or text/plain parts of messages so it can run against them (or maybe an option
for either)?  BTW, I don't know how much information is passed via the initial
IMAP listing, but if it lists the attachment headers, that would be good enough
since so many spam messages are obvious with idiotic content like:
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: base64
Michael: the UI for message classification has been enabled in trunk builds (see
bug 169638).

From my testing it seems that bodies of new, incoming messages are fetched and
run through the filter, while pre-existing messages have an icon indicating that
they haven't been processed yet. You can select them, then Tools->Run Junk Mail
Controls on them manually.
What do we have to do to test this?  Current nightly has the UI, but the UI
doesn't do anything at all.
Nathan: There is a problem with this in (at least) the windows installer
versions of the nightlies.  The .zip appears to function correctly.  See bug 179150
it works in installer builds starting with 2002-11-08-13
Been trying this out excitedly.
Few questions.
Does the Junk Mail log do anything?  I don't see anything being written to it.
Messages seem to be labelled as junk after they have all been retrieved.  So my
attempts to use a filter that has all incoming messages with junk flag be sent
to a particular folder fails.  I have to choose to apply filters to current
folder to have them move.  Any chance they can get flagged before the filters
are applied?
My junk mail log shows all occurances of a mail (in Inbox) getting automatically
marked as junk (after they have been retreived).

Marking messages as junk (or not junk) autmatically after they have move to a
folder by a filter is bug 180153 (for POP).
Right.  Mine get marked as well.  And I'm retrieving locally.
It isn't a problem with marking, it is the problem that I have a rule that says
if something is marked junk, move it to another folder.  All the other rules get
applied on retrieval, but that one doesn't, since on retrieval, status is still
undefined.  Status gets switched immediately after all are retrieved, but that
is too late for the rule.
Or at least, that's what it seems is happening.
The junk mail code has landed on the trunks.  New bugs are being filed as
blocking bug 11305, so if you have an issue with Junk Mail, see if it's in the
dependency tree in that bug, and if not, open a new one.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Dan: Did you really mean bug 11305? That's a VERIFIED-WONTFIX about
nsAppleSingleDecoder or something.
He probably meant Bug 11035
IDEA:  Address Book Immunity Rule
-------------------------------------
One of the biggest problems with making spam filters is making a filter that
keeps unwanted stuff out without trashing things you want.

How about making a rule that make email with an address from the user's address
book immune to any _or__ particular filters?

On a user interface this could be a check box titled:
 "don't apply this rule to people listed in my address book".

This would cut down on email getting lost from an overly tight filter and it
would cut down on having to wade through a folder full of junk mail to make sure
you are not throwing out anything important.

Just a thought...

Steve
we do that - it's called address book whitelisting, and if it's turned on, we
won't apply the spam filter to a message from someone in your addressbook.
There's a plan to extend it so you can check against multiple address books, but
as of today, you can use a single address book as your whitelist.
FYI, a different anti-spam technique is noted in bug #187044, called
"challenge-response".  It appears to me that the Bayesian and challenge-response
techniques can be used together - the combination should REALLY cut down
on spam, and make it harder for spammers to overcome (since they would have
to circumvent TWO mechanisms).



It would also be nice to see bug 184948 completed at some point in the future.
That way, fewer spammers would know their message was received.

- Adam
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: