Closed
Bug 163188
(bayesian)
Opened 22 years ago
Closed 22 years ago
Add Bayesian antispam filters per Paul Graham's design
Categories
(MailNews Core :: Filters, enhancement)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ericweb, Assigned: beard)
References
(Blocks 1 open bug, )
Details
Attachments
(3 files, 13 obsolete files)
2.88 KB,
patch
|
dmosedale
:
review+
sspitzer
:
superreview+
|
Details | Diff | Splinter Review |
3.43 KB,
patch
|
Details | Diff | Splinter Review | |
1.59 KB,
patch
|
peterv
:
review+
|
Details | Diff | Splinter Review |
Lisp guru Paul Graham has written a brilliant spam filtering system (at the above URL) based on Bayesian statistical evaluation of the tokens of incoming mail. The spam and non-spam probabilities of individual words are derived by scanning and comparing corpuses of the individual's received spam mail and non-spam mail. The brilliant things about this approach are that: 1) it is extremely difficult for spammers to defeat it 2) it automatically evolves as spammers evolve their pitches 3) it tailors its behavior to the individual's actual received emails A Moz Mail client implementation of this might have the following features: A) "Scan Spam Email" and "Scan Nonspam Email" context menu options in the mail folders window that would scan a folder's emails and process them as spam or nonspam (seeding the Bayesian statistical evaluation); you'd use this for "initializing" the filter when getting started and training it on your accumulated nonspam and spam email B) A "Delete As Spam" right-click menu option that would not only delete the email but would also use it as input to the spam filter C) An initial database of default token weights to get the filter started. (I'm sure Paul Graham would be happy to provide his corpuses and derived weightings to an open source project such as Mozilla. He clearly wants others to make use of his work.) D) A checkbox in prefs that turned this filtering off. (It will be a popular and accurate enough feature that it should be on my default to promote awareness, adoption, and viral adoption of Moz Mail.) E) A "Suspected Spam" folder into which suspected spam email is automatically filtered. The first email program with this kind of filtering built in will be the first killer app since the browser itself as the amount of spam is rapidly increasing. It could be a major driver of Mozilla adoption if Moz Mail had this feature. (e.g. I'd switch back to POP if I had to to take advantage of this feature.) Of course, mail server-based implementations are also possible, but mail server vendors and ISPs may resist implementing and adopting them because of the processing load it would put on their servers. Client CPU cycles on the other hand are cheap. It would be great if someone would add this to Moz Mail which is not supported by commercially available antispam systems I'm aware of.
Bravo! I was just chatting on #mozillazine that putting this in would be the thing that would inspire me to learn Mozilla's structure... Thought implementing it might make a nice easy project that would make my life sooo much more pleasant. (no more continual .procmailrc updating!) No knowledge of said structure at the moment, so would just like to register an interest in helping out, as well as seeking good places to learn about Mail/News methods.
Oh yeah, this would totally make my day! Gary Arnold also seems to have written this thing in Perl, which might be easier to understand for a lot of people: http://www.garyarnold.com/projects.php#bayespam
And more source to analyze: Written by ESR in C: http://www.tuxedo.org/~esr/bogofilter/ and someone made a "badwordlist" for that which can be found at http://www.xtdnet.nl/paul/spam/bogofilter/
Check bug 11035. Someone is working on the foundation work and comment 31 in that bug by Alec Flett seems like a very good idea. That way I can make spam filter process after all my other filters to make achieve better results. I think eventually we would have a number of filter plugins like "Spam Assassin (bug 11035)", "Bayesian (this bug)", "whitelist (bug 120160)" that user can use instead of one general solution. Should we add this as a dependancy?
Comment 5•22 years ago
|
||
I _DEFINITELY_ would like to see this happen; an tool that integrates into the mail browser to deal with spam is really needed. However, I think you can make the user interface even simpler. Create a nice big button named "SPAM" - press it whenever you see a SPAM message like you would press "Delete", but it will then do a number of configurable actions - and give it a useful default, since many users don't configure their systems. I suggest as the default that it (1) forge and send back a "no such user" message; this will remove you from some lists in a few cases and also warns others about forging emails. (2) save the message in a specially-named "spam" folder, for use in a naive Bayesian statistical analysis program (as Graham describes). Other options that could be turned on include forwarding a copy to a list of email addresses (e.g., your local "abuse" account, the newsgroup news.admin.net-abuse.sightings, and email addresses of well-known spam killers), or calling on other spam killers to check it like SpamAssassin. In the dialogue for configuring the SPAM button's actions, perhaps there could be a radio selection beside each action like "don't do it when you press SPAM", "do it when you press SPAM", or "confirm before doing it when you press SPAM" - that way, you can turn on "ask me before sending to abuse" or whatever. Also, you need to pre-create a "Suspected Spam" folder if one doesn't already exist. When the "spam" folder grows to 100 messages, start up the naive Bayesian analysis, and assume that any email saved in "spam" is spam, and anything saved in a folder other than the "Suspected Spam" or "Spam" folders is good. Automatically re-run the analysis every 100 messages or every week, whichever comes first. Note that from a user's point of view, the only thing they need to know is that they need to press the SPAM button when they see spam. Everything else is automatic. Simple, n'est-ce pas? There's a great deal of study on the topic of Naive Bayesian approaches to spam, and it's surprisingly effective. Here are some references studying it: http://arxiv.org/abs/cs.CL/0006013 An evaluation of Naive Bayesian anti-spam filtering http://arxiv.org/abs/cs/0008019 An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages http://arxiv.org/abs/cs/0009009 Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach http://www.lsi.upc.es/~carreras/pub/boospamev.ps http://www.monmouth.edu/~drucker/SVM_spam_article_compete.PDF Others have at least partly implemented it: http://www.ai.mit.edu/~jrennie/ifile/ Ifile implemented the idea many years ago. This is useful to show that the idea has been around awhile. http://crm114.sourceforge.net CRM114 can do this, and can extend the probabilities to phrases and not just individual words. Eric Raymond's implementation has already been mentioned; that might actually be a better starting point for actual code. Hope all this helps!!
Updated•22 years ago
|
Summary: want Bayesian antispam filters per Paul Graham's design → [RFE] Add Bayesian antispam filters per Paul Graham's design
Comment 6•22 years ago
|
||
*** Bug 165725 has been marked as a duplicate of this bug. ***
How about a mail filtering API of some sort that would allow plugable filters. That way I could use a Bayesian filter and someone else could use Vipul's Razor or a whitelist program. That way we might not have to wait as long when Joe Spammer finds a way past the current filter and we have to tighten up the spam filters or combine methods.
Comment 8•22 years ago
|
||
Just for the record: another article, by a statistician this time, on a method to perhaps improve the original "Bayesian" algorithm to one which actually is Bayesian. :-) http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Gerv
Comment 9•22 years ago
|
||
I've started working on this, I have a simple prototype. I'll post again once I get something decent.
Updated•22 years ago
|
Comment 11•22 years ago
|
||
For algorithmic ideas (if not code :-), the Python community has started a project aimed at developing optimal parsing and classifyer algorithms, varying within the ideas of Paul Graham. Using rigorous testing methods, lots of ideas of the form "wouldn't it work better if you changed this parameter or if you treated that header specially" are accepted and rejected. See http://spambayes.sf.net Developers welcome!
Comment 12•22 years ago
|
||
Just thought I'd mention that I'm the author of the essay mention in comment #8 above (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)... we're discussing it on the list Guido mentions (http://spambayes.sf.net). I, for one, would be very interested in any discussion of using Graham-derived stuff in Mozilla. Gary Robinson
Comment 13•22 years ago
|
||
This is the beginnings of a prototype for bayesian spam filtering, courtesy of peterv. This code has some issues right now, and is reportedly not catching spam as well as earlier versions, so I need to track down what's going on there. The biggest issue, though, is that right now a JS hash is used for word counts, and read and written to disk in one shot. We're gonna want a real on-disk DB here to avoid tons of bloat, so I'm guessing mork is not gonna be suitable... but I'm happy to be proven wrong. Perhaps we need to fall back to the ancient dbm code that the NSS folks use?
Comment 14•22 years ago
|
||
This takes the code from the last patch and hooks it up into the Mozilla build, and partially hooks it up with Seth's UI. It still needs a bunch of cleanup, but it's a start. This patch includes the Seth's "turn on the UI" patch, and as well as the nsIMsgFilterPlugin interface and the hooks into IMAP to use it.
Attachment #99870 -
Attachment is obsolete: true
Comment 15•22 years ago
|
||
It might be nice to automatically notify the sender of any email that is automatically sorted as spam that his email was automatically sorted as spam and not recieved by the intended recipient. This would help to catch false positives that might otherwise go unnoticed.
Comment 16•22 years ago
|
||
But if we automatically respond to the spammers, it just confirms that the address is valid, just as many "opt-out" links just confirm an active email address. The best bet is to ignore or send a false undeliverable.
Comment 17•22 years ago
|
||
I agree with #16. If one is concerned about false positives, one should just scan the spam folder briefly before emptying it. Replying to spam is the last thing you want to do. (right up there with opening them with HTML parsing enabled)
Comment 18•22 years ago
|
||
Patch v3; includes changes from peterv to the algorithm (starting to work nicely :-). Also has various infrastructure changes so that IMAP analyzes every message as it comes in (though it doesn't yet indicate that in the UI).
Attachment #100102 -
Attachment is obsolete: true
Comment 20•22 years ago
|
||
BTW, we should credit Paul Graham, Gary Robinson and the spambayes list (Tim Peters et alia) in the filter code. Though I didn't use the lisp or the python code I stole a lot of the ideas for our implementation from them. ;-)
Comment 21•22 years ago
|
||
OK, I've checked in peterv's existing code in mailnews/extensions/bayesian-spam-filter; it's not yet part of the build. I also added a comment crediting various folks as Peter suggested. The stuff that's checked in may not quite work, but it's close. Next step get the rest of the infrastructure changes in.
Comment 22•22 years ago
|
||
Is it too far beyond the scope of this bug to suggest the capability of pooling the "good" and "bad" corpus derived from many users? Hypothetically, in this way a company could designate a central repository for "interesting" tokens so that the scores from other employees can be evaluated along with personal dbs - this immediately reduces the false-positive rates for common contacts with suppliers, sales, etc. Or would such a repository of necessity need to be a transactional-based server (sql of some flavor for example) to avoid file-locking issues, and therefore would require much more development? An interesting side-development of this would be to create an infrastructure (between many sites) that share their corpus results and use this as a baseline against which an individual's behaviours (prior picking of spam or removing mail from the spam-holding folder) are given weight.
Comment 23•22 years ago
|
||
Some temporary code to get the current bayesian bits checked in hooked up to seth's UI. Eventually this will be replaced by code that goes the filtering plugin interface, but that's not there yet.
Attachment #100530 -
Attachment is obsolete: true
Comment 24•22 years ago
|
||
Micheal: an interesting idea, but definitely behind the scope of this particular bug. Feel free to file another about it, if you wish.
Comment 25•22 years ago
|
||
Comment on attachment 100812 [details] [diff] [review] temporary glue to hook up junk mail button to bayesian code r=bienvenu
Attachment #100812 -
Flags: review+
Comment 26•22 years ago
|
||
Comment on attachment 100812 [details] [diff] [review] temporary glue to hook up junk mail button to bayesian code I'm not sure this is right. why are you setting .label on the message header, and not the .score attribute? instead of function mark() can we call it something like function setSpamScore()?
Comment 27•22 years ago
|
||
Attachment #100812 -
Attachment is obsolete: true
Comment 28•22 years ago
|
||
Comment on attachment 100858 [details] [diff] [review] temp glue patch, v2 sr=sspitzer
Attachment #100858 -
Flags: superreview+
Comment 29•22 years ago
|
||
Re comment 22, comment 24, there are various downsides to a pooled repository: * Possibility of it being 'polluted' by malicious/incompetent users; * What different people consider as spam may not be the same; * Different people get different types of mail, therefore their 'good' lists in particular are likely to be very different; and * If everybody uses the same lists, spammers can tune their content based on these lists and thus avoid the filter - one of the strengths of the system is that everyone's list is different, so it's impossible to produce mail to get round every list.
Comment 30•22 years ago
|
||
Comment on attachment 100858 [details] [diff] [review] temp glue patch, v2 r=peterv, note that this will still mark the message as read.
Attachment #100858 -
Flags: review+
Comment 31•22 years ago
|
||
Is there a target date or milestone on this bug? There's been a lot of activity but no milestone has been set. I see the reviews are done, so I assume once approval has been obtained the code will be checked into the tree?
Assignee | ||
Comment 32•22 years ago
|
||
This feature is very much a work in progress. The architecture to support filter plugins is still being worked out.
Status: NEW → ASSIGNED
Comment 33•22 years ago
|
||
Fix problems related to the score property; make mark() iterate instead of using callbacks.
Attachment #100858 -
Attachment is obsolete: true
Comment 34•22 years ago
|
||
Comment on attachment 100954 [details] [diff] [review] temp glue patch, v3 sr=sspitzer sorry about the misleading ".score" thing. note to the review, we don't need to call setScore() as the doCommand() code does that.
Attachment #100954 -
Flags: superreview+
Comment 35•22 years ago
|
||
Comment on attachment 100954 [details] [diff] [review] temp glue patch, v3 Carrying forward peterv's r=.
Attachment #100954 -
Flags: review+
Comment 36•22 years ago
|
||
Glue patch checked in.
Comment 37•22 years ago
|
||
Patch to turn on the UI that seth put in (duplicated from his patch in the front-end bug) and cause the bayesian stuff to be installed at build time.
Assignee | ||
Comment 38•22 years ago
|
||
This implements a new build option, junkmailfilter, and adds appropriate build steps to the Mac build system.
Comment 39•22 years ago
|
||
Can someone please set an eta or milestone? Also, can we use the temporary "glue" until the message filtering plugin interface comes through or do we have to wait on that?
Comment 40•22 years ago
|
||
CW project change to build nsIMsgFilterPlugin.idl on the Mac.
Comment 41•22 years ago
|
||
Comment on attachment 100978 [details] [diff] [review] Mac build system patch. r=peterv
Attachment #100978 -
Flags: review+
Comment 42•22 years ago
|
||
gcc 3.1.1 points out that rv can be used uninitialized if aCount == 0. Here's a patch.
Assignee | ||
Comment 43•22 years ago
|
||
Comment on attachment 102510 [details] [diff] [review] gcc warning fix, v1 r=beard
Attachment #102510 -
Flags: review+
Assignee | ||
Comment 44•22 years ago
|
||
This patch uses PLDHashTable instead of nsObjectHashTable. This allows it to allocate all Token objects in place, and does aggressive key sharing. Still has some glitches right now, so this is a work in progress.
Assignee | ||
Updated•22 years ago
|
Attachment #102510 -
Attachment is obsolete: true
Assignee | ||
Comment 45•22 years ago
|
||
This simplifies the structure of the Token significantly, and uses the stub moveEntry and clearEntry PLDHashTable operators. Pooled allocation of token strings is now unconditionally used.
Attachment #102513 -
Attachment is obsolete: true
Comment 46•22 years ago
|
||
Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. sr=sspitzer
Attachment #102521 -
Flags: superreview+
Comment 47•22 years ago
|
||
I've been noticing some weirdness with this patch (things getting slow, or not working), taking forever to shutdown. I got a crash while classifying: PL_DHashStringKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 79 + 22 bytes HashKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 72 + 13 bytes PL_DHashTableOperate(PLDHashTable * 0x04a72344, const void * 0xdddddddd, int 0) line 479 + 16 bytes Tokenizer::get(const char * 0xdddddddd) line 134 + 15 bytes Tokenizer::remove(const char * 0xdddddddd, unsigned int 3722304989) line 157 + 12 bytes forgetTokens(Tokenizer & {...}, Token * * 0x02de2d68, unsigned int 91) line 593 nsBayesianFilter::observeMessage(Tokenizer & {...}, const char * 0x04a6c3a8, unsigned int 1, unsigned int 2, nsIJunkMailClassificationListener * 0x00000000) line 625 + 20 bytes MessageObserver::analyzeTokens(const char * 0x04a6c3a8, Tokenizer & {...}) line 578 TokenStreamListener::OnStopRequest(TokenStreamListener * const 0x04a6c310, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 375 nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x04a59880, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 1099 nsStreamListenerTee::OnStopRequest(nsStreamListenerTee * const 0x04915f20, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 66 nsOnStopRequestEvent0::HandleEvent(nsOnStopRequestEvent0 * const 0x04ad7d18) line 319 + 33 bytes nsStreamListenerEvent0::HandlePLEvent(PLEvent * 0x04ad7d28) line 113 + 12 bytes PL_HandleEvent(PLEvent * 0x04ad7d28) line 644 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x01266f20) line 574 + 9 bytes _md_EventReceiverProc(HWND__ * 0x00660256, unsigned int 49509, unsigned int 0, long 19296032) line 1335 + 9 bytes USER32! 77e11b60() USER32! 77e11cca() USER32! 77e183f1() nsAppShellService::Run(nsAppShellService * const 0x012db5c8) line 472 main1(int 2, char * * 0x00276ef8, nsISupports * 0x00276f40) line 1522 + 32 bytes main(int 2, char * * 0x00276ef8) line 1883 + 37 bytes mainCRTStartup() line 338 + 17 bytes KERNEL32! 77e8d326()
Comment 48•22 years ago
|
||
Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. r=dmose after casts are changed to C++ style
Attachment #102521 -
Flags: review+
Comment 49•22 years ago
|
||
Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. since I'm crashing with this, and there's some weirdness, marking this needs work.
Attachment #102521 -
Flags: superreview+
Attachment #102521 -
Flags: review+
Attachment #102521 -
Flags: needs-work+
Assignee | ||
Updated•22 years ago
|
Attachment #101103 -
Attachment is obsolete: true
Assignee | ||
Comment 50•22 years ago
|
||
Replaced C style casts with appropriate NS_(STATIC|REINTERPRET)_CAST macros. Added more error handling and NS_ASSERTIONs.
Attachment #102521 -
Attachment is obsolete: true
Assignee | ||
Comment 51•22 years ago
|
||
This patch seems to work much better -- no longer getting zero length tokens in the hash tables, which seemed to stem from using an alignment value of 1 when calling PL_InitArenaPool() -- now using an alignment of 2, and an arena size of 16K.
Attachment #102527 -
Attachment is obsolete: true
Comment 52•22 years ago
|
||
Comment on attachment 102658 [details] [diff] [review] PLDHashTable patch v4 A few error checking nits: >+static void PR_CALLBACK MoveEntry(PLDHashTable* table, >+ const PLDHashEntryHdr* from, >+ PLDHashEntryHdr* to) >+{ >+ const Token* fromToken = NS_STATIC_CAST(const Token*, from); >+ Token* toToken = NS_STATIC_CAST(Token*, to); >+ if (fromToken->mLength == 0) { >+ NS_WARNING("zero length token in table!"); Should this really be an assertion rather than just a warning? IE is a zero-length token ever valid? >-Tokenizer::Tokenizer() : mTokens(NULL, NULL, NULL, NULL) >+Tokenizer::Tokenizer() > { >- PL_InitArenaPool(&mTokenPool, "Tokens Arena", 4096 * sizeof(Token), sizeof(double)); >- PL_InitArenaPool(&mWordPool, "Words Arena", 32768, sizeof(char)); >+ PRBool ok = PL_DHashTableInit(&mTokenTable, &gTokenTableOps, nsnull, sizeof(Token), 256); >+ NS_ASSERTION(ok, "mTokenTable failed to initialize"); Since PL_DHashTableInit ultimately ends up allocating memory, and a failure there could be the cause of this failure, failure should probably be checked in all builds, not just asserted in debug builds. > Token* Tokenizer::add(const char* word, PRUint32 count) > { >- nsCStringKey key(word); >- Token* token = (Token*) mTokens.Get(&key); >+ Token* token = get(word); > if (!token) { >- token = newToken(word, count); >- if (token && token->mWord.get()) { >- // NOTE: to save space, sharedKey shares the string pointer with the token itself. >- // This is safe, as long as the token's lifetime exceeds the hash table / key itself. >- // Since the token string is now arena allocated, this will always be true. >- nsCStringKey sharedKey(token->mWord.get(), token->mWord.Length(), nsCStringKey::NEVER_OWN); >- mTokens.Put(&sharedKey, token); >+ PLDHashEntryHdr* newEntry = PL_DHashTableOperate(&mTokenTable, word, PL_DHASH_ADD); >+ if (newEntry) { >+ PRUint32 len = strlen(word); >+ token = NS_STATIC_CAST(Token*, newEntry); >+ token->mWord = copyWord(word, len); >+ NS_ASSERTION(token->mWord, "copyWord failed"); Same as previous comment: this should probably be more than just an assertion for the same reason. Other than these nits, it looks good. Fix them and you've got r=dmose.
Attachment #102658 -
Flags: review+
Updated•22 years ago
|
Summary: [RFE] Add Bayesian antispam filters per Paul Graham's design → Add Bayesian antispam filters per Paul Graham's design
Assignee | ||
Comment 53•22 years ago
|
||
This uses a 2-byte aligned string arena (1-byte doesn't seem to work, 4-byte seems wasteful), and addresses error checking concerns.
Attachment #102658 -
Attachment is obsolete: true
Updated•22 years ago
|
Alias: bayesian
Assignee | ||
Comment 54•22 years ago
|
||
Comment on attachment 103035 [details] [diff] [review] PLDHashTable patch v5 Patch checked in.
Attachment #103035 -
Attachment is obsolete: true
Assignee | ||
Comment 55•22 years ago
|
||
This patch introduces a new helper class, TokenEnumeration, to avoid copying tokens where possible, and renames getTokens() to copyTokens().
Comment 56•22 years ago
|
||
Comment on attachment 103118 [details] [diff] [review] TokenEnumeration patch v1 Are you sure you don't want the fix for bug 174859? Then you could avoid the overhead of copying tokens rather than token pointers in classifyMessage's call to copyTokens. Comments apart from this patch: - You might use a better magic number than 0xFEEDFACE -- see the magic strings used by the XPCOM typelib file format (http://www.mozilla.org/scriptable/typelib_file.html) and the XPCOM FastLoad file format (http://lxr.mozilla.org/mozilla/source/xpcom/io/nsFastLoadFile.h#139), both inspired by PNG's magic string header. - Nit: last_delimiter violates the otherwise-prevailing interCaps style for local variable names. Food for future revs, sr=brendan@mozilla.org on this one. /be
Attachment #103118 -
Flags: superreview+
Assignee | ||
Comment 57•22 years ago
|
||
Comment on attachment 103118 [details] [diff] [review] TokenEnumeration patch v1 Fixed style problem, and checked in.
Attachment #103118 -
Attachment is obsolete: true
Comment 58•22 years ago
|
||
Please use MLP/GPL/LGPL for the files in mozilla/mailnews/extensions/bayesian-spam-filter. It looks like mozilla/mailnews/extensions/bayesian-spam-filter/MANIFEST can be removed. The project file (mozilla/mailnews/extensions/bayesian-spam-filter/macbuild/BayesianFilter.xml) is referring to files for mdn, could you correct it?
Comment 59•22 years ago
|
||
Hmmm... Just thought of something. If the "spam-or-not" rating is stored in the message header/body, will these algorithms fail when a spammer includes something like "X-Mozilla-Spam: not" (or whatever the real header might be) in a spam message?
Comment 60•22 years ago
|
||
that's a good point, but I don't think we're planning on storing the spam header in the message. If we did, we should filter out whatever came from the server, and put our own in.
Comment 61•22 years ago
|
||
Re comment 60 - would this be a problem if you read your mail from two different places (e.g. with IMAP, from home and from work)? Suppose I read my mail at work, mark as spam or whatever, then at home I read it again and the header added by my work installation is stripped as my home installation downloads the message from the server. A way round this would be to store something that's unique to this installation in the header, and ignore the header if it doesn't match our installation, e.g. X-Mozilla-Spam: 3njh84n5 not
Comment 62•22 years ago
|
||
Russell, in IMAP, you can't just add a header to a message. The way to solve this with IMAP is to add a custom keyword to the message, on the server. This will only work on some IMAP servers, but I don't think there's a solution that will work on all IMAP servers.
Comment 63•22 years ago
|
||
I just thought I'd mention that the the SpamBayes project http://spambayes.sf.net/ has now pretty much completed their algorithm research phase, and now have a "clear winner" algorithm based on chi-square probabilities, which you may want to consider upgrading to. In contrast to Graham's original algorithm and the Robinson variations in use a month ago, the "chi-combining" algorithm has a usable "unsure" range where it "knows that it doesn't know" what a particular e-mail is. The earlier algorithms were either sure all the time, or unsure in unusable ways (e.g. drifting "unsure" ranges). The new algorithm can still be "trained" incrementally using pretty much the same word-frequency data, and has far fewer biases and tweak factors. There have also been many tokenizer improvements. Last, but not least, the project has a usable Outlook 2000 add-in that can be looked at for UI ideas and perhaps used for testing to verify that same-or-better results are produced by the Mozilla implementation of the classifier. One of the most difficult things about implementing this kind of filtering is validating your algorithms; the spambayes people have put an enormous amount of both theoretical work and large scale testing for statistical validation. It would be good to make use of it here.
Comment 64•22 years ago
|
||
Does this just count words in the message, or does it consider other clues as "words"? I mean things like "Message is HTML", "Out of date", etc. Bug 151622 is asking for score-based filtering, which covers most of these ideas, but with manual scoring. It seems a good idea to use the Bayesian engine to make it automatic.
Comment 65•22 years ago
|
||
Some thoughts on the UI for this. Add a special mail folder called Spam. At some point, add the ability to have messages older than a certain age purged. Add another menubar button next to Delete called Spam, when viewing all mail folders but Spam. When the user presses this on a message, it goes to the Spam folder, it gets processed and added to the filter list and at some point, add the ability to automatically report the spam to black list/spam maintainer. When viewing the Spam folder, the button becomes NotSpam (or something that sounds better). Pressing this moves the mail back into the proper folder (as determined by other filters the user has) and processes the mail removing it from the bad mail list and adding it to the good mail list. First time user hits the Spam button, bring up a brief descriptive dialog explaining what is going on and allowing the user to set other filter options (such as the report to blacklist maintainer, for instance). Any mail that comes in and doesn't get filtered as Spam is processed and added to the good list. In practice, the user would start off with empty lists for both. He could go through existing mail and mark it as Spam, starting off the list, but this isn't necessary. All they would need to do is mark new mail as Spam. This means that the filter would not have as high an accuracy at first, but would improve quickly. It is also very simple for the average user to start with.
Comment 66•22 years ago
|
||
Tim: see bug 169638. UI which is almost ready seems pretty similar to what you suggested. Adding this feature's UI bug, bug 169638, to dependencies so that people become aware of its existence.
Comment 67•22 years ago
|
||
Adding this bug to dependencies in bug 11035. This is after all a spam-blocking feature, right?
Blocks: 11035
Reporter | ||
Comment 68•22 years ago
|
||
We'd better not use the term "Spam" in the UI. Spam is a registered trademark and using it within the product UI could draw a lawsuit. The meat company that owns the trademark *is* actively trying to defend its trademark and would love a victory against a high-profile target like Mozilla. Suggest "Junk Mail" instead.
Comment 69•22 years ago
|
||
Not true, according to http://www.spam.com/ci/ci_in.htm > We do not object to use of this slang term to describe UCE, although we do > object to the use of our product image in association with that term. Also, if > the term is to be used, it should be used in all lower-case letters to > distinguish it from our trademark SPAM, which should be used with all uppercase > letters.
Comment 70•22 years ago
|
||
How will the filtering work for IMAP connections? As I understand it, the filters will run as the headers are generated for the message list, however a typical IMAP connection does not retrieve any part of the message until it is read/moved/copied/etc. Therefore, in a typical IMAP connection, only the message headers will be available for running through the filter. Will the filter work only on the headers, or will it force a download of the text/html, or text/plain parts of messages so it can run against them (or maybe an option for either)? BTW, I don't know how much information is passed via the initial IMAP listing, but if it lists the attachment headers, that would be good enough since so many spam messages are obvious with idiotic content like: Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64
Comment 71•22 years ago
|
||
Michael: the UI for message classification has been enabled in trunk builds (see bug 169638). From my testing it seems that bodies of new, incoming messages are fetched and run through the filter, while pre-existing messages have an icon indicating that they haven't been processed yet. You can select them, then Tools->Run Junk Mail Controls on them manually.
Comment 72•22 years ago
|
||
What do we have to do to test this? Current nightly has the UI, but the UI doesn't do anything at all.
Comment 73•22 years ago
|
||
Nathan: There is a problem with this in (at least) the windows installer versions of the nightlies. The .zip appears to function correctly. See bug 179150
Comment 74•22 years ago
|
||
it works in installer builds starting with 2002-11-08-13
Comment 75•22 years ago
|
||
Been trying this out excitedly. Few questions. Does the Junk Mail log do anything? I don't see anything being written to it. Messages seem to be labelled as junk after they have all been retrieved. So my attempts to use a filter that has all incoming messages with junk flag be sent to a particular folder fails. I have to choose to apply filters to current folder to have them move. Any chance they can get flagged before the filters are applied?
Comment 76•22 years ago
|
||
My junk mail log shows all occurances of a mail (in Inbox) getting automatically marked as junk (after they have been retreived). Marking messages as junk (or not junk) autmatically after they have move to a folder by a filter is bug 180153 (for POP).
Comment 77•22 years ago
|
||
Right. Mine get marked as well. And I'm retrieving locally. It isn't a problem with marking, it is the problem that I have a rule that says if something is marked junk, move it to another folder. All the other rules get applied on retrieval, but that one doesn't, since on retrieval, status is still undefined. Status gets switched immediately after all are retrieved, but that is too late for the rule. Or at least, that's what it seems is happening.
Comment 78•22 years ago
|
||
The junk mail code has landed on the trunks. New bugs are being filed as blocking bug 11305, so if you have an issue with Junk Mail, see if it's in the dependency tree in that bug, and if not, open a new one.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment 79•22 years ago
|
||
Dan: Did you really mean bug 11305? That's a VERIFIED-WONTFIX about nsAppleSingleDecoder or something.
Comment 80•22 years ago
|
||
He probably meant Bug 11035
Comment 81•22 years ago
|
||
IDEA: Address Book Immunity Rule ------------------------------------- One of the biggest problems with making spam filters is making a filter that keeps unwanted stuff out without trashing things you want. How about making a rule that make email with an address from the user's address book immune to any _or__ particular filters? On a user interface this could be a check box titled: "don't apply this rule to people listed in my address book". This would cut down on email getting lost from an overly tight filter and it would cut down on having to wade through a folder full of junk mail to make sure you are not throwing out anything important. Just a thought... Steve
Comment 82•22 years ago
|
||
we do that - it's called address book whitelisting, and if it's turned on, we won't apply the spam filter to a message from someone in your addressbook. There's a plan to extend it so you can check against multiple address books, but as of today, you can use a single address book as your whitelist.
Comment 83•22 years ago
|
||
FYI, a different anti-spam technique is noted in bug #187044, called "challenge-response". It appears to me that the Bayesian and challenge-response techniques can be used together - the combination should REALLY cut down on spam, and make it harder for spammers to overcome (since they would have to circumvent TWO mechanisms).
Comment 84•22 years ago
|
||
It would also be nice to see bug 184948 completed at some point in the future. That way, fewer spammers would know their message was received. - Adam
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•