Closed Bug 163188 (bayesian) Opened 22 years ago Closed 22 years ago

Add Bayesian antispam filters per Paul Graham's design

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: ericweb, Assigned: beard)

References

(
URL
)

Details

Attachments

(3 files, 13 obsolete files)

bayesian prototype patch, v1 22 years ago Dan Mosedale (:dmosedale, :dmose) 36.40 KB, patch		Details \| Diff \| Splinter Review
patch, v2 22 years ago Dan Mosedale (:dmosedale, :dmose) 63.13 KB, patch		Details \| Diff \| Splinter Review
patch, v3 22 years ago Dan Mosedale (:dmosedale, :dmose) 90.51 KB, patch		Details \| Diff \| Splinter Review
temporary glue to hook up junk mail button to bayesian code 22 years ago Dan Mosedale (:dmosedale, :dmose) 3.00 KB, patch	Bienvenu : review+	Details \| Diff \| Splinter Review
temp glue patch, v2 22 years ago Dan Mosedale (:dmosedale, :dmose) 3.02 KB, patch	peterv : review+ sspitzer : superreview+	Details \| Diff \| Splinter Review
temp glue patch, v3 22 years ago Dan Mosedale (:dmosedale, :dmose) 2.88 KB, patch	dmosedale : review+ sspitzer : superreview+	Details \| Diff \| Splinter Review
build patch, v1 22 years ago Dan Mosedale (:dmosedale, :dmose) 3.43 KB, patch		Details \| Diff \| Splinter Review
Mac build system patch. 22 years ago Patrick C. Beard 1.59 KB, patch	peterv : review+	Details \| Diff \| Splinter Review
Patch for msgcoreidl 22 years ago Peter Van der Beken [:peterv] 2.73 KB, patch		Details \| Diff \| Splinter Review
gcc warning fix, v1 22 years ago Dan Mosedale (:dmosedale, :dmose) 849 bytes, patch	beard : review+	Details \| Diff \| Splinter Review
Fixes warnings, adds use of PLDHashTable to improve performance. 22 years ago Patrick C. Beard 8.65 KB, patch		Details \| Diff \| Splinter Review
PLDHashTable patch, v2. 22 years ago Patrick C. Beard 9.83 KB, patch		Details \| Diff \| Splinter Review
PLDHashTable patch v3 22 years ago Patrick C. Beard 13.21 KB, patch		Details \| Diff \| Splinter Review
PLDHashTable patch v4 22 years ago Patrick C. Beard 14.12 KB, patch	dmosedale : review+	Details \| Diff \| Splinter Review
PLDHashTable patch v5 22 years ago Patrick C. Beard 15.00 KB, patch		Details \| Diff \| Splinter Review
TokenEnumeration patch v1 22 years ago Patrick C. Beard 9.69 KB, patch	brendan : superreview+	Details \| Diff \| Splinter Review

Eric Krock

Reporter

Description

•

22 years ago

Lisp guru Paul Graham has written a brilliant spam filtering system (at the above URL) based on Bayesian statistical evaluation of the tokens of incoming mail. The spam and non-spam probabilities of individual words are derived by scanning and comparing corpuses of the individual's received spam mail and non-spam mail. The brilliant things about this approach are that: 1) it is extremely difficult for spammers to defeat it 2) it automatically evolves as spammers evolve their pitches 3) it tailors its behavior to the individual's actual received emails A Moz Mail client implementation of this might have the following features: A) "Scan Spam Email" and "Scan Nonspam Email" context menu options in the mail folders window that would scan a folder's emails and process them as spam or nonspam (seeding the Bayesian statistical evaluation); you'd use this for "initializing" the filter when getting started and training it on your accumulated nonspam and spam email B) A "Delete As Spam" right-click menu option that would not only delete the email but would also use it as input to the spam filter C) An initial database of default token weights to get the filter started. (I'm sure Paul Graham would be happy to provide his corpuses and derived weightings to an open source project such as Mozilla. He clearly wants others to make use of his work.) D) A checkbox in prefs that turned this filtering off. (It will be a popular and accurate enough feature that it should be on my default to promote awareness, adoption, and viral adoption of Moz Mail.) E) A "Suspected Spam" folder into which suspected spam email is automatically filtered. The first email program with this kind of filtering built in will be the first killer app since the browser itself as the amount of spam is rapidly increasing. It could be a major driver of Mozilla adoption if Moz Mail had this feature. (e.g. I'd switch back to POP if I had to to take advantage of this feature.) Of course, mail server-based implementations are also possible, but mail server vendors and ISPs may resist implementing and adopting them because of the processing load it would put on their servers. Client CPU cycles on the other hand are cheap. It would be great if someone would add this to Moz Mail which is not supported by commercially available antispam systems I'm aware of.

nemo

Comment 1

•

22 years ago

Bravo! I was just chatting on #mozillazine that putting this in would be the thing that would inspire me to learn Mozilla's structure... Thought implementing it might make a nice easy project that would make my life sooo much more pleasant. (no more continual .procmailrc updating!) No knowledge of said structure at the moment, so would just like to register an interest in helping out, as well as seeking good places to learn about Mail/News methods.

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 2

•

22 years ago

Oh yeah, this would totally make my day! Gary Arnold also seems to have written this thing in Perl, which might be easier to understand for a lot of people: http://www.garyarnold.com/projects.php#bayespam

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 3

•

22 years ago

And more source to analyze: Written by ESR in C: http://www.tuxedo.org/~esr/bogofilter/ and someone made a "badwordlist" for that which can be found at http://www.xtdnet.nl/paul/spam/bogofilter/

Mike Lee

Comment 4

•

22 years ago

Check bug 11035. Someone is working on the foundation work and comment 31 in that bug by Alec Flett seems like a very good idea. That way I can make spam filter process after all my other filters to make achieve better results. I think eventually we would have a number of filter plugins like "Spam Assassin (bug 11035)", "Bayesian (this bug)", "whitelist (bug 120160)" that user can use instead of one general solution. Should we add this as a dependancy?

David A. Wheeler

Comment 5

•

22 years ago

I _DEFINITELY_ would like to see this happen; an tool that integrates into the mail browser to deal with spam is really needed. However, I think you can make the user interface even simpler. Create a nice big button named "SPAM" - press it whenever you see a SPAM message like you would press "Delete", but it will then do a number of configurable actions - and give it a useful default, since many users don't configure their systems. I suggest as the default that it (1) forge and send back a "no such user" message; this will remove you from some lists in a few cases and also warns others about forging emails. (2) save the message in a specially-named "spam" folder, for use in a naive Bayesian statistical analysis program (as Graham describes). Other options that could be turned on include forwarding a copy to a list of email addresses (e.g., your local "abuse" account, the newsgroup news.admin.net-abuse.sightings, and email addresses of well-known spam killers), or calling on other spam killers to check it like SpamAssassin. In the dialogue for configuring the SPAM button's actions, perhaps there could be a radio selection beside each action like "don't do it when you press SPAM", "do it when you press SPAM", or "confirm before doing it when you press SPAM" - that way, you can turn on "ask me before sending to abuse" or whatever. Also, you need to pre-create a "Suspected Spam" folder if one doesn't already exist. When the "spam" folder grows to 100 messages, start up the naive Bayesian analysis, and assume that any email saved in "spam" is spam, and anything saved in a folder other than the "Suspected Spam" or "Spam" folders is good. Automatically re-run the analysis every 100 messages or every week, whichever comes first. Note that from a user's point of view, the only thing they need to know is that they need to press the SPAM button when they see spam. Everything else is automatic. Simple, n'est-ce pas? There's a great deal of study on the topic of Naive Bayesian approaches to spam, and it's surprisingly effective. Here are some references studying it: http://arxiv.org/abs/cs.CL/0006013 An evaluation of Naive Bayesian anti-spam filtering http://arxiv.org/abs/cs/0008019 An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages http://arxiv.org/abs/cs/0009009 Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach http://www.lsi.upc.es/~carreras/pub/boospamev.ps http://www.monmouth.edu/~drucker/SVM_spam_article_compete.PDF Others have at least partly implemented it: http://www.ai.mit.edu/~jrennie/ifile/ Ifile implemented the idea many years ago. This is useful to show that the idea has been around awhile. http://crm114.sourceforge.net CRM114 can do this, and can extend the probabilities to phrases and not just individual words. Eric Raymond's implementation has already been mentioned; that might actually be a better starting point for actual code. Hope all this helps!!

Steve Wardell

Updated

•

22 years ago

Summary: want Bayesian antispam filters per Paul Graham's design → [RFE] Add Bayesian antispam filters per Paul Graham's design

Alfonso Martinez

Comment 6

•

22 years ago

*** Bug 165725 has been marked as a duplicate of this bug. ***

Chris

Comment 7

•

22 years ago

How about a mail filtering API of some sort that would allow plugable filters. That way I could use a Bayesian filter and someone else could use Vipul's Razor or a whitelist program. That way we might not have to wait as long when Joe Spammer finds a way past the current filter and we have to tighten up the spam filters or combine methods.

Gervase Markham [:gerv]

Comment 8

•

22 years ago

Just for the record: another article, by a statistician this time, on a method to perhaps improve the original "Bayesian" algorithm to one which actually is Bayesian. :-) http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Gerv

Peter Van der Beken [:peterv]

Comment 9

•

22 years ago

I've started working on this, I have a simple prototype. I'll post again once I get something decent.

Dan Mosedale (:dmosedale, :dmose)

Comment 10

•

22 years ago

Make this depend on the plugin interface bug.

Depends on: 167561

Dan Mosedale (:dmosedale, :dmose)

Updated

•

22 years ago

Depends on: 169557
No longer depends on: 167561

Guido van Rossum

Comment 11

•

22 years ago

For algorithmic ideas (if not code :-), the Python community has started a project aimed at developing optimal parsing and classifyer algorithms, varying within the ideas of Paul Graham. Using rigorous testing methods, lots of ideas of the form "wouldn't it work better if you changed this parameter or if you treated that header specially" are accepted and rejected. See http://spambayes.sf.net Developers welcome!

Gary Robinson

Comment 12

•

22 years ago

Just thought I'd mention that I'm the author of the essay mention in comment #8 above (http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html)... we're discussing it on the list Guido mentions (http://spambayes.sf.net). I, for one, would be very interested in any discussion of using Graham-derived stuff in Mozilla. Gary Robinson

Dan Mosedale (:dmosedale, :dmose)

Comment 13

•

22 years ago

Attached patch bayesian prototype patch, v1 (obsolete) — Details — Splinter Review

This is the beginnings of a prototype for bayesian spam filtering, courtesy of peterv. This code has some issues right now, and is reportedly not catching spam as well as earlier versions, so I need to track down what's going on there. The biggest issue, though, is that right now a JS hash is used for word counts, and read and written to disk in one shot. We're gonna want a real on-disk DB here to avoid tons of bloat, so I'm guessing mork is not gonna be suitable... but I'm happy to be proven wrong. Perhaps we need to fall back to the ancient dbm code that the NSS folks use?

Dan Mosedale (:dmosedale, :dmose)

Comment 14

•

22 years ago

Attached patch patch, v2 (obsolete) — Details — Splinter Review

This takes the code from the last patch and hooks it up into the Mozilla build, and partially hooks it up with Seth's UI. It still needs a bunch of cleanup, but it's a start. This patch includes the Seth's "turn on the UI" patch, and as well as the nsIMsgFilterPlugin interface and the hooks into IMAP to use it.

Attachment #99870 - Attachment is obsolete: true

micah shaw

Comment 15

•

22 years ago

It might be nice to automatically notify the sender of any email that is automatically sorted as spam that his email was automatically sorted as spam and not recieved by the intended recipient. This would help to catch false positives that might otherwise go unnoticed.

Wilbur

Comment 16

•

22 years ago

But if we automatically respond to the spammers, it just confirms that the address is valid, just as many "opt-out" links just confirm an active email address. The best bet is to ignore or send a false undeliverable.

nemo

Comment 17

•

22 years ago

I agree with #16. If one is concerned about false positives, one should just scan the spam folder briefly before emptying it. Replying to spam is the last thing you want to do. (right up there with opening them with HTML parsing enabled)

Dan Mosedale (:dmosedale, :dmose)

Comment 18

•

22 years ago

Attached patch patch, v3 (obsolete) — Details — Splinter Review

Patch v3; includes changes from peterv to the algorithm (starting to work nicely :-). Also has various infrastructure changes so that IMAP analyzes every message as it comes in (though it doesn't yet indicate that in the UI).

Attachment #100102 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 19

•

22 years ago

over to beard, who now owns this.

Assignee: naving → beard

Peter Van der Beken [:peterv]

Comment 20

•

22 years ago

BTW, we should credit Paul Graham, Gary Robinson and the spambayes list (Tim Peters et alia) in the filter code. Though I didn't use the lisp or the python code I stole a lot of the ideas for our implementation from them. ;-)

Dan Mosedale (:dmosedale, :dmose)

Comment 21

•

22 years ago

OK, I've checked in peterv's existing code in mailnews/extensions/bayesian-spam-filter; it's not yet part of the build. I also added a comment crediting various folks as Peter suggested. The stuff that's checked in may not quite work, but it's close. Next step get the rest of the infrastructure changes in.

Michael Baffoni

Comment 22

•

22 years ago

Is it too far beyond the scope of this bug to suggest the capability of pooling the "good" and "bad" corpus derived from many users? Hypothetically, in this way a company could designate a central repository for "interesting" tokens so that the scores from other employees can be evaluated along with personal dbs - this immediately reduces the false-positive rates for common contacts with suppliers, sales, etc. Or would such a repository of necessity need to be a transactional-based server (sql of some flavor for example) to avoid file-locking issues, and therefore would require much more development? An interesting side-development of this would be to create an infrastructure (between many sites) that share their corpus results and use this as a baseline against which an individual's behaviours (prior picking of spam or removing mail from the spam-holding folder) are given weight.

Dan Mosedale (:dmosedale, :dmose)

Comment 23

•

22 years ago

Attached patch temporary glue to hook up junk mail button to bayesian code (obsolete) — Details — Splinter Review

Some temporary code to get the current bayesian bits checked in hooked up to seth's UI. Eventually this will be replaced by code that goes the filtering plugin interface, but that's not there yet.

Attachment #100530 - Attachment is obsolete: true

Dan Mosedale (:dmosedale, :dmose)

Comment 24

•

22 years ago

Micheal: an interesting idea, but definitely behind the scope of this particular bug. Feel free to file another about it, if you wish.

David :Bienvenu

Comment 25

•

22 years ago

Comment on attachment 100812 [details] [diff] [review] temporary glue to hook up junk mail button to bayesian code r=bienvenu

Attachment #100812 - Flags: review+

(not reading, please use seth@sspitzer.org instead)

Comment 26

•

22 years ago

Comment on attachment 100812 [details] [diff] [review] temporary glue to hook up junk mail button to bayesian code I'm not sure this is right. why are you setting .label on the message header, and not the .score attribute? instead of function mark() can we call it something like function setSpamScore()?

Dan Mosedale (:dmosedale, :dmose)

Comment 27

•

22 years ago

Attached patch temp glue patch, v2 (obsolete) — Details — Splinter Review

Attachment #100812 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 28

•

22 years ago

Comment on attachment 100858 [details] [diff] [review] temp glue patch, v2 sr=sspitzer

Attachment #100858 - Flags: superreview+

Russell Odom

Comment 29

•

22 years ago

Re comment 22, comment 24, there are various downsides to a pooled repository: * Possibility of it being 'polluted' by malicious/incompetent users; * What different people consider as spam may not be the same; * Different people get different types of mail, therefore their 'good' lists in particular are likely to be very different; and * If everybody uses the same lists, spammers can tune their content based on these lists and thus avoid the filter - one of the strengths of the system is that everyone's list is different, so it's impossible to produce mail to get round every list.

Peter Van der Beken [:peterv]

Comment 30

•

22 years ago

Comment on attachment 100858 [details] [diff] [review] temp glue patch, v2 r=peterv, note that this will still mark the message as read.

Attachment #100858 - Flags: review+

NorthMan

Comment 31

•

22 years ago

Is there a target date or milestone on this bug? There's been a lot of activity but no milestone has been set. I see the reviews are done, so I assume once approval has been obtained the code will be checked into the tree?

Patrick C. Beard

Assignee

Comment 32

•

22 years ago

This feature is very much a work in progress. The architecture to support filter plugins is still being worked out.

Status: NEW → ASSIGNED

Dan Mosedale (:dmosedale, :dmose)

Comment 33

•

22 years ago

Attached patch temp glue patch, v3 — Details — Splinter Review

Fix problems related to the score property; make mark() iterate instead of using callbacks.

Attachment #100858 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 34

•

22 years ago

Comment on attachment 100954 [details] [diff] [review] temp glue patch, v3 sr=sspitzer sorry about the misleading ".score" thing. note to the review, we don't need to call setScore() as the doCommand() code does that.

Attachment #100954 - Flags: superreview+

Dan Mosedale (:dmosedale, :dmose)

Comment 35

•

22 years ago

Comment on attachment 100954 [details] [diff] [review] temp glue patch, v3 Carrying forward peterv's r=.

Attachment #100954 - Flags: review+

Dan Mosedale (:dmosedale, :dmose)

Comment 36

•

22 years ago

Glue patch checked in.

Dan Mosedale (:dmosedale, :dmose)

Comment 37

•

22 years ago

Attached patch build patch, v1 — Details — Splinter Review

Patch to turn on the UI that seth put in (duplicated from his patch in the front-end bug) and cause the bayesian stuff to be installed at build time.

Patrick C. Beard

Assignee

Comment 38

•

22 years ago

Attached patch Mac build system patch. — Details — Splinter Review

This implements a new build option, junkmailfilter, and adds appropriate build steps to the Mac build system.

NorthMan

Comment 39

•

22 years ago

Can someone please set an eta or milestone? Also, can we use the temporary "glue" until the message filtering plugin interface comes through or do we have to wait on that?

Peter Van der Beken [:peterv]

Comment 40

•

22 years ago

Attached patch Patch for msgcoreidl (obsolete) — Details — Splinter Review

CW project change to build nsIMsgFilterPlugin.idl on the Mac.

Peter Van der Beken [:peterv]

Comment 41

•

22 years ago

Comment on attachment 100978 [details] [diff] [review] Mac build system patch. r=peterv

Attachment #100978 - Flags: review+

Dan Mosedale (:dmosedale, :dmose)

Comment 42

•

22 years ago

Attached patch gcc warning fix, v1 (obsolete) — Details — Splinter Review

gcc 3.1.1 points out that rv can be used uninitialized if aCount == 0. Here's a patch.

Patrick C. Beard

Assignee

Comment 43

•

22 years ago

Comment on attachment 102510 [details] [diff] [review] gcc warning fix, v1 r=beard

Attachment #102510 - Flags: review+

Patrick C. Beard

Assignee

Comment 44

•

22 years ago

Attached patch Fixes warnings, adds use of PLDHashTable to improve performance. (obsolete) — Details — Splinter Review

This patch uses PLDHashTable instead of nsObjectHashTable. This allows it to allocate all Token objects in place, and does aggressive key sharing. Still has some glitches right now, so this is a work in progress.

Patrick C. Beard

Assignee

Updated

•

22 years ago

Attachment #102510 - Attachment is obsolete: true

Patrick C. Beard

Assignee

Comment 45

•

22 years ago

Attached patch PLDHashTable patch, v2. (obsolete) — Details — Splinter Review

This simplifies the structure of the Token significantly, and uses the stub moveEntry and clearEntry PLDHashTable operators. Pooled allocation of token strings is now unconditionally used.

Attachment #102513 - Attachment is obsolete: true

(not reading, please use seth@sspitzer.org instead)

Comment 46

•

22 years ago

Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. sr=sspitzer

Attachment #102521 - Flags: superreview+

(not reading, please use seth@sspitzer.org instead)

Comment 47

•

22 years ago

I've been noticing some weirdness with this patch (things getting slow, or not working), taking forever to shutdown. I got a crash while classifying: PL_DHashStringKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 79 + 22 bytes HashKey(PLDHashTable * 0x04a72344, const void * 0xdddddddd) line 72 + 13 bytes PL_DHashTableOperate(PLDHashTable * 0x04a72344, const void * 0xdddddddd, int 0) line 479 + 16 bytes Tokenizer::get(const char * 0xdddddddd) line 134 + 15 bytes Tokenizer::remove(const char * 0xdddddddd, unsigned int 3722304989) line 157 + 12 bytes forgetTokens(Tokenizer & {...}, Token * * 0x02de2d68, unsigned int 91) line 593 nsBayesianFilter::observeMessage(Tokenizer & {...}, const char * 0x04a6c3a8, unsigned int 1, unsigned int 2, nsIJunkMailClassificationListener * 0x00000000) line 625 + 20 bytes MessageObserver::analyzeTokens(const char * 0x04a6c3a8, Tokenizer & {...}) line 578 TokenStreamListener::OnStopRequest(TokenStreamListener * const 0x04a6c310, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 375 nsStreamConverter::OnStopRequest(nsStreamConverter * const 0x04a59880, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 1099 nsStreamListenerTee::OnStopRequest(nsStreamListenerTee * const 0x04915f20, nsIRequest * 0x04a6c160, nsISupports * 0x00000000, unsigned int 0) line 66 nsOnStopRequestEvent0::HandleEvent(nsOnStopRequestEvent0 * const 0x04ad7d18) line 319 + 33 bytes nsStreamListenerEvent0::HandlePLEvent(PLEvent * 0x04ad7d28) line 113 + 12 bytes PL_HandleEvent(PLEvent * 0x04ad7d28) line 644 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x01266f20) line 574 + 9 bytes _md_EventReceiverProc(HWND__ * 0x00660256, unsigned int 49509, unsigned int 0, long 19296032) line 1335 + 9 bytes USER32! 77e11b60() USER32! 77e11cca() USER32! 77e183f1() nsAppShellService::Run(nsAppShellService * const 0x012db5c8) line 472 main1(int 2, char * * 0x00276ef8, nsISupports * 0x00276f40) line 1522 + 32 bytes main(int 2, char * * 0x00276ef8) line 1883 + 37 bytes mainCRTStartup() line 338 + 17 bytes KERNEL32! 77e8d326()

Dan Mosedale (:dmosedale, :dmose)

Comment 48

•

22 years ago

Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. r=dmose after casts are changed to C++ style

Attachment #102521 - Flags: review+

(not reading, please use seth@sspitzer.org instead)

Comment 49

•

22 years ago

Comment on attachment 102521 [details] [diff] [review] PLDHashTable patch, v2. since I'm crashing with this, and there's some weirdness, marking this needs work.

Attachment #102521 - Flags: superreview+

Attachment #102521 - Flags: review+

Attachment #102521 - Flags: needs-work+

Patrick C. Beard

Assignee

Updated

•

22 years ago

Attachment #101103 - Attachment is obsolete: true

Patrick C. Beard

Assignee

Comment 50

•

22 years ago

Attached patch PLDHashTable patch v3 (obsolete) — Details — Splinter Review

Replaced C style casts with appropriate NS_(STATIC|REINTERPRET)_CAST macros. Added more error handling and NS_ASSERTIONs.

Attachment #102521 - Attachment is obsolete: true

Patrick C. Beard

Assignee

Comment 51

•

22 years ago

Attached patch PLDHashTable patch v4 (obsolete) — Details — Splinter Review

This patch seems to work much better -- no longer getting zero length tokens in the hash tables, which seemed to stem from using an alignment value of 1 when calling PL_InitArenaPool() -- now using an alignment of 2, and an arena size of 16K.

Attachment #102527 - Attachment is obsolete: true

Dan Mosedale (:dmosedale, :dmose)

Comment 52

•

22 years ago

Comment on attachment 102658 [details] [diff] [review] PLDHashTable patch v4 A few error checking nits: >+static void PR_CALLBACK MoveEntry(PLDHashTable* table, >+ const PLDHashEntryHdr* from, >+ PLDHashEntryHdr* to) >+{ >+ const Token* fromToken = NS_STATIC_CAST(const Token*, from); >+ Token* toToken = NS_STATIC_CAST(Token*, to); >+ if (fromToken->mLength == 0) { >+ NS_WARNING("zero length token in table!"); Should this really be an assertion rather than just a warning? IE is a zero-length token ever valid? >-Tokenizer::Tokenizer() : mTokens(NULL, NULL, NULL, NULL) >+Tokenizer::Tokenizer() > { >- PL_InitArenaPool(&mTokenPool, "Tokens Arena", 4096 * sizeof(Token), sizeof(double)); >- PL_InitArenaPool(&mWordPool, "Words Arena", 32768, sizeof(char)); >+ PRBool ok = PL_DHashTableInit(&mTokenTable, &gTokenTableOps, nsnull, sizeof(Token), 256); >+ NS_ASSERTION(ok, "mTokenTable failed to initialize"); Since PL_DHashTableInit ultimately ends up allocating memory, and a failure there could be the cause of this failure, failure should probably be checked in all builds, not just asserted in debug builds. > Token* Tokenizer::add(const char* word, PRUint32 count) > { >- nsCStringKey key(word); >- Token* token = (Token*) mTokens.Get(&key); >+ Token* token = get(word); > if (!token) { >- token = newToken(word, count); >- if (token && token->mWord.get()) { >- // NOTE: to save space, sharedKey shares the string pointer with the token itself. >- // This is safe, as long as the token's lifetime exceeds the hash table / key itself. >- // Since the token string is now arena allocated, this will always be true. >- nsCStringKey sharedKey(token->mWord.get(), token->mWord.Length(), nsCStringKey::NEVER_OWN); >- mTokens.Put(&sharedKey, token); >+ PLDHashEntryHdr* newEntry = PL_DHashTableOperate(&mTokenTable, word, PL_DHASH_ADD); >+ if (newEntry) { >+ PRUint32 len = strlen(word); >+ token = NS_STATIC_CAST(Token*, newEntry); >+ token->mWord = copyWord(word, len); >+ NS_ASSERTION(token->mWord, "copyWord failed"); Same as previous comment: this should probably be more than just an assertion for the same reason. Other than these nits, it looks good. Fix them and you've got r=dmose.

Attachment #102658 - Flags: review+

Sören 'Chucker' Kuklau (gone)

Updated

•

22 years ago

Summary: [RFE] Add Bayesian antispam filters per Paul Graham's design → Add Bayesian antispam filters per Paul Graham's design

Patrick C. Beard

Assignee

Comment 53

•

22 years ago

Attached patch PLDHashTable patch v5 (obsolete) — Details — Splinter Review

This uses a 2-byte aligned string arena (1-byte doesn't seem to work, 4-byte seems wasteful), and addresses error checking concerns.

Attachment #102658 - Attachment is obsolete: true

Brendan Eich [:brendan]

Updated

•

22 years ago

Alias: bayesian

Patrick C. Beard

Assignee

Comment 54

•

22 years ago

Comment on attachment 103035 [details] [diff] [review] PLDHashTable patch v5 Patch checked in.

Attachment #103035 - Attachment is obsolete: true

Patrick C. Beard

Assignee

Comment 55

•

22 years ago

Attached patch TokenEnumeration patch v1 (obsolete) — Details — Splinter Review

This patch introduces a new helper class, TokenEnumeration, to avoid copying tokens where possible, and renames getTokens() to copyTokens().

Brendan Eich [:brendan]

Comment 56

•

22 years ago

Comment on attachment 103118 [details] [diff] [review] TokenEnumeration patch v1 Are you sure you don't want the fix for bug 174859? Then you could avoid the overhead of copying tokens rather than token pointers in classifyMessage's call to copyTokens. Comments apart from this patch: - You might use a better magic number than 0xFEEDFACE -- see the magic strings used by the XPCOM typelib file format (http://www.mozilla.org/scriptable/typelib_file.html) and the XPCOM FastLoad file format (http://lxr.mozilla.org/mozilla/source/xpcom/io/nsFastLoadFile.h#139), both inspired by PNG's magic string header. - Nit: last_delimiter violates the otherwise-prevailing interCaps style for local variable names. Food for future revs, sr=brendan@mozilla.org on this one. /be

Attachment #103118 - Flags: superreview+

Brendan Eich [:brendan]

Updated

•

22 years ago

Depends on: 174859

Patrick C. Beard

Assignee

Comment 57

•

22 years ago

Comment on attachment 103118 [details] [diff] [review] TokenEnumeration patch v1 Fixed style problem, and checked in.

Attachment #103118 - Attachment is obsolete: true

Peter Van der Beken [:peterv]

Comment 58

•

22 years ago

Please use MLP/GPL/LGPL for the files in mozilla/mailnews/extensions/bayesian-spam-filter. It looks like mozilla/mailnews/extensions/bayesian-spam-filter/MANIFEST can be removed. The project file (mozilla/mailnews/extensions/bayesian-spam-filter/macbuild/BayesianFilter.xml) is referring to files for mdn, could you correct it?

mozilla.gv6r

Comment 59

•

22 years ago

Hmmm... Just thought of something. If the "spam-or-not" rating is stored in the message header/body, will these algorithms fail when a spammer includes something like "X-Mozilla-Spam: not" (or whatever the real header might be) in a spam message?

David :Bienvenu

Comment 60

•

22 years ago

that's a good point, but I don't think we're planning on storing the spam header in the message. If we did, we should filter out whatever came from the server, and put our own in.

Russell Odom

Comment 61

•

22 years ago

Re comment 60 - would this be a problem if you read your mail from two different places (e.g. with IMAP, from home and from work)? Suppose I read my mail at work, mark as spam or whatever, then at home I read it again and the header added by my work installation is stripped as my home installation downloads the message from the server. A way round this would be to store something that's unique to this installation in the header, and ignore the header if it doesn't match our installation, e.g. X-Mozilla-Spam: 3njh84n5 not

David :Bienvenu

Comment 62

•

22 years ago

Russell, in IMAP, you can't just add a header to a message. The way to solve this with IMAP is to add a custom keyword to the message, on the server. This will only work on some IMAP servers, but I don't think there's a solution that will work on all IMAP servers.

Phillip J. Eby

Comment 63

•

22 years ago

I just thought I'd mention that the the SpamBayes project http://spambayes.sf.net/ has now pretty much completed their algorithm research phase, and now have a "clear winner" algorithm based on chi-square probabilities, which you may want to consider upgrading to. In contrast to Graham's original algorithm and the Robinson variations in use a month ago, the "chi-combining" algorithm has a usable "unsure" range where it "knows that it doesn't know" what a particular e-mail is. The earlier algorithms were either sure all the time, or unsure in unusable ways (e.g. drifting "unsure" ranges). The new algorithm can still be "trained" incrementally using pretty much the same word-frequency data, and has far fewer biases and tweak factors. There have also been many tokenizer improvements. Last, but not least, the project has a usable Outlook 2000 add-in that can be looked at for UI ideas and perhaps used for testing to verify that same-or-better results are produced by the Mozilla implementation of the classifier. One of the most difficult things about implementing this kind of filtering is validating your algorithms; the spambayes people have put an enormous amount of both theoretical work and large scale testing for statistical validation. It would be good to make use of it here.

kirun

Comment 64

•

22 years ago

Does this just count words in the message, or does it consider other clues as "words"? I mean things like "Message is HTML", "Out of date", etc. Bug 151622 is asking for score-based filtering, which covers most of these ideas, but with manual scoring. It seems a good idea to use the Bayesian engine to make it automatic.

Tim McNerney

Comment 65

•

22 years ago

Some thoughts on the UI for this. Add a special mail folder called Spam. At some point, add the ability to have messages older than a certain age purged. Add another menubar button next to Delete called Spam, when viewing all mail folders but Spam. When the user presses this on a message, it goes to the Spam folder, it gets processed and added to the filter list and at some point, add the ability to automatically report the spam to black list/spam maintainer. When viewing the Spam folder, the button becomes NotSpam (or something that sounds better). Pressing this moves the mail back into the proper folder (as determined by other filters the user has) and processes the mail removing it from the bad mail list and adding it to the good mail list. First time user hits the Spam button, bring up a brief descriptive dialog explaining what is going on and allowing the user to set other filter options (such as the report to blacklist maintainer, for instance). Any mail that comes in and doesn't get filtered as Spam is processed and added to the good list. In practice, the user would start off with empty lists for both. He could go through existing mail and mark it as Spam, starting off the list, but this isn't necessary. All they would need to do is mark new mail as Spam. This means that the filter would not have as high an accuracy at first, but would improve quickly. It is also very simple for the average user to start with.

Aleksander Adamowski

Comment 66

•

22 years ago

Tim: see bug 169638. UI which is almost ready seems pretty similar to what you suggested. Adding this feature's UI bug, bug 169638, to dependencies so that people become aware of its existence.

Aleksander Adamowski

Comment 67

•

22 years ago

Adding this bug to dependencies in bug 11035. This is after all a spam-blocking feature, right?

Blocks: 11035

Eric Krock

Reporter

Comment 68

•

22 years ago

We'd better not use the term "Spam" in the UI. Spam is a registered trademark and using it within the product UI could draw a lawsuit. The meat company that owns the trademark *is* actively trying to defend its trademark and would love a victory against a high-profile target like Mozilla. Suggest "Junk Mail" instead.

Russell Odom

Comment 69

•

22 years ago

Not true, according to http://www.spam.com/ci/ci_in.htm > We do not object to use of this slang term to describe UCE, although we do > object to the use of our product image in association with that term. Also, if > the term is to be used, it should be used in all lower-case letters to > distinguish it from our trademark SPAM, which should be used with all uppercase > letters.

Michael Baffoni

Comment 70

•

22 years ago

How will the filtering work for IMAP connections? As I understand it, the filters will run as the headers are generated for the message list, however a typical IMAP connection does not retrieve any part of the message until it is read/moved/copied/etc. Therefore, in a typical IMAP connection, only the message headers will be available for running through the filter. Will the filter work only on the headers, or will it force a download of the text/html, or text/plain parts of messages so it can run against them (or maybe an option for either)? BTW, I don't know how much information is passed via the initial IMAP listing, but if it lists the attachment headers, that would be good enough since so many spam messages are obvious with idiotic content like: Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: base64

Aleksander Adamowski

Comment 71

•

22 years ago

Michael: the UI for message classification has been enabled in trunk builds (see bug 169638). From my testing it seems that bodies of new, incoming messages are fetched and run through the filter, while pre-existing messages have an icon indicating that they haven't been processed yet. You can select them, then Tools->Run Junk Mail Controls on them manually.

Nathanael C. Nerode

Comment 72

•

22 years ago

What do we have to do to test this? Current nightly has the UI, but the UI doesn't do anything at all.

Quinn Yost (mythdraug)

Comment 73

•

22 years ago

Nathan: There is a problem with this in (at least) the windows installer versions of the nightlies. The .zip appears to function correctly. See bug 179150

Jon Granrose

Comment 74

•

22 years ago

it works in installer builds starting with 2002-11-08-13

nemo

Comment 75

•

22 years ago

Been trying this out excitedly. Few questions. Does the Junk Mail log do anything? I don't see anything being written to it. Messages seem to be labelled as junk after they have all been retrieved. So my attempts to use a filter that has all incoming messages with junk flag be sent to a particular folder fails. I have to choose to apply filters to current folder to have them move. Any chance they can get flagged before the filters are applied?

Oliver Klee

Comment 76

•

22 years ago

My junk mail log shows all occurances of a mail (in Inbox) getting automatically marked as junk (after they have been retreived). Marking messages as junk (or not junk) autmatically after they have move to a folder by a filter is bug 180153 (for POP).

nemo

Comment 77

•

22 years ago

Right. Mine get marked as well. And I'm retrieving locally. It isn't a problem with marking, it is the problem that I have a rule that says if something is marked junk, move it to another folder. All the other rules get applied on retrieval, but that one doesn't, since on retrieval, status is still undefined. Status gets switched immediately after all are retrieved, but that is too late for the rule. Or at least, that's what it seems is happening.

Dan Mosedale (:dmosedale, :dmose)

Comment 78

•

22 years ago

The junk mail code has landed on the trunks. New bugs are being filed as blocking bug 11305, so if you have an issue with Junk Mail, see if it's in the dependency tree in that bug, and if not, open a new one.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Ashley Bischoff (blog at handcoding.com)

Comment 79

•

22 years ago

Dan: Did you really mean bug 11305? That's a VERIFIED-WONTFIX about nsAppleSingleDecoder or something.

Markus Gerstel

Comment 80

•

22 years ago

He probably meant Bug 11035

Steve

Comment 81

•

22 years ago

IDEA: Address Book Immunity Rule ------------------------------------- One of the biggest problems with making spam filters is making a filter that keeps unwanted stuff out without trashing things you want. How about making a rule that make email with an address from the user's address book immune to any _or__ particular filters? On a user interface this could be a check box titled: "don't apply this rule to people listed in my address book". This would cut down on email getting lost from an overly tight filter and it would cut down on having to wade through a folder full of junk mail to make sure you are not throwing out anything important. Just a thought... Steve

David :Bienvenu

Comment 82

•

22 years ago

we do that - it's called address book whitelisting, and if it's turned on, we won't apply the spam filter to a message from someone in your addressbook. There's a plan to extend it so you can check against multiple address books, but as of today, you can use a single address book as your whitelist.

David A. Wheeler

Comment 83

•

22 years ago

FYI, a different anti-spam technique is noted in bug #187044, called "challenge-response". It appears to me that the Bayesian and challenge-response techniques can be used together - the combination should REALLY cut down on spam, and make it harder for spammers to overcome (since they would have to circumvent TWO mechanisms).

Adam Masri

Comment 84

•

22 years ago

It would also be nice to see bug 184948 completed at some point in the future. That way, fewer spammers would know their message was received. - Adam

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.