<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 3

•

22 years ago

Yeah, this sounds like a good idea.

Assignee: naving → beard

Blocks: 11035, 168553, 178934

Status: UNCONFIRMED → NEW

Ever confirmed: true

Updated

•

22 years ago

No longer blocks: 168553, 178934

Daniel Wang

Updated

•

22 years ago

Summary: Update Junk mail filtering to use latest Bayes techniques from spambayes.sf.net → Update Junk mail filtering to use latest Bayesian techniques from spambayes.sf.net

Gábor Lipták

Comment 4

•

22 years ago

Is this functionality enabled in the daily Win builds? In Mozilla/5.0 (rv:1.3b) Gecko/20030109 I see the Junk icon showing up, but does not seem to do any logging? (And the move to folder checkbox is disabled too)

Tuukka Tolvanen (sp3000)

Comment 5

•

22 years ago

> I see the Junk icon showing up, but does not seem to do any logging? (And the Seems that logging should happen whenever a junk message is detected if the logging checkbox is checked -- if it doesn't, file a bug > move to folder checkbox is disabled too) http://www.mozilla.org/mailnews/spam-faq.html#q2, bug 181394.

Gábor Lipták

Comment 6

•

22 years ago

Thanks for the pointer.

Comment 7

•

22 years ago

I'm starting to think this would be a good idea too. The filter that's currently in Moz is showing some severe deficiencies for me. For example, I've received about 10 emails from "greatdeals@greatoffrs.com" recently, and marked them all as junk (in addition to other giant piles of spam, of course). They still keep coming. While they don't have much else in common, the email address is always the same. I would hope that the filter would give at least some weight to the headers. This isn't the only example, just the most egregious. I'm still seeing 2-3 unfiltered spams a day out of ~30 that are caught (i.e. only about 90% success rate). A big improvement, no doubt, but...

earthsound

Comment 8

•

22 years ago

a concise rundown of how to improve spam detection (which was implemented by SpamBayes) can be seen here, by Gary Robinson: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

Simon McQueen

Updated

•

22 years ago

Blocks: 181631

Updated

•

22 years ago

No longer blocks: 181631

Comment 9

•

22 years ago

Hi, I'm the author of the "Spam Detection" piece referred to in comment #8. Let me know if I can help.

Matthew Fowles

Comment 10

•

22 years ago

I would also like a seperate classifying state for "unsure" messages which are not certain to be spam.

Comment 11

•

22 years ago

Reply to comment 10: The technique from the Spam Detection piece seems to be good at identifying the unsure spams. I think that's the thing the Spambayes and Bogofilter groups consider to be its main advantage over the techniques it's been tested against.

JoSH Lehan

Comment 12

•

22 years ago

Reply to comment 10: This is a good idea, and reflects something that should be added to the UI, but is off the subject of this bug. I filed bug 209898 a while ago, requesting more states in the spam checkbox. Just having two states, "spam" and "not spam", is not sufficient. It leads to problems when seeing new messages in the mailbox, since new messages automatically default to "not spam" in the current system. I also filed bug 209898, requesting a user-visible spam score or percentage, so that the "unsure" messages can be easily seen, and perhaps sorted based on likelihood of being spam. I'd like feedback on these two bugs. These bugs are enhancements to the UI and overall mail engine, which affect the handling of mail determined to be spam, and are separate from this bug 181534, which affects the underlying Bayesian techniques used to sort the spam in the first place.

JoSH Lehan

Comment 13

•

22 years ago

Sorry, I meant to have links to bug 209890 and bug 209898. Something got doubled-up in the cut and paste.

Peter Van der Beken [:peterv]

Comment 14

•

22 years ago

I have made some (as yet unpublished) improvements to the chi-square algorithm used by SpamBayes, Bogofilter, etc. I'll try and put a comment here when it's available online, but in the meantime if someone wants to implement it for Mozilla, please contact me.

Comment 15

•

21 years ago

Taking, we should look into this.

Assignee: beard → peterv

Priority: -- → P3

Target Milestone: --- → mozilla1.7alpha

lev

Comment 16

•

21 years ago

Any news on this? The junk filters are *still* letting through messages with "viagra" in the subject (and body) in plain text. (I assure you that such messages have been 100% spam, so if I understand the algorithm correctly this word shuold weigh down a message like a lead anchor)

Nik Johnson

Comment 17

•

21 years ago

Agree 100% with comment #16. I can't count the number of times I've tagged emails with "viagra" in the subject as spam, and yet I get several a day that still show up. Another point: Over the last couple of weeks I've seen a surge in spam subject lines with multiple periods (".") interspersed throughout. I've tagged over 100 of those as spam. You'd think by now the if (num_subject_periods > 2) then spam = true; rule would have kicked in by now. But no.

Comment 18

•

21 years ago

I know this comment won't help make any progress on this bug but I'll make it anyway :) I use both Mozilla Mail and SpamBayes' Outlook plugin. In my experience Mozilla recognizes about 70-80% of spam and gets about .2% (1 out of 500) false positives. SpamBayes recognizes at least 99% of spam with 0 false positives (because of the "Possible Spam" category). I haven't seen a single spam in my Outlook inbox in over a month, so you could say it's 100%. I have some experience in C++ and I wish I could help, but I can't wrap my head around Mozilla's source code. This bug would be a great one to fix.

Peter Van der Beken [:peterv]

Assignee

Comment 19

•

21 years ago

FYI, here is the function which attempts to classify a message based on a set of tokens extracted from the message: http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#604 I'm not an expert on the spam probability work and have not read through all the documentation. Would the main improvement suggested by Gary Robinson (and the source of this bug), just involve changes to this one routine? i.e. changing the formula we use for classifying a message. If so, it seems like we should be able to pick up this improvement pretty easily. Maybe someone with more knowledge can look at this routine and the current research by Gary and others and advise us.

Comment 20

•

21 years ago

Mscott: yeah, that would be pretty much it.

URL: http://cvs.sourceforge.net/cgi-bin/vi... → http://cvs.sourceforge.net/viewcvs.py...

Comment 21

•

21 years ago

There's a little more to it than just changing that one routine. The new algorithm has the extra 'unsure' category. The 'unsure' category is essential when you consider the 3rd graph on this page: http://spambayes.sourceforge.net/background.html This extra category would probably require a bunch of new code and also a redesign of the UI.

Assignee

Comment 22

•

21 years ago

implementing the unsure feature is not going to happen for a while, as you said that's a huge change that will require re-working the UI and changing how we do things. That technique is referred to in comment #10. Comment #12 cites several spin off bugs which address an unsure state for messages. The initial description of this bug was to change the classification algorithm to use a chi-squared probability distribution. Can we not improve our algorithm taking that approach without going down the path of brining in a new state: unsure? I was assuming/hoping so. Maybe Gary can chime in and help direct us here.

Assignee

Comment 23

•

21 years ago

Skimmed through http://spambayes.sourceforge.net/background.html thanks for the link. For now, can't we treat Unsure messages as not junk and let the user train / mark it as junk on their own? i.e. "we don't know if this is junk or not junk so we are going to leave it alone and just not mark it as junk." That seems like a safe and reasonable assumption. I would like to have one bug (this one) which would try to improve the core algorithm by using a chi-squared probability distribution and then a different bug for later development to possibly show the unsure state in the UI. This later work is very expensive and the resources are not there to do it. Improving the algorithm is something we could realistically go after in the short term to make the experience better.

Joe McCabe

Comment 24

•

21 years ago

I'm a little confused by the difficulty of the "Unsure" state. I recall distinctly during the initial development of the Junk Filter that the Unsure state existed (though I'm not really certain what constituted "Unsure" at the time). Was that work not something worth following, or is it worth pulling it out of mothballs for use here?

Comment 25

•

21 years ago

Unless things have significantly changed since the code was originally written, I believe Joe is right, and almost all of the infrastructure is already there. There are already 3 UI states, it just so happens that two of them currently look identical to the end user (well, it's very slightly more complicated than that, but not much). Whether or not a message is spam is already expressed internally as a confidence score, so having a range of that score be interpreted as "unsure" should be easy enough.

Comment 26

•

21 years ago

Quick comment... if it really isn't practical to show the "unsure" state, then I agree with the idea of grouping the unsure's with the good mail. With the chi-square approach, the unsures are labelled that way, the vast majority of the time, not because there is no evidence either way, but because there is significant evidence in both directions. So to avoid losing valuable emails, I think it makes sense to group them with the Goods rather than junk them.

Comment 27

•

21 years ago

In SpamBayes the default probabilities for the three categories are 0-.2 Ham .2-.9 Unsure .9-1.0 Spam So if we're not going to have Unsure our categories would be 0-.9 Not Junk .9-1.0 Junk Gary, would you agree?

Comment 28

•

21 years ago

I'd say that if an "unsure" category is too hard to implement now (or deemed undesirable for other reasons), then we want to be sure to classify the stuff right around .5 as ham, and anything below that, as ham. Looking at the spambayes chart (3rd chart at http://spambayes.sourceforge.net/background.html), which is some of the best testing on this technique, I'd choose a ham/spam cut off .51. But if there is means to give the user a slider which allows him to choose the sensitivity, that would be a good idea. (I personally use SpamSieve on OS X, which uses the chi technique and provides such a slider.) My suggestion, if Mozilla developers want to try the chi-square approach, would be to use the links at http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html, particularly those to Greg Louis' work on finding the best parameters.

Comment 29

•

21 years ago

I think that if Mozilla doesn't handle "unsure" as a separate state that is somehow distinguishable and filterable, then the advantage of the spambayes method is greatly reduced. One of the whole points of going with a scheme like the chi-squared method is that the user can, with extremely high confidence, ignore the spam category, enjoy a spam-free ham category, and have a *very small* list of emails that must be inspected manually to determine what they are. Putting the unsures into either ham or spam is theoretically quite bad, but UI reality intrudes... As people have observed, of the two choices, uncertains should *clearly* be put into the ham category. False positives are devastating and must be avoided at all costs. Re: comment 28, I think .51 is uncomfortably close to this boundary, but a moderate value such as .55 would be reasonable, assuming that these statistics hold in wide use (which might be a bad assumption). A dynamic threshold would be better (i.e. adjust it based on when the user corrects the classification), but tricky to implement right... a pref that set this manually would be nice for experts though, and wouldn't require UI.

Assignee

Updated

•

21 years ago

Blocks: spam

nemo

Comment 30

•

21 years ago

Given this seems to be the generic Bayesian implementation bug. Would someone care to look over: http://bugzilla.mozilla.org/show_bug.cgi?id=132340#c9 ? I was making some speculation about Bayesian filtering, and I was wondering if it was valid.

Jo Hermans

Updated

•

21 years ago

Depends on: 230093

Comment 31

•

21 years ago

I've gone through the mozilla and SpamBayes sources and I've created a semi-pseudo code for the core classification functions. I hope it's easy to understand. This doesn't look like it'll be hard to implement (I wish I knew C++ better, so I could do it). We're going to have to import the math.h library, the functions useed are log, exp, floor, frexp, and the constants M_E and M_LN2.

Comment 32

•

21 years ago

Attached file Mozilla & SpamBayes classifier psuedo-code (obsolete) — Details

Comment 33

•

21 years ago

I created bug 230093 for improvements to the tokenizer. I'm assuming this bug is just for the classification algorithm.

Comment 34

•

21 years ago

Attached patch First patch attempt (obsolete) — Details — Splinter Review

Disclaimers: I am not an experienced C++ programmer. This is my first attempt at patching mozilla, so I don't know if this is correct as far as portability or formatting. I don't know how to use CVS, so I created this patch by using 'diff -u -8 -p nsBayesianFilter.cpp.orig nsBayesianFilter.cpp'. In any case, this compiles on Linux and it seems to work. This is only a change to the core classification algorithm using the chi-squared combining method invented by Gary Robinson, this does nothing about the 'unsure' category. I think I've copied the idea correctly from SpamBayes but I might be wrong, I don't actually understand the math involved to make sure. Also, I don't know how much of a win this is going to be, I think that improvements to the tokenizer might have a bigger impact. I set the spam cutoff at 0.9 even though there was some discussion here about setting it at 0.5 - 0.55. 0.5 seems dangerous as far as false positives, I want to be at least 90% sure that an email is spam before it gets deleted. This number can be changed on the line that says 'PRBool isJunk = (prob >= 0.90)'. I am currently testing this patch by simultaneously running a patched and an unpatched copy of mozilla side-by-side. I am going to let it run for a week and then report the stats. It would be great if other people here would also test this. To test you need to copy your training.dat file from your current profile to the new profile. You should apply the exact same training to both copies. Then you need to keep track of what happens in four categories tn, fn, tp, fp: tn: true negative, email correctly identified as not being spam fn: false negative, email incorrectly identified as not being spam tp: true positive, email correctly identified as being spam fp: false positive, email correctly identified as being spam

Updated

•

21 years ago

Attachment #138425 - Attachment is obsolete: true

Comment 35

•

21 years ago

After writing the patch I found some problems with the psuedo-code I wrote, so i obsoleted that attatchment.

Assignee

Comment 36

•

21 years ago

Miguel, thanks so much for getting this started. I'll be curious to see how your results look. You may find it easier to compare the two if you leave the old classification routine around as well and then use the junk mail log file to log about messages the new routine caught that the old one did not. And vice versa. You can do this by adopting the following line to do what you want: PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("%s is junk probability = (%f)", messageURI, prob)); You'll want to change the args of course to taste. Then you don't need to worry about running two builds at the same time and comparing their results. Let the junk mail log do it for you in the same build.

Comment 37

•

21 years ago

Here are the results from my testing. It looks good, my false negatives where reduced to less than half! Unfortunately, I only had a total of 163 emails to test with and these results are specific to my training file and the types of emails that I get. Still, I think this shows that this will work. I calculated the numbers using different spam cutoffs to see what works best. Total emails: 163, spam: 48, ham: 115 original algorithm: tn:115, tp:12, fn:36, fp: 0 new algo w/ .9 cutoff: tn:115, tp:22, fn:26, fp: 0 new algo w/ .5 cutoff: tn:110, tp:35, fn:13, fp: 5 new algo w/ .55 cutoff: tn:114, tp:35, fn:13, fp: 1 .55 was the best cutoff, anything less increased the fp rate and more increased the fn rate. The worrisome part is the one fp, that email was a "local breaking news alert" that I subscribed to and it was full of advertising so it got a .66 score, this is a good example why the 'unsure' category would be nice. We might need to consider setting the cutoff at .7 or .8 to reduce the fp rate at the expense of the fn rate. Even so, we're better off with the new algorithm. I'm attatching a spreadsheet with all the numbers in case anyone wants to look, you can change the spam cutoff to see what happens, it even has a distribution graph like the one on SpamBayes' web page (ok, I like to play with excel). The distribution looks similar to spambayes' considering that I only had 163 emails and they had several thousand. The main difference that I see is that in my emails there are several spams that get low scores, in their graph they don't seem to get any spams with scores less than .5. Looking through the spams that got low scores I have a suspicion that this would be fixed by improving the tokenizer.

Comment 38

•

21 years ago

Attached file test result spreadsheet — Details

Adam D. Moss

Comment 39

•

21 years ago

Yup, that's pretty exciting! At what cutoff point does fp=0, though? 1-in-115 good emails erroneously marked as spam seems excessively high, as an fp is very significantly less desirable than a fn. A cutoff that can be tuned to spread the new confidence evenly over a reduced fn rate and a reduced fp rate would probably be ideal.

Comment 40

•

21 years ago

.66 was the highest rated ham in my corpora, so a .7 cutoff might be good, that would result in fn=22. I just found out that SpamAssasin has a huge 6,047 email public corpus available for testing at http://spamassassin.org/publiccorpus. When I get some time I'll try to test against it. To do that, I need to figure out how to stuff that corpus into a POP3 server and then I'll train mozilla on half the emails and then see how the other half gets rated. That should give us a much more accurate idea of what the cutoff should be.

Ari Steinberg

Comment 41

•

21 years ago

Nice work. You mentioned (comment 37) that an unsure category would be helpful, and I definitely can see the logic to that. Fortunately, something like this could be a lot easier to pull off with some of the changes I'm working on for a related bug (bug 181866). In particular, it changes around the whole onMessageClassified structure a bit so that it takes a string as the category instead of an enum, and then instead of directly making decisions about how to handle the classification (ie, should I move the message or not?), it passes that off to nsISpamSettings to handle it. I'm kind of busy with some other stuff at the moment but when I get a chance to post a more up-to-date patch there it'll have these infrastructure changes that will hopefully make plugging in alternate filters such as this one a bit easier.

Assignee

Comment 42

•

21 years ago

Does anyone know if spam bayes has done work to limit the overall size of the training set or do they let it grow unbounded (which is how we behave).

earthsound

Comment 43

•

21 years ago

Regarding training, some ideas tossed around on http://www.entrian.com/sbwiki/TrainingIdeas should be helpful.

Comment 44

•

21 years ago

From reading earthsound's link, it looks like the answer to mscott's question is that SpamBayes is doing some testing but currently they let the database grow unbounded. They also say that sometimes it is good to trash your existing training database and start over from scratch, especially if you've made some training mistakes. I've had it happen to me where my spam recognition seems to get worse and worse over time. I was thinking that maybe it would be good to add an option to the "Junk Mail Control" window for deleting your existing training database.

Comment 45

•

21 years ago

I've got the results from testing using spamassasin's email corpus. The results are mixed, the fn rate was a huge improvement but the fp actually went up. Total emails: 4300, spam: 949, ham: 3350 original algorithm: fn=205, fp=45 new algo w/ .5 cutoff: fn=40, fp=166 new algo w/ .7 cutoff: fn=41, fp=85 new algo w/ .9 cutoff: fn=42, fp=63 new algo w/ .99 cutoff: fn=44, fp=47! I don't know what to make out of the high fp rates, it doesn't make sense to me. The fp number is actually low when you consider that it is in the .01%-.02% range (1 to 2 out of 1,000 emails), but it seems like it should be lower than that. This makes me start wondering about the algorithm, I need to go back through and double check everything against spambayes' source. I didn't have time to test the tokenizer patches, I'm hoping they'll lower the fp rates, I'll try to get around to it as soon as I can.

Isaac Arias

Comment 46

•

21 years ago

Has anybody tried contacting Tim Peters or Barry Warsaw (the originators of the SpamBayes project) or any other current SpamBayes developer? It just seems practical to inform the original Spambayes developers about this effort. They may be interested in making this happen and can probably contribute ideas and even code. I also noticed that someone proposed a Thunderbird plug-in/extension of SpamBayes on the feature requests tracker in the SourceForge project. It would be a waste to have someone work on a plug-in when it could be directly integrated into the app's junk mail functionality. I'll post a link to this bug on their tracker in the hopes that it will spark some collaboration effort.

Tony Meyer

Comment 47

•

21 years ago

(In reply to comment #46) > I'll post a link to this bug on their tracker in the hopes that it will spark > some collaboration effort. I'm sure that the other spambayes developers and I would be happy to help out with this if we can. If you want the same math as spambayes uses, then it should be pretty simple to copy it out of classifier.py and translate it into C (I presume that this is what Miguel has done, although I haven't looked at the code). That said, if Gary has improvements, then it would be good to include those, presumably (I'm sure spambayes will try them out, too, once he writes them up somewhere). If you want to use similar tokenizing as spambayes, then it would be worth looking at the various testing results from each of the options. All the tokenizing that's currently done has shown (in testing) that it improve things, so it would be probably worth reading up the results to see if it's worth using any that you aren't already (I have no idea what tokenizing is currently done). The testing results are scattered about a bit, unfortunately - most are in tokenizer.py or the spambayes/spambayes-dev archives. In response to the db size question, there has been a little bit of testing regarding this, although it was a wee while ago. <http://cashew.wolfskeep.com/~popiel/spambayes/>. Anyway, this was mostly to confirm that if there are any questions that the spambayes team can help with, just post them to the spambayes-dev list, by all means.

Assignee

Comment 48

•

21 years ago

Here are some of the testing results comparing 0.5 with JUST the tokenizer changes, and then I did another run with the tokenizer changes + Miguel's port of the chi-squared probability distribution algorithm instead of a pure bayesian algorithm listed in this bug. The test set was somewhat arbitrary. So I don't think the resulting percentages are important, what's more important is the delta in the percentage of spam caught between the 3 test runs. I took spam assassin's public corpus, randomly selected 200 HAM and 200 SPAM from the corups and used those as the training set. I then tested the training set against the remaining 5602 messages. I could use help trying to figure out what to use as a training set for building a benchline of comparisons. I suspect having a well defined ratio of ham to spam for training would make the numbers more interesting. Here are the results: # of Messages 5602 # SPAM 1684 False Negatives False Positives Percentage Caught Thunderbird 0.5 925 1 45.07% Tokenizer Changes 614 3 63.54% Tokenizer + Chi 190 22 88.72% We are almost doubling the percentage of spam caught by combining the algorith with the tokenizer changes. However, the huge spike in false positives with the algorithm change is alarming. Btw, these tests are done with the address book whitelist filter turned OFF. So in practical use, the fp rate should get cut down some because we white list folks already in your address book.

Adam D. Moss

Comment 49

•

21 years ago

FP spike fixable by tuning the cutoff point, perhaps?

Assignee

Comment 50

•

21 years ago

Here are some updated numbers by varying the cut off point for spam from 0.9, 0.95 and 0.99. 0.99 had the lowest false postive count, but obviously had the worst percentage at false negatives. Summary # of Messages 5602 # SPAM 1684 False Negatives False Positives Percentage Caught Thunderbird 0.5 925 1 45.07% Tokenizer Changes 614 3 63.54% Tokenizer + Chi (.9) 190 22 88.72% Tokenizer + Chi (.95) 245 16 85.45% Tokenizer + Chi (.99) 410 4 75.65% What does this mean? Maybe .95 is a good cut off point given this test data. We might also want to allow this to be set via a pref so hackers can tweak it on there own. Miguel mentioned that one should probably audit his algorithm patch in case he made a mistake and that's why we have such a high false positive count when using this new algorithm.

Comment 51

•

21 years ago

It seems to me that running the entire Spam Assassin corpus against any filter based on an initial fixed set of training data isn't representative of what will happen in the real world. In actual use, most false negatives will be trained by the user as spam, and false positives will be trained as ham. This would be easy to simulate, I would think. I suspect (based solely on personal experience and knowledge of Bayesian analysis, but no true experimental data) that the existing algorithms/tokenizer would perform proportionally worse as it's trained up during use vs. the spambeyes ones. In fact, with the existing mechanisms, I find that after a certain point my false positive rate actually goes up with additional spam training (with some, but a much smaller, effect on the false negative rate). I tend to get a lot of newsletters, though... :-). How much effort does Spam Assassin go to to "spetrally match" (if you'll pardon the thinly stretched metaphor) their corpus to field spam/ham.

Joe McCabe

Comment 52

•

21 years ago

It seems that what's really needed here is live results from daily use. Can this go into the nightly builds alongside the current algorithm? If they're both there, the current one active, and the new one only able to be enabled via a pref, then we can start to get real world results without disturbing the bulk of the normal users.

Assignee

Comment 53

•

21 years ago

Once I get a little more refinement going for processing headers in the tokenizer, I plan on releasing a test build of thunderbird with this functionality so folks can start seeing how it does in the 'real world'. unfortunately there is no way to pref this new behavior so you can try both ways with the same build. The changes were too far reaching. FYI, for devs looking to contribute to this project but aren't fully familiar with mozilla, see this comment about the help I could use with the tokenizer: http://bugzilla.mozilla.org/show_bug.cgi?id=230093#c9

Assignee: peterv → mscott

Assignee

Comment 54

•

21 years ago

the latest patch in Bug #230093 includes the tokenizer changes plus the new probability algorith. I also added a pref, mail.adaptivefilters.junk_threshold which will allow testers to change the cut off point for determining if a message is junk or not. I hope to turn this patch into a test build within the next day or so.

Assignee

Comment 55

•

21 years ago

FYI, I have made a thunderbird test build (Win32) with the both the algorithm change and the tokenizer changes I've been working on the last week or so. You can download it here: http://ftp.mozilla.org/pub/mozilla.org/thunderbird/test/thunderbird-junk-test-win32.zip I'd recommend re-training with a clean training.dat before judging the effectivness of the new stuff. If you want to play around with the sensitivity of the filter, you can change the probability threshhold for determining if a message is junk or not (the lower the number the more agressive) by tweaking the following pref. The current cut off is .90 +// the probablilty threshold over which messages are classified as junk +// this number is divided by 100 before it is used. The classifier can be fine tuned +// by changing this pref. Typical values are .99, .95, .90, .5, etc. +pref("mail.adaptivefilters.junk_threshold", 90);

Status: NEW → ASSIGNED

Target Milestone: mozilla1.7alpha → mozilla1.7beta

Isaac Arias

Comment 56

•

21 years ago

(In reply to comment #55) > FYI, I have made a thunderbird test build (Win32) with the both the algorithm >... Hi Scott, I downloaded your test build but I couldn't get it to work for some reason. I started with a fresh training.dat and activated the Junk Mail Controls through the GUI interface, however, the application never detects or moves e-mails when they arrive. I even tried manually moving all the junk mails from the Junk folder to a new temp folder and selected the "Run Junk Mail Controls on Folder" option. All it did was clear the junk e-mail status on all the messages. Any ideas?

Bill Mason

Comment 57

•

21 years ago

(In reply to comment #56) > I downloaded your test build but I couldn't get it to work for some reason. I > started with a fresh training.dat and activated the Junk Mail Controls through > the GUI interface, however, the application never detects or moves e-mails when > they arrive. I can confirm the same problem. Nothing was marked as junk in 3 days, even SpamAssassin mail with "***possible spam*** slapped into the header. And I probably get 100 pieces of spam a day at least, so there was plenty to learn from.

Comment 58

•

21 years ago

FWIW, I've been using this downloaded build and it's working fine. I just dropped in the new exe directory, renamed training.dat, and went for it (ok, I had to remove a few themes that were incompatible with v 0.5 Thunderbird). So far, it is seeming similar (fewer false positives now that I'm even partially trained up, though, which is interesting... of course whitelisting helps that a lot).

Assignee

Comment 59

•

21 years ago

Attached patch updated patch (obsolete) — Details — Splinter Review

Miguel has put together a new patch for the new algorithm with some generous help from the spam bayes folks. This fixes a couple bugs in the new algorithm. In this patch, I have also included a change to how we count tokens in the training set to match the approach taken by spam bayes. Before, if a token occurred n times in a message and the user marked it as junk (or not junk) we bumped the count for this token in the training set by n. Spam Bayes says when adding / removing to the training set, you should ignore the number of times the token may appear in the message. Just bump the training set token count by 1, regardless of how many times the token in question occurrs in the message. Because of this accounting change, I highly recommend anyone trying these patches at home to remove their training.dat file and re-train.

Attachment #138741 - Attachment is obsolete: true

Ari Steinberg

Comment 60

•

21 years ago

Hi, sorry to jump in here suddenly, but reading mscott's last update inspired me to throw in a few cents from my own research on the message classification. Quick background info - I did some work on a related topic for Mozilla - see bug 181866 (and the update on that is that the versions shown in that bug report are unfortunately pretty far out of date - I'll try to get back to work on that if I have time over the next month or two). Anyway, if I'm understanding Scott's proposal, in machine learning terms he is suggesting a switch from a "Multinomial Model" to a "Multi-variate Bernoulli Model" - effectively going from a model that includes information about the number of times a word appears in a document to a binary model. In general, for text classification, though, it seems that the multinomial model (that is, the original method) actually works better, which makes some intuitive sense since there is a difference between a word being mentioned once and one being mentioned repeatedly. Research has empirically confirmed this, a good paper on the topic is this one: http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf However, of course, spam is a unique beast. In particular, you don't want to risk allowing one single word to dominate the classification of a message - if by some coincidence you happen to get a legit email from a friend about Viagra, you wouldn't want the frequency of the single spammy word Viagra to cause the message to be flagged as spam. Therefore, I propose a compromise, which I have found in my own tests works well in addition to making linguistic sense. Check out section 4 of this paper: http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf. The idea is that rather than using the number of times the word has occurred in the document directly (the term frequency or TF), you use log(1+TF). This makes sense linguistically because of Zipf's Law - basically, in this case, the key observation is that we want the first time a term shows up to be the most significant, and then each subsequent term to be less and less significant, but still to count for something. The 1+ helps mathematically stablize the function, and finally if you use log base 2 you have the nice property that if a word shows up once in a document, the modified TF vector will still represent it as a 1 (further occurrences will add fractional amounts). I have implemented this in my own project (see bug link above), but obviously I don't expect you all to take my word for it, so I would encourage you to try it out to see which performs best - it may be that for spam the binary model works best. I will try to post a code snippet in a few days if my description was not adequate.

Comment 61

•

21 years ago

> http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf. The idea is that rather > than using the number of times the word has occurred in the document directly > (the term frequency or TF), you use log(1+TF). This makes sense linguistically > because of Zipf's Law Jumping in here -- I have thought about doing experiments with this with the chi-square approach but haven't gotten around to it yet. Jason (the Rennieof the above URL; haven't read the paper yet though) mentioned to me that they'd been doing a log(1+tf) calculation, and it made sense to me. I wouldn't necessarily associated it with the zipf distribution, rather to me the root of it is in information theory. In fact for information theoretic reasons I'm actually using a log of occurrence counts in a product we're working on. I haven't tried it yet in the context of the "grahamian probability" calcs being used at the root of spambayes and other spam filters that incorporate those calcs. I think it would be interesting to play with, and that it has the potential to improve performance over the version being tested now. However, I would also like to mention that based on my own experience in using logs of counts, any improvement is not likely to be so great that a possible incorporation of the current version of the spam filtering into the public-release-versions of mozilla should be held up for long over it. I hope anyone who tests the log-of-counts approach will report results here. As a starting point I'd suggest simply counting each email in which a word occurs as log TF rather than 1 in calculating the original grahamian probabilities.

Assignee

Comment 62

•

21 years ago

Attached patch another updated patch — Details — Splinter Review

This version of the patch uses the same new token counting logic in the last patch, but in less code. It also fixes a minor sorting bug Miguel found yesterday.

Attachment #143253 - Attachment is obsolete: true

Assignee

Comment 63

•

21 years ago

*** Bug 200087 has been marked as a duplicate of this bug. ***

Assignee

Comment 64

•

21 years ago

Yikes, I just found a really bad problem where the token counts get shorted if you delete training.dat and try to retrain by selecting a bunch of already classified messages, re-classifying them with the same classification. (i.e. you select messages in your junk folder and say mark as junk again) See: Bug #237095 if you are interested in talking about this issue. I can fix it, but the fix means allowing a user to classify the same message over and over again which could lead to data skew with the token counts for tokens in that message. I may have to allow the user to do that in order to fix this retraining issue though.

Joe McCabe

Comment 65

•

21 years ago

I've been running with this patch for about a week now. It seems to work, though frankly I'm unable to tell if it works better than the previous mechanism. I still get many false negatives, but so far have not encountered a false positive (could be due to better training that I had previously). I've noticed that the spam system deals very badly with selecting 30,000+ messages and marking them as junk. Things get very locked up (still running, but other mail functions are not executed - feels like all threads are tied up in the junk classification). It's also been a challenge actually training the new training.dat due to the bug mentioned above (reclassifying already classified messages). What would be extrememly usefull both for testing as well as a generally cool feature would be to be able to see the junk score in the message header, and the ability to set my "junk threshold" via a pref (a slider with values from "loose" to "strict" would be great and would not needlessly confuse people who are unfamiliar with how the classification system works).

Assignee

Comment 66

•

21 years ago

Does anyone know how Spam Bayes handles multipart/alternative when feeding data into the tokenizer? Do they feed both the text/html part and the text/plain part into the tokenizer or just one of the parts? Currently we feed both parts into the tokenizer. Things would be easier on my end if I only had to feed one of the parts into the tokenizer which is why I am asking. The closest blurb I could find in the tokenizer script was: # Later: The decision to ignore "redundant" HTML is also dubious, since # the text/plain and text/html alternatives may have entirely different # content. options.ignore_redundant_html was introduced to control this, # and it defaults to False. Later: ignore_redundant_html was also removed. But I couldn't tell if that means they only use one part and ignore the other or if they now tokenize on both parts.

Comment 67

•

21 years ago

I always understood that comment to mean that they tokenize both parts. Since the token counting scheme was changed so that multiple instances of a word in a message are only counted once, it doesn't matter if redundant parts are tokenized. Here's the comment on it on the SpamBayes mailing list, http://mail.python.org/pipermail/spambayes/2002-October/001261.html

Anthony Baxter

Comment 68

•

21 years ago

We tokenise both. A non-zero amount of spam had a text part that was pretty simple and non-spammy, with all the spamminess in the HTML.

Comment 69

•

21 years ago

Hello, for anyone interested in the chi-square spam detection method, 1) I've written an <a href=http://www.garyrobinson.net/2004/04/improved_chi.html>article that introduces an improvement to the chi-square algorithm</a> and includes test results. The improvement is statistically significant. 2) Greg Louis of the Bogofilter project has also <a href=/bogofilter/esf.html >tested the improvement mentioned above</a> and obtained positive results. 3) I've written <a href=http://www.garyrobinson.net/2004/05/why_chi.html>another article about the theory behind the chi-square technique</a> and reasons for using it. 4) I tried using the log of the number of messages as part of the chi-square algorithm. It didn't help, at least not significantly. There is some possibility that there was an error in the test, but at this point I'm not encouraged that it would lead to a big improvement. I mention it because the subject came up earlier in these comments.

Ari Steinberg

Comment 70

•

21 years ago

Interesting stuff, Gary. Just to quickly respond to your point 4, I'm not sure if you mis-stated that in your comment or if there was a misunderstanding in where the log should be used, but to clarify, it should be the log of the number of times each word occurs in a given document, not the log of the number of documents. Also keep in mind that this requires changes to the training set which unfortunately cannot be retroactively applied unless you still have the word counts for each individual document (as opposed to just the sums). Any example may be handy. Consider a document which contains the word "the" 10 times and the word "and" 5 times. If you were training, you would log(10) to the number of times you have seen "the" and log(5) to "and", while if you were classifying, rather than the usual (for NB) p(the|spam)^10 * p(and|spam)^5 or 10*log(p(the|spam))+5*log(p(and|spam)), you would now have p(the|spam)^log(10) * p(and|spam)^log(5) or log(10)*log(p(the|spam)) + log(5)*log(p(and|spam)). If you are normalizing anything, the document length is now the sum of the logs of the number of times each word occurs, so you would have the length of the above document as log(10)+log(5) instead of 15). I'm sure I messed something up there, but hopefully it made sense...

Comment 71

•

21 years ago

> it should be the log of the number > of times each word occurs in a given document, not the log of the number of > documents. Yes. That's what I meant (and did in the code). > Also keep in mind that this requires changes to the training set > which unfortunately cannot be retroactively applied unless you still have the > word counts for each individual document (as opposed to just the sums) Yes, I did that. I had all the original word counts. I was a bit surprised that it didn't help, because I've used the log-of-the-number-of-things trick in other contexts. But it's also true that there is a mathematical basis for the chi-square technique that doesn't make complete sense if we use the log of the counts, But it is also true that my results shouldn't be taken as definitive, there could have been some kind of error. It would be good for someone else to duplicate.

Assignee

Comment 72

•

21 years ago

At long last this patch has finally made its way into the mozilla 1.8 trunk. This was already part of the recently released Thunderbird 0.6 release. The patch was checked in as part of Bug #230093. I'm going to mark this bug as fixed. However, Gary Robinson and others have already been talking about improvements to the chi squared prob. approach that we want to continue discussion on. I have filed Bug #243430 to further track this aspect. I cc'ed a couple of you on that bug already. Feel free to cc yourself if I missed you. Hopefully we can continue to improve the new algorithm in that bug. I can't thank everyone enough for their help in making this happen!

Status: ASSIGNED → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Andrew Perry

Comment 73

•

21 years ago

Latest builds are having problems creating/adding to training.dat file. I've tried deleting the training.dat file ... It will get re-created, but only the first entry takes ... after that, it will no longer get updated. Build ID 2004051608

Sven Grull

Comment 74

•

21 years ago

See bug 243680 for this problem.

Pratik

Comment 75

•

21 years ago

My experience with the new filtering - terrible. I open my mail today and it marks *ALL* of my mail as Junk. In the last few days I got around 2-3 false positives so I reset my training.dat but seeing all my mail marked as junk takes the cake. Does anyone else have a similar experience?

Andreas Kunz

Comment 76

•

21 years ago

Pratik: the two comments before yours point to a problem that causes training.dat not to get updated at all, thus not learning. Might this be what you are seeing?

Pratik

Comment 77

•

21 years ago

The behaviour I see (with Build 20040513) is this - no training.dat is present. - almost every mail is marked as junk (including the bugmail I got about the last comment). I'll try a more recent build just in case...

Andreas Kunz

Comment 78

•

21 years ago

Pratik: would you plese actually read bug 243680 - which I just pointed you to? It explicitly says that no training.dat is created. One patch for that bug is already in and might already solve the problem, but the real patch is in bug 193625 and still awaiting review.

Assignee

Comment 79

•

21 years ago

spam bayes gurus....Someone in mozillazine pointed out the following to me. I'm looking for validation of their comment about the value of an unknown word probability: "spam bayes uses a value of 0.5 where mozilla uses a value of 0.225 as the coefficient in the calculation of probability of each token" http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#973 The original comment came from: http://forums.mozillazine.org/viewtopic.php?t=81108 Does this sound like something we need to change or am I looking at the wrong part of the algorithm?

Kenny Pitt

Comment 80

•

21 years ago

As mentioned in a follow-up comment to the referenced mozillaZine topic, SpamBayes multiplies the unknown word prob by the unknown word strength, which results in 0.225 with the default values. The exact calculation is as follows: n = hamcount + spamcount => # of times word has been seen before S = 0.45 => unknown word strength X = 0.4 => unknown word probability For a given word with base probability prob, the adjusted probability is: prob = ((S * X) + (n * prob)) / (S + n) If n = 0 indicating that the word has never been seen before then this reduces to: prob = (S * X) / S = X So the final probability of an unknown word is always the constant X = 0.5. Since SpamBayes ignores any words that have a probability between 0.4 and 0.6, words that have never been seen before are always discarded from the calculation of the overall message score.

Comment 81

•

21 years ago

I should probably go research this myself, but has anyone thought about upping the spam probability of unknown words? The only emails I receive that have "unknown" words in them these days are spam. I'd be willing to accept a slightly higher temporary false positive rate in order to get such a greatly increased permanent reduction in false negatives...

Comment 82

•

21 years ago

> I should probably go research this myself, but has anyone thought about upping > the spam probability of unknown words? The only emails I receive that have > "unknown" words in them these days are spam. I'd be willing to accept a slightly > higher temporary false positive rate in order to get such a greatly increased > permanent reduction in false negatives... Given the fact that so many emails now contain random text as a disguise, that sounds like a good idea to me. I would suggest: all unknown words should be treated as the same word, and a spam probability for the unknown word token should be calculated. That probability should then be used as X in the equation in comment #80. In calculating the spam prob for the unknown word token itself, just leave X out of the equation, because it would be overwhelmed by the amount of unknown word data anyway.

Comment 83

•

21 years ago

Oh, and for that last suggestion (#82) to make any sense, the unknown words would have to be left in the processing and not removed like other "middle probability" words are. I don't know if that would improve things or make them worse. It would be interesting to test, though.

Shane Killian

Comment 84

•

21 years ago

On our iMail systems, we upped the unknown word probability from .4 to .6, and that caught a LOT of the spams that were getting through before. We haven't noticed any appreciable rise in false positives since we did that.

Kenny Pitt

Comment 85

•

21 years ago

(In reply to comment #82) > I would suggest: all unknown words should be treated as the same word, and a spam probability for > the unknown word token should be calculated. That probability should then be used as X in the > equation in comment #80. In calculating the spam prob for the unknown word token itself, just > leave X out of the equation, because it would be overwhelmed by the amount of unknown word data > anyway. I'm curious as to how you would calculate the probability of the unknown words? At first glance it would seem that n = 0 no matter how many unknown words you add together because you have no prior info about unknown words. If you plug n = 0 into the equation and leave out X then the equation reduces to S / S = 1. BTW: sorry for the typo in comment #80: should be "X = 0.5", obviously, and not "X = 0.4"

Comment 86

•

21 years ago

" At first glance it would seem that n = 0 no matter how many unknown words you add together because you have no prior info about unknown words." I'm suggesting we treat all unknown words as if they were the same word for the purpose of computing the initial Unknown Word probability. So the second time you get an email containing a word you haven't seen before, it won't be unknown in that sense -- it will be the second occurrence of The Unknown Word in an email, even though that actual word may not have been seen before. It's just an idea. I admit I haven't spent a great deal of time thinking about it. But it seemed like a decent idea.

Ian Pottinger

Comment 87

•

21 years ago

How would this affect those of us who work in multilingual environments? "Unknown" words might be rare in a unilingual environment but not so uncommon in bi/tri/lingual nations. I, for example, deal with French translations for our bilingual website. I don't get that many each week but enough that it would be irritating if they were constantly hidden in the junk mail folder because some French text is "unknown". Just a concern.