Closed Bug 181534 Opened 22 years ago Closed 20 years ago

Update Junk mail filtering to use latest Bayesian techniques from spambayes.sf.net

Categories

(MailNews Core :: Filters, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.7beta

People

(Reporter: james.a.hunziker, Assigned: mscott)

References

(Blocks 2 open bugs, )

Details

Attachments

(2 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.2b) Gecko/20021017
Build Identifier: Mozilla 1.3a Nightly 2002/11/19

The current Bayesian Junk filtering uses the original Paul Graham technique for
classifying messages.  The group at http://spambayes.sourceforge.net has done
lots of work on perfecting this technique.

One of the most useful changes is using a chi-squared probability distribution.
 This makes the filter more willing to classify a message's spam status as
unknown in cases where it shouldn't know whether something is spam.

For information, check out the comments in their source code at:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/spambayes/spambayes/classifier.py

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
QA Contact: laurel → dmose
according to bug 163188 comment 63 and 64, this could be integrated with bug 151622
I did find that comment when I searched on this stuff, but I thought it deserved
a bug of its own.  The scoring bug you mention really isn't all that related. 
It describes a bunch of random things that indicate that something is spam.

The real problem here is that the algorithm that is in the Bayesian scoring code
is a little flawed - not flawed enough to cause terrible problems, but flawed
enough that the junk mail filter ends up being a little overconfident.

When the functionality appears that moves messages automatically into a Junk
folder, I imagine this overconfidence will be more obvious.
OS: MacOS X → All
Hardware: Macintosh → All
Yeah, this sounds like a good idea.
Assignee: naving → beard
Blocks: 11035, 168553, 178934
Status: UNCONFIRMED → NEW
Ever confirmed: true
No longer blocks: 168553, 178934
Summary: Update Junk mail filtering to use latest Bayes techniques from spambayes.sf.net → Update Junk mail filtering to use latest Bayesian techniques from spambayes.sf.net
Is this functionality enabled in the daily Win builds? In

Mozilla/5.0 (rv:1.3b) Gecko/20030109

I see the Junk icon showing up, but does not seem to do any logging? (And the
move to folder checkbox is disabled too)
> I see the Junk icon showing up, but does not seem to do any logging? (And the

Seems that logging should happen whenever a junk message is detected if the
logging checkbox is checked -- if it doesn't, file a bug

> move to folder checkbox is disabled too)

http://www.mozilla.org/mailnews/spam-faq.html#q2, bug 181394.
Thanks for the pointer.
I'm starting to think this would be a good idea too. The filter that's currently
in Moz is showing some severe deficiencies for me. 

For example, I've received about 10 emails from "greatdeals@greatoffrs.com"
recently, and marked them all as junk (in addition to other giant piles of spam,
of course). They still keep coming. While they don't have much else in common,
the email address is always the same. 

I would hope that the filter would give at least some weight to the headers.

This isn't the only example, just the most egregious. I'm still seeing 2-3
unfiltered spams a day out of ~30 that are caught (i.e. only about 90% success
rate). A big improvement, no doubt, but...

a concise rundown of how to improve spam detection (which was implemented by
SpamBayes) can be seen here, by Gary Robinson:

http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
Blocks: 181631
No longer blocks: 181631
Hi, I'm the author of the "Spam Detection" piece referred to in comment #8. Let me know if I can 
help.
I would also like a seperate classifying state for "unsure" messages which are
not certain to be spam.
Reply to comment 10: The technique from the Spam Detection piece seems to be good at 
identifying the unsure spams. I think that's the thing the Spambayes and Bogofilter groups 
consider to be its main advantage over the techniques it's been tested against.
Reply to comment 10: This is a good idea, and reflects something that should be
added to the UI, but is off the subject of this bug.  I filed bug 209898 a while
ago, requesting more states in the spam checkbox.  Just having two states,
"spam" and "not spam", is not sufficient.  It leads to problems when seeing new
messages in the mailbox, since new messages automatically default to "not spam"
in the current system.  I also filed bug 209898, requesting a user-visible spam
score or percentage, so that the "unsure" messages can be easily seen, and
perhaps sorted based on likelihood of being spam.  I'd like feedback on these
two bugs.  These bugs are enhancements to the UI and overall mail engine, which
affect the handling of mail determined to be spam, and are separate from this
bug 181534, which affects the underlying Bayesian techniques used to sort the
spam in the first place.
Sorry, I meant to have links to bug 209890 and bug 209898.  Something got
doubled-up in the cut and paste.

I have made some (as yet unpublished) improvements to the chi-square algorithm used by 
SpamBayes, Bogofilter, etc. I'll try and put a comment here when it's available online, but in the 
meantime if someone wants to implement it for Mozilla, please contact me.
Taking, we should look into this.
Assignee: beard → peterv
Priority: -- → P3
Target Milestone: --- → mozilla1.7alpha
Any news on this? The junk filters are *still* letting through messages with
"viagra" in the subject (and body) in plain text. (I assure you that such
messages have been 100% spam, so if I understand the algorithm correctly this
word shuold weigh down a message like a lead anchor)
Agree 100% with comment #16.  I can't count the number of times I've tagged
emails with "viagra" in the subject as spam, and yet I get several a day that
still show up.  Another point:  Over the last couple of weeks I've seen a surge
in spam subject lines with multiple periods (".") interspersed throughout.  I've
tagged over 100 of those as spam.  You'd think by now the
if (num_subject_periods > 2)
then spam = true;
rule would have kicked in by now.  But no.
I know this comment won't help make any progress on this bug but I'll make it
anyway :)

I use both Mozilla Mail and SpamBayes' Outlook plugin.  In my experience Mozilla
recognizes about 70-80% of spam and gets about .2% (1 out of 500) false
positives.  SpamBayes recognizes at least 99% of spam with 0 false positives
(because of the "Possible Spam" category).  I haven't seen a single spam in my
Outlook inbox in over a month, so you could say it's 100%.

I have some experience in C++ and I wish I could help, but I can't wrap my head
around Mozilla's source code.  This bug would be a great one to fix.
FYI, here is the function which attempts to classify a message based on a set of
tokens extracted from the message:

http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#604

I'm not an expert on the spam probability work and have not read through all the
documentation. Would the main improvement suggested by Gary Robinson (and the
source of this bug), just involve changes to this one routine? 

i.e. changing the formula we use for classifying a message.

If so, it seems like we should be able to pick up this improvement pretty
easily. Maybe someone with more knowledge can look at this routine and the
current research by Gary and others and advise us.
There's a little more to it than just changing that one routine.  The new
algorithm has the extra 'unsure' category.  The 'unsure' category is essential
when you consider the 3rd graph on this page:
http://spambayes.sourceforge.net/background.html

This extra category would probably require a bunch of new code and also a
redesign of the UI.
implementing the unsure feature is not going to happen for a while, as you said
that's a huge change that will require re-working the UI and changing how we do
things. That technique is referred to in comment #10. Comment #12 cites several
spin off bugs which address an unsure state for messages. 

The initial description of this bug was to change the classification algorithm
to use a chi-squared probability distribution. Can we not improve our algorithm
taking that approach without going down the path of brining in a new state:
unsure? I was assuming/hoping so. Maybe Gary can chime in and help direct us here. 
Skimmed through 

http://spambayes.sourceforge.net/background.html

thanks for the link.

For now, can't we treat Unsure messages as not junk and let the user train /
mark it as junk on their own? i.e. "we don't know if this is junk or not junk so
we are going to leave it alone and just not mark it as junk." That seems like a
safe and reasonable assumption.

I would like to have one bug (this one) which would try to improve the core
algorithm by using a chi-squared probability distribution and then a different
bug for later development to possibly show the unsure state in the UI. This
later work is very expensive and the resources are not there to do it. Improving
the algorithm is something we could realistically go after in the short term to
make the experience better.
I'm a little confused by the difficulty of the "Unsure" state. I recall
distinctly during the initial development of the Junk Filter that the Unsure
state existed (though I'm not really certain what constituted "Unsure" at the time).

Was that work not something worth following, or is it worth pulling it out of
mothballs for use here?
Unless things have significantly changed since the code was originally written,
I believe Joe is right, and almost all of the infrastructure is already there. 
There are already 3 UI states, it just so happens that two of them currently
look identical to the end user (well, it's very slightly more complicated than
that, but not much).  Whether or not a message is spam is already expressed
internally as a confidence score, so having a range of that score be interpreted
as "unsure" should be easy enough.
Quick comment... if it really isn't practical to show the "unsure" state, then I agree with the idea of 
grouping the unsure's with the good mail.

With the chi-square approach, the unsures are labelled that way, the vast majority of the time, not 
because there is no evidence either way, but because there is significant evidence in both 
directions. So to avoid losing valuable emails, I think it makes sense to group them with the Goods 
rather than junk them.
In SpamBayes the default probabilities for the three categories are
0-.2 Ham
.2-.9 Unsure
.9-1.0 Spam

So if we're not going to have Unsure our categories would be
0-.9 Not Junk
.9-1.0 Junk

Gary, would you agree?
I'd say that if an "unsure" category is too hard to implement now (or deemed undesirable for other 
reasons), then we want to be sure to classify the stuff right around .5 as ham, and anything below 
that, as ham.

Looking at the spambayes chart (3rd chart at http://spambayes.sourceforge.net/background.html), 
which is some of the best testing on this technique, I'd choose a ham/spam cut off .51.

But if there is means to give the user a slider which allows him to choose the sensitivity, that would 
be a good idea. (I personally use SpamSieve on OS X, which uses the chi technique and provides 
such a slider.) 

My suggestion, if Mozilla developers want to try the chi-square approach, would be to use the links 
at http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html, particularly 
those to Greg Louis' work on finding the best parameters.
I think that if Mozilla doesn't handle "unsure" as a separate state that is
somehow distinguishable and filterable, then the advantage of the spambayes
method is greatly reduced.

One of the whole points of going with a scheme like the chi-squared method is
that the user can, with extremely high confidence, ignore the spam category,
enjoy a spam-free ham category, and have a *very small* list of emails that 
must be inspected manually to determine what they are. 

Putting the unsures into either ham or spam is theoretically quite bad, but UI
reality intrudes...

As people have observed, of the two choices, uncertains should *clearly* be put
into the ham category. False positives are devastating and must be avoided at
all costs. Re: comment 28, I think .51 is uncomfortably close to this boundary,
but a moderate value such as .55 would be reasonable, assuming that these
statistics hold in wide use (which might be a bad assumption). 

A dynamic threshold would be better (i.e. adjust it based on when the user
corrects the classification), but tricky to implement right... a pref that set
this manually would be nice for experts though, and wouldn't require UI. 
Blocks: spam
Given this seems to be the generic Bayesian implementation bug.
Would someone care to look over:
http://bugzilla.mozilla.org/show_bug.cgi?id=132340#c9 
?

I was making some speculation about Bayesian filtering, and I was wondering if
it was valid.
Depends on: 230093
I've gone through the mozilla and SpamBayes sources and I've created a
semi-pseudo code for the core classification functions.  I hope it's easy to
understand.

This doesn't look like it'll be hard to implement (I wish I knew C++ better, so
I could do it). We're going to have to import the math.h library, the functions
useed are log, exp, floor, frexp, and the constants M_E and M_LN2.
Attached file Mozilla & SpamBayes classifier psuedo-code (obsolete) —
I created bug 230093 for improvements to the tokenizer.  I'm assuming this bug
is just for the classification algorithm.
Attached patch First patch attempt (obsolete) — — Splinter Review
Disclaimers:  I am not an experienced C++ programmer.  This is my first attempt
at patching mozilla, so I don't know if this is correct as far as portability
or formatting.	I don't know how to use CVS, so I created this patch by using
'diff -u -8 -p nsBayesianFilter.cpp.orig nsBayesianFilter.cpp'.

In any case, this compiles on Linux and it seems to work.  This is only a
change to the core classification algorithm using the chi-squared combining
method invented by Gary Robinson, this does nothing about the 'unsure'
category.  I think I've copied the idea correctly from SpamBayes but I might be
wrong, I don't actually understand the math involved to make sure.  Also, I
don't know how much of a win this is going to be, I think that improvements to
the tokenizer might have a bigger impact.

I set the spam cutoff at 0.9 even though there was some discussion here about
setting it at 0.5 - 0.55.  0.5 seems dangerous as far as false positives, I
want to be at least 90% sure that an email is spam before it gets deleted. 
This number can be changed on the line that says 'PRBool isJunk = (prob >=
0.90)'.

I am currently testing this patch by simultaneously running a patched and an
unpatched copy of mozilla side-by-side.  I am going to let it run for a week
and then report the stats.  It would be great if other people here would also
test this.  To test you need to copy your training.dat file from your current
profile to the new profile.  You should apply the exact same training to both
copies.  Then you need to keep track of what happens in four categories tn, fn,
tp, fp:
tn: true negative, email correctly identified as not being spam
fn: false negative, email incorrectly identified as not being spam
tp: true positive, email correctly identified as being spam
fp: false positive, email correctly identified as being spam
Attachment #138425 - Attachment is obsolete: true
After writing the patch I found some problems with the psuedo-code I wrote, so i
obsoleted that attatchment.
Miguel, thanks so much for getting this started. I'll be curious to see how your
results look. You may find it easier to compare the two if you leave the old
classification routine around as well and then use the junk mail log file to log
about messages the new routine caught that the old one did not. And vice versa.

You can do this by adopting the following line to do what you want:

PR_LOG(BayesianFilterLogModule, PR_LOG_ALWAYS, ("%s is junk probability = (%f)",
messageURI, prob));

You'll want to change the args of course to taste. 

Then you don't need to worry about running two builds at the same time and
comparing their results. Let the junk mail log do it for you in the same build.

Here are the results from my testing.  It looks good, my false negatives where
reduced to less than half!  Unfortunately, I only had a total of 163 emails to
test with and these results are specific to my training file and the types of
emails that I get.  Still, I think this shows that this will work.  I calculated
the numbers using different spam cutoffs to see what works best.

Total emails: 163, spam: 48, ham: 115
original algorithm:	tn:115, tp:12, fn:36, fp: 0
new algo w/ .9 cutoff:	tn:115, tp:22, fn:26, fp: 0
new algo w/ .5 cutoff:	tn:110, tp:35, fn:13, fp: 5
new algo w/ .55 cutoff:	tn:114, tp:35, fn:13, fp: 1

.55 was the best cutoff, anything less increased the fp rate and more increased
the fn rate.  The worrisome part is the one fp, that email was a "local breaking
news alert" that I subscribed to and it was full of advertising so it got a .66
score, this is a good example why the 'unsure' category would be nice.  We might
need to consider setting the cutoff at .7 or .8 to reduce the fp rate at the
expense of the fn rate.  Even so, we're better off with the new algorithm.

I'm attatching a spreadsheet with all the numbers in case anyone wants to look,
you can change the spam cutoff to see what happens, it even has a distribution
graph like the one on SpamBayes' web page (ok, I like to play with excel).  The
distribution looks similar to spambayes' considering that I only had 163 emails
and they had several thousand.  The main difference that I see is that in my
emails there are several spams that get low scores, in their graph they don't
seem to get any spams with scores less than .5.  Looking through the spams that
got low scores I have a suspicion that this would be fixed by improving the
tokenizer.
Attached file test result spreadsheet —
Yup, that's pretty exciting!  At what cutoff point does fp=0, though?  1-in-115
good emails erroneously marked as spam seems excessively high, as an fp is very
significantly less desirable than a fn.  A cutoff that can be tuned to spread
the new confidence evenly over a reduced fn rate and a reduced fp rate would
probably be ideal.
.66 was the highest rated ham in my corpora, so a .7 cutoff might be good, that
would result in fn=22.

I just found out that SpamAssasin has a huge 6,047 email public corpus available
for testing at http://spamassassin.org/publiccorpus.  When I get some time I'll
try to test against it.  To do that, I need to figure out how to stuff that
corpus into a POP3 server and then I'll train mozilla on half the emails and
then see how the other half gets rated.  That should give us a much more
accurate idea of what the cutoff should be.
Nice work.  You mentioned (comment 37) that an unsure category would be helpful,
and I definitely can see the logic to that.  Fortunately, something like this
could be a lot easier to pull off with some of the changes I'm working on for a
related bug (bug 181866).  In particular, it changes around the whole
onMessageClassified structure a bit so that it takes a string as the category
instead of an enum, and then instead of directly making decisions about how to
handle the classification (ie, should I move the message or not?), it passes
that off to nsISpamSettings to handle it.

I'm kind of busy with some other stuff at the moment but when I get a chance to
post a more up-to-date patch there it'll have these infrastructure changes that
will hopefully make plugging in alternate filters such as this one a bit easier.
Does anyone know if spam bayes has done work to limit the overall size of the
training set or do they let it grow unbounded (which is how we behave).
Regarding training, some ideas tossed around on
http://www.entrian.com/sbwiki/TrainingIdeas should be helpful.
From reading earthsound's link, it looks like the answer to mscott's question is
that SpamBayes is doing some testing but currently they let the database grow
unbounded.

They also say that sometimes it is good to trash your existing training database
and start over from scratch, especially if you've made some training mistakes. 
I've had it happen to me where my spam recognition seems to get worse and worse
over time. I was thinking that maybe it would be good to add an option to the
"Junk Mail Control" window for deleting your existing training database.
I've got the results from testing using spamassasin's email corpus.  The results
are mixed, the fn rate was a huge improvement but the fp actually went up.

Total emails: 4300, spam: 949, ham: 3350
original algorithm:	fn=205, fp=45
new algo w/ .5 cutoff:  fn=40,  fp=166
new algo w/ .7 cutoff:  fn=41,  fp=85
new algo w/ .9 cutoff:	fn=42,  fp=63
new algo w/ .99 cutoff: fn=44,  fp=47!

I don't know what to make out of the high fp rates, it doesn't make sense to me.
 The fp number is actually low when you consider that it is in the .01%-.02%
range (1 to 2 out of 1,000 emails), but it seems like it should be lower than
that.  This makes me start wondering about the algorithm, I need to go back
through and double check everything against spambayes' source.

I didn't have time to test the tokenizer patches, I'm hoping they'll lower the
fp rates, I'll try to get  around to it as soon as I can.
Has anybody tried contacting Tim Peters or Barry Warsaw (the originators of the
SpamBayes project) or any other current SpamBayes developer? It just seems
practical to inform the original Spambayes developers about this effort. They
may be interested in making this happen and can probably contribute ideas and
even code.

I also noticed that someone proposed a Thunderbird plug-in/extension of
SpamBayes on the feature requests tracker in the SourceForge project. It would
be a waste to have someone work on a plug-in when it could be directly
integrated into the app's junk mail functionality.

I'll post a link to this bug on their tracker in the hopes that it will spark
some collaboration effort.
(In reply to comment #46)
> I'll post a link to this bug on their tracker in the hopes that it will spark
> some collaboration effort.

I'm sure that the other spambayes developers and I would be happy to help out
with this if we can.  If you want the same math as spambayes uses, then it
should be pretty simple to copy it out of classifier.py and translate it into C
(I presume that this is what Miguel has done, although I haven't looked at the
code).  That said, if Gary has improvements, then it would be good to include
those, presumably (I'm sure spambayes will try them out, too, once he writes
them up somewhere).

If you want to use similar tokenizing as spambayes, then it would be worth
looking at the various testing results from each of the options.  All the
tokenizing that's currently done has shown (in testing) that it improve things,
so it would be probably worth reading up the results to see if it's worth using
any that you aren't already (I have no idea what tokenizing is currently done).
 The testing results are scattered about a bit, unfortunately - most are in
tokenizer.py or the spambayes/spambayes-dev archives.

In response to the db size question, there has been a little bit of testing
regarding this, although it was a wee while ago. 
<http://cashew.wolfskeep.com/~popiel/spambayes/>.

Anyway, this was mostly to confirm that if there are any questions that the
spambayes team can help with, just post them to the spambayes-dev list, by all
means.
Here are some of the testing results comparing 0.5 with JUST the tokenizer
changes, and then I did another run with the tokenizer changes + Miguel's port
of the chi-squared probability distribution algorithm instead of a pure bayesian
algorithm listed in this bug.

The test set was somewhat arbitrary. So I don't think the resulting percentages
are important, what's more important is the delta in the percentage of spam
caught between the 3 test runs.

I took spam assassin's public corpus, randomly selected 200 HAM and 200 SPAM
from the corups and used those as the training set. I then tested the training
set against the remaining 5602 messages. I could use help trying to figure out
what to use as a training set for building a benchline of comparisons. I suspect
having a well defined ratio of ham to spam for training would make the numbers
more interesting. Here are the results:

# of Messages	5602		
# SPAM	1684		
	         False Negatives	False Positives	Percentage Caught
Thunderbird 0.5	             925	       1	     45.07%
Tokenizer Changes	     614	       3	     63.54%
Tokenizer + Chi	             190	      22	     88.72%

We are almost doubling the percentage of spam caught by combining the algorith
with the tokenizer changes. However, the huge spike in false positives with the
algorithm change is alarming. Btw, these tests are done with the address book
whitelist filter turned OFF. So in practical use, the fp rate should get cut
down some because we white list folks already in your address book.
FP spike fixable by tuning the cutoff point, perhaps?
Here are some updated numbers by varying the cut off point for spam from 0.9,
0.95 and 0.99. 0.99 had the lowest false postive count, but obviously had the
worst percentage at false negatives.

Summary			
# of Messages	5602		
# SPAM	1684		
	          False Negatives	False Positives	    Percentage Caught
Thunderbird 0.5	              925	      1	                   45.07%
Tokenizer Changes	      614	      3	                   63.54%
Tokenizer + Chi (.9)	      190	     22	                   88.72%
Tokenizer + Chi (.95)	      245	     16	                   85.45%
Tokenizer + Chi (.99)	      410	      4	                   75.65%

What does this mean? Maybe .95 is a good cut off point given this test data. We
might also want to allow this to be set via a pref so hackers can tweak it on
there own. Miguel mentioned that one should probably audit his algorithm patch
in case he made a mistake and that's why we have such a high false positive
count when using this new algorithm.
It seems to me that running the entire Spam Assassin corpus against any filter
based on an initial fixed set of training data isn't representative of what will
happen in the real world. 

In actual use, most false negatives will be trained by the user as spam, and 
false positives will be trained as ham. This would be easy to simulate, I would
think.

I suspect (based solely on personal experience and knowledge of Bayesian
analysis, but no true experimental data) that the existing algorithms/tokenizer
would perform proportionally worse as it's trained up during use vs. the
spambeyes ones. 

In fact, with the existing mechanisms, I find that after a certain point my
false positive rate actually goes up with additional spam training (with some,
but a much smaller, effect on the false negative rate). I tend to get a lot of
newsletters, though... :-).

How much effort does Spam Assassin go to to "spetrally match" (if you'll pardon
the thinly stretched metaphor) their corpus to field spam/ham. 
It seems that what's really needed here is live results from daily use.

Can this go into the nightly builds alongside the current algorithm? If they're
both there, the current one active, and the new one only able to be enabled via
a pref, then we can start to get real world results without disturbing the bulk
of the normal users.
Once I get a little more refinement going for processing headers in the
tokenizer, I plan on releasing a test build of thunderbird with this
functionality so folks can start seeing how it does in the 'real world'.
unfortunately there is no way to pref this new behavior so you can try both ways
with the same build. The changes were too far reaching.

FYI, for devs looking to contribute to this project but aren't fully familiar
with  mozilla, see this comment about the help I could use with the tokenizer:

http://bugzilla.mozilla.org/show_bug.cgi?id=230093#c9
Assignee: peterv → mscott
the latest patch in Bug #230093 includes the tokenizer changes plus the new
probability algorith. I also added a pref, mail.adaptivefilters.junk_threshold
which will allow testers to change the cut off point for determining if a
message is junk or not. 

I hope to turn this patch into a test build within the next day or so.
FYI, I have made a thunderbird test build (Win32) with the both the algorithm
change and the tokenizer changes I've been working on the last week or so.

You can download it here:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/test/thunderbird-junk-test-win32.zip

I'd recommend re-training with a clean training.dat before judging the
effectivness of the new stuff.

If you want to play around with the sensitivity of the filter, you can change
the probability threshhold for determining if a message is junk or not (the
lower the number the more agressive) by tweaking the following pref. The current
cut off is .90

+// the probablilty threshold over which messages are classified as junk
+// this number is divided by 100 before it is used. The classifier can be fine
tuned
+// by changing this pref. Typical values are .99, .95, .90, .5, etc. 
+pref("mail.adaptivefilters.junk_threshold", 90); 


Status: NEW → ASSIGNED
Target Milestone: mozilla1.7alpha → mozilla1.7beta
(In reply to comment #55)
> FYI, I have made a thunderbird test build (Win32) with the both the algorithm
>...

Hi Scott,

I downloaded your test build but I couldn't get it to work for some reason. I
started with a fresh training.dat and activated the Junk Mail Controls through
the GUI interface, however, the application never detects or moves e-mails when
they arrive.

I even tried manually moving all the junk mails from the Junk folder to a new
temp folder and selected the "Run Junk Mail Controls on Folder" option. All it
did was clear the junk e-mail status on all the messages.

Any ideas?
(In reply to comment #56)
> I downloaded your test build but I couldn't get it to work for some reason. I
> started with a fresh training.dat and activated the Junk Mail Controls through
> the GUI interface, however, the application never detects or moves e-mails when
> they arrive.

I can confirm the same problem.  Nothing was marked as junk in 3 days, even
SpamAssassin mail with "***possible spam*** slapped into the header.  And I
probably get 100 pieces of spam a day at least, so there was plenty to learn from.
FWIW, I've been using this downloaded build and it's working fine. I just
dropped in the new exe directory, renamed training.dat, and went for it (ok, I
had to remove a few themes that were incompatible with v 0.5 Thunderbird). 

So far, it is seeming similar (fewer false positives now that I'm even partially
trained up, though, which is interesting... of course whitelisting helps that a
lot).
Attached patch updated patch (obsolete) — — Splinter Review
Miguel has put together a new patch for the new algorithm with some generous
help from the spam bayes folks. This fixes a couple bugs in the new algorithm.

In this patch, I have also included a change to how we count tokens in the
training set to match the approach taken by spam bayes. Before, if a token
occurred n times in a message and the user marked it as junk (or not junk) we
bumped the count for this token in the training set by n. Spam Bayes says when
adding / removing to the training set, you should ignore the number of times
the token may appear in the message. Just bump the training set token count by
1, regardless of how many times the token in question occurrs in the message.

Because of this accounting change, I highly recommend anyone trying these
patches at home to remove their training.dat file and re-train.
Attachment #138741 - Attachment is obsolete: true
Hi, sorry to jump in here suddenly, but reading mscott's last update inspired me
to throw in a few cents from my own research on the message classification. 
Quick background info - I did some work on a related topic for Mozilla - see bug
181866 (and the update on that is that the versions shown in that bug report are
unfortunately pretty far out of date - I'll try to get back to work on that if I
have time over the next month or two).

Anyway, if I'm understanding Scott's proposal, in machine learning terms he is
suggesting a switch from a "Multinomial Model" to a "Multi-variate Bernoulli
Model" - effectively going from a model that includes information about the
number of times a word appears in a document to a binary model.  In general, for
text classification, though, it seems that the multinomial model (that is, the
original method) actually works better, which makes some intuitive sense since
there is a difference between a word being mentioned once and one being
mentioned repeatedly.  Research has empirically confirmed this, a good paper on
the topic is this one: http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf

However, of course, spam is a unique beast.  In particular, you don't want to
risk allowing one single word to dominate the classification of a message - if
by some coincidence you happen to get a legit email from a friend about Viagra,
you wouldn't want the frequency of the single spammy word Viagra to cause the
message to be flagged as spam.  Therefore, I propose a compromise, which I have
found in my own tests works well in addition to making linguistic sense.  Check
out section 4 of this paper:
http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf.  The idea is that rather
than using the number of times the word has occurred in the document directly
(the term frequency or TF), you use log(1+TF).  This makes sense linguistically
because of Zipf's Law - basically, in this case, the key observation is that we
want the first time a term shows up to be the most significant, and then each
subsequent term to be less and less significant, but still to count for
something.  The 1+ helps mathematically stablize the function, and finally if
you use log base 2 you have the nice property that if a word shows up once in a
document, the modified TF vector will still represent it as a 1 (further
occurrences will add fractional amounts).

I have implemented this in my own project (see bug link above), but obviously I
don't expect you all to take my word for it, so I would encourage you to try it
out to see which performs best - it may be that for spam the binary model works
best.  I will try to post a code snippet in a few days if my description was not
adequate.
> http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf.  The idea is that rather
> than using the number of times the word has occurred in the document directly
> (the term frequency or TF), you use log(1+TF).  This makes sense linguistically
> because of Zipf's Law 

Jumping in here --

I have thought about doing experiments with this with the chi-square approach but haven't gotten 
around to it yet. Jason (the Rennieof the above URL; haven't read the paper yet though) mentioned to 
me that they'd been doing a log(1+tf) calculation, and it made sense to me. I wouldn't necessarily 
associated it with the zipf distribution, rather to me the root of it is in information theory. In fact for 
information theoretic reasons I'm actually using a log of occurrence counts in a product we're working 
on.

I haven't tried it yet in the context of the "grahamian probability" calcs being used at the root of 
spambayes and  other spam filters that incorporate those calcs. I think it would be interesting to play 
with, and that it has the potential to improve performance over the version being tested now. However, 
I would also like to mention that based on my own experience in using logs of counts, any 
improvement is not likely to be so great that a possible incorporation of the current version of the 
spam filtering into the public-release-versions of mozilla should be held up for long over it.

I hope anyone who tests the log-of-counts approach will report results here.  As a starting point I'd 
suggest simply counting each email in which a word occurs as log TF rather than 1 in calculating the 
original grahamian probabilities.
Attached patch another updated patch — — Splinter Review
This version of the patch uses the same new token counting logic in the last
patch, but in less code.

It also fixes a minor sorting bug Miguel found yesterday.
Attachment #143253 - Attachment is obsolete: true
*** Bug 200087 has been marked as a duplicate of this bug. ***
Yikes, I just found a really bad problem where the token counts get shorted if
you delete training.dat and try to retrain by selecting a bunch of already
classified messages, re-classifying them with the same classification. (i.e. you
select messages in your junk folder and say mark as junk again)

See: Bug #237095 if you are interested in talking about this issue.

I can fix it, but the fix means allowing a user to classify the same message
over and over again which could lead to data skew with the token counts for
tokens in that message. I may have to allow the user to do that in order to fix
this retraining issue though.
I've been running with this patch for about a week now. It seems to work, though
frankly I'm unable to tell if it works better than the previous mechanism.

I still get many false negatives, but so far have not encountered a false
positive (could be due to better training that I had previously).

I've noticed that the spam system deals very badly with selecting 30,000+
messages and marking them as junk. Things get very locked up (still running, but
other mail functions are not executed - feels like all threads are tied up in
the junk classification).

It's also been a challenge actually training the new training.dat due to the bug
mentioned above (reclassifying already classified messages).

What would be extrememly usefull both for testing as well as a generally cool
feature would be to be able to see the junk score in the message header, and the
ability to set my "junk threshold" via a pref (a slider with values from "loose"
to "strict" would be great and would not needlessly confuse people who are
unfamiliar with how the classification system works).
Does anyone know how Spam Bayes handles multipart/alternative when feeding data
into the tokenizer? Do they feed both the text/html part and the text/plain part
into the tokenizer or just one of the parts?

Currently we feed both parts into the tokenizer. Things would be easier on my
end if I only had to feed one of the parts into the tokenizer which is why I am
asking.

The closest blurb I could find in the tokenizer script was:

# Later:  The decision to ignore "redundant" HTML is also dubious, since
# the text/plain and text/html alternatives may have entirely different
# content.  options.ignore_redundant_html was introduced to control this,
# and it defaults to False.  Later:  ignore_redundant_html was also removed.

But I couldn't tell if that means they only use one part and ignore the other or
if they now tokenize on both parts.
I always understood that comment to mean that they tokenize both parts.  Since
the token counting scheme was changed so that multiple instances of a word in a
message are only counted once, it doesn't matter if redundant parts are tokenized.

Here's the comment on it on the SpamBayes mailing list,
http://mail.python.org/pipermail/spambayes/2002-October/001261.html
We tokenise both. A non-zero amount of spam had a text part that was pretty
simple and non-spammy, with all the spamminess in the HTML.
Hello, for anyone interested in the chi-square spam detection method,

1) I've written an <a href=http://www.garyrobinson.net/2004/04/improved_chi.html>article that 
introduces an improvement to the chi-square algorithm</a> and includes test results. The 
improvement is statistically significant. 

2) Greg Louis of the Bogofilter project has also <a href=/bogofilter/esf.html
>tested the improvement mentioned above</a> and obtained positive results.

3) I've written <a href=http://www.garyrobinson.net/2004/05/why_chi.html>another article about 
the theory behind the chi-square technique</a> and reasons for using it. 

4) I tried using the log of the number of messages as part of the chi-square algorithm. It didn't 
help, at least not significantly. There is some possibility that there  was an error in the test, but at 
this point I'm not encouraged that it would lead to a big improvement. I mention it because the 
subject came up earlier in these comments.
Interesting stuff, Gary.  Just to quickly respond to your point 4, I'm not sure
if you mis-stated that in your comment or if there was a misunderstanding in
where the log should be used, but to clarify, it should be the log of the number
of times each word occurs in a given document, not the log of the number of
documents.  Also keep in mind that this requires changes to the training set
which unfortunately cannot be retroactively applied unless you still have the
word counts for each individual document (as opposed to just the sums).

Any example may be handy.  Consider a document which contains the word "the" 10
times and the word "and" 5 times.

If you were training, you would log(10) to the number of times you have seen
"the" and log(5) to "and", while if you were classifying, rather than the usual
(for NB) p(the|spam)^10 * p(and|spam)^5 or
10*log(p(the|spam))+5*log(p(and|spam)), you would now have p(the|spam)^log(10) *
p(and|spam)^log(5) or log(10)*log(p(the|spam)) + log(5)*log(p(and|spam)).

If you are normalizing anything, the document length is now the sum of the logs
of the number of times each word occurs, so you would have the length of the
above document as log(10)+log(5) instead of 15).

I'm sure I messed something up there, but hopefully it made sense...
> it should be the log of the number
> of times each word occurs in a given document, not the log of the number of
> documents.  

Yes. That's what I meant (and did in the code).

> Also keep in mind that this requires changes to the training set
> which unfortunately cannot be retroactively applied unless you still have the
> word counts for each individual document (as opposed to just the sums)

Yes, I did that. I had all the original word counts.

I was a bit surprised that it didn't help, because I've used the log-of-the-number-of-things trick in 
other contexts. But it's also true that there is a mathematical basis for the chi-square technique that 
doesn't make complete sense if we use the log of the counts,

But it is also true that my results shouldn't be taken as definitive, there could have been some kind 
of error. It would be good for someone else to duplicate. 
At long last this patch has finally made its way into the mozilla 1.8 trunk.
This was already part of the recently released Thunderbird 0.6 release.

The patch was checked in as part of Bug #230093.

I'm going to mark this bug as fixed.

However, Gary Robinson and others have already been talking about improvements
to the chi squared prob. approach that we want to continue discussion on. 

I have filed Bug #243430 to further track this aspect. I cc'ed a couple of you
on that bug already. Feel free to cc yourself if I missed you.

Hopefully we can continue to improve the new algorithm in that bug. I can't
thank everyone enough for their help in making this happen!
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Latest builds are having problems creating/adding to training.dat file.   I've
tried deleting the training.dat file ... It will get re-created, but only the
first entry takes ... after that, it will no longer get updated.

Build ID 2004051608
See bug 243680 for this problem.
My experience with the new filtering - terrible. I open my mail today and it
marks *ALL* of my mail as Junk. In the last few days I got around 2-3 false
positives so I reset my training.dat but seeing all my mail marked as junk takes
the cake.

Does anyone else have a similar experience?
Pratik: the two comments before yours point to a problem that causes
training.dat not to get updated at all, thus not learning. Might this be what
you are seeing?
The behaviour I see (with Build 20040513) is this

- no training.dat is present.
- almost every mail is marked as junk (including the bugmail I got about the
last comment).

I'll try a more recent build just in case...
Pratik:
would you plese actually read bug 243680 - which I just pointed you to?
It explicitly says that no training.dat is created.
One patch for that bug is already in and might already solve the problem, but
the real patch is in bug 193625 and still awaiting review.
spam bayes gurus....Someone in mozillazine pointed out the following to me. I'm
looking for validation of their comment about the value of an unknown word
probability:

"spam bayes uses a value of 0.5 where mozilla uses a value of 0.225 as the
coefficient in the calculation of probability of each token"

http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#973

The original comment came from:

http://forums.mozillazine.org/viewtopic.php?t=81108

Does this sound like something we need to change or am I looking at the wrong
part of the algorithm?
As mentioned in a follow-up comment to the referenced mozillaZine topic,
SpamBayes multiplies the unknown word prob by the unknown word strength, which
results in 0.225 with the default values.  The exact calculation is as follows:

n = hamcount + spamcount => # of times word has been seen before
S = 0.45 => unknown word strength
X = 0.4 => unknown word probability

For a given word with base probability prob, the adjusted probability is:

prob = ((S * X) + (n * prob)) / (S + n)

If n = 0 indicating that the word has never been seen before then this reduces to:

prob = (S * X) / S = X

So the final probability of an unknown word is always the constant X = 0.5. 
Since SpamBayes ignores any words that have a probability between 0.4 and 0.6,
words that have never been seen before are always discarded from the calculation
of the overall message score.
I should probably go research this myself, but has anyone thought about upping
the spam probability of unknown words? The only emails I receive that have
"unknown" words in them these days are spam. I'd be willing to accept a slightly
higher temporary false positive rate in order to get such a greatly increased
permanent reduction in false negatives...
> I should probably go research this myself, but has anyone thought about upping
> the spam probability of unknown words? The only emails I receive that have
> "unknown" words in them these days are spam. I'd be willing to accept a slightly
> higher temporary false positive rate in order to get such a greatly increased
> permanent reduction in false negatives...

Given the fact that so many emails now contain random text as a disguise, that sounds like a good 
idea to me.

I would suggest: all unknown words should be treated as the same word, and a spam probability for 
the unknown word token should be calculated. That probability  should then be used as X in the 
equation in comment #80. In calculating the spam prob for the unknown word token itself, just 
leave X out of the equation, because it would be overwhelmed by the amount of unknown word data 
anyway.
Oh, and for that last suggestion (#82) to make any sense, the unknown words would have to be left 
in the processing and not removed like other "middle probability" words are.

I don't know if that would improve things or make them worse. It would be interesting to test, 
though.
On our iMail systems, we upped the unknown word probability from .4 to .6, and
that caught a LOT of the spams that were getting through before. We haven't
noticed any appreciable rise in false positives since we did that.
(In reply to comment #82)
> I would suggest: all unknown words should be treated as the same word, and a
spam probability for 
> the unknown word token should be calculated. That probability  should then be
used as X in the 
> equation in comment #80. In calculating the spam prob for the unknown word
token itself, just 
> leave X out of the equation, because it would be overwhelmed by the amount of
unknown word data 
> anyway.

I'm curious as to how you would calculate the probability of the unknown words?
 At first glance it would seem that n = 0 no matter how many unknown words you
add together because you have no prior info about unknown words.  If you plug n
= 0 into the equation and leave out X then the equation reduces to S / S = 1.

BTW: sorry for the typo in comment #80: should be "X = 0.5", obviously, and not
"X = 0.4"
" At first glance it would seem that n = 0 no matter how many unknown words you
add together because you have no prior info about unknown words."

I'm suggesting we treat all unknown words as if they were the same word for the purpose of 
computing the initial Unknown Word probability. So the second time you get an email containing a 
word you haven't seen before, it won't be unknown in that sense -- it will be the second occurrence 
of The Unknown Word in an email, even though that actual word may not have been seen before.

It's just an idea. I admit I haven't spent a great deal of time thinking about it. But it seemed like a 
decent idea.
How would this affect those of us who work in multilingual environments? 
"Unknown" words might be rare in a unilingual environment but not so uncommon in
bi/tri/lingual nations.  I, for example, deal with French translations for our
bilingual website.  I don't get that many each week but enough that it would be
irritating if they were constantly hidden in the junk mail folder because some
French text is "unknown".

Just a concern.
Re: comment 87, the nice thing about Gary's suggestion is that in your
environment, the "unknown word" probability would quickly be trained down
because unknown words are common in your environment. Whereas in my environment,
I suspect they would get trained up to around .7, because they are relatively
uncommon for me.

I'd put a floor on it, though. Unknown words shouldn't become a active indicator
of ham, because all instances of the database will start out with all words
being unknown (and because they'll never be completely uncommon). 

Also, would there need to be some way of retrograding the unknown word
probability when a word becomes "known" (i.e. seen > once)? I tend to think not,
but I haven't run the math (even in my head :-). 
Before making changes like this, I strongly strongly recommend you make sure you
can test for it. The main thing that let us make SB as strong as it is was that
we  had a test-first approach - if the change didn't make it better, it didn't
go in. Please _don't_ just "check something in because it sounds like it should
make it better". Often the "obvious" fixes actually make stuff worse.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: