Closed Bug 230093 Opened 21 years ago Closed 21 years ago

Improve the tokenizer for the spam filter using ideas from SpamBayes

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: unroar, Assigned: mscott)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 6 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7a) Gecko/20031231 MultiZilla/1.6.0.0e Build Identifier: I've been looking throgh the SpamBayes Python source code and I see a couple of improvements to the tokenizer that would be easy to implement. Our tokenizer code is located here: http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#261 The SpamBayes tokenizer splits words on whitespace, ours does strtok with delimeters " \t\n\r\f!\"#%&()*+,./:;<=>?@[\\]^_`{|}~". So we're ignoring punctuations, but that means we can't tell the difference between 'free' and 'free!!', which is a much higer spam indicator. The other easy change is that they ignore tokens that are < 3 characters. Tokens that are > 12 characters are replaced by a special token that has the first letter of the token and how many characters there were. Besides those 2 changes, there are other improvements that could be made but I'm not sure how they would be implemented. Is the message header fed into our tokenizer? Are html tags stripped from the message before they get to the tokenizer? Reproducible: Always Steps to Reproduce:
Blocks: 181534
Assignee: sspitzer → mscott
Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: spam
Low hanging fruit part first. Ignore tokens less than 3 characters in length. This check should work fine for both ascii and utf-8 strings. I'm trying to use the following as my guide: http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.29&view=markup
This patch adds: 1) Ignore tokens less than 3 bytes in length 2) If a token is more than 12 bytes, add a special token: "skip:K N" where K is the first character in the word and N is an approximate length rounded to the nearest 10. 3) If the token is too long, but it looks like an email address, add two tokens, one for the email name and one for the domain name.
Attachment #139618 - Attachment is obsolete: true
Miguel, I'd be curious how this patch improves your test results for the algorithm change you made in Bug #181534. So much more to do here with the tokenizer. Next steps are: 1) I think we tokenize html tags and the stuff inside the html tags. We need to stop that. 2) We tokenize the headers as if they were the body (i.e. they are fed into the tokenizer). I need some help confirming, but I don't think Spam Bayes lumps all the headers into the tokenizer. Instead, it picks a couple headers (subject, received headers, content type, content disposition, content encoding), and does special magic to generate some tokens for those values. Or are they double counting the headers they explicitly set along with when they tokenize the message. 3) I'm still not tokenizing on just white space yet. I'm slowly easing into that with this patch by removing ".@!" from our delimeter set. This means separate tokens for "free" and "free!!" now since the later is a higher indication of spam. 4) what to do about non ascii messages. This current patch just does all this work for ascii words and not UTF-8 words. 5) Add code to strip out HTML comments and replace them with nothing to handle: FR<!--asdfjklsdjf-->EE That should keep us plenty busy.
Status: NEW → ASSIGNED
More low hanging fruit. Start adding specific tokens based on certain headers such as the content type, the charset encoding and the x-mailer/user-agent header. Hopefully pulling out a token for the charset will help make us better at the international spam that we get a lot of.
Attachment #139642 - Attachment is obsolete: true
Depends on: 231873
Here's an intersting idea that I think could help. For each attachment we discover in the message, add a token for the following properties: content-type (aren't a lot of virus attachments certain content types?) file name (don't a lot of them use the same name like 'test.exe') I wonder if those two techniques would help improve our ability to detect incoming virus mail.
This patch adds special tokens for the name and content type of each attachment. It also adds code to strip html tags from the body text before tokenizing the body. We no longer break the body tokenizer on :, /, ? and &. This helps us tokenize complete urls instead of breaking the urls into words. More work still needs to be done on this front.
This is all quite interesting. Do you consider these patches safely 'testable'? I'll give them a spin if that's helpful. If so, should I reset my weighting before testing?
Duplicating my comment in Bug #181534 since it applies to the tokenizer work here too: Here are some of the testing results comparing 0.5 with JUST these tokenizer changes, and then I did another run with the tokenizer changes + Miguel's port of the chi-squared probability distribution algorithm instead of a pure bayesian algorithm. # of Messages 5602 # SPAM 1684 False Negatives False Positives Percentage Caught Thunderbird 0.5 925 1 45.07% Tokenizer Changes 614 3 63.54% Tokenizer + Chi 190 22 88.72% We are almost doubling the percentage of spam caught by combining the algorith with the tokenizer changes. However, the huge spike in false positives with the algorithm change is alarming. Btw, these tests are done with the address book whitelist filter turned OFF. So in practical use, the fp rate should get cut down some because we white list folks already in your address book.
Now that I have individual headers being broadcasted to the classifier, it is really easy to customize the rules / parsing used on each header. Spam Bayes has explicit steps it takes for headers like date, received, message ID, to/cc/bcc, etc. It gets tricky for us because we are doing all this in C++ and we don't have access to regexpressions so we have to figure out create C++ ways to do some of the things spam bayes does. My hope it that once this stuff lands, some volunteers from the community will be able to step up and translate the header code found http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.29&view=markup into C++ code. The good news is that with my changes, you won't have to be familiar with mozilla to do this. Everything is already contained in one little method (Tokenizer::tokenizeHeaders). You just need to read the spam bayes document, figure out how to do the same mutation for a particular header in C++ without reg expressions and you are in business. This could be a great project for some folks looking to get their feet wet with mozilla without having to know all the ins and outs.
this patch also uses a pref, "mail.adaptivefilters.junk_threshold" so testers can adjust the probability value used to determine if a message is junk or not. You can see that we treat almost all the headers the same, this is where I'm hoping volunteers will step in: default: addTokenForHeader(lowerCaseHeaderName.get(), nsDependentCString(headerValue)); break; Right now I'm not even tokenizing these header values, I'm just adding a token for "headername:headervalue". Spam Bayes has lots of little rules for how each header should be broken down.
Attachment #139674 - Attachment is obsolete: true
Attachment #141078 - Attachment is obsolete: true
I just had a look at the latest patch. As many users may have noticed, the current implementation has problems identifying emails containing viruses because each is slightly different and in most cases the body is very short. Did anyone already do a test run with this patch in order to see if adding tokens for filename and contenttype of the attachments does a good job for this type of spam? What about eventually also adding a token for the size of attachments?
That's a fair observation -- for example, for the past few weeks I've been able to pretty much reflexively delete the MyDoom false-negatives just by eyeballing that they're almost always 31K.
See this comment about the availability of a test build containing the tokenizer changes and the new algorithm: http://bugzilla.mozilla.org/show_bug.cgi?id=181534#c55
Scott: just wanted to make sure you've read http://www.paulgraham.com/better.html
Attached patch updated patch (obsolete) — Splinter Review
Attachment #141299 - Attachment is obsolete: true
One thing to be very wary of is that just cherry-picking the "easy" chunks of the tokenisation bits from spambayes may in fact not produce the best results. The current set of tokeniser tricks are an accumulation of code, each of which was tested thoroughly before being accepted. But they were tested _in_ _conjunction_ _with_ _all_ _other_ tricks. They inter-relate in a way that's far too scary to contemplate. I'd strongly recommend you set up a solid testing harness, with a couple of different data sets, for checking new code. In particular, something like the SB 'timcv' cross-validation driver would seem to be essential. This breaks the messages up into N buckets; picks a bucket, trains on all other buckets, then scores the bucket chosen. Repeat for all buckets.
Has there been any thought give to letting the user set the ham/spam cutoff percentages like SpamBayes does? Building on that the inclusion of a "Junk Suspects" folder would be great. This is what the SpamBayes Outlook plugin does and it's easy to setup for Mozilla/Thunderbird when using the SpamBayes POP3 proxy. So the client would include the following configuration settings: Spam Cutoff Percentage, default 90% (scores at or above this number are spam) Ham Cutoff Percentage, default 10% (scores at or below this number are ham) Add a Junk Suspects folder that gets any messages classified as unknown (score falls between the two numbers set above).
Yes there has been thought on this. It is currently pref driven. +// the probablilty threshold over which messages are classified as junk +// this number is divided by 100 before it is used. The classifier can be fine tuned +// by changing this pref. Typical values are .99, .95, .90, .5, etc. +pref("mail.adaptivefilters.junk_threshold", 90);
Scott, I spent some time going over my patch and testing to try to fix the fp problem. I asked for and got a lot of help from the SpamBayes developers and we've found several mistakes in my previous patch and testing is showing more big improvements. What I've now discovered is that SpamBayes uses a different method for counting tokens during training then mozilla does, since we're now using their formulas the numbers are coming out wrong. In mozilla every ocurrence of the same token in a single message is counted and added to the training database, in spambayes they only count that token once. For example, if a spam message has the word 'free' in it 5 times, Mozilla increases the spamcount for 'free' by 5 in training.dat while SpamBayes increases it by 1. Since you spent some time digging through the tokenizer code I thought you could point me to where this change could be made.
Miguel, that's awesome news!! As soon as you give me a new algorithm patch, I'll re-run my tests to measure the performance and put up a new test build. Here's where we update the token count: http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#202 if the token is already in the hashtable, we bump the mCount field on the token right there. Change that to do nothing, and the count would stay at 1.
That didn't work. That made it so that the count is never increased past 1. I need it so that it's increased once per message, but not more than once.
Well that hash table contains all of the tokens for the current message being parsed. If the count stays at 1 here, then later on if we end up adding those tokens to the training set, we end up copying out mCount for each token and adding it to the training set of tokens. Maybe this method is used in two ways, one for the tokenizer that builds up the tokens for a specific message (in which case we want that line to never increment past 1) and secondly, we have a tokenizer that orgaizes the tokens for the training set. When adding tokens to that tokenizer, we want to really increment mCount past 1. If that's the case, we may have to set a flag on the tokenizer so we now which way it is being used.
This patch merges the new tokenizing work with an updated version of the new algorithm. See this comment for more details: http://bugzilla.mozilla.org/show_bug.cgi?id=181534#c59
Attachment #141577 - Attachment is obsolete: true
Is HTML still being stripped from the body? Paul Graham has made references to the fact that the FF0000 token (HTML color red) ranked as high as "teens". There is some discussion about his updated tokenizer (which reads a, img, and font tags) at http://www.paulgraham.com/better.html
Disregard my last comment. I just read the SpamBayes background page: http://spambayes.sourceforge.net/background.html Apparently, HTML tokenizing unfairly penalized ham with HTML. "In the end, the best results were found by stripping out most HTML clues."
Comment on attachment 143254 [details] [diff] [review] updated patch that includes bug fixes to the core algorithm Let's start getting these changes ready for a landing during 1.8. David, break down of the changes: 1) Modify libmime, when fetching a message for junk mail, set the output type as text/html to force the HTML MIME emitter to get invoked 2) Allow a header sink to be set on individual urls instead of just the nsIMsgWindow 3) Modify our header sink broad caster class to use an array of nsAUTF8Strings to avoid unicode / utf8 mismatches when calling into the JS front end for message display and the CPP junk mail plugin header sink listener. 4) Modify msgHdrViewOverlay.js to account for this API change 5) modify libmime to properly call through this new version of the interface. 6) Land the new algorithm which uses different math for the junk scores 7) tokenizer changes based on some of the spam bayes rules talked about in this bug. Including stripping out HTML, creating special tokens for some headers (we still ignore some headers for now), ignore words greater > 13 characters and turn them into n-gram tokens, add special tokens for each broadcasted attachment.
Attachment #143254 - Flags: superreview?(bienvenu)
these are all nits (Sethism's) Can we get rid of the dumps? + dump(srcUrlPrimitive + '\n'); for (index in currentAttachments) { attachment = currentAttachments[index]; + dump('attach.url: ' + attachment.url + '\n'); if (attachment.url == srcUrlPrimitive) + if(!aMsgHdrSink) + return NS_ERROR_NULL_POINTER; + + *aMsgHdrSink = mMsgHeaderSink; + NS_IF_ADDREF(*aMsgHdrSink); + return NS_OK; can be: NS_ENSURE_ARG_POINTER(aMsgHdrSink) NS_IF_ADDREF(*aMsgHdrSink = mMsgHeaderSink); +nsresult nsMimeHtmlDisplayEmitter::BroadcastHeaders(nsIMsgHeaderSink * aHeaderSink, PRInt32 aHeaderMode, PRBool aFromNewsgroup) +{ + nsresult rv = NS_OK; if you move the decl of rv closer to where it's used, and remove the unneeded init... + { + if (word[0] == '\0') continue; if (!*word) continue + + // important: leave out sender field. To strong of an indicator should be "too strong..." + nsresult rv = NS_OK; + // Create a parser + nsCOMPtr<nsIParser> parser = do_CreateInstance(kParserCID); + NS_ENSURE_TRUE(parser, NS_ERROR_FAILURE); slightly cleaner to do: (partly because I hate returning NS_ERROR_FAILURE... + nsresult rv; + // Create a parser + nsCOMPtr<nsIParser> parser = do_CreateInstance(kParserCID, &rv); + NS_ENSURE_SUCCESS(rv, rv); similarly, here: + nsCOMPtr<nsIDTD> dtd = do_CreateInstance(kNavDTDCID, &rv); + NS_ENSURE_SUCCESS(rv, rv); here, just return parser->Parse... + rv = parser->Parse(inString, 0, NS_LITERAL_CSTRING("text/html"), PR_FALSE, PR_TRUE); + return rv; + if (goodclues > 150) + first = count - 150; + else first = 0; can just be first = (goodclues > 150) ? count - 150 : 0; return NS_OK;
Comment on attachment 143254 [details] [diff] [review] updated patch that includes bug fixes to the core algorithm sr=bienvenu with those nits.
Attachment #143254 - Flags: superreview?(bienvenu) → superreview+
At long last this patch has finally made its way onto the 1.8 trunk. It was already part of the latest Thunderbird 0.6 release. I'm going to mark this as fixed, but there is a lot more tokenizer work to do here. We'll spin up new bugs to track further improvements.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: