Closed
Bug 230093
Opened 21 years ago
Closed 21 years ago
Improve the tokenizer for the spam filter using ideas from SpamBayes
Categories
(MailNews Core :: Filters, enhancement)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: unroar, Assigned: mscott)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 6 obsolete files)
50.36 KB,
patch
|
Bienvenu
:
superreview+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.7a) Gecko/20031231 MultiZilla/1.6.0.0e
Build Identifier:
I've been looking throgh the SpamBayes Python source code and I see a couple of
improvements to the tokenizer that would be easy to implement.
Our tokenizer code is located here:
http://lxr.mozilla.org/seamonkey/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#261
The SpamBayes tokenizer splits words on whitespace, ours does strtok with
delimeters " \t\n\r\f!\"#%&()*+,./:;<=>?@[\\]^_`{|}~". So we're ignoring
punctuations, but that means we can't tell the difference between 'free' and
'free!!', which is a much higer spam indicator.
The other easy change is that they ignore tokens that are < 3 characters.
Tokens that are > 12 characters are replaced by a special token that has the
first letter of the token and how many characters there were.
Besides those 2 changes, there are other improvements that could be made but I'm
not sure how they would be implemented. Is the message header fed into our
tokenizer? Are html tags stripped from the message before they get to the
tokenizer?
Reproducible: Always
Steps to Reproduce:
Assignee | ||
Updated•21 years ago
|
Assignee: sspitzer → mscott
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 1•21 years ago
|
||
Low hanging fruit part first. Ignore tokens less than 3 characters in length.
This check should work fine for both ascii and utf-8 strings.
I'm trying to use the following as my guide:
http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.29&view=markup
Assignee | ||
Comment 2•21 years ago
|
||
This patch adds:
1) Ignore tokens less than 3 bytes in length
2) If a token is more than 12 bytes, add a special token: "skip:K N" where K is
the first character in the word and N is an approximate length rounded to the
nearest 10.
3) If the token is too long, but it looks like an email address, add two
tokens, one for the email name and one for the domain name.
Attachment #139618 -
Attachment is obsolete: true
Assignee | ||
Comment 3•21 years ago
|
||
Miguel, I'd be curious how this patch improves your test results for the
algorithm change you made in Bug #181534.
So much more to do here with the tokenizer. Next steps are:
1) I think we tokenize html tags and the stuff inside the html tags. We need to
stop that.
2) We tokenize the headers as if they were the body (i.e. they are fed into the
tokenizer). I need some help confirming, but I don't think Spam Bayes lumps all
the headers into the tokenizer. Instead, it picks a couple headers (subject,
received headers, content type, content disposition, content encoding), and does
special magic to generate some tokens for those values. Or are they double
counting the headers they explicitly set along with when they tokenize the message.
3) I'm still not tokenizing on just white space yet. I'm slowly easing into that
with this patch by removing ".@!" from our delimeter set. This means separate
tokens for "free" and "free!!" now since the later is a higher indication of spam.
4) what to do about non ascii messages. This current patch just does all this
work for ascii words and not UTF-8 words.
5) Add code to strip out HTML comments and replace them with nothing to handle:
FR<!--asdfjklsdjf-->EE
That should keep us plenty busy.
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•21 years ago
|
||
More low hanging fruit. Start adding specific tokens based on certain headers
such as the content type, the charset encoding and the x-mailer/user-agent
header.
Hopefully pulling out a token for the charset will help make us better at the
international spam that we get a lot of.
Assignee | ||
Updated•21 years ago
|
Attachment #139642 -
Attachment is obsolete: true
Assignee | ||
Comment 5•21 years ago
|
||
Here's an intersting idea that I think could help. For each attachment we
discover in the message, add a token for the following properties:
content-type (aren't a lot of virus attachments certain content types?)
file name (don't a lot of them use the same name like 'test.exe')
I wonder if those two techniques would help improve our ability to detect
incoming virus mail.
Assignee | ||
Comment 6•21 years ago
|
||
This patch adds special tokens for the name and content type of each
attachment. It also adds code to strip html tags from the body text before
tokenizing the body.
We no longer break the body tokenizer on :, /, ? and &. This helps us tokenize
complete urls instead of breaking the urls into words. More work still needs to
be done on this front.
Comment 7•21 years ago
|
||
This is all quite interesting. Do you consider these patches safely 'testable'?
I'll give them a spin if that's helpful. If so, should I reset my weighting
before testing?
Assignee | ||
Comment 8•21 years ago
|
||
Duplicating my comment in Bug #181534 since it applies to the tokenizer work
here too:
Here are some of the testing results comparing 0.5 with JUST these tokenizer
changes, and then I did another run with the tokenizer changes + Miguel's port
of the chi-squared probability distribution algorithm instead of a pure bayesian
algorithm.
# of Messages 5602
# SPAM 1684
False Negatives False Positives Percentage Caught
Thunderbird 0.5 925 1 45.07%
Tokenizer Changes 614 3 63.54%
Tokenizer + Chi 190 22 88.72%
We are almost doubling the percentage of spam caught by combining the algorith
with the tokenizer changes. However, the huge spike in false positives with the
algorithm change is alarming. Btw, these tests are done with the address book
whitelist filter turned OFF. So in practical use, the fp rate should get cut
down some because we white list folks already in your address book.
Assignee | ||
Comment 9•21 years ago
|
||
Now that I have individual headers being broadcasted to the classifier, it is
really easy to customize the rules / parsing used on each header. Spam Bayes
has explicit steps it takes for headers like date, received, message ID,
to/cc/bcc, etc. It gets tricky for us because we are doing all this in C++ and
we don't have access to regexpressions so we have to figure out create C++ ways
to do some of the things spam bayes does.
My hope it that once this stuff lands, some volunteers from the community will
be able to step up and translate the header code found
http://cvs.sourceforge.net/viewcvs.py/spambayes/spambayes/spambayes/tokenizer.py?rev=1.29&view=markup
into C++ code.
The good news is that with my changes, you won't have to be familiar with
mozilla to do this. Everything is already contained in one little method
(Tokenizer::tokenizeHeaders). You just need to read the spam bayes document,
figure out how to do the same mutation for a particular header in C++ without
reg expressions and you are in business. This could be a great project for some
folks looking to get their feet wet with mozilla without having to know all the
ins and outs.
Assignee | ||
Comment 10•21 years ago
|
||
this patch also uses a pref, "mail.adaptivefilters.junk_threshold" so testers
can adjust the probability value used to determine if a message is junk or not.
You can see that we treat almost all the headers the same, this is where I'm
hoping volunteers will step in:
default:
addTokenForHeader(lowerCaseHeaderName.get(),
nsDependentCString(headerValue));
break;
Right now I'm not even tokenizing these header values, I'm just adding a token
for "headername:headervalue". Spam Bayes has lots of little rules for how each
header should be broken down.
Assignee | ||
Updated•21 years ago
|
Attachment #139674 -
Attachment is obsolete: true
Attachment #141078 -
Attachment is obsolete: true
Comment 11•21 years ago
|
||
I just had a look at the latest patch. As many users may have noticed, the
current implementation has problems identifying emails containing viruses
because each is slightly different and in most cases the body is very short. Did
anyone already do a test run with this patch in order to see if adding tokens
for filename and contenttype of the attachments does a good job for this type of
spam? What about eventually also adding a token for the size of attachments?
Comment 12•21 years ago
|
||
That's a fair observation -- for example, for the past few weeks I've been able
to pretty much reflexively delete the MyDoom false-negatives just by eyeballing
that they're almost always 31K.
Assignee | ||
Comment 13•21 years ago
|
||
See this comment about the availability of a test build containing the tokenizer
changes and the new algorithm:
http://bugzilla.mozilla.org/show_bug.cgi?id=181534#c55
Comment 14•21 years ago
|
||
Scott: just wanted to make sure you've read http://www.paulgraham.com/better.html
Assignee | ||
Comment 15•21 years ago
|
||
Attachment #141299 -
Attachment is obsolete: true
Comment 16•21 years ago
|
||
One thing to be very wary of is that just cherry-picking the "easy" chunks of
the tokenisation bits from spambayes may in fact not produce the best results.
The current set of tokeniser tricks are an accumulation of code, each of which
was tested thoroughly before being accepted. But they were tested _in_
_conjunction_ _with_ _all_ _other_ tricks. They inter-relate in a way that's far
too scary to contemplate.
I'd strongly recommend you set up a solid testing harness, with a couple of
different data sets, for checking new code. In particular, something like the SB
'timcv' cross-validation driver would seem to be essential. This breaks the
messages up into N buckets; picks a bucket, trains on all other buckets, then
scores the bucket chosen. Repeat for all buckets.
Comment 17•21 years ago
|
||
Has there been any thought give to letting the user set the ham/spam cutoff
percentages like SpamBayes does? Building on that the inclusion of a "Junk
Suspects" folder would be great. This is what the SpamBayes Outlook plugin does
and it's easy to setup for Mozilla/Thunderbird when using the SpamBayes POP3
proxy.
So the client would include the following configuration settings:
Spam Cutoff Percentage, default 90% (scores at or above this number are spam)
Ham Cutoff Percentage, default 10% (scores at or below this number are ham)
Add a Junk Suspects folder that gets any messages classified as unknown (score
falls between the two numbers set above).
Assignee | ||
Comment 18•21 years ago
|
||
Yes there has been thought on this. It is currently pref driven.
+// the probablilty threshold over which messages are classified as junk
+// this number is divided by 100 before it is used. The classifier can be fine
tuned
+// by changing this pref. Typical values are .99, .95, .90, .5, etc.
+pref("mail.adaptivefilters.junk_threshold", 90);
Reporter | ||
Comment 19•21 years ago
|
||
Scott,
I spent some time going over my patch and testing to try to fix the fp problem.
I asked for and got a lot of help from the SpamBayes developers and we've found
several mistakes in my previous patch and testing is showing more big
improvements.
What I've now discovered is that SpamBayes uses a different method for counting
tokens during training then mozilla does, since we're now using their formulas
the numbers are coming out wrong. In mozilla every ocurrence of the same token
in a single message is counted and added to the training database, in spambayes
they only count that token once. For example, if a spam message has the word
'free' in it 5 times, Mozilla increases the spamcount for 'free' by 5 in
training.dat while SpamBayes increases it by 1.
Since you spent some time digging through the tokenizer code I thought you could
point me to where this change could be made.
Assignee | ||
Comment 20•21 years ago
|
||
Miguel, that's awesome news!! As soon as you give me a new algorithm patch, I'll
re-run my tests to measure the performance and put up a new test build.
Here's where we update the token count:
http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#202
if the token is already in the hashtable, we bump the mCount field on the token
right there. Change that to do nothing, and the count would stay at 1.
Reporter | ||
Comment 21•21 years ago
|
||
That didn't work. That made it so that the count is never increased past 1. I
need it so that it's increased once per message, but not more than once.
Assignee | ||
Comment 22•21 years ago
|
||
Well that hash table contains all of the tokens for the current message being
parsed. If the count stays at 1 here, then later on if we end up adding those
tokens to the training set, we end up copying out mCount for each token and
adding it to the training set of tokens.
Maybe this method is used in two ways, one for the tokenizer that builds up the
tokens for a specific message (in which case we want that line to never
increment past 1)
and secondly, we have a tokenizer that orgaizes the tokens for the training set.
When adding tokens to that tokenizer, we want to really increment mCount past 1.
If that's the case, we may have to set a flag on the tokenizer so we now which
way it is being used.
Assignee | ||
Comment 23•21 years ago
|
||
This patch merges the new tokenizing work with an updated version of the new
algorithm. See this comment for more details:
http://bugzilla.mozilla.org/show_bug.cgi?id=181534#c59
Attachment #141577 -
Attachment is obsolete: true
Comment 24•21 years ago
|
||
Is HTML still being stripped from the body? Paul Graham has made references to
the fact that the FF0000 token (HTML color red) ranked as high as "teens".
There is some discussion about his updated tokenizer (which reads a, img, and
font tags) at http://www.paulgraham.com/better.html
Comment 25•21 years ago
|
||
Disregard my last comment. I just read the SpamBayes background page:
http://spambayes.sourceforge.net/background.html
Apparently, HTML tokenizing unfairly penalized ham with HTML. "In the end, the
best results were found by stripping out most HTML clues."
Assignee | ||
Comment 26•21 years ago
|
||
Comment on attachment 143254 [details] [diff] [review]
updated patch that includes bug fixes to the core algorithm
Let's start getting these changes ready for a landing during 1.8.
David, break down of the changes:
1) Modify libmime, when fetching a message for junk mail, set the output type
as text/html to force the HTML MIME emitter to get invoked
2) Allow a header sink to be set on individual urls instead of just the
nsIMsgWindow
3) Modify our header sink broad caster class to use an array of nsAUTF8Strings
to avoid unicode / utf8 mismatches when calling into the JS front end for
message display and the CPP junk mail plugin header sink listener.
4) Modify msgHdrViewOverlay.js to account for this API change
5) modify libmime to properly call through this new version of the interface.
6) Land the new algorithm which uses different math for the junk scores
7) tokenizer changes based on some of the spam bayes rules talked about in this
bug. Including stripping out HTML, creating special tokens for some headers (we
still ignore some headers for now), ignore words greater > 13 characters and
turn them into n-gram tokens, add special tokens for each broadcasted
attachment.
Attachment #143254 -
Flags: superreview?(bienvenu)
Comment 27•21 years ago
|
||
these are all nits (Sethism's)
Can we get rid of the dumps?
+ dump(srcUrlPrimitive + '\n');
for (index in currentAttachments)
{
attachment = currentAttachments[index];
+ dump('attach.url: ' + attachment.url + '\n');
if (attachment.url == srcUrlPrimitive)
+ if(!aMsgHdrSink)
+ return NS_ERROR_NULL_POINTER;
+
+ *aMsgHdrSink = mMsgHeaderSink;
+ NS_IF_ADDREF(*aMsgHdrSink);
+ return NS_OK;
can be:
NS_ENSURE_ARG_POINTER(aMsgHdrSink)
NS_IF_ADDREF(*aMsgHdrSink = mMsgHeaderSink);
+nsresult nsMimeHtmlDisplayEmitter::BroadcastHeaders(nsIMsgHeaderSink *
aHeaderSink, PRInt32 aHeaderMode, PRBool aFromNewsgroup)
+{
+ nsresult rv = NS_OK;
if you move the decl of rv closer to where it's used, and remove the unneeded
init...
+ {
+ if (word[0] == '\0') continue;
if (!*word) continue
+
+ // important: leave out sender field. To strong of an indicator
should be "too strong..."
+ nsresult rv = NS_OK;
+ // Create a parser
+ nsCOMPtr<nsIParser> parser = do_CreateInstance(kParserCID);
+ NS_ENSURE_TRUE(parser, NS_ERROR_FAILURE);
slightly cleaner to do: (partly because I hate returning NS_ERROR_FAILURE...
+ nsresult rv;
+ // Create a parser
+ nsCOMPtr<nsIParser> parser = do_CreateInstance(kParserCID, &rv);
+ NS_ENSURE_SUCCESS(rv, rv);
similarly, here:
+ nsCOMPtr<nsIDTD> dtd = do_CreateInstance(kNavDTDCID, &rv);
+ NS_ENSURE_SUCCESS(rv, rv);
here, just return parser->Parse...
+ rv = parser->Parse(inString, 0, NS_LITERAL_CSTRING("text/html"), PR_FALSE,
PR_TRUE);
+ return rv;
+ if (goodclues > 150)
+ first = count - 150;
+ else
first = 0;
can just be
first = (goodclues > 150) ? count - 150 : 0;
return NS_OK;
Comment 28•21 years ago
|
||
Comment on attachment 143254 [details] [diff] [review]
updated patch that includes bug fixes to the core algorithm
sr=bienvenu with those nits.
Attachment #143254 -
Flags: superreview?(bienvenu) → superreview+
Assignee | ||
Comment 29•21 years ago
|
||
At long last this patch has finally made its way onto the 1.8 trunk. It was
already part of the latest Thunderbird 0.6 release.
I'm going to mark this as fixed, but there is a lot more tokenizer work to do
here. We'll spin up new bugs to track further improvements.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Updated•20 years ago
|
Product: MailNews → Core
Updated•17 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•