Last Comment Bug 228675 - Limit growth of junk token store
: Limit growth of junk token store
Status: RESOLVED FIXED
: footprint
Product: MailNews Core
Classification: Components
Component: Backend (show other bugs)
: Trunk
: All All
: -- normal with 2 votes (vote)
: mozilla1.9
Assigned To: Kent James (:rkent)
:
:
Mentors:
http://www.entrian.com/sbwiki/Adaptiv...
: 219949 302567 309620 442183 (view as bug list)
Depends on: 432812
Blocks: spam 294647 437098
  Show dependency treegraph
 
Reported: 2003-12-16 09:52 PST by Scott MacGregor
Modified: 2009-12-31 12:50 PST (History)
25 users (show)
mscott: blocking‑thunderbird2-
standard8: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Limit training.dat by simple /2 algorithm (9.07 KB, patch)
2008-03-28 09:46 PDT, Kent James (:rkent)
no flags Details | Diff | Splinter Review
Fixed Standard8's nits (9.08 KB, patch)
2008-04-14 16:40 PDT, Kent James (:rkent)
standard8: review+
Details | Diff | Splinter Review
Fixed more nits (9.02 KB, patch)
2008-04-16 12:38 PDT, Kent James (:rkent)
rkent: review+
Details | Diff | Splinter Review
Added unit tests (25.11 KB, patch)
2008-05-07 13:14 PDT, Kent James (:rkent)
no flags Details | Diff | Splinter Review
Fixed issues, added support for head and tail in unit test (29.03 KB, patch)
2008-05-23 16:48 PDT, Kent James (:rkent)
standard8: review+
Details | Diff | Splinter Review
[checked in] Updated unit tests to latest format (26.78 KB, patch)
2008-06-02 22:17 PDT, Kent James (:rkent)
rkent: review+
mozilla: superreview+
Details | Diff | Splinter Review
[checked in] Unit test classification forced sequential (2.93 KB, patch)
2008-06-04 10:28 PDT, Kent James (:rkent)
standard8: review+
Details | Diff | Splinter Review

Description Scott MacGregor 2003-12-16 09:52:09 PST
Per David:

"we need to do something to limit/manage the growth of the token store. It takes
up a huge amount of memory"
Comment 1 Scott MacGregor 2004-01-22 08:40:07 PST
See: http://www.entrian.com/sbwiki/AdaptiveTraining for a discussion about
pruning the token database.

We may have to change the kind of information we write out to the token store in
order to make effective pruning decisions.
Comment 2 Patrick 2004-02-24 11:54:23 PST
*** Bug 219949 has been marked as a duplicate of this bug. ***
Comment 3 Scott MacGregor 2006-02-08 13:39:47 PST
This won't be a real blocker but I don't want to lose track of it.

I'd like to revisit some SPAM ideas such as this one for Thunderbird 2.0/3.0.
Comment 4 AlexIhrig 2006-07-16 03:22:53 PDT
Scott/David, should we set a target milestone?

My training.dat is >= 30 MByte today.
Comment 5 Scott MacGregor 2007-01-10 13:12:53 PST
punting for tb2, we didn't get any traction on this.
Comment 6 Worcester12345 2007-02-12 12:21:41 PST
What does "punting" mean in terms of software development? Does that mean "does not make ship date due to deadline"?  I thought mozilla software releases were determined by feature readiness, not timetables.
Comment 7 zug_treno 2008-02-07 07:24:42 PST
*** Bug 309620 has been marked as a duplicate of this bug. ***
Comment 8 Worcester12345 2008-02-07 15:36:31 PST
(In reply to comment #4)
> Scott/David, should we set a target milestone?
> 
> My training.dat is >= 30 MByte today.
> 
How does one determine this, and what does that mean and tell us? I'll report back here if you want. Junk mail filtering/blocking has not been working properly since maybe 1.5.

Comment 9 Kent James (:rkent) 2008-02-07 16:22:43 PST
"How does one determine this" - look at the file training.dat in your profile, and check its size.

"What does this mean and tell us" - that he has a very large amount of training data, and that will slow down his junk mail processing.

"... has not been working properly since maybe 1.5 ..." - all of the tests that I have done show that it IS working properly. What has changed is that spam senders have adapted to bayesian filters.

My initial tests on the TREC 2005 spam corpus generally showed that effectiveness of the spam filter continually increased as additional tokens were added. That means that there will always be a tradeoff between size of the training dataset, and effectiveness.

What is needed is:

1) Some method of token pruning. This will probably require maintaining additional information about each token, such as dates when it is created, or some measure of effectiveness in use. An alternative might be to simply divide all token counts by 2 periodically, eliminating all tokens at that point with a count of 1.
2) A limit to the number of tokens used, with pruning occurring when that limit is reached. This would probably be a hidden preference.
3) Elimination of certain types of tokens that are virtually always hapaxes (such as message IDs)
Comment 10 Worcester12345 2008-02-08 10:10:09 PST
(In reply to comment #9)

> "What does this mean and tell us" - that he has a very large amount of training
> data, and that will slow down his junk mail processing.



> That means that there will always be a tradeoff between size of the
> training dataset, and effectiveness.
> 
> What is needed is:

What would be nice is a simple tool to edit what is in there. If there are 2 entries to mark it as spam and 1 telling it it is not spam, those 2 extra can go. As far as the rest, my training.dat file is 3865 KB, which is far less than that other one up above. Thanks.
Comment 11 John T 2008-02-08 13:53:07 PST
Re: comment 8---the majority of complaints about junk filtering effectiveness including my own started with the mozilla 1.8a versions, predecessors of SM. Then my ISP changed its name and I haven't gotten enough spam to judge. Lately there has been more, and I'm seeing some filtering. There's not enough to compare to Moz 1.7.13 though.
Comment 12 Kent James (:rkent) 2008-02-08 15:29:43 PST
Re: comment 10  It is not as simple as that. Each message may have as many as 150 different tokens that are combined to determine spam versus ham (though typically, particularly in marginal cases, it is much less than that). I have found that eliminating all tokens that are seen, for example, only once in every 10000 messages still results in a measurable decrease in the spam rating effectiveness. When you have 100,000 tokens then you are not likely to be able to select yourself which ones to keep and which to eliminate. That's why you need a statistical method of determining that.

Another complication is that TB encourages you to do very little training, particularly of ham. You might have a few tokens that were only seen once during training, but now are important in determining ham vs spam. So we really need some way to determine the historical effectiveness of tokens before we eliminate them. Be that as it may, I may try the simple approach of dividing the counts in half as I described in comment 9 and see how much damage that causes, at least in analysis with a standard spam corpus. It is difficult to test the wide range of possible training systems that people might have used in the past, as there is no standard way to do this.

Generally spam management seems to be of less interest than in the past, if I judge by discussions on m.d.a.thunderbird  I think that most people now have server-based spam filtering that accomplishes most of what they need.
Comment 13 Worcester12345 2008-02-08 15:37:12 PST
The point is that there is no access to this for the casual user. If you mark something as spam and it is deleted, there is no calling it back to unmark it. This needs to change. 
Comment 14 Tony Mechelynck [:tonymec] 2008-02-09 02:37:18 PST
(In reply to comment #13)
> The point is that there is no access to this for the casual user. If you mark
> something as spam and it is deleted, there is no calling it back to unmark it.
> This needs to change. 
> 

That's why spam sould not be deleted out of hand as soon as it is detected. Suspected spam should be moved to the Junk folder, where the user can inspect it, mark false positives as non-junk (thus contributing to the training), and move them elsewhere. IIUC, this is also the default JMC setting when junk filtering is first enabled.

If you mean the handling of spam by ISPs, yes, deleting spam out of hand without any possibility of user intervention is BAD because false positives get lost, thus preventing the user from seeing possibly important mail. I prefer the mail servers which separate spam away from "legit mail" but still give the user access to what got sorted away as spam, with the possibility to mark it as non-spam (example: gmail), or even those where the user has the option of doing no filtering at all, so he receives everything, losing no important mail at all.
Comment 15 Chris Andrichak 2008-02-10 16:59:03 PST
If you want to edit the training.dat file, there is this tool, which is a couple years old:

http://bayesjunktool.mozdev.org/

Like Kent says, though, It's more about useful marking than messing with the training file. I used the tool a few times and it didn't really seem to help much in the ways that I thought it would.
Comment 16 Kent James (:rkent) 2008-03-28 09:46:20 PDT
Created attachment 312297 [details] [diff] [review]
Limit training.dat by simple /2 algorithm

This patch limits the token store, without the need to change its format, by the process of dividing data by two when a token limit is reached. This has the effect of removing all tokens with a count of 1 (which is typically half of the tokens). Also, it discounts data in the past relative to future data.

I've done a bit of testing on this using the TREC 2005 and 2006 databases. First, understand that for a static database the more tokens that you use, the better the results. So implementing this token limit is done for performance reasons, not for reasons of improving the junk detection. My tests with a static corpus show that limiting tokens in this way gives similar junk detection performance compared to not limiting the tokens, for the same number of tokens used in both cases. The benefit comes with dynamic data. That is, using the training data from the TREC 2005 corpus to score the TREC 2006 corpus gives lousy results. If you limit the tokens, start with the 2005 data, and begin scoring 2006 data and also continuing to train, the junk detection very quickly reaches a level appropriate to the number of tokens being used. So while you would achieve similar or better results by continuing to train but not limiting the tokens, in real cases this is not practical because the token store gets larger and larger. Hmmm, I'm not sure how to make all of these points without writing a paper with graphs.

Anyway, this patch does nothing by default. There is a new preference that is the token limit. It is set to 0 by default, which does not limit tokens. Perhaps later we can set it to a real value, particularly if we incorporate some automatic training features.
Comment 17 Kent James (:rkent) 2008-03-30 17:47:36 PDT
I've put up a wiki page at http://wiki.mozilla.org/User:Rkentjames:Bug228675 that shows some results from using my patch. The conclusion is that, for the same number of average tokens with static data, you get similar performance from using a static training set, or  continuing to train but limiting the token size with this patch. (The point of the patch is to allow you to continually train when the nature of your ham and spam is changing, so a static training set is not appropriate, but neither is a continually growing training.dat)

I'm generally satisfied that this simple patch works, and want a review. I'll flag bienvenu for that review.
Comment 18 Chris Andrichak 2008-03-30 18:03:45 PDT
What was the actual number you used for your max token store size? I'm guessing about 500 from what you have on the wiki?

Generally, this patch looks like a good idea to me.
Comment 19 Kent James (:rkent) 2008-03-30 20:00:49 PDT
For the combined TREC 2005+2006 corpus, there were 52309 good messages, 77568 junk messages, 593,186 good tokens, and 484,172 junk tokens. training.dat size is 43 megabytes. That's what gives the best results!

Training using 5/100 of the TREC 2006 corpus gave 27108 good tokens, 31002 junk tokens, combined is 58110 tokens. For the test using the patch, I limited tokens to 4/3 of that, or 77480.

Comment 20 Tony Mechelynck [:tonymec] 2008-04-08 03:50:03 PDT
The way I read this bug, it should be cross-platform.
Comment 21 Tony Mechelynck [:tonymec] 2008-04-08 03:52:29 PDT
*** Bug 302567 has been marked as a duplicate of this bug. ***
Comment 22 Mark Banner (:standard8) 2008-04-14 00:58:35 PDT
Comment on attachment 312297 [details] [diff] [review]
Limit training.dat by simple /2 algorithm

 void Tokenizer::remove(const char* word, PRUint32 count)

I don't see how the changes in this function relate to this bug.

+      for (PRUint32 i=0; i<tokenCount; ++i) {

nit: we generally prefer more spaces i.e.

for (PRUint32 i = 0; i < tokenCount; ++i) {

+  if ( (mMaximumTokenCount > 0) && // if 0, do not limit tokens
+       (mGoodTokens.countTokens() + mBadTokens.countTokens() > mMaximumTokenCount) ) {

nit: no need for the extra space in between (( and ))

+      // Shrinking the token database is accomplished by dividing all token counts by 2.
+      // Recalculate the shrunk token count, keeping tokens with a count > 1
...
   if (!((fwrite(kMagicCookie, sizeof(kMagicCookie), 1, stream) == 1) &&
-        (writeUInt32(stream, mGoodCount) == 1) &&
-        (writeUInt32(stream, mBadCount) == 1) &&
-         writeTokens(stream, mGoodTokens) &&
-         writeTokens(stream, mBadTokens)))
+        (writeUInt32(stream, shrink ? mGoodCount/2 : mGoodCount) == 1) &&
+        (writeUInt32(stream, shrink ? mBadCount/2 : mBadCount) == 1) &&
+         writeTokens(stream, mGoodTokens, shrink) &&
+         writeTokens(stream, mBadTokens, shrink)))

The shrink ? mGoodCount/2 : mGoodCount seems wrong to me. Take a really simple example:

Good Tokens -> Count (
mGoodCount = 6):

a -> 2
b -> 1
c -> 1
d -> 1
e -> 1

So b to e get dropped, a remains, that gets divided by two, giving you a token count of 1, rather than a token count of 3.

Also, although we're not turning it on by default, this seems like a really simple case where we could write an xpcshell test case that would compare the before and after files (which would only need to contain a few tokens). I would be willing to help you construct the basics/point you in the right direction if you want help with that (and its not necessary to get review on this code at this stage).
Comment 23 Kent James (:rkent) 2008-04-14 09:48:41 PDT
The Tokenizer::remove changes are necessary because it is possible now for someone to request removal of a message from the database after the token counts have been shrunk. So if token IAmAHapax previously had a count of 1, and the database is shrunk, it will be removed. If later someone tries to remove the original message from the database, then IAmAHapax will be missing. Previously this was asserted as an error. With these changes, that can happen normally so should not be asserted.

Re shrink ? mGoodCount/2 : mGoodCount

mGoodCount is the number of good messages, not a count of tokens. The filter is only weakly dependent on this value, and always in the ratio of mGoodCount/mBadCount There is no fully "correct" way to selectively remove tokens from the database, and keep these numbers correct - that's one reason people have hesitated to suggest changes like this in the past. But recognizing that these numbers are relatively unimportant, and only the ratio needs to be maintained, then the approximation here seems reasonable. Why do it at all? Because someone might change their strategy, so that for example previously they had an excess of GoodCounts, while later they have an excess of BadCounts. This is just a smooth transition between these strategies.

Re: xpcshell tests. I'd like to do that. Are they normally submitted as part of the main patch, or separately? Can you point me to a mailnews bug where an xpcshell test was added? How do I actually run the tests?

Also, for the record, although I am not proposing now to enable these limits, if we implement some form of automated training, which I think is a likely direction, then I will request this be enabled.

I'll fix your nits and resubmit. Thanks for the review!
Comment 24 Kent James (:rkent) 2008-04-14 16:40:10 PDT
Created attachment 315670 [details] [diff] [review]
Fixed Standard8's nits

Fixed Standard8's nits.
Comment 25 Mark Banner (:standard8) 2008-04-16 02:22:40 PDT
(In reply to comment #23)
> The Tokenizer::remove changes are necessary ... With these changes, that can happen normally so should not be asserted.

ok, that's fine, thanks for the explanation.

> Re shrink ? mGoodCount/2 : mGoodCount
> 
> mGoodCount is the number of good messages, not a count of tokens. The filter is
> only weakly dependent on this value, and always in the ratio of
> mGoodCount/mBadCount There is no fully "correct" way to selectively remove
> tokens from the database, and keep these numbers correct - that's one reason
> people have hesitated to suggest changes like this in the past. But recognizing
> that these numbers are relatively unimportant, and only the ratio needs to be
> maintained, then the approximation here seems reasonable. Why do it at all?
> Because someone might change their strategy, so that for example previously
> they had an excess of GoodCounts, while later they have an excess of BadCounts.
> This is just a smooth transition between these strategies.

This sounds reasonable. One thought though, would it be worth correcting the count on the next load so that it doesn't get too far out over long periods of time?

> Re: xpcshell tests. I'd like to do that. Are they normally submitted as part of
> the main patch, or separately? Can you point me to a mailnews bug where an
> xpcshell test was added? How do I actually run the tests?

The starting place is:

http://developer.mozilla.org/en/docs/Writing_xpcshell-based_unit_tests

Here's a list of the current mailnews tests:

http://mxr.mozilla.org/seamonkey/find?string=%2Fmailnews%2F.*test_.*&tree=seamonkey&hint=

Bug 415942 is probably a good example as any. You'll probably also want to apply the patch from bug 426615 as that will set up your test "profile" directory in a suitable place. Please also clean up the file(s) on exit.

Also, as I expect you'll be loading in one or more files, some of the tests in the address book copy book data into the test profile, and so they would be good to look at for how to do that.

Any more questions, feel free to ping me on irc.
Comment 26 Mark Banner (:standard8) 2008-04-16 02:28:46 PDT
Comment on attachment 315670 [details] [diff] [review]
Fixed Standard8's nits

+        PR_LOG(BayesianFilterLogModule, PR_LOG_DEBUG, ("remove word: %s (count=%d) (mCount=%d)", word, count, token->mCount));

Could you put this on two lines:

+        PR_LOG(BayesianFilterLogModule, PR_LOG_DEBUG,
       ("remove word: %s (count=%d) (mCount=%d)", word, count, token->mCount));

+      // We'll clear the tokens, and read them back in from the file. Yes this is slower than in place, but this is a rare event.
+      
+         // clear out our in memory training tokens...
+      if (mGoodTokens && mGoodTokens.countTokens())

The first comment needs splitting onto two lines, the second one needs correct indentation.

r=me with those fixed. Please do have a think about resetting the counts as well.
Comment 27 Kent James (:rkent) 2008-04-16 09:27:36 PDT
(In reply to comment #25)
> 
> This sounds reasonable. One thought though, would it be worth correcting the
> count on the next load so that it doesn't get too far out over long periods of
> time?
> 

I don't see how one could "correct the count". The counts represent a long history of training for messages that may have been deleted. What we are doing is discounting by half old message counts relative to the new messages that we will add in future training, since we have discounted the old token counts by half. This is my best attempt to be self correcting. It seems to perform in a stable manner in the tests I have done - though I have not tried a test where there has been a radical change in the (goodcount/badcount) ratio.
Comment 28 Kent James (:rkent) 2008-04-16 12:38:48 PDT
Created attachment 316077 [details] [diff] [review]
Fixed more nits

Carrying over r from Standard8.
Comment 29 Kent James (:rkent) 2008-05-07 13:14:37 PDT
Created attachment 319852 [details] [diff] [review]
Added unit tests

Carrying over sr, asking for new review on unit tests. There are no new changes to the main code from the last patch.
Comment 30 Kent James (:rkent) 2008-05-07 13:31:48 PDT
Comment on attachment 319852 [details] [diff] [review]
Added unit tests

Oops, not carrying over sr
Comment 31 Mark Banner (:standard8) 2008-05-09 05:36:43 PDT
Comment on attachment 319852 [details] [diff] [review]
Added unit tests

+do_import_script("mailnews/test/resources/mailDirService.js");

...
+      shutdownAcctMgr(nsIMsgAccountManager);

If bug 432812 goes in before this one, please change to use the new head_/tail_ format.

+// Create a local mail account (we need this first)
+nsIMsgAccountManager.createLocalMailAccount();

Does this need to be done outside run_test? Could you move it inside please.

+  try {nsIJunkMailPlugin.resetTrainingData();}
+  catch (e) {do_throw("resetTrainingData failed");}

nit: this is really hard to read. Also, do you really need to catch it here? You could just remove the try/catch and the test harness should catch it for you.

+  //dump("profileDir exist? " + profileDir.exists() + "\n");

I think this could proably be removed.

+  return uri.spec;
+  dump("uri.spec is " + uri.spec + "\n");
+}

nit: I don't think you need this dump here any more as well, if you do, then it should be on the line above.

+  const kError = 0; // Error return
+  const kOK = 1; // No error return
+   
+  // public methods
+  
+  this.read = read; // Returns kError=0 for error, kOK=1 for OK

Except it doesn't. So I think you can drop the constants and the comment.

+      checkToken("money", 0, 2);
+      checkToken("subject:report", 0, 0);
+      checkToken("to:careful reader <reader@example.org>", 1, 2);
+      checkToken("make", 0, 1);
+      checkToken("important", 1, 0);

Can you put some notes in regarding how these are calculated. As far as I understand it, you half the counts when we get to the limit. So, in the spam messages I read 6 instances of "money", but the number is only 2. So are these counts, the sum of occurrences in messages, or the sum of messages that have the token in them?
Comment 32 Kent James (:rkent) 2008-05-13 00:46:59 PDT
Removing dependency on bug 235226 - that bug seems unrelated to me.

Added dependency on bug 432812, since unit tests will be changed when this bug is available.
Comment 33 Mark Banner (:standard8) 2008-05-19 10:59:51 PDT
Comment on attachment 319852 [details] [diff] [review]
Added unit tests

Cancelling review per comment 31, this is about half-way between a r- and an r+, and I think I'd prefer to see an updated patch once we sort out your other problems.
Comment 34 David :Bienvenu 2008-05-19 12:36:55 PDT
Comment on attachment 319852 [details] [diff] [review]
Added unit tests

cancelling sr request, if I understand Standard8's previous comment, a new patch is coming?
Comment 35 Kent James (:rkent) 2008-05-19 13:16:23 PDT
(In reply to comment #34)
> (From update of attachment 319852 [details] [diff] [review])
> cancelling sr request, if I understand Standard8's previous comment, a new
> patch is coming?
> 
Yes. I've got several patches that I am trying to get working with the testing framework from bug 432812, this one included. I'll give another patch when I get that resolved.
Comment 36 Kent James (:rkent) 2008-05-23 16:48:42 PDT
Created attachment 322327 [details] [diff] [review]
Fixed issues, added support for head and tail in unit test

Unit tests follow the latest checked-in head and tail philosophy. I'm continuing to duplicate the loadLocalMailAccount() functionality from bug 434810 - and expect it will need to exist in virtually every unit test that I do. We need to at least resolve where we will keep such common functions, and whether to follow the load/unload naming pattern.

I also believe I fixed all of the comments on the previous patch.
Comment 37 Mark Banner (:standard8) 2008-05-29 07:46:00 PDT
Comment on attachment 322327 [details] [diff] [review]
Fixed issues, added support for head and tail in unit test

+// before shrink, the trained messages have 84 tokens. Force shrink.
+nsIPrefBranch.setIntPref("mailnews.bayesian_spam_filter.junk_maxtokens", 83);

nit: Please put this inside run_test - it'll be much clearer that it is happening, rather than potentially getting lost amongst the constants.

return true also isn't necessary in run_test afaik.

Please rework the bug 434810 items to be in line with what we've agreed.

r=me with those comments fixed.
Comment 38 Kent James (:rkent) 2008-06-02 22:17:13 PDT
Created attachment 323503 [details] [diff] [review]
[checked in] Updated unit tests to latest format

Standard8 commented:

"+// before shrink, the trained messages have 84 tokens. Force shrink.
+nsIPrefBranch.setIntPref("mailnews.bayesian_spam_filter.junk_maxtokens", 83);

nit: Please put this inside run_test - it'll be much clearer that it is
happening, rather than potentially getting lost amongst the constants."

Unfortunately I can't easily do that. The preference must be set before the junk filter instance is created, as it is read during initialization. Other constant definitions rely on the junk filter being created, so no matter what I do the majority of the constant definitions will be after the preference setting. So I left this alone.

The patch was adjusted to match the latest unit test environment. Carrying over the r, requesting sr.
Comment 39 David :Bienvenu 2008-06-03 09:32:42 PDT
Comment on attachment 323503 [details] [diff] [review]
[checked in] Updated unit tests to latest format

I was also confused by the  mGoodCount/2, mBadCount/2 code, and then I went back and read the comments, so it makes a little more sense. It still seems like a pretty rough approximation, but I don't really know what the distribution of counts looks like for tokens...I was worried that mGoodCount and mBadCount might get to zero incorrectly.
Comment 40 Kent James (:rkent) 2008-06-03 10:14:57 PDT
(In reply to comment #39)
> (From update of attachment 323503 [details] [diff] [review])
> I was also confused by the  mGoodCount/2, mBadCount/2 code, and then I went
> back and read the comments, so it makes a little more sense. It still seems
> like a pretty rough approximation

No doubt it is. It is a pragmatic approximation designed to work with the existing code and available data, without needing to change data file formats. But I did test it, and it gives performance roughly equivalent to the existing filter after correcting for the number of tokens in use. 

There is a whole body of knowledge surrounding token pruning, so there are opportunities to improve this so that pruning can give better performance. However, they require additional information that we currently don't maintain. Perhaps at some point in the future we can revisit the statistical processing of email data, and incorporate additional information that would be useful to improve token pruning. For now, this patch needs to be compared to the current recommended option, which is to delete training.dat and start over. Compared to that, this is less disruptive and more effective.

The current utility of this patch will be for options involving drastically increasing the junk training of email, probably by automating it. You would need token pruning to manage training.dat growth in that case.

The question we will face before next release is whether to enable this by default. I would probably argue that it makes sense to prune tokens by default when they reach some very high value, such as when the memory requirements for the bayes filter are equal to 50% of the memory for the rest of the application.
Comment 41 David :Bienvenu 2008-06-03 10:41:13 PDT
I think if we might want to turn this on before the next release, we should try turning it on in nightly builds, sooner rather than later.

I think devoting 33% of app memory for the bayes filter (if I understood your math correctly) might be a bit high. There's also a point at which loading training.dat from disk into memory is a performance problem. Do you have any data on where the sweet spot where adding more tokens has very little effect on the accuracy of the filter?
Comment 42 Kent James (:rkent) 2008-06-03 11:24:54 PDT
(In reply to comment #41)
> Do you have any
> data on where the sweet spot where adding more tokens has very little effect on
> the accuracy of the filter?
> 

"Very little effect" is a relative term. I assume you have looked at the wiki page I mentioned in comment 17.  To roughly generalize from that, using 10 times the data reduces the classification error by a factor of 2. Still, the error is still significant even for an unreasonable number of tokens, so we want to use the most available data. For that reason, I think it is better if the decision is scaled to available resources (such as existing memory or cpu usage) rather than to some concept of a sweet spot.

I think that the main risk of turning this on deals with the good tokens. Current TB design does not encourage people to train good messages, even though it is important. Somebody could have a lot of tokens, but the vast majority might be junk tokens. Pruning tokens in that case might reduce their already paltry collection of good tokens to an untenable value.

While I don't think it is good in general for spam processing to take 33% of available memory, the question here is do you want to silently limit the total number of tokens for someone who is a very aggressive bayes trainer, who are the people who will reach the limits. They are people who are hoping that the filtering will respond to their aggressive training, so may not appreciate too much pruning, which will hurt the bayes effectiveness. I don't think this will affect the vast majority of users, who probably train more casually.

Comment 43 Kent James (:rkent) 2008-06-03 12:09:50 PDT
I filed bug 437098 to continue the discussion of whether to enable this feature by default in Thunderbird.
Comment 44 Mark Banner (:standard8) 2008-06-03 13:18:55 PDT
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/Makefile.in,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/Makefile.in;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/Makefile.in,v  <--  Makefile.in
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/unit/head_bayes.js,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/unit/head_bayes.js;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/unit/head_bayes.js,v  <--  head_bayes.js
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js,v  <--  test_bug228675.js
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/ham1.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/ham1.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/ham1.eml,v  <--  ham1.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/ham2.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/ham2.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/ham2.eml,v  <--  ham2.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam1.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/spam1.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam1.eml,v  <--  spam1.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam2.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/spam2.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam2.eml,v  <--  spam2.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam3.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/spam3.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam3.eml,v  <--  spam3.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam4.eml,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/spam4.eml;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/spam4.eml,v  <--  spam4.eml
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/trainingfile.js,v
done
Checking in mailnews/extensions/bayesian-spam-filter/test/resources/trainingfile.js;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/resources/trainingfile.js,v  <--  trainingfile.js
initial revision: 1.1
done
Checking in mailnews/extensions/bayesian-spam-filter/Makefile.in;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/Makefile.in,v  <--  Makefile.in
new revision: 1.5; previous revision: 1.4
done
Checking in mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp,v  <--  nsBayesianFilter.cpp
new revision: 1.75; previous revision: 1.74
done
Checking in mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.h;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.h,v  <--  nsBayesianFilter.h
new revision: 1.21; previous revision: 1.20
done
Comment 45 Mark Banner (:standard8) 2008-06-04 02:32:59 PDT
test_bug228675.js was randomly failing, we think it was due to the classifications not always being in sequence. Kent is going to come up with a new patch later, for now I've commented out the contents of run_test() so that we don't keep getting random failures on the SeaMonkey tinderboxes.
Comment 46 Kent James (:rkent) 2008-06-04 10:28:50 PDT
Created attachment 323729 [details] [diff] [review]
[checked in] Unit test classification forced sequential

This patch forces the classification in the unit test to be done sequentially. Try it out, Standard8, to see if it fails on your Mac.
Comment 47 Mark Banner (:standard8) 2008-06-05 02:25:18 PDT
Comment on attachment 323729 [details] [diff] [review]
[checked in] Unit test classification forced sequential

Much better, I couldn't get this to fail on my mac, even with 5 builds running alongside it.
Comment 48 Mark Banner (:standard8) 2008-06-05 02:29:10 PDT
Comment on attachment 323729 [details] [diff] [review]
[checked in] Unit test classification forced sequential

Checking in mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js;
/cvsroot/mozilla/mailnews/extensions/bayesian-spam-filter/test/unit/test_bug228675.js,v  <--  test_bug228675.js
new revision: 1.3; previous revision: 1.2
done
Comment 49 Mark Banner (:standard8) 2008-06-05 06:41:25 PDT
This test seems to be running fine now on the tinderboxes.
Comment 50 Wayne Mery (:wsmwk, NI for questions) 2008-11-17 11:35:45 PST
*** Bug 442183 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.