Closed Bug 437098 Opened 16 years ago Closed 16 years ago

Enable junk token limits

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9

People

(Reporter: rkent, Assigned: rkent)

References

Details

Attachments

(1 file, 1 obsolete file)

We started a discussion in bug 228675 about whether to enable by default the pruning of junk tokens in Thunderbird, the capability of which is added in that bug. This bug continues that discussion, and proposes that the token limit preference be set by default, but to a relatively large value that will not affect the vast majority of users.

The relevant preference, defined in bug 228675, is mailnews.bayesian_spam_filter.junk_maxtokens
Flags: blocking-thunderbird3?
Here's some rough numbers to use as a basis for setting this limit.

On a Windows 2000 system, I installed TB trunk and then loaded different sizes of training.dat. The training.dat came from different runs that I did using the TREC 2006 corpus. Here is TB memory used versus number of tokens. Note there is some noise in the memory data:

Token Counts   TB memory (MB) training.dat size (MB)
 58110            64.7               1.3
 92114            64.4               2.1
119972            70.2               3.1
390577            90.4               9.6
524622           106.2              16.2

A linear regression of this shows relationship between memory usage and total allowed token counts:

(TB Memory) = 58.2 MB + (8.83E-5)(Total Counts)

With these numbers, bayes filter memory usage is 10% of original TB memory at 66,000 counts, and is 50% of original TB memory at 330,000 counts. By comparison, the current training.dat I am running has 124,000 token counts. Also by comparison, the testing graph shown in http://wiki.mozilla.org/User:Rkentjames:Bug228675 using the TREC 2006 corpus shows an average false negative rate of about 12% using a limit of 77480 tokens.

I propose that the maximum counts be set at 200,000. At that value, the bayes filter should use a maximum of about 30% of the memory of the base TB program, and achieve an error rate of about 10%.
Attached patch Token limit set to 200000 (obsolete) — Splinter Review
Just to keep this moving, here's the patch to set the limit at 200,000. I added Neil to the review mix to make sure to hear his opinion.
Attachment #324738 - Flags: superreview?(bienvenu)
Attachment #324738 - Flags: review?
Attachment #324738 - Flags: review? → review?(neil)
blocking‑thunderbird3+ as this could give great memory consumption improvements, according to the numbers in comment 1.
Flags: blocking-thunderbird3? → blocking-thunderbird3+
Comment on attachment 324738 [details] [diff] [review]
Token limit set to 200000

I'd probably be quite happy with a smaller figure (e.g. 100,000) if you prefer.
Attachment #324738 - Flags: review?(neil) → review+
I'd also be happier with a default limit of 100,000
OK, here's the patch for 100,000
Attachment #324738 - Attachment is obsolete: true
Attachment #324819 - Flags: superreview?(bienvenu)
Attachment #324738 - Flags: superreview?(bienvenu)
Comment on attachment 324819 [details] [diff] [review]
Token limit set to 100000

thx, Kent
Attachment #324819 - Flags: superreview?(bienvenu) → superreview+
Let's then talk about risk. When this hits, some people are going to see their training.dat shrunk. As a rough guess, filter error rate will initially change from about 8% to 11% for people affected by this - but with a significant boost in performance and better startup time. But really aggressive trainers who have huge training files might see performance go from, say 6% to 11% and really notice it. That is best case assuming it works as designed.

Do you think that we need to warn anyone about this, and if so how?
For the upcoming Shredder 3.0a2 preview release it could be a release note item listed as a performance enhancing feature.  This has been needed for a long time.
Keywords: checkin-needed
Checking in mailnews/mailnews.js;
/cvsroot/mozilla/mailnews/mailnews.js,v  <--  mailnews.js
new revision: 3.316; previous revision: 3.315
done
Keywords: checkin-needed
Target Milestone: --- → Thunderbird 3
This will affect SeaMonkey too, right ?
-> Core / Backend ?
(In reply to comment #11)
> This will affect SeaMonkey too, right ?
> -> Core / Backend ?
> 
Yes you're right. Changing to Core / Mailnews:Filters
Assignee: kent → nobody
Status: ASSIGNED → NEW
Component: Preferences → MailNews: Filters
Product: Thunderbird → Core
QA Contact: preferences → filters
Target Milestone: Thunderbird 3 → ---
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: nobody → kent
Status: REOPENED → NEW
Target Milestone: --- → mozilla1.9
Mkmelin, why did you reopen this?
Ah sorry, just meant to assign it back to you...
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Product: Core → MailNews Core
FWIW, I've _really_ noticed the decrease in junk filtering performance from this change; My old 4.1MB training.dat with threshold at 85 had me look at maybe two spam emails a month; now (with 1.4MB) I have three a day. (Out of a total of 10,000/month.)
Could it be worth investigating setting this number dynamically, based on total number of incoming junk mails per certain time period? Those with a lot of spam might be much more willing to accept the higher memory usage.
Sander, if you want to adjust the limit, it is easily done by setting the integer preference mailnews.bayesian_spam_filter.junk_maxtokens. A UI to set the limit is also available in my JunQuilla extension at https://addons.mozilla.org/en-US/thunderbird/addon/9886. JunQuilla by default increases the limit to 300,000.

But you should also know that your experience is not typical, and by that I mean that the numbers that you are reporting are not realistic for the junk filter in Thunderbird. Typical performance of the internal bayes filter by itself would be a 10% false negative rate, which in your case would equate to around 30 spam emails per day showing. So I doubt if you are going to go from your current impossibly good results to your former really impossibly good results simply by increasing the junk token limit.
I just installed JunQuilla and manually increased the limit to 300k -- but after training and then restarting Thunderbird (24.1.0, btw), the reported amount of tokens drops down to about 70k...
Does that mean this setting is ignored now?
Konstantin: I am unaware of any changes that would disable the token limits. That being said, when you hit the limit and tokens are pruned, there is no simple formula to determine how many tokens would remain, because it depends on how many single-use tokens that you had. But I would expect that a remainder of 70K is more likely from a 300K starting place than from a 100K starting place.

In short, you have not presented any evidence that proves this is not working. Do some more training, monitoring the increase in token count, and see if the count increases beyond 100K.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: