Closed Bug 437098 Opened 16 years ago Closed 16 years ago

Enable junk token limits

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla1.9

People

(Reporter: rkent, Assigned: rkent)

References

Details

Attachments

(1 file, 1 obsolete file)

Token limit set to 200000 16 years ago Kent James (:rkent) 1.26 KB, patch	neil : review+	Details \| Diff \| Splinter Review
Token limit set to 100000 16 years ago Kent James (:rkent) 1.26 KB, patch	Bienvenu : superreview+	Details \| Diff \| Splinter Review

Kent James (:rkent)

Assignee

Description

•

16 years ago

We started a discussion in bug 228675 about whether to enable by default the pruning of junk tokens in Thunderbird, the capability of which is added in that bug. This bug continues that discussion, and proposes that the token limit preference be set by default, but to a relatively large value that will not affect the vast majority of users.

The relevant preference, defined in bug 228675, is mailnews.bayesian_spam_filter.junk_maxtokens

Flags: blocking-thunderbird3?

Kent James (:rkent)

Assignee

Comment 1

•

16 years ago

Here's some rough numbers to use as a basis for setting this limit.

On a Windows 2000 system, I installed TB trunk and then loaded different sizes of training.dat. The training.dat came from different runs that I did using the TREC 2006 corpus. Here is TB memory used versus number of tokens. Note there is some noise in the memory data:

Token Counts   TB memory (MB) training.dat size (MB)
 58110            64.7               1.3
 92114            64.4               2.1
119972            70.2               3.1
390577            90.4               9.6
524622           106.2              16.2

A linear regression of this shows relationship between memory usage and total allowed token counts:

(TB Memory) = 58.2 MB + (8.83E-5)(Total Counts)

With these numbers, bayes filter memory usage is 10% of original TB memory at 66,000 counts, and is 50% of original TB memory at 330,000 counts. By comparison, the current training.dat I am running has 124,000 token counts. Also by comparison, the testing graph shown in http://wiki.mozilla.org/User:Rkentjames:Bug228675 using the TREC 2006 corpus shows an average false negative rate of about 12% using a limit of 77480 tokens.

I propose that the maximum counts be set at 200,000. At that value, the bayes filter should use a maximum of about 30% of the memory of the base TB program, and achieve an error rate of about 10%.

Kent James (:rkent)

Assignee

Comment 2

•

16 years ago

Attached patch Token limit set to 200000 (obsolete) — Details — Splinter Review

Just to keep this moving, here's the patch to set the limit at 200,000. I added Neil to the review mix to make sure to hear his opinion.

Attachment #324738 - Flags: superreview?(bienvenu)

Attachment #324738 - Flags: review?

Kent James (:rkent)

Assignee

Updated

•

16 years ago

Attachment #324738 - Flags: review? → review?(neil)

Magnus Melin [:mkmelin]

Comment 3

•

16 years ago

blocking‑thunderbird3+ as this could give great memory consumption improvements, according to the numbers in comment 1.

Flags: blocking-thunderbird3? → blocking-thunderbird3+

neil@parkwaycc.co.uk

Comment 4

•

16 years ago

Comment on attachment 324738 [details] [diff] [review]
Token limit set to 200000

I'd probably be quite happy with a smaller figure (e.g. 100,000) if you prefer.

Attachment #324738 - Flags: review?(neil) → review+

David :Bienvenu

Comment 5

•

16 years ago

I'd also be happier with a default limit of 100,000

Kent James (:rkent)

Assignee

Comment 6

•

16 years ago

Attached patch Token limit set to 100000 — Details — Splinter Review

OK, here's the patch for 100,000

Attachment #324738 - Attachment is obsolete: true

Attachment #324819 - Flags: superreview?(bienvenu)

Attachment #324738 - Flags: superreview?(bienvenu)

David :Bienvenu

Comment 7

•

16 years ago

Comment on attachment 324819 [details] [diff] [review]
Token limit set to 100000

thx, Kent

Attachment #324819 - Flags: superreview?(bienvenu) → superreview+

Kent James (:rkent)

Assignee

Comment 8

•

16 years ago

Let's then talk about risk. When this hits, some people are going to see their training.dat shrunk. As a rough guess, filter error rate will initially change from about 8% to 11% for people affected by this - but with a significant boost in performance and better startup time. But really aggressive trainers who have huge training files might see performance go from, say 6% to 11% and really notice it. That is best case assuming it works as designed.

Do you think that we need to warn anyone about this, and if so how?

Ronald Killmer

Comment 9

•

16 years ago

For the upcoming Shredder 3.0a2 preview release it could be a release note item listed as a performance enhancing feature.  This has been needed for a long time.

Kent James (:rkent)

Assignee

Updated

•

16 years ago

Keywords: checkin-needed

Mark Banner (:standard8)

Comment 10

•

16 years ago

Checking in mailnews/mailnews.js;
/cvsroot/mozilla/mailnews/mailnews.js,v  <--  mailnews.js
new revision: 3.316; previous revision: 3.315
done

Keywords: checkin-needed

Target Milestone: --- → Thunderbird 3

Serge Gautherie (:sgautherie)

Comment 11

•

16 years ago

This will affect SeaMonkey too, right ?
-> Core / Backend ?

Kent James (:rkent)

Assignee

Comment 12

•

16 years ago

(In reply to comment #11)
> This will affect SeaMonkey too, right ?
> -> Core / Backend ?
> 
Yes you're right. Changing to Core / Mailnews:Filters

Assignee: kent → nobody

Status: ASSIGNED → NEW

Component: Preferences → MailNews: Filters

Product: Thunderbird → Core

QA Contact: preferences → filters

Target Milestone: Thunderbird 3 → ---

Kent James (:rkent)

Assignee

Updated

•

16 years ago

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Magnus Melin [:mkmelin]

Updated

•

16 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Magnus Melin [:mkmelin]

Updated

•

16 years ago

Assignee: nobody → kent

Status: REOPENED → NEW

Magnus Melin [:mkmelin]

Updated

•

16 years ago

Target Milestone: --- → mozilla1.9

Kent James (:rkent)

Assignee

Comment 13

•

16 years ago

Mkmelin, why did you reopen this?

Magnus Melin [:mkmelin]

Comment 14

•

16 years ago

Ah sorry, just meant to assign it back to you...

Status: NEW → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core

Sander

Comment 15

•

14 years ago

FWIW, I've _really_ noticed the decrease in junk filtering performance from this change; My old 4.1MB training.dat with threshold at 85 had me look at maybe two spam emails a month; now (with 1.4MB) I have three a day. (Out of a total of 10,000/month.)
Could it be worth investigating setting this number dynamically, based on total number of incoming junk mails per certain time period? Those with a lot of spam might be much more willing to accept the higher memory usage.

Kent James (:rkent)

Assignee

Comment 16

•

14 years ago

Sander, if you want to adjust the limit, it is easily done by setting the integer preference mailnews.bayesian_spam_filter.junk_maxtokens. A UI to set the limit is also available in my JunQuilla extension at https://addons.mozilla.org/en-US/thunderbird/addon/9886. JunQuilla by default increases the limit to 300,000.

But you should also know that your experience is not typical, and by that I mean that the numbers that you are reporting are not realistic for the junk filter in Thunderbird. Typical performance of the internal bayes filter by itself would be a 10% false negative rate, which in your case would equate to around 30 spam emails per day showing. So I doubt if you are going to go from your current impossibly good results to your former really impossibly good results simply by increasing the junk token limit.

Konstantin Svist

Comment 17

•

11 years ago

I just installed JunQuilla and manually increased the limit to 300k -- but after training and then restarting Thunderbird (24.1.0, btw), the reported amount of tokens drops down to about 70k...
Does that mean this setting is ignored now?

Kent James (:rkent)

Assignee

Comment 18

•

11 years ago

Konstantin: I am unaware of any changes that would disable the token limits. That being said, when you hit the limit and tokens are pruned, there is no simple formula to determine how many tokens would remain, because it depends on how many single-use tokens that you had. But I would expect that a remainder of 70K is more likely from a 300K starting place than from a 100K starting place.

In short, you have not presented any evidence that proves this is not working. Do some more training, monitoring the increase in token count, and see if the count increases beyond 100K.

You need to log in before you can comment on or make changes to this bug.