Closed Bug 309620 Opened 17 years ago Closed 15 years ago

Discard old junk filter data from training.dat

Categories

(SeaMonkey :: MailNews: Account Configuration, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 228675

People

(Reporter: allltaken, Unassigned)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.9a1) Gecko/20050919 SeaMonkey/1.1a
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.9a1) Gecko/20050919 SeaMonkey/1.1a

I just noticed that my training.dat file is almost 6Mb, and suspect that the
earlier entries in it are only wasting space and processing time. What's the
feasibility of purging it of old entries once a week or so? "If date is after
???, purge training dat of entries before ???-a year and increment the next
purge date by a week."

This would help to clear old useless entries out of training.dat.


Reproducible: Always

Steps to Reproduce:
1. Set up junk mail controls.
2. Watch the training.dat file grow for a year or two.
3. Watch the time and memory demands of processing the file grow.

Actual Results:  
After some length of time, the file becomes large and contains a lower percent
of useful junk filter information.

Expected Results:  
The software should purge old junk data, perhaps older than 6 months or a year,
as specified by the user (file size? date? other criteria?)
FWIW, my training.dat is 856KB; as far as I can recall it's the same file that 
I've been training to since junk filtering was first introduced (Moz 1.4?) -- 
when I switched to TB in August '04, I copied   training.dat   over to the new 
profile.
We don't store any dates in the training.dat, so we can't age entries there.

You may want to have a look at the <http://bayesjunktool.mozdev.org>, though,
and use it to remove all entries in your training.dat with fewer than ~20
occurences...
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → INVALID
Marked invalid? This is an enhancement request. 

Adding a last-modified-date to the entries along with counts would make it
possible to select and discard out-dated entries in training.dat. I haven't
looked at the junk log, but if those entries are dated and old enough, they
might show that entries were last modified at or before a specified cutoff date.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
> This is an enhancement request. 

True. Sorry.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Here's a suggestion for an approach that would have about the same effect as
aging, if the new records entered in the training.dat file are either appended
or prepended: Limit the size of training.dat. A preference and maybe a dialog to
specify the maximum size of training.dat, and some code to remove records from
the old part of the file would do the job.
This has proved to be a problem with at least one other junk filter, Spamassassin. My email started bouncing because the junk file used up my quota of space at my ISP. The lack of a tool to remove old blacklist data that's probably invalid is an error of logic and foresight in the process of junk filtering. The removal of whitelist data is more problematic and I wouldn't suggest removing any even if it hasn't been used for a long time.
I noticed the addition to preferences in Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20060921 SeaMonkey/1.5a, the button to reset training.dat. Does this button's action restart the whole file, or just the blacklist data? It seems to me that the whitelist probably should be left as is.
Status: NEW → RESOLVED
Closed: 17 years ago15 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 228675
You need to log in before you can comment on or make changes to this bug.