283493 - flushing of training data during session practically never happens due to too high "# of changes" threshold

Assignee

Description

•

19 years ago

If I'm reading the nsBayesianFilter code right, it looks like it requires 50
changes to the training data during a session, or else it won't ever update
training.dat while the program is running. (cf bug 283372 comment 2)

Theoretically training.dat should be updated on exit, but that doesn't seem to
work reliably (bug 283372 comment 0).

I doubt anyone in normal everyday use marks 50 spam/ham during a single session.
This means that, in practice, training data is never flushed during the session.
Also, if for some reason the flush on exit keeps failing, no training data is
ever saved.

My suggestion is to change

-#define DEFAULT_WRITE_TRAINING_DATA_MESSAGES_THRESHOLD  50
+#define DEFAULT_WRITE_TRAINING_DATA_MESSAGES_THRESHOLD  1

or, preferably, just go bool again on mNumDirtyingMessages.

Eyal, afaict you you wrote the timed flushing code in bug 245499; does this
suggestion make sense?

Eyal Rozenberg

Comment 1

•

19 years ago

It seems to me like you're mixing two different issues. One issue is the problem
with flushing on exit due to the inavailabity of the profile directory name.
That has been / is being addressed AFAIK; it would not be reasonable to make
flush policy changes based on this issue. Once the patch has been checked in,
flushing on exit is quite reliable.

As for the number of classified messages, IIRC the filter also includes in the
training set messages classified automatically rather than manually, which means
50 shouldn't be that much. Which is probably why David (Bienvenu) didn't
complain. But on second though, I'm not too sure about this point.

My idea was simply to prevent a flush after each 'batch' of messages. If nobody
(specifically, David Bienvenu and Scott McGregor) thinks there's a need for
having a threshold higher than one, than the pref should indeed be removed
entirely and instead of an integral number of dirtying messages we would go back
to having a flag (but only flush every 15 minutes at most frequent). I would
have no objection to this.

Tuukka Tolvanen (sp3000)

Assignee

Comment 2

•

19 years ago

I'm not mixing the two issues -- that's why I spun this off bug 283372. If we
want to put all our eggs in the flush-on-exit basket, why does the code for
flush-while-running exist in the first place?

If the junk filter did have autolearning, balancing autolearning with
user-applied training would probably be a major issue. Seeing as I've noticed no
discussion on that, I assume there is no autolearning taking place :)

Regardless of that, learning takes place only due to events that cause immediate
disk writes anyway, be it new mail arrival for autolearn or the user marking
messages manually causing status flag updates. Flushing the training data once
many minutes later for a bunch of such events, even one, just can't be
considered too expensive to do.

(summoning people from comment 1 for comment)

Eyal Rozenberg

Comment 3

•

19 years ago

But appending a message to an existing mailbox file is not the same as rewriting
the training data which is AFAIK what happens on a flush. Plus, the less
unnecessary disk writes the better.

Tuukka Tolvanen (sp3000)

Assignee

Comment 4

•

19 years ago

right, right... I'm not arguing for more disk writes for the sake of disk writes
:) just that, barring reliable flush-on-exit, this is afaik the only user data
we essentially elect to never save in practice, and that seems wrong.

Tuukka Tolvanen (sp3000)

Assignee

Comment 5

•

19 years ago

Attached patch go bool — Details — Splinter Review

*cough* sorry about getting argumentative above, that happens when a bug dares
intrude on /my/ routines, heh

bottom line, neither trigger for flushing training data is happening:
 1) even after bug 283080, flush on exit can apparently be fudged into failing
    without spectacular user effort, I can try to retrace the steps to that in
    bug 283372 once real life stops interfering
 2) flush while running doesn't happen because it requires too many changes
    to training data before flushing

This takes care of (2) by considering one change to training data enough to
warrant flushing. Doing flushing no less than 15 minutes apart is perhaps a
longish interval, but I'm not changing that in this patch.

Tuukka Tolvanen (sp3000)

Assignee

Updated

•

19 years ago

Assignee: sspitzer → tuukka.tolvanen

Status: NEW → ASSIGNED

Attachment #175681 - Flags: superreview?(bienvenu)

Attachment #175681 - Flags: review?(mscott)

Worcester12345

Updated

•

19 years ago

Flags: blocking-aviary1.1?

Tuukka Tolvanen (sp3000)

Assignee

Updated

•

19 years ago

Summary: Don't prevent periodic flushing of training data due to not "enough" changes → flushing of training data during session practically never happens due to too high "# of changes" treshold

Jestre

Comment 6

•

19 years ago

(In reply to comment #2)
> I'm not mixing the two issues -- that's why I spun this off bug 283372. If we
> want to put all our eggs in the flush-on-exit basket, why does the code for
> flush-while-running exist in the first place?
> 
> If the junk filter did have autolearning, balancing autolearning with
> user-applied training would probably be a major issue. Seeing as I've noticed no
> discussion on that, I assume there is no autolearning taking place :)
> 
> Regardless of that, learning takes place only due to events that cause immediate
> disk writes anyway, be it new mail arrival for autolearn or the user marking
> messages manually causing status flag updates. Flushing the training data once
> many minutes later for a bunch of such events, even one, just can't be
> considered too expensive to do.
> 
> (summoning people from comment 1 for comment)

There is a possibility of auto learning, it is possible to set up a standard
filter to mark as junk.
Over half of my junk is currently caught, marked as junk and deleted with
several filters.  (I know I don't get mail from .TV & .CZ domains etc.)
These filters are autotraining my Bayesian spam filter.  (when it works)
As for only processing a couple spam at a time, if I go one day with out getting
my email, I get 150 spam. For a weekend (Fri-Mon) I may have in excess of 500
spam, 250+ which get marked as junk by filters while downloading on Monday.
FWIW
Limiting writes is important to me as I run my TB off of a USB drive.
Jestre

Asa Dotzler [:asa]

Updated

•

19 years ago

Flags: blocking-aviary1.1? → blocking-aviary1.1-

Tuukka Tolvanen (sp3000)

Assignee

Comment 7

•

19 years ago

Should the timed flushing code just go away, instead, now that flush on exit
works (bug 283372)? It's only doing work in extreme cases such as e.g. the
mass-manualfilter-training of comment 6, and is adding code obfuscation to
debugging possible future training flush problems. The patch to make it work
more predictably could at times cause large writes in response to events that as
such only caused small writes (comment 3).

Eyal Rozenberg

Comment 8

•

19 years ago

(In reply to comment #7)
I don't agree. See also bug 245499 in which this code was introduced. The point
is that the previous state of affairs was that after every 'batch' of messages,
i.e. after filtering the messages on a 'get new messages', or some set of
messages you've selected and marked manually - even one - the training data is
rewritten. This is overkill - if you get new messages and then you look at at
some of them, and click, say, to mark one as junk, a second, and then a third -
you get a total of 4 writes within the span of a minute. Instead, it is more
reasonable to space out the writes so that in such cases only one write will
happen - about . Right now the default min time interval is 1 second (the number
of dirtying messages was the limiting factor earlier), I think raising it to 60
seconds or 120 seconds would be a better idea.

Worcester12345

Updated

•

19 years ago

Flags: blocking-aviary2.0?

Scott MacGregor

Comment 9

•

19 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

I'm inclined to listen to Eyal who had some comments that seemed to disagree with this patch if I understood the conversation correctly.

Attachment #175681 - Flags: review?(mscott)

Worcester12345

Updated

•

19 years ago

Flags: blocking-thunderbird2?

Scott MacGregor

Comment 10

•

18 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

I think Eyal had some concerns about this patch in comment 8. I'll let him review (which could mean minusing) this patch as we work with Tuukka to figure out the right thing to do.

Attachment #175681 - Flags: review?(eyalroz)

Eyal Rozenberg

Comment 11

•

18 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

My concerns were not with the patch, but rather with the idea of making timed flushing go away (which the patch does not do). Removing the number-of-learned-messages threshold is fine by me as long as the minimum delay stays.

Attachment #175681 - Flags: review?(eyalroz) → review+

Scott MacGregor

Comment 12

•

18 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

This patch looks good to me. I'll let David jump in if he disagrees.

Let me know if you need someone to check this in.

Attachment #175681 - Flags: superreview?(bienvenu) → superreview+

Tuukka Tolvanen (sp3000)

Assignee

Comment 13

•

18 years ago

> Let me know if you need someone to check this in. 

please do, thanks

Whiteboard: [checkin needed]

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 14

•

18 years ago

mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.h 1.18
mozilla/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp 1.55

Status: ASSIGNED → RESOLVED

Closed: 18 years ago

OS: Linux → All

Hardware: PC → All

Resolution: --- → FIXED

Whiteboard: [checkin needed]

Worcester12345

Comment 15

•

18 years ago

Any chance this can get fixed on the branch for 2.0? Junk mail is having real issues there.

Scott MacGregor

Comment 16

•

18 years ago

not a blocker, but thanks for nominating. I'll look at approving the patch for the branch.

Flags: blocking-thunderbird2? → blocking-thunderbird2-

Worcester12345

Comment 17

•

17 years ago

(In reply to comment #16)
> not a blocker, but thanks for nominating. I'll look at approving the patch for
> the branch. 
> 
This would be great. Junk blocking still not right on nightly branch builds.

Wayne Mery (:wsmwk)

Comment 18

•

17 years ago

two questions:
1. v2 still doesn't have this, right?  (based on my reading of the checkins)
2. is this a big enough win that a v2 patch should be done? (key points comment 8 and comment 11?)  The patch seems like it would be low risk.

Wayne Mery (:wsmwk)

Comment 19

•

17 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

see comment 18

Attachment #175681 - Flags: approval1.8.1.10?

Daniel Veditz [:dveditz]

Comment 20

•

17 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

next TB is 1.8.1.11

Attachment #175681 - Flags: approval1.8.1.10? → approval1.8.1.11?

Mike Schroepfer

Updated

•

17 years ago

Flags: blocking1.8.1.12?

Daniel Veditz [:dveditz]

Updated

•

17 years ago

Flags: blocking1.8.1.12? → blocking1.8.1.12+

Daniel Veditz [:dveditz]

Comment 21

•

17 years ago

Comment on attachment 175681 [details] [diff] [review]
go bool

approved for 1.8.1.12, a=dveditz for release-drivers

Who's going to land this?

Attachment #175681 - Flags: approval1.8.1.12? → approval1.8.1.12+

Samuel Sidler (old account; do not CC)

Comment 22

•

17 years ago

QA: To test, mark a message as spam and see if the training.dat file gets updated.

Phil Ringnalda (:philor)

Comment 23

•

17 years ago

mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp 1.50.4.5
mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.h 1.17.8.1

Keywords: fixed1.8.1.12

QA Contact: filters

Summary: flushing of training data during session practically never happens due to too high "# of changes" treshold → flushing of training data during session practically never happens due to too high "# of changes" threshold

Joe Sabash [:JoeS1]

Comment 24

•

17 years ago

If I'm reading the checkins correctly this went into branch with previous checkin.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12pre) Gecko/20080115 Thunderbird/2.0.0.12pre ID:2008011503
After send mail, on shutdown:
VC++ runtime library R6025 -pure virtual function call
Not sure exactly when this started, as I haven't been updating my branch build daily. Can't be more than a few days though.
If mail not sent shutdown is normal.

Samuel Sidler (old account; do not CC)

Comment 25

•

17 years ago

Joe, can you please file a new bug for that issue? When you do, please make it blocking this bug (for now) and request blocking1.8.1.12.

I kind of think it's unlikely that your issue was caused by this patch, but let's triage it in another bug.

Joe Sabash [:JoeS1]

Updated

•

17 years ago

Depends on: 412708

Joe Sabash [:JoeS1]

Comment 26

•

17 years ago

filed bug 412708

Daniel Veditz [:dveditz]

Updated

•

17 years ago

No longer depends on: 412708

juan becerra [:juanb]

Comment 27

•

16 years ago

Verified on: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12pre) Gecko/2008012903 Thunderbird/2.0.0.12pre

Marked a message as Junk (per comment #22) and shutdown Tbird. At that point the training.dat file was updated. If no messages are flagged as Junk, the file does not change on exit.

Added the "verified1.8.1.12" keyword.

Keywords: fixed1.8.1.12 → verified1.8.1.12

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → MailNews Core