Closed Bug 243680 Opened 20 years ago Closed 20 years ago

training.dat is not updated/created by junk/spam filter, so no learning across sessions

Categories

(MailNews Core :: Filters, defect)

defect
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.8alpha1

People

(Reporter: bugzilla.spam2, Assigned: mscott)

References

Details

(Whiteboard: fixed-aviary1.0)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514

In order to retrain junk mail controls for the updated junk mail filtering
techniques I deleted the training.dat. Then I tried to train the junk mail
filter by marking my old mails as "not junk". But no training.dat was created.
When I marked one mail as "junk" the training.dat was created. Changing the mail
back to "not junk" the training.dat was not updated.

Same problem with Thunderbird 0.6+.

Reproducible: Always
Steps to Reproduce:
1.
2.
3.
I deleted training.dat, to get it retrained. I noticed very a vert high false
positive rate. Lots of messages wrongly marked as junk. But some correct.
Then, i discovered i had no training.dat at all. marking as junk and not junk
never created it.
recent cvs build (20040515), linux gtk2, seamonkey.
Current WinXP trunk build:
training.dat is not created in a new profile and I cannot make mailnews create
it by any means (marking as junk/not junk, changing junk prefs etc.).
Als the tool Filemon (it shows every kind of access to files) shows, Mozilla
does not even try to find/create/read/whtever it.
Only when new mail is downloaded, Mozilla tries to lookup the training data, but
of course fails: Filemon shows that Mozilla is asking for the existence of this
file and is answered that the file does not exist. That's all.

When an older profile is used with the same build, a training.dat is even
created if deleted before, and accessed when downloading new mails, but marking
a message (just received and found to be spam by mozilla's junk filter) as "not
junk" does not cause any access to training.dat - not when doing so, not on
closing mailnews, not when mozilla is closed completely.

-> confirming, All/All, major (tends to cause false positives), cc'ing mail
people, extending summary and requesting blocking1.8a?. I guess that's all... ;-)
Severity: normal → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.8a?
OS: Windows XP → All
Hardware: PC → All
Summary: Marking mails as "not junk" does not update training.dat → Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all
I can't reproduce this but I'll keep it on my plate
Assignee: sspitzer → mscott
Target Milestone: --- → mozilla1.8alpha
Latest builds are having problems creating/adding to training.dat file.   I've
tried deleting the training.dat file ... It will get re-created, but only the
first entry takes ... after that, it will no longer get updated.

Build ID 2004051608
would you please attach the filemon log? i actually read such things...
Attached file Filemon log
timeless:
Ok, I made a manual list of various steps, after each step I pasted the Filemon
output, if any.
I did four test runs, the first with a new profile, the second with an existing
training.dat copied into this profile, then with an old profile. While these
were done with a Mozilla cvs build from 2004-5-17, I did a test run with 1.7rc2
and a new profile to compare it.

This was all done with a pop mailbox on WinXP. My profile directory was
replaced/shortened to "T:\[path]\x.slt\" to improve readability and protect
privacy.

The result as far as I can see is that current Mozilla builds do only access
training.dat to read from it.
This happens as soon as knowledge about it is needed and is not repeated as
long as mailnews is running, so it seems just to stay in memory.
And this is what 1.7rc2 also does.

BUT 1.7rc2 also ALWAYS makes write accesses within 1-2 seconds whenever a
message is marked as Junk/not Junk - during my tests the current build did
never do this.

So it seems like current Mozilla builds do never write to training.dat and
therefore may either not learn at all or forget everything upon restart.
I was filtering Filemon output for training (but NOT limiting logging in any
other way, e.g. to my profile folder), so if just a different file is used,
just forget this all.
Attachment #148731 - Attachment mime type: application/octet-stream → text/plain
In case anyone *cough* wonders about my profile path...
This is the regular profile directory (....application data\mozilla...) on my
Windows partition.
However, I just also tested with trunk build 2004051709 from mozilla.org on a
different Windows installation on drive C:

There the problems are just the same. One example log entry including my test
profile path is:
13:38:26	mozilla.exe:1056	DIRECTORY	C:\DOKUMENTE UND
EINSTELLUNGEN\ADMINISTRATOR\ANWENDUNGSDATEN\Mozilla\Profiles\akutrain\ta32o2jv.slt\
NO SUCH FILE	FileBothDirectoryInformation: training.dat	
I can't reproduce this. Here's what I see:

1) in thunderbird: reset training dat. Or just remove training.dat by hand
2) quit the app
3) Restart again. 
4) Mark a message as junk
5) In nsBayesianSpamFilter::WriteTrainingData (
http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1300
)
 
We call:

rv = file->OpenANSIFileDesc("wb", &stream);

which creates the file and opens it
 
The next line then writes out the new training data.

6) After quitting I see a newly created training.dat.

I wonder if we are passing the wrong arguments to OpenANSIFileDesc for some
platforms?
mscott:
what system are you using?

I just set breakpoints at line 1304 and 1309 in this file and they are not hit
when manually changing the junk status of a mail. (or better: they were never
hit during my tests)
Ah I did find one problem with batching. It looks like the fix for Bug #193625
which went in last month (4/16) has an impedence mismatch for junk batching.

This patch added the following conditional at the start of
nsMsgDBView::OnMessageClassified:

  // this check is necessary because it is theoretically 
  // possible for a message to be flagged as non-junk,
  // and before the classifier can finish with it, for it
  // to be again classified as junk, and thus to have
  // mLastJunkURIInBatch set with its URI - and it should
  // not be considered the last junk messages during
  // the first call of this function
  if (aClassification == nsIJunkMailPlugin::GOOD)
    return NS_OK;

If you have a message marked as junk and you mark it as NOT JUNK. We enter
batching mode for the junk plugin. Then when the classification is done we enter
nsMsgDBView::OnMessageClassified which hits this new conditional and kicks out
right away. So we never end the batch mode for the junk plugin.

As a result the junk plugin is left in batch mode for the rest of the session
and we never write anything out to the training file again.

cc'ing folks who were involved in that bug fix because this seems like a pretty
serious problem.


Status: NEW → ASSIGNED
mscott:
I also set breakpoints to each caller of writeTrainingData. There is always an
if clause containing it and preventing it from being called because the
!mBatchLevel part is always "false". (mBatchLevel is 1 on first run, later 2, I
got it up to 4)

(those if clauses were hit in nsBayesianFilter::EndBatch and
nsBayesianFilter::observeMessage)
Ah, ok, I sent my comment before reading yours, but looks like it fits.
Keywords: regression
making this batching mismatch even worse is the fact that nsBayesianFilter.cpp
has a method called Shutdown which writes out the training set if it is dirty
regardless of the batching count.

Presumably this should be called when the app is shutting down but I see no
callers of: nsBayesianFilter::Shutdown
Keywords: regression
I don't see anyone calling nsIFilterPlugin::Shutdown in mailnews. Step 1, when
the plugin is getting destroyed, be sure to flush any changes to the training
set.

We still need to solve the mismatched batch calls though.
Comment on attachment 148768 [details] [diff] [review]
first part of the fix (call shutdown on exit)

Regardless of the mismatchedd batch notification calls in nsMsgDBView, I think
we'll want to make sure Shutdown gets called from the dtor of the bayesian
filter
Attachment #148768 - Flags: superreview?(bienvenu)
while I am thinking of a fix for the batch start/end mismatch in my code (the 
fix for 193625), it occurs to me that this may be a good time to quote my 
comment from nsMsgDBView saying:

...
nsMsgDBView::OnMessageClassified(const char *aMsgURI,
                                 nsMsgJunkStatus aClassification)
...
    // it seems EndBatch must be called here, rather
    // than after the last message has been dispatched for
    // classification. One would think when the UI tells the
    // classifier "I am done with sending you this batch of 
    // messages" the classifer should say "ok, now as soon
    // as I finish classification I should flush my data file"
    // ... but that's not the way it works.
Comment on attachment 148768 [details] [diff] [review]
first part of the fix (call shutdown on exit)

assuming that the destructor is called early enough (e.g., before some
interesting xpcom components are unloaded, sr=bienvenu)
Attachment #148768 - Flags: superreview?(bienvenu) → superreview+
fortunately the routine to write out the training set is extremely simple. It
opens an ANSI file descriptor then in a single pass writes out the bytes. So it
isn't relying on anything that would have gone away (at least on the surface)
I've posted a fix for the batch start/end at bug 193625. It's pretty
straightforward but it still needs r,sr... or you could just use it yourselves
without it being checked in.
Flags: blocking1.8a? → blocking1.8a-
It looks like this first checkin did not help at all.

It was checked in at 2004-05-18 16:41 and I was just testing with build
2004051909 - which is supposed to have the fix, no? - on WinXP.
Steps:
- Created new profile and mail account
- sent myself 2 identical mails with typical spam subject from another account
- those were marked as spam, I manually marked both of them as not-spam
- sent myself one more such mail, it now was automatically classified as not-spam.
So far so good, in-memory update of learning data does work.
- Then I shut down Mozilla completely, but no training.dat was written! No write
access even tried.
- For re-checking I started Mozilla/mailnews again and sent myself one more such
mail, it was marked as spam agin although it was supposed to have learned that
this kind of mail is not spam. (But since training.dat still did not exist, it
of course did not "remember".)

So spam filters will not learn anything across sessions and this way (as several
people reported) you will have a *lot* of false positives.
I can't believe this should be in 1.8a1 :-/
Maybe we should re-request blocking1.8a because of these new findings, but of
course it would be better, if someone could confirm my findings before.


I can confirm that this fix is not working in 2004051909.   No new updates are
made to training.dat.
the first part of the problem has been checked into the aviary 1.0 branch
Whiteboard: fixed-aviary1.0
The summary was not really reflecting reality... changing.
Summary: Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all → training.dat is not updated/created by junk/spam filter, so no learning across sessions
*** Bug 244402 has been marked as a duplicate of this bug. ***
Silly question:  what moz branch did bug 193625 land on?  If it was trunk and
not branch, then why the whiteboard item "fixed-aviary1.0" ?
blocking1.8a2 ?
Flags: blocking1.8a2?
*** Bug 245168 has been marked as a duplicate of this bug. ***
Can I get an "amen" ?  This has been busted too long now.  This is a MAJOR
feature functionality, can someone please make this a priority to fix in the
trunk builds?

I definitely vote for a BLOCKER on this one.
My fix for bug 193625, which prevents this bug from occuring (although it does
not fix it in the deeper sense), has just gotten sr+, and since it has r+ as
well, it can now be checked in, to spare us the annoyance. Now all we need is a
person with CVS access and some iniative.

BTW (in continuation of my previous comment), instead of the patch above, a
'dirty' flag could be added, which is turned on whenever any training occurs,
off whenever training data is written to disk, and is checked periodically, and
at shutdown, to decide whether data needs to be written.
the cause of this bug has just been fixed.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
(In reply to comment #30)
> the cause of this bug has just been fixed.

No it hasn't. The current _trigger_ for this bug has been fixed. Please revoke
the FIXED status.
Have filed http://bugzilla.mozilla.org/show_bug.cgi?id=245499 regarding this matter.
Flags: blocking1.8a2?
This has been working now for a while on the trunk Windows builds, verified
FIXED using 2004-11-18-05, Windows XP Seamonkey trunk.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Is it possible anything could have regressed when "Aviary branch" started
landing on the trunk? My spam detection stinks. Just wondering if this should be
reopened.
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: