243680 - training.dat is not updated/created by junk/spam filter, so no learning across sessions

Reporter

Description

•

21 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514 In order to retrain junk mail controls for the updated junk mail filtering techniques I deleted the training.dat. Then I tried to train the junk mail filter by marking my old mails as "not junk". But no training.dat was created. When I marked one mail as "junk" the training.dat was created. Changing the mail back to "not junk" the training.dat was not updated. Same problem with Thunderbird 0.6+. Reproducible: Always Steps to Reproduce: 1. 2. 3.

Michiel van Leeuwen (email: mvl+moz@)

Comment 1

•

21 years ago

I deleted training.dat, to get it retrained. I noticed very a vert high false positive rate. Lots of messages wrongly marked as junk. But some correct. Then, i discovered i had no training.dat at all. marking as junk and not junk never created it. recent cvs build (20040515), linux gtk2, seamonkey.

Andreas Kunz

Comment 2

•

21 years ago

Current WinXP trunk build: training.dat is not created in a new profile and I cannot make mailnews create it by any means (marking as junk/not junk, changing junk prefs etc.). Als the tool Filemon (it shows every kind of access to files) shows, Mozilla does not even try to find/create/read/whtever it. Only when new mail is downloaded, Mozilla tries to lookup the training data, but of course fails: Filemon shows that Mozilla is asking for the existence of this file and is answered that the file does not exist. That's all. When an older profile is used with the same build, a training.dat is even created if deleted before, and accessed when downloading new mails, but marking a message (just received and found to be spam by mozilla's junk filter) as "not junk" does not cause any access to training.dat - not when doing so, not on closing mailnews, not when mozilla is closed completely. -> confirming, All/All, major (tends to cause false positives), cc'ing mail people, extending summary and requesting blocking1.8a?. I guess that's all... ;-)

Severity: normal → major

Status: UNCONFIRMED → NEW

Ever confirmed: true

Flags: blocking1.8a?

OS: Windows XP → All

Hardware: PC → All

Summary: Marking mails as "not junk" does not update training.dat → Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all

Scott MacGregor

Assignee

Comment 3

•

21 years ago

I can't reproduce this but I'll keep it on my plate

Assignee: sspitzer → mscott

Target Milestone: --- → mozilla1.8alpha

Andrew Perry

Comment 4

•

21 years ago

Latest builds are having problems creating/adding to training.dat file. I've tried deleting the training.dat file ... It will get re-created, but only the first entry takes ... after that, it will no longer get updated. Build ID 2004051608

timeless

Comment 5

•

21 years ago

would you please attach the filemon log? i actually read such things...

Andreas Kunz

Comment 6

•

21 years ago

Attached file Filemon log — Details

timeless: Ok, I made a manual list of various steps, after each step I pasted the Filemon output, if any. I did four test runs, the first with a new profile, the second with an existing training.dat copied into this profile, then with an old profile. While these were done with a Mozilla cvs build from 2004-5-17, I did a test run with 1.7rc2 and a new profile to compare it. This was all done with a pop mailbox on WinXP. My profile directory was replaced/shortened to "T:\[path]\x.slt\" to improve readability and protect privacy. The result as far as I can see is that current Mozilla builds do only access training.dat to read from it. This happens as soon as knowledge about it is needed and is not repeated as long as mailnews is running, so it seems just to stay in memory. And this is what 1.7rc2 also does. BUT 1.7rc2 also ALWAYS makes write accesses within 1-2 seconds whenever a message is marked as Junk/not Junk - during my tests the current build did never do this. So it seems like current Mozilla builds do never write to training.dat and therefore may either not learn at all or forget everything upon restart. I was filtering Filemon output for training (but NOT limiting logging in any other way, e.g. to my profile folder), so if just a different file is used, just forget this all.

Andreas Kunz

Updated

•

21 years ago

Attachment #148731 - Attachment mime type: application/octet-stream → text/plain

Andreas Kunz

Comment 7

•

21 years ago

In case anyone *cough* wonders about my profile path... This is the regular profile directory (....application data\mozilla...) on my Windows partition. However, I just also tested with trunk build 2004051709 from mozilla.org on a different Windows installation on drive C: There the problems are just the same. One example log entry including my test profile path is: 13:38:26 mozilla.exe:1056 DIRECTORY C:\DOKUMENTE UND EINSTELLUNGEN\ADMINISTRATOR\ANWENDUNGSDATEN\Mozilla\Profiles\akutrain\ta32o2jv.slt\ NO SUCH FILE FileBothDirectoryInformation: training.dat

Scott MacGregor

Assignee

Comment 8

•

21 years ago

I can't reproduce this. Here's what I see: 1) in thunderbird: reset training dat. Or just remove training.dat by hand 2) quit the app 3) Restart again. 4) Mark a message as junk 5) In nsBayesianSpamFilter::WriteTrainingData ( http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1300 ) We call: rv = file->OpenANSIFileDesc("wb", &stream); which creates the file and opens it The next line then writes out the new training data. 6) After quitting I see a newly created training.dat. I wonder if we are passing the wrong arguments to OpenANSIFileDesc for some platforms?

Andreas Kunz

Comment 9

•

21 years ago

mscott: what system are you using? I just set breakpoints at line 1304 and 1309 in this file and they are not hit when manually changing the junk status of a mail. (or better: they were never hit during my tests)

Scott MacGregor

Assignee

Comment 10

•

21 years ago

Ah I did find one problem with batching. It looks like the fix for Bug #193625 which went in last month (4/16) has an impedence mismatch for junk batching. This patch added the following conditional at the start of nsMsgDBView::OnMessageClassified: // this check is necessary because it is theoretically // possible for a message to be flagged as non-junk, // and before the classifier can finish with it, for it // to be again classified as junk, and thus to have // mLastJunkURIInBatch set with its URI - and it should // not be considered the last junk messages during // the first call of this function if (aClassification == nsIJunkMailPlugin::GOOD) return NS_OK; If you have a message marked as junk and you mark it as NOT JUNK. We enter batching mode for the junk plugin. Then when the classification is done we enter nsMsgDBView::OnMessageClassified which hits this new conditional and kicks out right away. So we never end the batch mode for the junk plugin. As a result the junk plugin is left in batch mode for the rest of the session and we never write anything out to the training file again. cc'ing folks who were involved in that bug fix because this seems like a pretty serious problem.

Status: NEW → ASSIGNED

Andreas Kunz

Comment 11

•

21 years ago

mscott: I also set breakpoints to each caller of writeTrainingData. There is always an if clause containing it and preventing it from being called because the !mBatchLevel part is always "false". (mBatchLevel is 1 on first run, later 2, I got it up to 4) (those if clauses were hit in nsBayesianFilter::EndBatch and nsBayesianFilter::observeMessage)

Andreas Kunz

Comment 12

•

21 years ago

Ah, ok, I sent my comment before reading yours, but looks like it fits.

Keywords: regression

Scott MacGregor

Assignee

Comment 13

•

21 years ago

making this batching mismatch even worse is the fact that nsBayesianFilter.cpp has a method called Shutdown which writes out the training set if it is dirty regardless of the batching count. Presumably this should be called when the app is shutting down but I see no callers of: nsBayesianFilter::Shutdown

Keywords: regression

Scott MacGregor

Assignee

Comment 14

•

21 years ago

Attached patch first part of the fix (call shutdown on exit) — Details — Splinter Review

I don't see anyone calling nsIFilterPlugin::Shutdown in mailnews. Step 1, when the plugin is getting destroyed, be sure to flush any changes to the training set. We still need to solve the mismatched batch calls though.

Scott MacGregor

Assignee

Comment 15

•

21 years ago

Comment on attachment 148768 [details] [diff] [review] first part of the fix (call shutdown on exit) Regardless of the mismatchedd batch notification calls in nsMsgDBView, I think we'll want to make sure Shutdown gets called from the dtor of the bayesian filter

Attachment #148768 - Flags: superreview?(bienvenu)

Eyal Rozenberg

Comment 16

•

21 years ago

while I am thinking of a fix for the batch start/end mismatch in my code (the fix for 193625), it occurs to me that this may be a good time to quote my comment from nsMsgDBView saying: ... nsMsgDBView::OnMessageClassified(const char *aMsgURI, nsMsgJunkStatus aClassification) ... // it seems EndBatch must be called here, rather // than after the last message has been dispatched for // classification. One would think when the UI tells the // classifier "I am done with sending you this batch of // messages" the classifer should say "ok, now as soon // as I finish classification I should flush my data file" // ... but that's not the way it works.

David :Bienvenu

Comment 17

•

21 years ago

Comment on attachment 148768 [details] [diff] [review] first part of the fix (call shutdown on exit) assuming that the destructor is called early enough (e.g., before some interesting xpcom components are unloaded, sr=bienvenu)

Attachment #148768 - Flags: superreview?(bienvenu) → superreview+

Scott MacGregor

Assignee

Comment 18

•

21 years ago

fortunately the routine to write out the training set is extremely simple. It opens an ANSI file descriptor then in a single pass writes out the bytes. So it isn't relying on anything that would have gone away (at least on the surface)

Eyal Rozenberg

Comment 19

•

21 years ago

I've posted a fix for the batch start/end at bug 193625. It's pretty straightforward but it still needs r,sr... or you could just use it yourselves without it being checked in.

Asa Dotzler [:asa]

Updated

•

21 years ago

Flags: blocking1.8a? → blocking1.8a-

Andreas Kunz

Comment 20

•

21 years ago

It looks like this first checkin did not help at all. It was checked in at 2004-05-18 16:41 and I was just testing with build 2004051909 - which is supposed to have the fix, no? - on WinXP. Steps: - Created new profile and mail account - sent myself 2 identical mails with typical spam subject from another account - those were marked as spam, I manually marked both of them as not-spam - sent myself one more such mail, it now was automatically classified as not-spam. So far so good, in-memory update of learning data does work. - Then I shut down Mozilla completely, but no training.dat was written! No write access even tried. - For re-checking I started Mozilla/mailnews again and sent myself one more such mail, it was marked as spam agin although it was supposed to have learned that this kind of mail is not spam. (But since training.dat still did not exist, it of course did not "remember".) So spam filters will not learn anything across sessions and this way (as several people reported) you will have a *lot* of false positives. I can't believe this should be in 1.8a1 :-/ Maybe we should re-request blocking1.8a because of these new findings, but of course it would be better, if someone could confirm my findings before.

Andrew Perry

Comment 21

•

21 years ago

I can confirm that this fix is not working in 2004051909. No new updates are made to training.dat.

Scott MacGregor

Assignee

Comment 22

•

21 years ago

the first part of the problem has been checked into the aviary 1.0 branch

Whiteboard: fixed-aviary1.0

Andreas Kunz

Comment 23

•

21 years ago

The summary was not really reflecting reality... changing.

Summary: Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all → training.dat is not updated/created by junk/spam filter, so no learning across sessions

Scott MacGregor

Assignee

Comment 24

•

21 years ago

*** Bug 244402 has been marked as a duplicate of this bug. ***

alanjstr

Comment 25

•

21 years ago

Silly question: what moz branch did bug 193625 land on? If it was trunk and not branch, then why the whiteboard item "fixed-aviary1.0" ?

Chris Thomas (CTho) [formerly cst@andrew.cmu.edu cst@yecc.com]

Comment 26

•

21 years ago

blocking1.8a2 ?

Flags: blocking1.8a2?

Frank Wein [:mcsmurf]

Comment 27

•

21 years ago

*** Bug 245168 has been marked as a duplicate of this bug. ***

Andrew Perry

Comment 28

•

21 years ago

Can I get an "amen" ? This has been busted too long now. This is a MAJOR feature functionality, can someone please make this a priority to fix in the trunk builds? I definitely vote for a BLOCKER on this one.

Eyal Rozenberg

Comment 29

•

21 years ago

My fix for bug 193625, which prevents this bug from occuring (although it does not fix it in the deeper sense), has just gotten sr+, and since it has r+ as well, it can now be checked in, to spare us the annoyance. Now all we need is a person with CVS access and some iniative. BTW (in continuation of my previous comment), instead of the patch above, a 'dirty' flag could be added, which is turned on whenever any training occurs, off whenever training data is written to disk, and is checked periodically, and at shutdown, to decide whether data needs to be written.

Scott MacGregor

Assignee

Comment 30

•

21 years ago

the cause of this bug has just been fixed.

Status: ASSIGNED → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Eyal Rozenberg

Comment 31

•

21 years ago

(In reply to comment #30) > the cause of this bug has just been fixed. No it hasn't. The current _trigger_ for this bug has been fixed. Please revoke the FIXED status.

Eyal Rozenberg

Comment 32

•

21 years ago

Have filed http://bugzilla.mozilla.org/show_bug.cgi?id=245499 regarding this matter.

Asa Dotzler [:asa]

Updated

•

21 years ago

Flags: blocking1.8a2?

Stephen Donner [:stephend] Not actively reading bugmail

Comment 33

•

20 years ago

This has been working now for a while on the trunk Windows builds, verified FIXED using 2004-11-18-05, Windows XP Seamonkey trunk.

Status: RESOLVED → VERIFIED

Myk Melez [:myk] [@mykmelez]

Updated

•

20 years ago

Product: MailNews → Core

Worcester12345

Comment 34

•

20 years ago

Is it possible anything could have regressed when "Aviary branch" started landing on the trunk? My spam detection stinks. Just wondering if this should be reopened.

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

Filemon log 21 years ago Andreas Kunz 6.73 KB, text/plain		Details
first part of the fix (call shutdown on exit) 21 years ago Scott MacGregor 828 bytes, patch	Bienvenu : superreview+	Details \| Diff \| Splinter Review