Closed
      
        Bug 243680
      
      
        Opened 21 years ago
          Closed 21 years ago
      
        
    
  
training.dat is not updated/created by junk/spam filter, so no learning across sessions 
    Categories
(MailNews Core :: Filters, defect)
        MailNews Core
          
        
        
      
        
    
        Filters
          
        
        
      
        
    Tracking
(Not tracked)
        VERIFIED
        FIXED
        
    
  
        
            mozilla1.8alpha1
        
    
  
People
(Reporter: bugzilla.spam2, Assigned: mscott)
References
Details
(Whiteboard: fixed-aviary1.0)
Attachments
(2 files)
| 6.73 KB,
          text/plain         | Details | |
| 828 bytes,
          patch         | Bienvenu
:
              
              superreview+ | Details | Diff | Splinter Review | 
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a) Gecko/20040514
In order to retrain junk mail controls for the updated junk mail filtering
techniques I deleted the training.dat. Then I tried to train the junk mail
filter by marking my old mails as "not junk". But no training.dat was created.
When I marked one mail as "junk" the training.dat was created. Changing the mail
back to "not junk" the training.dat was not updated.
Same problem with Thunderbird 0.6+.
Reproducible: Always
Steps to Reproduce:
1.
2.
3.
|   | ||
| Comment 1•21 years ago
           | ||
I deleted training.dat, to get it retrained. I noticed very a vert high false
positive rate. Lots of messages wrongly marked as junk. But some correct.
Then, i discovered i had no training.dat at all. marking as junk and not junk
never created it.
recent cvs build (20040515), linux gtk2, seamonkey.
|   | ||
| Comment 2•21 years ago
           | ||
Current WinXP trunk build:
training.dat is not created in a new profile and I cannot make mailnews create
it by any means (marking as junk/not junk, changing junk prefs etc.).
Als the tool Filemon (it shows every kind of access to files) shows, Mozilla
does not even try to find/create/read/whtever it.
Only when new mail is downloaded, Mozilla tries to lookup the training data, but
of course fails: Filemon shows that Mozilla is asking for the existence of this
file and is answered that the file does not exist. That's all.
When an older profile is used with the same build, a training.dat is even
created if deleted before, and accessed when downloading new mails, but marking
a message (just received and found to be spam by mozilla's junk filter) as "not
junk" does not cause any access to training.dat - not when doing so, not on
closing mailnews, not when mozilla is closed completely.
-> confirming, All/All, major (tends to cause false positives), cc'ing mail
people, extending summary and requesting blocking1.8a?. I guess that's all... ;-)
Severity: normal → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.8a?
OS: Windows XP → All
Hardware: PC → All
Summary: Marking mails as "not junk" does not update training.dat → Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all
|   | Assignee | |
| Comment 3•21 years ago
           | ||
I can't reproduce this but I'll keep it on my plate
Assignee: sspitzer → mscott
Target Milestone: --- → mozilla1.8alpha
|   | ||
| Comment 4•21 years ago
           | ||
Latest builds are having problems creating/adding to training.dat file.   I've
tried deleting the training.dat file ... It will get re-created, but only the
first entry takes ... after that, it will no longer get updated.
Build ID 2004051608
would you please attach the filemon log? i actually read such things...
|   | ||
| Comment 6•21 years ago
           | ||
timeless:
Ok, I made a manual list of various steps, after each step I pasted the Filemon
output, if any.
I did four test runs, the first with a new profile, the second with an existing
training.dat copied into this profile, then with an old profile. While these
were done with a Mozilla cvs build from 2004-5-17, I did a test run with 1.7rc2
and a new profile to compare it.
This was all done with a pop mailbox on WinXP. My profile directory was
replaced/shortened to "T:\[path]\x.slt\" to improve readability and protect
privacy.
The result as far as I can see is that current Mozilla builds do only access
training.dat to read from it.
This happens as soon as knowledge about it is needed and is not repeated as
long as mailnews is running, so it seems just to stay in memory.
And this is what 1.7rc2 also does.
BUT 1.7rc2 also ALWAYS makes write accesses within 1-2 seconds whenever a
message is marked as Junk/not Junk - during my tests the current build did
never do this.
So it seems like current Mozilla builds do never write to training.dat and
therefore may either not learn at all or forget everything upon restart.
I was filtering Filemon output for training (but NOT limiting logging in any
other way, e.g. to my profile folder), so if just a different file is used,
just forget this all.
|   | ||
| Updated•21 years ago
           | 
        Attachment #148731 -
        Attachment mime type: application/octet-stream → text/plain
|   | ||
| Comment 7•21 years ago
           | ||
In case anyone *cough* wonders about my profile path...
This is the regular profile directory (....application data\mozilla...) on my
Windows partition.
However, I just also tested with trunk build 2004051709 from mozilla.org on a
different Windows installation on drive C:
There the problems are just the same. One example log entry including my test
profile path is:
13:38:26	mozilla.exe:1056	DIRECTORY	C:\DOKUMENTE UND
EINSTELLUNGEN\ADMINISTRATOR\ANWENDUNGSDATEN\Mozilla\Profiles\akutrain\ta32o2jv.slt\
NO SUCH FILE	FileBothDirectoryInformation: training.dat	
|   | Assignee | |
| Comment 8•21 years ago
           | ||
I can't reproduce this. Here's what I see:
1) in thunderbird: reset training dat. Or just remove training.dat by hand
2) quit the app
3) Restart again. 
4) Mark a message as junk
5) In nsBayesianSpamFilter::WriteTrainingData (
http://lxr.mozilla.org/mozilla/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#1300
)
 
We call:
rv = file->OpenANSIFileDesc("wb", &stream);
which creates the file and opens it
 
The next line then writes out the new training data.
6) After quitting I see a newly created training.dat.
I wonder if we are passing the wrong arguments to OpenANSIFileDesc for some
platforms?
|   | ||
| Comment 9•21 years ago
           | ||
mscott:
what system are you using?
I just set breakpoints at line 1304 and 1309 in this file and they are not hit
when manually changing the junk status of a mail. (or better: they were never
hit during my tests)
|   | Assignee | |
| Comment 10•21 years ago
           | ||
Ah I did find one problem with batching. It looks like the fix for Bug #193625
which went in last month (4/16) has an impedence mismatch for junk batching.
This patch added the following conditional at the start of
nsMsgDBView::OnMessageClassified:
  // this check is necessary because it is theoretically 
  // possible for a message to be flagged as non-junk,
  // and before the classifier can finish with it, for it
  // to be again classified as junk, and thus to have
  // mLastJunkURIInBatch set with its URI - and it should
  // not be considered the last junk messages during
  // the first call of this function
  if (aClassification == nsIJunkMailPlugin::GOOD)
    return NS_OK;
If you have a message marked as junk and you mark it as NOT JUNK. We enter
batching mode for the junk plugin. Then when the classification is done we enter
nsMsgDBView::OnMessageClassified which hits this new conditional and kicks out
right away. So we never end the batch mode for the junk plugin.
As a result the junk plugin is left in batch mode for the rest of the session
and we never write anything out to the training file again.
cc'ing folks who were involved in that bug fix because this seems like a pretty
serious problem.
Status: NEW → ASSIGNED
|   | ||
| Comment 11•21 years ago
           | ||
mscott:
I also set breakpoints to each caller of writeTrainingData. There is always an
if clause containing it and preventing it from being called because the
!mBatchLevel part is always "false". (mBatchLevel is 1 on first run, later 2, I
got it up to 4)
(those if clauses were hit in nsBayesianFilter::EndBatch and
nsBayesianFilter::observeMessage)
|   | ||
| Comment 12•21 years ago
           | ||
Ah, ok, I sent my comment before reading yours, but looks like it fits.
Keywords: regression
|   | Assignee | |
| Comment 13•21 years ago
           | ||
making this batching mismatch even worse is the fact that nsBayesianFilter.cpp
has a method called Shutdown which writes out the training set if it is dirty
regardless of the batching count.
Presumably this should be called when the app is shutting down but I see no
callers of: nsBayesianFilter::Shutdown
Keywords: regression
|   | Assignee | |
| Comment 14•21 years ago
           | ||
I don't see anyone calling nsIFilterPlugin::Shutdown in mailnews. Step 1, when
the plugin is getting destroyed, be sure to flush any changes to the training
set.
We still need to solve the mismatched batch calls though.
|   | Assignee | |
| Comment 15•21 years ago
           | ||
Comment on attachment 148768 [details] [diff] [review]
first part of the fix (call shutdown on exit)
Regardless of the mismatchedd batch notification calls in nsMsgDBView, I think
we'll want to make sure Shutdown gets called from the dtor of the bayesian
filter
        Attachment #148768 -
        Flags: superreview?(bienvenu)
|   | ||
| Comment 16•21 years ago
           | ||
while I am thinking of a fix for the batch start/end mismatch in my code (the 
fix for 193625), it occurs to me that this may be a good time to quote my 
comment from nsMsgDBView saying:
...
nsMsgDBView::OnMessageClassified(const char *aMsgURI,
                                 nsMsgJunkStatus aClassification)
...
    // it seems EndBatch must be called here, rather
    // than after the last message has been dispatched for
    // classification. One would think when the UI tells the
    // classifier "I am done with sending you this batch of 
    // messages" the classifer should say "ok, now as soon
    // as I finish classification I should flush my data file"
    // ... but that's not the way it works.
|   | ||
| Comment 17•21 years ago
           | ||
Comment on attachment 148768 [details] [diff] [review]
first part of the fix (call shutdown on exit)
assuming that the destructor is called early enough (e.g., before some
interesting xpcom components are unloaded, sr=bienvenu)
        Attachment #148768 -
        Flags: superreview?(bienvenu) → superreview+
|   | Assignee | |
| Comment 18•21 years ago
           | ||
fortunately the routine to write out the training set is extremely simple. It
opens an ANSI file descriptor then in a single pass writes out the bytes. So it
isn't relying on anything that would have gone away (at least on the surface)
|   | ||
| Comment 19•21 years ago
           | ||
I've posted a fix for the batch start/end at bug 193625. It's pretty
straightforward but it still needs r,sr... or you could just use it yourselves
without it being checked in.
| Updated•21 years ago
           | 
Flags: blocking1.8a? → blocking1.8a-
|   | ||
| Comment 20•21 years ago
           | ||
It looks like this first checkin did not help at all.
It was checked in at 2004-05-18 16:41 and I was just testing with build
2004051909 - which is supposed to have the fix, no? - on WinXP.
Steps:
- Created new profile and mail account
- sent myself 2 identical mails with typical spam subject from another account
- those were marked as spam, I manually marked both of them as not-spam
- sent myself one more such mail, it now was automatically classified as not-spam.
So far so good, in-memory update of learning data does work.
- Then I shut down Mozilla completely, but no training.dat was written! No write
access even tried.
- For re-checking I started Mozilla/mailnews again and sent myself one more such
mail, it was marked as spam agin although it was supposed to have learned that
this kind of mail is not spam. (But since training.dat still did not exist, it
of course did not "remember".)
So spam filters will not learn anything across sessions and this way (as several
people reported) you will have a *lot* of false positives.
I can't believe this should be in 1.8a1 :-/
Maybe we should re-request blocking1.8a because of these new findings, but of
course it would be better, if someone could confirm my findings before.
|   | ||
| Comment 21•21 years ago
           | ||
I can confirm that this fix is not working in 2004051909.   No new updates are
made to training.dat.
|   | Assignee | |
| Comment 22•21 years ago
           | ||
the first part of the problem has been checked into the aviary 1.0 branch
Whiteboard: fixed-aviary1.0
|   | ||
| Comment 23•21 years ago
           | ||
The summary was not really reflecting reality... changing.
Summary: Marking mails as "not junk" does not update training.dat / spam filter does not create that file at all → training.dat is not updated/created by junk/spam filter, so no learning across sessions
|   | Assignee | |
| Comment 24•21 years ago
           | ||
*** Bug 244402 has been marked as a duplicate of this bug. ***
|   | ||
| Comment 25•21 years ago
           | ||
Silly question:  what moz branch did bug 193625 land on?  If it was trunk and
not branch, then why the whiteboard item "fixed-aviary1.0" ?
blocking1.8a2 ?
Flags: blocking1.8a2?
| Comment 27•21 years ago
           | ||
*** Bug 245168 has been marked as a duplicate of this bug. ***
|   | ||
| Comment 28•21 years ago
           | ||
Can I get an "amen" ?  This has been busted too long now.  This is a MAJOR
feature functionality, can someone please make this a priority to fix in the
trunk builds?
I definitely vote for a BLOCKER on this one.
|   | ||
| Comment 29•21 years ago
           | ||
My fix for bug 193625, which prevents this bug from occuring (although it does
not fix it in the deeper sense), has just gotten sr+, and since it has r+ as
well, it can now be checked in, to spare us the annoyance. Now all we need is a
person with CVS access and some iniative.
BTW (in continuation of my previous comment), instead of the patch above, a
'dirty' flag could be added, which is turned on whenever any training occurs,
off whenever training data is written to disk, and is checked periodically, and
at shutdown, to decide whether data needs to be written.
|   | Assignee | |
| Comment 30•21 years ago
           | ||
the cause of this bug has just been fixed.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
|   | ||
| Comment 31•21 years ago
           | ||
(In reply to comment #30)
> the cause of this bug has just been fixed.
No it hasn't. The current _trigger_ for this bug has been fixed. Please revoke
the FIXED status.
|   | ||
| Comment 32•21 years ago
           | ||
Have filed http://bugzilla.mozilla.org/show_bug.cgi?id=245499 regarding this matter.
| Updated•21 years ago
           | 
Flags: blocking1.8a2?
This has been working now for a while on the trunk Windows builds, verified
FIXED using 2004-11-18-05, Windows XP Seamonkey trunk.
Status: RESOLVED → VERIFIED
| Updated•20 years ago
           | 
Product: MailNews → Core
| Comment 34•20 years ago
           | ||
Is it possible anything could have regressed when "Aviary branch" started
landing on the trunk? My spam detection stinks. Just wondering if this should be
reopened.
| Updated•17 years ago
           | 
Product: Core → MailNews Core
          You need to log in
          before you can comment on or make changes to this bug.
        
Description
•