Closed Bug 200087 Opened 21 years ago Closed 20 years ago

Junk classification effectiveness peaks and then falls over time (spam)

Categories

(MailNews Core :: Filters, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 181534

People

(Reporter: adam, Assigned: sspitzer)

References

Details

Attachments

(2 files)

I've been using MailNews' spam classification for about five weeks now.

I've been quite happy with it.  Within a few days of training it was doing a
pretty impressive job and at the 3-week mark it was >~95% flawless.

But in the last week something a bit strange has been happening: more and more
spams have been getting through (while the number of falsely-identified-as-spams
hasn't dropped, possibly they've even risen), slowly at first, but seemingly a
greater proportion every day, despite my continuous correction.

Either spammers are suddenly changing the nature of their spams (as far as the
Bayes-like technique perceives) faster than Mozilla can be retrained, or
something is slowly going awry (i.e. some weights have overflowed, wrapped,
whatever).

This is with CVS HEAD, Linux/x86.
I've noticed the spam filter becoming less accurate lately too, using the 1.4
nightlies on MacOS X, so the hardware type and OS should probably be set to All.
I have not tried to determine if this is because a change in the filtering code
(in which case going back to 1.3 or even 1.3b would result in an immediate
improvement) or if it is overcorrecting. I suspect it is overcorrecting, and
probably just needs a way to tell Mozilla to stop learning what is spam while
keeping the filters active.

It does seem (based on subject lines only) that the messages missed all have
similar subjects. One day it will be "low mortgage rate" messages, another day
it will classify those as junk but miss the "see my webcam" messages, and after
a few days it will be back to the mortgage messages being problematic. 
I've experienced this too, with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3)
Gecko/20030313 on Red Hat Linux 8. Apparently, after the junk mail gets moved
into the junk folder it sometimes gets unmarked as junk. I think this is causing
the Bayesian filter to get confused and consider the messages as "good" again,
because if I manually mark them as junk again the filter regains its
effectiveness. But left on its own, the junk mail reverts to a non-junk state
and the filter loses its effectiveness.
These messages were missed by the spam filter. I manually marked them as junk
and moved them to a separate folder. Notice that the messages that arrive on
April 2 all have similar subjects (digital cable boxes).

There was one similar message that arrived the same day and was caught by the 
filters, that will be attached in a second file.
This is one message copied and pasted from my "Junk" mbox file, that is very
similar to other messages received the same day. This message was automatically
tagged as junk, the others (in the attachment called "Uncaught.txt") had to be
manually marked as junk.

Notice the odd code at the bottom. Perhaps the spammers are tricking out the
filters? That would be a good reason to have the ability to "lock" the spam
filters in a mode where they do not try to adapt anymore.
I've seen the exact same problem on my system.  Over time it's gotten much
less effective.  i'm using mozilla 1.3 and redhat 9.

Is there a workaround?  would resetting the learned database help?  if
so how to do that?
There is a file called "training.dat" in your profile directory, you could try
deleting that to set it back to zero. 

I had some success in locking it (Mac OS X/BSD has a "chflags uchg" command, but
"chmod a-w" would probably work just as well) which keeps the filtering working
in the exact same state as when you locked the file. It will keep missing 
whatever it misses at that point and you won't be able to train it to recognize
new junk, but it at least it will be consistent and not grow worse. 
Regarding my comment #2 above, I think I've figured out why the Junk messages
are getting unmarked.

Sometimes, and this is true of Inbox, Junk, and all of my custom folders, when I
start Mozilla Mail I can go straight to the folder; but sometimes, it has to say
"Building summary file for..." and then, after a moment, it displays my folder.

In the former case, when it goes straight in, Mozilla remembers my folder
settings, such as which column I'm sorting and in ascending or descending order;
when it has to rebuild the summary file, it forgets that and goes back to the
default of sorting by date in ascending order. See Bug 93271.

In the Junk folder, the same thing applies, and additionally, after it rebuilds
the summary file, the messages are not marked as Junk. As I said above, I think
this is why it "forgets" what it has learned because if I manually mark the
messages as Junk then it gets better.

So, I think at least in my case, Bug 93271 is related to this bug. Is anyone
else here experiencing the same thing?
n.b. I haven't seen any of the strange moving/marking/unmarking behaviour
mentioned by other commenters.  Just the spotty effectivensss mentioned in my
bug report.
mass re-assign.
Assignee: naving → sspitzer
I've been seeing this behavior on Win 98 using Mozilla 1.3.1. I'm not certain,
but I'm fairly sure the Junk controls quit working correctly after I set up a
second profile. My wife's junk filter (the second profile) works fine, but mine
hasn't identified any junk for weeks. I tried deleting training.dat, but that
had no effect.
Using 1.3.1 - the Bayesian(sp?) filter got to about 95-96% effective.  I
downloaded 4 email 2M files of jpgs and after that the filtering rate is down to
about 80%.  It's staying there even with additional training.

I also see the hourglass continuing for at least 5-10 minutes after getting
email messages.  In my view - it never seems to finish.  My training.dat file is
about 512K.

Suggestions?

Ah, that's a very interesting data-point.  I'm not sure what to make of it,
though -- maybe the strings from the encoded data are being entered into the
good/bad filters and flushing-out/unweighting a lot of the more relevant
weighting data (I don't know how, or whether, older weights decay over time though).
I think the bayesian filter needs to be examined and fixed in its entirety.  It 
behaves VERY badly under a number of situations including what you have 
noticed.  I am using it from a blank state and trying to train it from my 7200 
known non-spam emails and 1200 known spam emails but it refuses to learn 
properly.  With a corpus of this size any decent bayesian filter (including one 
I wrote myself as an excercise) should well exceed 90% accuracy, approaching 
99.9% with more complex tokenization rules.  Mozilla seems to be hovering in 
the 50-60% range though, which means either the filter is broken or the 
training is broken, or both.
This may be what beard has called overfitting, which (iirc) happens if the
filter is overtrained.  Not sure if this is easily fixable or not.  At some
point, we will likely want to give up on bayesian and move to a chi-separating
algorithm, like the spambayes folks did.  There is an open bug on that somewhere.
OS: Linux → All
Hardware: PC → All
I'm way out of my depth here, and have no idea what a "chi-separating" algorithm
is. But I did find one paper talking about the problems with the pseudo-Bayesian
approach and suggesting a fairly simple substitute:

http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

And he mentions that the "spambeyes" project has decided to use it. So I think
that's the one Dan just mentioned.

Just for everyone's info.
Dmose mentioned bug 210215 as a possible cause of this bug (I've been bothered
by declining filter effectiveness too, ever since a week or two after training).

/be
*** Bug 196213 has been marked as a duplicate of this bug. ***
*** Bug 211976 has been marked as a duplicate of this bug. ***
Regarding comment 14, what you're looking for is bug 181534. 

IMHO, fixing bug 181534 would fix this bug. 
FYI: I am using the spam filter since 7 weeks now, and I don't see a significant
decline. I have a hit rate of (guessed) >90-95% and <= 0.1% false positives.
That's with 100 spam and 100 ham msgs/day.
This problem is most likely caused by the spam filter not training itself on its
own correct classifications.  What you are seeing is probably a change in the
headers, maybe something even as simple as the date (if you train on all spam
from 2002 and all ham from 2003, the token 2002 will have a very high spam
probability).  If the filter learned from its own work then any time it DID
catch a spam with the new feature (date, mail server, sender, whatever) that
feature would become more well known to it.  As it is now you can only wait for
the problem to become evident and then begin combatting it, which is less than
optimal because while you may be marking 10 mails a day with the new feature as
junk there are probably already 500 of them sitting in your junk folder that
havent been used for training.  I have spent days trying to get the filter to
train itself to no avail, mostly due to my lack of coding experience.  I have no
doubt that a proficient coder could implement it rapidly.  It would solve a
number of problems, including this one most likely.
I have seen the same issues reported in this bug with Mozilla v1.4 and v1.5.
I have noticed a lot of spammers put 10+ lines of random words at the bottom of
their messages, presumably to add noise to the messages and make them harder to
filter out. The same is true in a smaller way of the message titles and sender
names.
Yeah, they take advantage of the fact that Paul Graham recommended a .4
probability for unknown words (which, by his own admission, was based solely on
trial and error and not anything objective). On our IMail systems, we raised
this probability to .6 and saw a dramatic decrease in the amount of delivered
spams while still noticing no false positives. Perhaps the settings for the Junk
Mail system could be put into preferences? Even if they're hidden preferences,
there would still be at least a way to go in and tweak the settings if they
don't work well for you.
It's trivial to patch moz to change this value (I just did it, going to give it
a spin :)).  I don't think it's necessarily related to this bug, but it is
potentially one of things that is making Moz's spam-filtering less than optimal.
*** Bug 202066 has been marked as a duplicate of this bug. ***
This bug is still around with Moz 1.5 and Moz 1.6. At home I set up SpamBayes as
a proxy mail server, and it performs much better than Mozilla's junk filter. My
ISP pre-filters spam based on header/server info, and they cull about 95% of my
spam (with an occasional false positive). I second (or third?) the idea of
making the spam classification parameters configurable to allow power users to
tweak their settings.
Tweaking the broken Paul Graham "1.0" filter seems worse than using spambayes'
chi-squared reformulation with its fixes and improvements.  mscott@mozilla.org
is working with spambayes folks to do just that, so maybe we can just let the
first junk mail filter retire soon.

I'm impatient too -- I want better filtering.  mscott says it's coming soon. 
Maybe he can give us an update.  Cc'ing him.

/be
duping this against the new algorithm bug.

*** This bug has been marked as a duplicate of 181534 ***
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → DUPLICATE
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: