Closed
Bug 200087
Opened 21 years ago
Closed 20 years ago
Junk classification effectiveness peaks and then falls over time (spam)
Categories
(MailNews Core :: Filters, defect)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 181534
People
(Reporter: adam, Assigned: sspitzer)
References
Details
Attachments
(2 files)
I've been using MailNews' spam classification for about five weeks now. I've been quite happy with it. Within a few days of training it was doing a pretty impressive job and at the 3-week mark it was >~95% flawless. But in the last week something a bit strange has been happening: more and more spams have been getting through (while the number of falsely-identified-as-spams hasn't dropped, possibly they've even risen), slowly at first, but seemingly a greater proportion every day, despite my continuous correction. Either spammers are suddenly changing the nature of their spams (as far as the Bayes-like technique perceives) faster than Mozilla can be retrained, or something is slowly going awry (i.e. some weights have overflowed, wrapped, whatever). This is with CVS HEAD, Linux/x86.
Comment 1•21 years ago
|
||
I've noticed the spam filter becoming less accurate lately too, using the 1.4 nightlies on MacOS X, so the hardware type and OS should probably be set to All. I have not tried to determine if this is because a change in the filtering code (in which case going back to 1.3 or even 1.3b would result in an immediate improvement) or if it is overcorrecting. I suspect it is overcorrecting, and probably just needs a way to tell Mozilla to stop learning what is spam while keeping the filters active. It does seem (based on subject lines only) that the messages missed all have similar subjects. One day it will be "low mortgage rate" messages, another day it will classify those as junk but miss the "see my webcam" messages, and after a few days it will be back to the mortgage messages being problematic.
Comment 2•21 years ago
|
||
I've experienced this too, with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313 on Red Hat Linux 8. Apparently, after the junk mail gets moved into the junk folder it sometimes gets unmarked as junk. I think this is causing the Bayesian filter to get confused and consider the messages as "good" again, because if I manually mark them as junk again the filter regains its effectiveness. But left on its own, the junk mail reverts to a non-junk state and the filter loses its effectiveness.
Comment 3•21 years ago
|
||
These messages were missed by the spam filter. I manually marked them as junk and moved them to a separate folder. Notice that the messages that arrive on April 2 all have similar subjects (digital cable boxes). There was one similar message that arrived the same day and was caught by the filters, that will be attached in a second file.
Comment 4•21 years ago
|
||
This is one message copied and pasted from my "Junk" mbox file, that is very similar to other messages received the same day. This message was automatically tagged as junk, the others (in the attachment called "Uncaught.txt") had to be manually marked as junk. Notice the odd code at the bottom. Perhaps the spammers are tricking out the filters? That would be a good reason to have the ability to "lock" the spam filters in a mode where they do not try to adapt anymore.
Comment 5•21 years ago
|
||
I've seen the exact same problem on my system. Over time it's gotten much less effective. i'm using mozilla 1.3 and redhat 9. Is there a workaround? would resetting the learned database help? if so how to do that?
Comment 6•21 years ago
|
||
There is a file called "training.dat" in your profile directory, you could try deleting that to set it back to zero. I had some success in locking it (Mac OS X/BSD has a "chflags uchg" command, but "chmod a-w" would probably work just as well) which keeps the filtering working in the exact same state as when you locked the file. It will keep missing whatever it misses at that point and you won't be able to train it to recognize new junk, but it at least it will be consistent and not grow worse.
Comment 7•21 years ago
|
||
Regarding my comment #2 above, I think I've figured out why the Junk messages are getting unmarked. Sometimes, and this is true of Inbox, Junk, and all of my custom folders, when I start Mozilla Mail I can go straight to the folder; but sometimes, it has to say "Building summary file for..." and then, after a moment, it displays my folder. In the former case, when it goes straight in, Mozilla remembers my folder settings, such as which column I'm sorting and in ascending or descending order; when it has to rebuild the summary file, it forgets that and goes back to the default of sorting by date in ascending order. See Bug 93271. In the Junk folder, the same thing applies, and additionally, after it rebuilds the summary file, the messages are not marked as Junk. As I said above, I think this is why it "forgets" what it has learned because if I manually mark the messages as Junk then it gets better. So, I think at least in my case, Bug 93271 is related to this bug. Is anyone else here experiencing the same thing?
Reporter | ||
Comment 8•21 years ago
|
||
n.b. I haven't seen any of the strange moving/marking/unmarking behaviour mentioned by other commenters. Just the spotty effectivensss mentioned in my bug report.
Comment 10•21 years ago
|
||
I've been seeing this behavior on Win 98 using Mozilla 1.3.1. I'm not certain, but I'm fairly sure the Junk controls quit working correctly after I set up a second profile. My wife's junk filter (the second profile) works fine, but mine hasn't identified any junk for weeks. I tried deleting training.dat, but that had no effect.
Comment 11•21 years ago
|
||
Using 1.3.1 - the Bayesian(sp?) filter got to about 95-96% effective. I downloaded 4 email 2M files of jpgs and after that the filtering rate is down to about 80%. It's staying there even with additional training. I also see the hourglass continuing for at least 5-10 minutes after getting email messages. In my view - it never seems to finish. My training.dat file is about 512K. Suggestions?
Reporter | ||
Comment 12•21 years ago
|
||
Ah, that's a very interesting data-point. I'm not sure what to make of it, though -- maybe the strings from the encoded data are being entered into the good/bad filters and flushing-out/unweighting a lot of the more relevant weighting data (I don't know how, or whether, older weights decay over time though).
Comment 13•21 years ago
|
||
I think the bayesian filter needs to be examined and fixed in its entirety. It behaves VERY badly under a number of situations including what you have noticed. I am using it from a blank state and trying to train it from my 7200 known non-spam emails and 1200 known spam emails but it refuses to learn properly. With a corpus of this size any decent bayesian filter (including one I wrote myself as an excercise) should well exceed 90% accuracy, approaching 99.9% with more complex tokenization rules. Mozilla seems to be hovering in the 50-60% range though, which means either the filter is broken or the training is broken, or both.
Comment 14•21 years ago
|
||
This may be what beard has called overfitting, which (iirc) happens if the filter is overtrained. Not sure if this is easily fixable or not. At some point, we will likely want to give up on bayesian and move to a chi-separating algorithm, like the spambayes folks did. There is an open bug on that somewhere.
OS: Linux → All
Hardware: PC → All
Comment 15•21 years ago
|
||
I'm way out of my depth here, and have no idea what a "chi-separating" algorithm is. But I did find one paper talking about the problems with the pseudo-Bayesian approach and suggesting a fairly simple substitute: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html And he mentions that the "spambeyes" project has decided to use it. So I think that's the one Dan just mentioned. Just for everyone's info.
Comment 16•21 years ago
|
||
Dmose mentioned bug 210215 as a possible cause of this bug (I've been bothered by declining filter effectiveness too, ever since a week or two after training). /be
Comment 17•21 years ago
|
||
*** Bug 196213 has been marked as a duplicate of this bug. ***
Comment 18•21 years ago
|
||
*** Bug 211976 has been marked as a duplicate of this bug. ***
Comment 19•21 years ago
|
||
Regarding comment 14, what you're looking for is bug 181534. IMHO, fixing bug 181534 would fix this bug.
Comment 20•21 years ago
|
||
FYI: I am using the spam filter since 7 weeks now, and I don't see a significant decline. I have a hit rate of (guessed) >90-95% and <= 0.1% false positives. That's with 100 spam and 100 ham msgs/day.
Comment 21•21 years ago
|
||
This problem is most likely caused by the spam filter not training itself on its own correct classifications. What you are seeing is probably a change in the headers, maybe something even as simple as the date (if you train on all spam from 2002 and all ham from 2003, the token 2002 will have a very high spam probability). If the filter learned from its own work then any time it DID catch a spam with the new feature (date, mail server, sender, whatever) that feature would become more well known to it. As it is now you can only wait for the problem to become evident and then begin combatting it, which is less than optimal because while you may be marking 10 mails a day with the new feature as junk there are probably already 500 of them sitting in your junk folder that havent been used for training. I have spent days trying to get the filter to train itself to no avail, mostly due to my lack of coding experience. I have no doubt that a proficient coder could implement it rapidly. It would solve a number of problems, including this one most likely.
Comment 22•21 years ago
|
||
I have seen the same issues reported in this bug with Mozilla v1.4 and v1.5.
Comment 23•21 years ago
|
||
I have noticed a lot of spammers put 10+ lines of random words at the bottom of their messages, presumably to add noise to the messages and make them harder to filter out. The same is true in a smaller way of the message titles and sender names.
Comment 24•21 years ago
|
||
Yeah, they take advantage of the fact that Paul Graham recommended a .4 probability for unknown words (which, by his own admission, was based solely on trial and error and not anything objective). On our IMail systems, we raised this probability to .6 and saw a dramatic decrease in the amount of delivered spams while still noticing no false positives. Perhaps the settings for the Junk Mail system could be put into preferences? Even if they're hidden preferences, there would still be at least a way to go in and tweak the settings if they don't work well for you.
Reporter | ||
Comment 25•21 years ago
|
||
It's trivial to patch moz to change this value (I just did it, going to give it a spin :)). I don't think it's necessarily related to this bug, but it is potentially one of things that is making Moz's spam-filtering less than optimal.
Comment 26•21 years ago
|
||
*** Bug 202066 has been marked as a duplicate of this bug. ***
Comment 27•20 years ago
|
||
This bug is still around with Moz 1.5 and Moz 1.6. At home I set up SpamBayes as a proxy mail server, and it performs much better than Mozilla's junk filter. My ISP pre-filters spam based on header/server info, and they cull about 95% of my spam (with an occasional false positive). I second (or third?) the idea of making the spam classification parameters configurable to allow power users to tweak their settings.
Comment 28•20 years ago
|
||
Tweaking the broken Paul Graham "1.0" filter seems worse than using spambayes' chi-squared reformulation with its fixes and improvements. mscott@mozilla.org is working with spambayes folks to do just that, so maybe we can just let the first junk mail filter retire soon. I'm impatient too -- I want better filtering. mscott says it's coming soon. Maybe he can give us an update. Cc'ing him. /be
Comment 29•20 years ago
|
||
duping this against the new algorithm bug. *** This bug has been marked as a duplicate of 181534 ***
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → DUPLICATE
Updated•20 years ago
|
Product: MailNews → Core
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•