Closed Bug 284308 Opened 19 years ago Closed 14 years ago

Flawed SPAM filtering won't learn from many messages

Categories

(MailNews Core :: Filters, defect)

x86
Windows 2000
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: stan, Unassigned)

References

(Blocks 1 open bug)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1

I get probably 200-300 SPAMs a day.  150 of those end up in my Inbox.  This
depsite the fact I trained Thunderbird with over 40000 SPAM messages from the
past 3 years.  Evolution+SPAMASSASIN I'd get 2/300 SPAMs in my Inbox.  So I
started trying to figure out what was going on.  And then I found the problem:

Despite manually marking certain message as SPAM no matter how many times you
mark that exact message as SPAM it will never automatically recognize it as
SPAM, no matter how easily it is to tell its SPAM and after training it with the
message dozens of times it will never mark it as SPAM.

So basically 2/10 SPAMs I get Thunderbird will never actually mark as SPAM on
its on.  EVER.  Try and try it will never recognize it...  the only explanation
is that it can't learn form the messages and this is clearly a bad bad bug.  I
can send a 100+ such messages if need be.  But until this behaviour is resolved
it makes Thunderbird practically useless to anyone who gets large volumes of
mail :-/

Reproducible: Always




Frustration comes in 10s with this terrible SPAM filtering.
I'd like to add an example email which is NEVER recognized as spam, no matter
how often I mark it as such.

The sender is always the same (yes, I know, I could add a message filter and
filter it out myself), but the content changes (different parties etc.). But
what remains the same is the "[SPAM?]" in the subject! I find this highly
suspicious!

Is this an easy trick to cheat the spamfilter of Thunderbird (currently 1.0.6,
observed with Mozilla Mail and Thunderbird 1.0.2 as well).

Adding the mail now (removed embedded JPEG data though and my email, marked as
[DELETED]):

From - Mon Aug 22 20:41:33 2005
X-Account-Key: account1
X-UIDL: 9586
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Return-Path: <info@djdeep.ch>
Received: from mx28.bluewin.ch (195.186.19.40) by mssazhh-int.msg.bluewin.ch
(Bluewin 7.0.045)
        id 42F5CA6300BE0037 for [DELETED]; Mon, 22 Aug 2005 15:10:11 +0000
Received: from smtp.messaging.ch (81.221.252.201) by mx28.bluewin.ch (Bluewin
7.2.063)
        id 42EA3AD401026FA7 for [DELETED]; Mon, 22 Aug 2005 15:10:11 +0000
Received: from neo-kdbug5b04ws ([80.219.188.154]) by smtp.messaging.ch with
Microsoft SMTPSVC(6.0.3790.211);
	 Mon, 22 Aug 2005 17:10:06 +0200
Message-ID: <4147-220058122151239906@neo-kdbug5b04ws>
Errors-to: unscribe@djdeep.ch
From: <info@djdeep.ch>
To: [DELETED]
Subject: [SPAM?] PLAY IT LOUD @ LE BAL
Date: Mon, 22 Aug 2005 17:12:39 +0200
MIME-Version: 1.0
Content-Type: multipart/related; 
	boundary="----=_NextPart_94915C5ABAF209EF376268C8"
X-OriginalArrivalTime: 22 Aug 2005 15:10:07.0075 (UTC) FILETIME=[94BCA330:01C5A72B]

This is a multi-part message in MIME format.

------=_NextPart_94915C5ABAF209EF376268C8
Content-Type: multipart/alternative; 
	boundary="----=_NextPart_84815C5ABAF209EF376268C8"

------=_NextPart_84815C5ABAF209EF376268C8
Content-type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable


LATIN DANCE @ LE BAL

Mit Stargast Leo Garcia

Dienstag 23=2E August' 05 von 21=2E00 - 02=2E00 Uhr

SALSA, MERENGUE, BACHATA,  MIT MAMBO KING LEO GARCIA

LEO GARCIA WIRD FUER EUCH ZUSAMMEN MIT DEN DJ'S PEPPE UND DEEP DIE SCHOENS=
TE TANZMUSIK AUF'S BARKETT ZAUBERN=2E MIT EINER KLEINEN TANZSHOW, WIRD DER=
 GANZE ABEND VOLLER FEUER UND PASSION ABGERUNDET=2E



=20



MIT EINEM MAIL AN INFO@DJDEEP=2ECH K=D6NNT IHR EUCH AUF DIE G=C4STELISTE S=
ETZTEN=2E

=20

NEWSLETTER ABMELDEN

------=_NextPart_84815C5ABAF209EF376268C8
Content-Type: text/html; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

<html>

<head>
<meta http-equiv=3D"Content-Language" content=3D"de-ch">
<meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 5=2E0">
<meta name=3D"ProgId" content=3D"FrontPage=2EEditor=2EDocument">
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dwindows-=
1252">
<title>LATIN DANCE</title>
</head>

<body text=3D"#FFFFFF" bgcolor=3D"#000000">

<p align=3D"center"><b><font size=3D"7" color=3D"#FFFF00">LATIN DANCE @ LE=
 BAL</font></b></p>

<p align=3D"center"><font size=3D"7" color=3D"#FFFF00"><b>Mit Stargast Leo=
 Garcia</b></font></p>
<p align=3D"center"><font size=3D"6">Dienstag 23=2E August' 05 von 21=2E00=
 - 02=2E00 Uhr</font></p>
<p align=3D"center"><font size=3D"4">SALSA, MERENGUE, BACHATA,  MIT MAMBO =
KING LEO=20
GARCIA</font></p>
<p align=3D"center"><font size=3D"4">LEO GARCIA WIRD FUER EUCH ZUSAMMEN MI=
T DEN DJ'S=20
PEPPE UND DEEP DIE SCHOENSTE TANZMUSIK AUF'S BARKETT ZAUBERN=2E MIT EINER =
KLEINEN=20
TANZSHOW, WIRD DER GANZE ABEND VOLLER FEUER UND PASSION ABGERUNDET=2E</fon=
t></p>

<center>
<p>
<img border=3D"0" src=3D"cid:290170-22005812215933828127@neo-kdbug5b04ws" =
width=3D"267" height=3D"369"></p>
</center>
<p align=3D"center">&nbsp;</p>
<p align=3D"center">
<img border=3D"0" src=3D"cid:263501-22005812215933828128@neo-kdbug5b04ws" =
width=3D"730" height=3D"527"></p>
<p align=3D"center">MIT EINEM MAIL AN
<a href=3D"mailto:INFO@DJDEEP=2ECH?subject=3DG=C4STELISTE Latin Dance">INF=
O@DJDEEP=2ECH</a> K=D6NNT IHR=20
EUCH AUF DIE G=C4STELISTE SETZTEN=2E</p>
<p align=3D"center">&nbsp;</p>

<p align=3D"center"><font size=3D"2">
<a href=3D"mailto:unscribe@djdeep=2Ech?subject=3Daustragen">NEWSLETTER ABM=
ELDEN</a></font></p>

</body>

</html>
------=_NextPart_84815C5ABAF209EF376268C8--

------=_NextPart_94915C5ABAF209EF376268C8
Content-Type: image/jpeg; name="leo_garcia.jpg"
Content-Transfer-Encoding: base64
Content-Description: leo_garcia.jpg
Content-Id: <290170-22005812215933828127@neo-kdbug5b04ws>

/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAA4JCQoKCg4LCw4UDgwOFBgSDg4SGBwVFRYVFRwcFhgY
GBgWHBsfISIhHxspKSwsKSk3Nzc3Nzg4ODg4ODg4ODj/2wBDAQ8ODhERERcSEhcXExQTFx0ZGRkZ
this post is right. tbird has long, slowwwwwww learning curve. not as effective
as i had hoped. i'm new user.

i had previously "programmed" my old OE to filter better than t bird is doing. 

i'm using winxp pro


 
See also Bug 321712 and the MozillaZine forum threads "Thunderbird SPAM solution doesn't work" <http://forums.mozillazine.org/viewtopic.php?t=345221> and "TB 1.0 Junk Mail Filter is dumb as a post" <http://forums.mozillazine.org/viewtopic.php?t=188486>.
*** Bug 321712 has been marked as a duplicate of this bug. ***
I have found that reducing the size of my token database seems to IMPROVE performance temporarily, until training.dat bloats again. I removed all tokens where both values were below 3 using http://bayesjunktool.mozdev.org/index.html  After performing this edit, I save training.dat and retest:

I'm also using PopFile, and when I open the [SPAM] folder where PopFile spam gets filtered to in Thunderbird, the messages Thunderbird thinks are junk disappear as they are moved to the Junk Folder. Messages which remain in the SPAM folder are messages which are correctly identified as SPAM by PopFile, but incorrectly identified as NOT JUNK by Thunderbird.

After performing the edit to reduce the number of questionable tokens, Thunderbird doesn't miss as many of the messages in the SPAM folder I created for PopFile.
 I've been able to reproduce this bug on version 1.5 (20051201) after resetting the training database. The statistics below are for the most recent 7 days of spam. I have set up IMAP and POP3 Thunderbird accounts which connect to the same email account on the server. The same messages are downloaded by each Thunderbird account and are not deleted from the server. Each time a false negative is reported, I manually mark it as "Junk". I manually mark all good messages I receive as "Not Junk"

           Email Account A     Email Account B
TB IMAP     182/272 : 67%        78/99 : 79%
TB POP3     121/242 : 50%        54/93 : 58%

N.B. Differences in the total number of spam messages between IMAP and POP3 appear to be due to the different methods of determining message age. Email accounts A and B are on different servers.

 First, this performance is much less than it ought to be. Second, why is there such a consistent difference between the performance of the filter on the same messages when downloaded from the server by IMAP vs. POP3?

 Perhaps this bug could be marked confirmed by someone with the appropriate privs (assuming they agree that it should be confirmed)?
 I reset the database last Monday (the 13th of March) to determine Thunderbird's spam filter performance over a single week. The results show the same trend as I reported earlier:

           Email Account A     Email Account B
TB IMAP     167/246 : 68%       118/137 : 86%
TB POP3      87/217 : 40%        78/119 : 65%


Scott, could you please confirm this bug and take a look at the filter code to see if there is anything obvious which would cause poor spam filter performance?
Thunderbird version 1.5.0.2 (20060308); Mac OS 10.3.9

 I've just noticed that when the automatic run of the junk filter doesn't catch all spam messages, manually running the junk filter directly after (Tools -> Run Junk Mail Controls on Folder) will recognise some - but only some - of the spam messages as junk and classify them as such. The difference in classification between IMAP and POP3 accounts I noted above is still present.
I note that Bug 280716 reports the problem referred to in the second paragraph of Comment #0 - certain identical spam messages are never identified as junk, no matter how many times you receive and mark those messages as junk.
we can't index the text in the jpeg images, of course...
Status: UNCONFIRMED → NEW
Ever confirmed: true
 As I described above, most of my TB installs have IMAP and POP3 accounts pointing at the same server email account. I do have one install which is only IMAP accounts and one install which is only POP3 accounts. On both of those, the spam filter accuracy is >95%. Some of the IMAP-only environment accounts point to the same server accounts which exhibit poor and inconsistent (the spam filter recognises a message as spam in IMAP, but not in POP3 or vice versa) spam filter performance in the mixed environment.

 As only on my mixed IMAP and POP3 environments do I see poor filter performance, I hypothesise that there is a bug which manifests itself in poor spam filter performance when a mix of POP3 and IMAP accounts are defined.
QA Contact: general
 Once again I've observed a spam message being recognised as spam by only 1 out of 3 of the accounts at which it was received (recognised in original account, accessed by IMAP; not recognised in second account to which it was forwarded, accessed by both IMAP and POP3).

 This is clearly not the intended behaviour of the spam filter. Some time ago, there a call was placed on the roadmap page <http://www.mozilla.org/projects/thunderbird/roadmap.html> for volunteers with maths skills to look into improving the spam filter. Was this ever answered?
My personal experience is that I have to periodically reset training data to make Thunderbird SPAM filter work again in a decent way for a while...
But it's quite frustrating to know that my long training has to be periodically trashed away because Thunderbird gets "confused"...

Previously, I used POPFile with Eudora and it worked great... The only reasons I stopped using it is that its performance tends to become quite poor (it's written in PERL...) and that I thought that using Thunderbird's built in filter should have been enough...

Mauro.
Assignee: mscott → nobody
Blocks: junktracker
Is this still an issue with version 3.0?  If it is, then we need to get down to specifics - with testcases
Component: General → Filters
Product: Thunderbird → MailNews Core
QA Contact: general → filters
Whiteboard: [closeme 2010-05-10]
Version: unspecified → Trunk
In my own experience, after upgrading tu TB 3.0 the filter seems to be learning again, however I would need more time to see how it is going. Since this is not easily reproducible, it's hard to say it's working or not now...

On the other hand, I was wondering if any work on this has been done here for TB 3.0: if no work has been done in this area, it's hard to believe that the problem has gone away on its own. Maybe it can appear again in the future.
While there was some refactoring of the bayes filter in 3.0, and adding of some additional options, there was no real change to the underlying algorithms. There was also a new feature added that controls the size of the token database. It is possible that your issue was caused by some older tokens that have gradually been removed by that pruning algorithm in TB 3.
So given comment 15 and 16, and dead reporter address, Douglas no longer uses thunderbird, and [1] comment 1/Oliver is WFM ... I propose closing because we don't have a testcase for comment 0 (except perhaps Oliver, who is WFM) and there are plenty of other junk bugs. 

jwq, you seem to have followed this closely...sound good?
Whiteboard: [closeme 2010-05-10] → closeme 2010-05-25
FWIW, the spam filter in TB 2 performs better than that in 1.5. It's not perfect, but it is better. It's certainly handling Japanese-language spam a *lot* better. I'm not using TB 3, probably won't be in a position to upgrade for some time, and thus can't comment on the latest performance.

I agree that this bug is probably best closed.
thanks jwq. using WFM from comment 1
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Whiteboard: closeme 2010-05-25
You need to log in before you can comment on or make changes to this bug.