junk training isn't helping, keeps misclassifying emails

RESOLVED INCOMPLETE

Status

MailNews Core
Filters
RESOLVED INCOMPLETE
6 years ago
5 years ago

People

(Reporter: a_geek, Unassigned)

Tracking

({testcase-wanted})

x86
Linux
testcase-wanted

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
I receive ham in several languages including Chinese and Japanese. I receive spam in more languages than I can read. However much I click on messages to manually classify them as spam or ham, the next round I start TB, my messages get mis-classified. Either far too many spams slip through, or quite a few hams get flagged as spam. The filter sometimes even decides to flag messages sent by me as spam, too! This way, the filter is almost useless. It would be nice if the filter could get a serious overhaul.
Any idea on what you would change ?

This bug report is very difficult to take action on as it's very general.
Component: General → Filters
Product: Thunderbird → MailNews Core
QA Contact: general → filters

Comment 2

6 years ago
The spam filter works very well (at least in English) IF you regularly train both uncertain ham and spam. I have about 95% correct categorization of spam messages, with maybe one or two misclassifications of good emails as spam per year. So this works extremely well.

But nobody has figured out an unintrusive user interface that permits proper, regular training. My JunQuilla extension allows this to some extent. The main feature of that that I use is simply showing the junk percentage of each email in my inbox. I train spam, which is usually obvious because spam in my inbox usually has a junk % of 60% - 70%. I also train any ham that has a percentage over 10%. But this requires diligence on the part of the user. It is this required diligence that has always been the rub, such that I could not get TB drivers convinced of the value even of showing junk percentage the last time I tried.

Not sure about Chinese and Japanese, especially with respect to tokenization. There is specific support for Japanese tokenization in the core code though. Anyone with knowledge of Chinese, and some idea of what would represent appropriate tokenization, could always view the tokenization and calculation of junk score using JunQuilla.

As to this bug, we probably need to decide what is the master bug for all rather vague complaints about the performance of the spam filter, and dup to that one.

If the OP wants to investigate further, I would be interested in knowing the training count for ham and spam for his situation (also available in JunQuilla), and comments on the tokenization and spam calculations.
(In reply to Kent James (:rkent) from comment #2)
> As to this bug, we probably need to decide what is the master bug for all
> rather vague complaints about the performance of the spam filter, and dup to
> that one.
I like that idea. Wayne thoughts ?
(Reporter)

Comment 4

6 years ago
From my point of view, the major misclassifications that TB makes, are these:

1. On technical mailing lists, especially where they send patches (like eg. C code), these emails _frequently_ get classified as spam.

2. Mails in languages other than English often get classified as spam, but with no obvious pattern, and often, spam emails in those foreign languages (greek, portugese etc) slip through.

3. Mails with attachments - in my case often invoices and such - also frequently get tagged as spam.

As for training ham and spam... If I find an email misclassified as spam, I untick the "spam" flag and thought that that would be training as ham. No? I have disabled the spam filter due to the high error rate.

So far, spam filtering seems to be Baysean training _only_, with no way of influencing the heuristics. This might work well in English, but I'd say that this is not a satisfactory solution for speakers of other languages. The configuration of the spam filter could be more complex than "on/off", but so far I will need to rely on a different spam filter mechanism.

I am interested in investigating the matter further, but don't know what would be an efficient way to proceed. Wrt. JunQuilla: TB's use of screen real estate leaves much to be desired, imho, but that's certainly outside the scope of this bug.

Comment 5

6 years ago
"I am interested in investigating the matter further, but don't know what would be an efficient way to proceed"

I tried to tell you. Load JunQuilla, and report back on some of the parameters that it reports.


But in truth, the reality is that a server-based anti spam solution is technically superior to a client-only based solution, hence for the vast majority of users they will need to subscribe to a commercial anti-spam solution (or used an ad-based system like Gmail).

But there is the option of using TB's internal filter, which works well if someone trains it diligently (which is not possible IMHO with the standard UI, but needs something like JunQuilla). It works really well in combination with a SpamAssssin front, particularly if you reconfigure and retrain TB to take advantage of the Spam Assassin-supplied tokens as I show in http://mesquilla.com/2010/02/12/combining-thunderbird-with-spamassassin/ ("Really well" is fewer than 4 false positives per year, with >95% rejection rate of spam).

"Wrt. JunQuilla: TB's use of screen real estate leaves much to be desired, imho, but that's certainly outside the scope of this bug." To the contrary, IMHO the main issue with the TB anti spam filter IS the user interface to it, not the underlying bayes algorithms. But I have been unable to convince others of this, including not only yourself but also the TB drivers.
(Reporter)

Comment 6

6 years ago
I have installed JunQuilla (great tool, btw), but now what? I mean, it looks like it will only gather such parameters IF I have adaptive filtering enabled. My false positives have spam scores of up to 100%, if that's it what you are after.

My remark about TB's usage of screen real estate wasn't limited to spam filtering, but imho apply across the board for many functions of TB. If you want an idea, JunQuilla could imho be more user-friendly if you would hook some of the options/functions into the context menu. And yes, I have my own instance of SpamAssassin, but find it very tedious to adjust the rules all the way, and have disabled Bayes for ill training effects much like those that I see in TB, too.

Comment 7

6 years ago
(In reply to a_geek from comment #6)
> I have installed JunQuilla (great tool, btw), but now what? 

the information at the junquilla site describes how to interpret the info

> I mean, it looks
> like it will only gather such parameters IF I have adaptive filtering
> enabled. 

Of course. do you expect something different?


(In reply to a_geek from comment #4)
> From my point of view, the major misclassifications that TB makes, are these:
> 
> 1. On technical mailing lists, especially where they send patches (like eg.
> C code), these emails _frequently_ get classified as spam.

have you marked such messages as not junk?
your alternative, when bayes doesn't work, is to whitelist the ML 


> 2. Mails in languages other than English often get classified as spam, but
> with no obvious pattern, and often, spam emails in those foreign languages
> (greek, portugese etc) slip through.

Bug 234411 Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' 

> 3. Mails with attachments - in my case often invoices and such - also
> frequently get tagged as spam.

rkent, does bayes ignore attachment data?

> As for training ham and spam... If I find an email misclassified as spam, I
> untick the "spam" flag and thought that that would be training as ham. No? I
> have disabled the spam filter due to the high error rate.
 
> So far, spam filtering seems to be Baysean training _only_, with no way of
> influencing the heuristics. This might work well in English, but I'd say
> that this is not a satisfactory solution for speakers of other languages.
> The configuration of the spam filter could be more complex than "on/off",
> but so far I will need to rely on a different spam filter mechanism.

have you tried adjusting the junk threshold value, as described at http://mesquilla.com/extensions/junquilla/ ?

Comment 8

6 years ago
"rkent, does bayes ignore attachment data?"

Looking briefly at the code, for attachments the file name and content type only are added to the token list.
(Reporter)

Comment 9

6 years ago
In reply to Wayne Mery's comment #7:

Yes, I frequently click on ham messages to mark them as ham, and on spam messages to mark them as spam. I did not yet see much of an improvement, if any.

I also adjusted the spam threshhold from 80% to 90%.

Comment 10

6 years ago
(In reply to a_geek from comment #9)
> In reply to Wayne Mery's comment #7:
> 
> Yes, I frequently click on ham messages to mark them as ham, and on spam
> messages to mark them as spam. I did not yet see much of an improvement, if
> any.

please answer the question - how many messages marked ham and how many marked spam, per filtaquilla?


> I also adjusted the spam threshhold from 80% to 90%.

and what effect did that have?
(Reporter)

Comment 11

6 years ago
Wayne, you mean, I should post numbers about how many messages I classified? Or re-classified?

I have to pass up on exact numbers, but as a rule of thumb, I'd say I mark maybe 20-50 messages per day as non-junk (sometimes more), and confirm another 50 messages per day to be junk. After raising the threshhold, I'd say that less than probably 30 messages per day that I see (I receive much more email than that which I see - they're automatically filed to the corresponding mailing list folders) get classified the wrong way. I'd like to get closer to 3 messages per day that are misclassified. JQ frequently marks messages with a red questionmark. I make sure to mark them all appropriately by hand.

Comment 12

6 years ago
(In reply to Wayne Mery (:wsmwk) from comment #10)
> (In reply to a_geek from comment #9)
> > In reply to Wayne Mery's comment #7:
> 
> > I also adjusted the spam threshhold from 80% to 90%.
> 
> and what effect did that have?

(In reply to a_geek from comment #4)
> From my point of view, the major misclassifications that TB makes, are these:
> 
> 1. On technical mailing lists, especially where they send patches (like eg.
> C code), these emails _frequently_ get classified as spam.
> 
> 2. Mails in languages other than English often get classified as spam, but
> with no obvious pattern, and often, spam emails in those foreign languages
> (greek, portugese etc) slip through.

you earlier gave an example of chinese, which would be Bug 234411 - Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' 


> 3. Mails with attachments - in my case often invoices and such - also
> frequently get tagged as spam.

this bug?  Bug 280716 - image spam evades bayesian junk control filter


rkent, how do we handle attachments?
Summary: spam filter doesn't yield to training, keeps misclassifying emails → junk training isn't helping, keeps misclassifying emails

Comment 13

5 years ago
(In reply to Ludovic Hirlimann [:Usul] from comment #3)
> (In reply to Kent James (:rkent) from comment #2)
> > As to this bug, we probably need to decide what is the master bug for all
> > rather vague complaints about the performance of the spam filter, and dup to
> > that one.
> I like that idea. Wayne thoughts ?

sure, you just need to find someone to sift through them. But I'm sure that will accomplish much other than create a junky junk bug. 

this bug ... is stuck on comment 12

Comment 14

5 years ago
"rkent, how do we handle attachments?"

Just noticed this question. I believe that we just add tokens for the filename and content type, and ignore the actual contents.
(Reporter)

Comment 15

5 years ago
Although I still try to train TB, it is still mostly unable to "make up it's mind", and frequently displays the red question mark for good and bad messages alike, also when TB knows that I wrote the message.

Comment 16

5 years ago
Thunderbird doesn't attempt to classify messages you sent. hence the question mark

have you trained at least 100 varied messages as Junk?
and 100 varied messages as NOT junk?  (including messages with attachments?)
(Reporter)

Comment 17

5 years ago
I didn't count, but probably yes - ever since I installed JunQuilla, I have been manually training every message that I found not clearly enough, or wrongly, labelled. The question mark is not only being displayed on message I sent, but also on messages I reveive, and from long time correspondents as well. Some messages are directly categorized as junk, though.

Comment 18

5 years ago
messages "you sent" are virtually irrelevant - it doesn't act on them, and it doesn't necessarily make sense to process them because "you are trusted". (besides, bug 179518 already exists for it) 

this probably should have been asked a while back ... please attach some testcase messages to the bug please.
Keywords: testcase-wanted

Comment 19

5 years ago
lacks testcase
Status: UNCONFIRMED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.