Open Bug 62598 Opened 24 years ago Updated 2 years ago

Allow to filter mail messages by language (charset)

Categories

(MailNews Core :: Filters, enhancement, P3)

x86
Linux
enhancement

Tracking

(Not tracked)

People

(Reporter: dr, Unassigned)

References

Details

I get a fair amount of Japanese spam. I don't speak Japanese, so I can safely
filter all jp-encoded mail to my trash. I'd like to see a filter-by-language or
by-charset criterion added to filters.
Not a bad idea. Here is a little something to describe what could be done with
it, in preferences there could be an option menu just for languages so languages
could be setup for Chatzilla!, Mail, Composer, and Navigator. It could be setup
for both reading and writing in all components like this. Just a visualization
of how this could be carried out. Hope it helps.
QA Contact: esther → laurel
The languages don't have to be installed. The header has to just be searched I 
think to see what language its in.
Depends on: 59368
Blocks: 66425
I wonder how practical this is.
In any case, if we want this kind of feature,
maybe we should allow for a general 'header' key in which
the user can manually specify a header attribute.
For example,

Key:      relationship  value:

Headers   include     charset=iso-2022-jp
Headers   include     Content-type: text/html
etc.
A specific language filter rule would be more discoverable by novice users.
In that case, there needs to be a backend way to
map a 'language' name to a corresponding set of 'charsets'.
This is because one language may use more than one charsets.
Coming up with such a list is non-trivial as there are
quite a few languages. For the encoding such as
ISO-8859-1, there is no way to distinguish languages
that it supports without additional lang info buried 
in the messages.

That is why I asked above how practical this is.
Yep.  Maybe you could do something like:

Encoding is <blah> (Language, Language, ...)
"Language Encoding" or similar might be easier to understand.
Mm, "language encoding" sounds good to me. There are descriptions for each
encoding ("Central European" for example) which serve to better describe each
encoding than enumerating each language using it (since there are often
political issues attached to the names of languages -- Serbo-Croat vs. Croatian
vs. Bosnian, etc.)
reassigning to naving
Assignee: gayatrib → naving
This is a bad way to filter spam.  Many Japanese users send all mail (including 
mail written in English) in the Japanese charset, etc.
*** Bug 129263 has been marked as a duplicate of this bug. ***
AFAIK such a filter is useless - the SPAMers always can switch to UTF-8
encoding... what do you in that case ?
Also, as comment 10 points out, many non-spammers use asian encodings even when
sending messages in english. This would probably cause a lot of trouble for such
users sending mail to a mozilla-mailnews user.   I hate spam, but I believe
there must be better ways to filter out junk -- this would be a ugly hack. 
Suggesting wontfix.
I would encourage discussion regarding clever ways / algorithms to detect spam,
so we could build those features instead. netscape.public.mail-news, anyone?
Håkan Waara write:
> Suggesting wontfix.
It is not a way to defeat spammers - but it may be usefull in other ways.

Just implement it - it won't hurt... :)
This kind of filtering can be done for user who really want it with the current
Mozilla even if it's not very easy.

The steps could just documented in a document somewhere.

- Create a new filter 
- choose the Customize header
- create a new customized header named "Content-Type"
- When it contains "ks_c_5601" (corean spam) or "iso-2002-jp" (japanese spam)or
big-5 (chinese spam), set the rule to destroy the message.
- add a rule to also destroy the mail when you find one of the above string in
the subject. (I haven't tested if it really works, I don't know if the filter
applies before or after subject encoding decoding. If it's after, and thinking
about it it should be after, this won't work).
Wontfix.  Filtering by language would be an ineffective and dangerous way to
block spam, so we should not encourage Mozilla users to use language to filter spam.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
Jesse Ruderman wrote:
> Wontfix.  Filtering by language would be an ineffective and dangerous way to
> block spam, so we should not encourage Mozilla users to use language to filter
> spam.

Please read comment #15 - and consider reopening this bug. "Implement
filter-by-language" may not be effective to filter SPAM but it may have other
(usefull) purposes...
There are many headers that people might want to filter on, but they can't all
be listed in the filter dialog.  Language encoding would be one of the less
reliable and more confusing headers to filter on.
I don't know Japanese, Chinese, or any other oriental language for that matter,
and somehow I manage to get all these emails in Outlook that look like absolute
jibberish because I'm on some stupid mailing list, and its absolutely annoying!
If Mailnews is able to show chinese, etc characters for an email, then doesn't
it KNOW its in chinese?  
*** Bug 154811 has been marked as a duplicate of this bug. ***
*** Bug 159150 has been marked as a duplicate of this bug. ***
The filter would also have to take advantage of UTF character ranges. Maybe
one solution would be to provide the ability to add mail filter plug-ins. This
way Mozilla need not provide all these filters, but at least offer the
possibility for someone to provide their Mozilla distribution independent
filter. The filter dialog would then be coupled with an 'advanced' filter
section where you would select the filter by name and then hit a configure
button which would bring up the settings panel of the filter. Below is a quick
rendition of what the code interface could look like:

   Filter 
     - getName() : String
     - getConfigPanel( SettingsRef ) : Panel
     - matches ( e-mail ) : boolean

BTW I certainly feel that this needs to be reopened
Ok, you have good points. I agree that it wouldn't have the intended effect as 
much as I'd hope. Your argument of Japanese users sending English mail in Shift-
JIS encoding is particularly compelling. But I wouldn't consider it 
categorically useless, or harmful.

I don't ever expect to get any email from Russia or China, for example. Even 
English email. I don't have any friends or family there, and I'm certainly not 
interested in any "business opportunities" there. So if I see KOI8 or Big5 
emails, I don't care what the content is, I want that mail in the trash.

As for UTF-8, it'd be pretty easy to deal with character ranges... Actually, 
come to think of it, do we convert all encodings to Unicode internally? (That'd 
help with the other encodings). Regardless, it'd be plenty good to take a 
random sampling of characters in the email, determine their unicode range, and 
filter based on that. I'd be very happy to say "only send me email in basic 
latin and latin-1 supplement."

I think behavior like that, specified by the mail recipient's expectations of 
where -- in a very broad-brush approximation sense -- they expect to be 
receiving mail from, would be plenty good.

So I apologize if this is a nuisance but I'd like to reopen this request for 
enhancement. It shouldn't clutter your radar as such, and I think it would be a 
rather useful feature for many users, even if it's not perfect for everybody.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
I don't get much legitimate mail from China or Japan, but I do get legitimate
mail from friends who live in the US and send all of their mail in strange
character sets.
Summary: Implement filter-by-language → Filter by language
mass re-assign.
Assignee: naving → sspitzer
Status: REOPENED → NEW
*** Bug 225784 has been marked as a duplicate of this bug. ***
has anything ever happened here?  I have become a favorite of korean and
japanese spammers---and since I have not spoken korean EVER, I would love to
turn these off.

*** Bug 268646 has been marked as a duplicate of this bug. ***
Product: MailNews → Core
I disagree with Jesse's reasoning.  That people post to lists in other than their native (or default) language and forget to change the language type represents incorrect behaviour on their part...

This is not an invalidation of the basic idea and its merit of filtering based on language.

The people doing the above will eventually learn to do the right thing.  In any case, TB and the rest of the Mozilla clan allows for one to easily set up multiple user profiles.  One could be a profile used exclusively for posting to mailing lists, that has default options such as:

* charset 8859-1
* top-posting
* text only encoding
* etc.

It's been possible for a long time now to add Content-Type to the list of searchable headers, and then filter on a charset name (similar to comment 3).

Comment 4 and 5 are still valid, if it's actually worth anyone's effort to set up a mapping between, say, Japanese and several encodings on the behalf of those novice users who want an easily discoverable way to ignore such messages.

But if the encoding is UTF-8, you can't tell what language it's in.  Either you'll end up filtering out, say, French or German messages, or you'll allow some Japanese messages.  Either way, these novice users are going to be confused by the situation.  And I agree with Jesse Ruderman's basic premise: this is a dangerous feature; therefore, I'd argue it shouldn't be discoverable.

Given that, combined with the pretty high quality of the Junk Controls feature, this bug really should be WontFix'd, for good.
(In reply to comment #31)

> But if the encoding is UTF-8, you can't tell what language it's in.  Either
> you'll end up filtering out, say, French or German messages, or you'll allow
> some Japanese messages.  Either way, these novice users are going to be
> confused by the situation.  And I agree with Jesse Ruderman's basic premise:
> this is a dangerous feature; therefore, I'd argue it shouldn't be discoverable.

True, but there are a lot of spammers out there that try to imitate what Outlook does, and outlook likes Windows-125[0-8] charset encoding... God knows why.

There isn't much that can't be encoded in USASCII, ISO-8859-1, or UTF8 (in that order of trying).

In fact, there's an RFC out there (forget which) that says that these are the recommended encodings, and that nothing else should be used.

This applies to comment #10 as well:  since messages should be encoded in the smallest encoding that that they will fit ("Be conservative in what you end...", to quote Jon Postel), since this has the highest probability of being supported, then English would be encoded in USASCII or at worst ISO-8859-1, not any Japanese native charsets.

As for comment #12: when that happens, we'll evolve, as they have.

(In reply to comment #32)

Umm... "Conservative in what you send..."  Fat fingers.
(In reply to comment #33)
> (In reply to comment #32)
> 
> Umm... "Conservative in what you send..."  Fat fingers.

This bug isn't about sending, it's about receiving.  And the rest of that aphorism, "be liberal in what you accept," exactly countermands your so-called "argument" for keeping this bug open.
Just a question, are there any hooks available to be able to write a plugin to do this? In a worst case scenario I can imagine being able to add custom filter plugins until there is a large enough demand for such a feature.
(In reply to comment #34)
> 
> This bug isn't about sending, it's about receiving.  And the rest of that
> aphorism, "be liberal in what you accept," exactly countermands your so-called
> "argument" for keeping this bug open.

The flip-side is that it's fairly clear what the correct behavior should have been for Outlook in how they select their encodings... and that there's a limit in how much another MTA should be willing to "be liberal" in order to accept things that are just plain wrong.

Demonstrably, Outlook in just plain wrong.

And it shouldn't matter that they have an 80% market share (or whatever it is).
*** Bug 354445 has been marked as a duplicate of this bug. ***
It could be done if thunderbird analyse the content. I'm of opinion that the key is not the encoding of message (more languages have the same encoding) but the words in the message.

It could be possible:
- to show the words that appear in the message ("ham", "cosa",  "equus", ...), 
- classify these in the possible languages these belong to ("ham" belongs to {english}, "cosa" belongs to {spanish, catalan, italian}...),
- determine what is the most probable language in which message is written (message is written in language L if L is the language that has more possible belonged words),
- and, then, determine if we should filter the message or not....

It's a draft

The classification of the words could be done automatically if we know the words that belong to any language. And it could be done, for example, of public database of such words. The translation of programs, wikipedia pages, etc could be done for sources

Thanks,
Xan.
sorry for the spam.  making bugzilla reflect reality as I'm not working on these bugs.  filter on FOOBARCHEESE to remove these in bulk.
Assignee: sspitzer → nobody
Filter on "Nobody_NScomTLD_20080620"
QA Contact: laurel → filters
Product: Core → MailNews Core
Summary: Filter by language → Allow to filter mail messages by language (charset)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.