Can't filter/search based on charset of RFC 2047 (MIME-encoded) header (subject)

RESOLVED WONTFIX

Status

--
enhancement
RESOLVED WONTFIX
13 years ago
4 years ago

People

(Reporter: philipp, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

13 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc3 Firefox/1.0.7
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc3 Firefox/1.0.7

I tried to filter out messages that come in languages and/or character sets that I don't read/understand, so that I can at least limit my spam to the English language. ;-)

Filters with subjects containing =?...? never match.

Reproducible: Always

Steps to Reproduce:
1. create a filter on "Subject" "contains" "=?windows-1255?"
2. Send yourself a message in this character set
3. check your filter log for a match

Actual Results:  
A match on =?windows-1255? will never occur.

Expected Results:  
It should have matched, but didn't.


Having literal, exact matches could be useful.

So could having regex matches, for that matter.

Comment 1

13 years ago
(In reply to comment #0)

> I tried to filter out messages that come in languages and/or character sets
> that I don't read/understand, so that I can at least limit my spam to the
> English language. ;-)

Duplicate of Core bug 62598, related to Suite bug 152888?

> Having literal, exact matches could be useful.

Maybe related to Core bug 16913?

> So could having regex matches, for that matter.

Duplicate of Core bug 19442?

Comment 2

12 years ago
The problem here is that, by the time the header has gotten to the filter/search mechanism, it's already been decoded and the MIME info left behind.


(In reply to comment #1)
> (In reply to comment #0)
> 
> > I tried to filter out messages that come in languages and/or character sets
> > that I don't read/understand, so that I can at least limit my spam to the
> > English language. ;-)
> 
> Duplicate of Core bug 62598, related to Suite bug 152888?

62598 is about filtering based on the charset spec'd in the Content-Type (message-body), not in the MIME encoding of the header (RFC 2047).


> > Having literal, exact matches could be useful.
> 
> Maybe related to Core bug 16913?

No; that's only for News, the Mail equivalent of that bug was fixed long ago.

> > So could having regex matches, for that matter.
> 
> Duplicate of Core bug 19442?

That's the regexp feature, but this bug is not a dupe of that.
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Component: General → General
Ever confirmed: true
OS: Linux → All
Product: Thunderbird → Core
Hardware: PC → All
Summary: Filtering based on language, charset, etc. in Subject: line fails → Can't filter/search based on charset of RFC 2047 (MIME-encoded) header (subject)
Version: unspecified → Trunk

Updated

12 years ago
Component: General → MailNews: Search

Comment 3

12 years ago
No, wait, that's wrong -- in fact, that bug is the opposite of this.

Comment 4

12 years ago
(Whoops -- that comment was intended for bug 355018.)
Assignee: mscott → nobody
QA Contact: general → search
(Assignee)

Updated

10 years ago
Product: Core → MailNews Core
Matching on charsets is generally a dumb idea, because charsets and languages are actually very orthogonal. And we're basically moving towards a UTF-8-only world, which means that you get pretty much 0 information from charsets anyways.

And this request is especially WONTFIX because we need to be eradicating knowledge of the 2047 encoding within Thunderbird, not encouraging it.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 6

4 years ago
(In reply to Joshua Cranmer [:jcranmer] from comment #5)
> Matching on charsets is generally a dumb idea, because charsets and
> languages are actually very orthogonal. And we're basically moving towards a
> UTF-8-only world, which means that you get pretty much 0 information from
> charsets anyways.

That's wishful thinking.

We have the means to move to a UTF-8 only world, but Mail.app (for instance), forcibly defaults the charset to win-1252 for users in the US.  Why?  I have no idea.  iso-8859-1 or utf-8 would be more than adequate, and indeed if you look at:

https://discussions.apple.com/message/26274457

there's no reasonable explanation why Apple chose to ignore RFC-6657, sections 3 and 4.

There's a gulf, whether you see it or not, between what people can/should do, and what they in fact chose to do.

Previous (up to MacOS X 10.6) versions of Mail.app did in fact allow you to select the character set.  Now it's forced by your configured locale.  Go figure: they've gotten further from doing the right thing, not closer to it.

In any case, it's perhaps because languages and characters ARE orthogonal that people test the intersections of those.  I have SpamAssassin rules that give extra points to emails to English-language mailing lists that are encoded in GB-2312, for instance, because 95% of the time they're Spam.

I could write an entirely English language message in KOI-8 or GB-2312, but most people DON'T.  Unless it's Spam.  I would't call that "zero information".

> And this request is especially WONTFIX because we need to be eradicating
> knowledge of the 2047 encoding within Thunderbird, not encouraging it.

That's not a decision you can make in a vacuum.  Everyone else needs to make the move more or less at the same time.

But you're conflating 2 issues, since we're on the subject of orthogonality: whether you chose to do the right thing and SEND in UTF-8 or not, is irrelevant to whether you have to deal with messages you've RECEIVED in each of those 2047 different encodings...
(In reply to Philip Prindeville from comment #6)
> I could write an entirely English language message in KOI-8 or GB-2312, but
> most people DON'T.  Unless it's Spam.  I would't call that "zero
> information".

Evidently, you never have reason to communicate with Chinese, Russian, or Japanese people in English. Most of the Japanese I communicate with (in English!) always use ISO-2022-JP for their posts, and they're not spam.

> > And this request is especially WONTFIX because we need to be eradicating
> > knowledge of the 2047 encoding within Thunderbird, not encouraging it.
> 
> That's not a decision you can make in a vacuum.  Everyone else needs to make
> the move more or less at the same time.

I found this bug while I was filing a bug to move our database from storing the Subject header internally in the original, raw form to the MIME-decoded form. If that bug is fixed, then it becomes impossible to implement this feature because the filter code will never see the 2047 encoding in the first place.
(Reporter)

Comment 8

4 years ago
(In reply to Joshua Cranmer [:jcranmer] from comment #7)
> (In reply to Philip Prindeville from comment #6)
> > I could write an entirely English language message in KOI-8 or GB-2312, but
> > most people DON'T.  Unless it's Spam.  I would't call that "zero
> > information".
> 
> Evidently, you never have reason to communicate with Chinese, Russian, or
> Japanese people in English. Most of the Japanese I communicate with (in
> English!) always use ISO-2022-JP for their posts, and they're not spam.

I communicate with such people all the time, and they're mostly using (like the rest of the world) broken MUA's!!!

Quoting RFC-1521, Section "7.1.1 The charset parameter":

  In general, mail-sending software must always use the "lowest common
  denominator" character set possible. For example, if a body contains
  only US-ASCII characters, it must be marked as being in the US-ASCII
  character set, not ISO-8859-1, which, like all the ISO-8859 family of
  character sets, is a superset of US-ASCII. More generally, if a
  widely-used character set is a subset of another character set, and a
  body contains only characters in the widely-used subset, it must be
  labeled as being in that subset. This will increase the chances that
  the recipient will be able to view the mail correctly.

If someone is emailing me English text (only) but using GB-2312, KOI-8, or JP-2022, they are SEVERELY BROKEN.

The above text couldn't be more clear.

Respectfully, rather than addressing the fact that the people that you can't control, those using the above broken mailers, are doing the wrong thing, you're trying to enforce unnecessary change on those you feel you can control.

This is misdirected desperation.

I'm onboard for a UTF-8 universe some day, but let's not whip the wrong donkey.

The proof of the utility of the feature described by this issue is that I might want to search for all messages I've received IN THE WRONG CHARSET and send them an email citing chapter and verse (as I have done here) of the IETF requirements for proper email interoperability and plea to ask them to file bugs fixing their respective broken MUA's.


> > > And this request is especially WONTFIX because we need to be eradicating
> > > knowledge of the 2047 encoding within Thunderbird, not encouraging it.
> > 
> > That's not a decision you can make in a vacuum.  Everyone else needs to make
> > the move more or less at the same time.
> 
> I found this bug while I was filing a bug to move our database from storing
> the Subject header internally in the original, raw form to the MIME-decoded
> form. If that bug is fixed, then it becomes impossible to implement this
> feature because the filter code will never see the 2047 encoding in the
> first place.

That's an artifact of someone's implementation, rather than any real litmus test of whether this functionality is merited or not.

We deal with similar issues in SpamAssassin all the time, and that's why we have BODY and RAWBODY rules.  One applies to the "COOKED" or decoded text, and the other applies to the literal, canonicalized text as seen on the wire, i.e. "text" vs. "dGV4dAo=" for instance.

Turns out, they're both equally useful.

Spammers often unnecessarily base64 encode messages, or mark 8-bit ISO Latin 1 as "content-transfer-encoding: 7bit", etc. and being able to detect malformed or improperly encoded text is a good hint at what's Spam and what's not.

Similarly, we pay attention to both the charset (CHARSET_FARAWAY) as well as which sections of that charset actually got used (UNWANTED_LANGUAGE_BODY).  You could write in GB-2312 but only use Roman letters, in which case the first rule would match but the second wouldn't.

You could also write a message in iso-8859-1 and mark it as "Language: de" in which case the 1st rule wouldn't match, but the 2nd one would.

Look at the ACKNOWLEDGEMENTS section of RFC-1345.  I've been dealing with this issue for more than 25 years (yeah, that's my French-speaking alter ego).
Let me be forcefully clear here:

The costs needed to actually implement this feature within the codebase and to support it in the future, when combined with the political ramifications of this feature (note: filtering on charsets is precisely why we have to mojibake messages in the jp locale instead of fallback to UTF-8. I wish I were making this up) far outweigh any utility this feature has. Nothing you have said has introduced any new information to me or changed those opinions in the slightest.

Filtering on the scripts used would be a much better idea. Although they still aren't 1-1 with languages, it is a much better and less destructive way to achieve most of the same goals.

Comment 10

4 years ago
Philip, I don't feel as strongly as Joshua does that it is pointless to filter the way that you wish, but I do share his concern that we need to control complexity within Thunderbird. This feature has not received much attention within Bugzilla, so your need is fairly rare per that metric. Also clearly the normal way filtering should be done is using the decoded subject, so it makes sense that you do not normally see the MIME-encoded header when you filter on Subject. Your expected "It should have matched, but didn't." in comment 0 is not something I would support.

You could actually probably accomplish your current request with a custom javascript search term (such as is incorporated in FiltaQuilla) that searches the actual msgHdr.subject rather than the mime2DecodedSubject. I don't know if that will still be possible after the proposed MIME changes that Joshua wants to do, but I would hope that there would still be some way to programmatically get access to the encoded subject.

In any case, this request is specialized enough that it does not make sense to support it in core, but an extension is the correct place. WONTFIX is appropriate.

Philip, if you still want to do this, try a custom js search term in FiltaQuilla. Feel free to contact me if you have issues.
You need to log in before you can comment on or make changes to this bug.