Content-Type charset too strictly interpreted (e.g., ISO8859_1 != ISO-8859-1)

RESOLVED FIXED in Thunderbird 65.0

Status

defect
RESOLVED FIXED
7 years ago
8 months ago

People

(Reporter: david, Assigned: hsivonen)

Tracking

Thunderbird 65.0
x86_64
Windows 7
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird_esr6064+ fixed, thunderbird65 fixed)

Details

Attachments

(3 attachments, 2 obsolete attachments)

Posted file vm.eml
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11

Steps to reproduce:

Read an HTML-formatted email


Actual results:

Pound signs (that is, UK currency symbol, Unicode 00A3, £) came out as unknown characters (black diamond with question mark). 

Appears correctly in GMail.

Complete email source attached (slightly redacted for my security).


Expected results:

I think it should have shown pound signs correctly.

I looked at the header and I see that in the HTML part it has 
  Content-Type: text/html; charset="ISO8859_1"
  Content-Transfer-Encoding: quoted-printable
and the HTML that follows itself has
  <head>
      ...
      <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8=
859-1">
  </head>

If I change
  Content-Type: text/html; charset="ISO8859_1"
to
  Content-Type: text/html; charset="ISO-8859-1"
(by editing the raw mail file in emacs), it then works.

Some research (http://www.w3.org/Protocols/rfc1341/7_1_Text.html )
suggests that ISO8859_1 is wrong:

         "The defined charset values are:
          US-ASCII       as defined in [US-ASCII].
          ISO-8859-X
          where "X" is to be replaced, as necessary, for the parts of ISO-8859 [ISO- 8859]. ...No other 
          character set name may be used in Internet mail without the publication of a formal specification 
          and its registration with IANA as described in Appendix F, or by private agreement, in which case
          the character set name must begin with "X-".

and I don't know where the correspondent's mail system (JavaMail) got ISO8859_1 from as an alternative form (it;'s not listed among the recognised aliases I found at http://www.iana.org/assignments/character-sets/character-sets.xml )

HOWEVER, GMail displays this "correctly", that is with pound signs intact, and I suspect many other mail clients would as well. 

So my reading is that the sender email is strictly speaking wrong, but on the principle of "write strictly, read relaxed", and the competition gets it right, TB ought to be more flexible in its interpretation of the charset.
My guess is that Gmail just strips and special characters from those attributes, thus recognizing "ISO88591" as a valid encoding (but don't know for sure). This should be a prudent handling for Mozilla applications as well, thus being more error tolerant with non-compliant messages (while still encoding it correctly in any messages sent, of course) as long as there is no ambiguity.
Component: Untriaged → MIME
Product: Thunderbird → MailNews Core
Summary: Content-Type charset too strictly interpreted → Content-Type charset too strictly interpreted (e.g., ISO8859_1 != ISO-8859-1)
Attachment #691280 - Attachment mime type: application/octet-stream → text/plain
As an interesting observation, the <head><meta> attributes are using the correct ISO identifier in the charset specification:

> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
(In reply to rsx11m from comment #2)
> As an interesting observation, the <head><meta> attributes are using the
> correct ISO identifier in the charset specification:
> 
> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

The transport level charset would take precedence over the meta.
(In reply to Masatoshi Kimura [:emk] from comment #3)
> (In reply to rsx11m from comment #2)
> > As an interesting observation, the <head><meta> attributes are using the
> > correct ISO identifier in the charset specification:
> > 
> > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> 
> The transport level charset would take precedence over the meta.

That said, if the transport level charset label is unrecognizable, maybe the chaset shoule be taken from the meta instead of finalizing to UTF-8, yes.
That would certainly solve this particular case (and indeed if you Save As HTML from TB and view in a browser, it is correct because of this). But what's worrying me is that something prompted the sender's mail client to use this particular form of ISO8859_1; I don't know why, but they must have done this deliberately and got that string from somewhere. The fact that Google recognises it too may mean they are aware of it too - or maybe, since they are using a browser to display, they are just using the meta, so they don't even see there's a problem.

Another possibility would be to say if you don't recognise the charset string, to see if you can get a match between a form which has all punctuation (dashes, underscores, periods etc) removed against the same for your table of charset names treated the same way, case-insensitively of course.
(In reply to David Earl from comment #5)
> Another possibility would be to say if you don't recognise the charset
> string, to see if you can get a match between a form which has all
> punctuation (dashes, underscores, periods etc) removed against the same for
> your table of charset names treated the same way, case-insensitively of
> course.

There is a strict registration procedure for character sets at <http://www.iana.org/assignments/character-sets/character-sets.xml>; an algorithm for matching character sets that is web-compatible is at <http://encoding.spec.whatwg.org/>. Following your recommendation would be in violation of pretty much every specification that partially occupies this space; given that there are other mechanisms which would correctly detect this issue, I don't think it qualifies as a useful workaround at this point in term.

In this scenario, we should be rejecting the charset type of the protocol as invalid and the GUI should fall back onto the <meta> declaration instead. Actually, given that the default presumed type is US-ASCII, which in practice is best mapped to ISO-8859-1 for decoding, we shouldn't even need to have the meta...
OK, so what are you going to do with the text/plain alternative multipart content? This example didn't actually have one, but many corresponding examples would.

I did find the iana page when researching this, and the key thing is that the erroneous name is not among the aliases. It's also not listed on your second reference, so neither will help. I'm just suggesting a better backstop position when all else fails than a purist "this is an error" response. The worst that happens is you get errors displayed differently if it isn't correct, but still errors.

I'm sure if it can happen in this context, it will happen in other contexts which don't have a meta available.
(In reply to rsx11m from comment #2)
> As an interesting observation, the <head><meta> attributes are using the
> correct ISO identifier in the charset specification:
> 
> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

I imagine that the HTML content of the email was produced in entirely separate software from the software used for the mail distribution.
(In reply to David Earl from comment #7)
> OK, so what are you going to do with the text/plain alternative multipart
> content? This example didn't actually have one, but many corresponding
> examples would.

Are you sure that Gmail, Outlook, etc., would actually decode the message properly in such a scenario? I'm willing to bet that most programs delegate the charset decision to whatever library looks up charsets; neither the Java runtime nor the C# runtime appear to do the kind of normalization you talk about, so I highly doubt that we would be interoperable with most mail clients if we implemented your suggestion.
No, I'm not sure, and it would take a bit of effort to find out. One lesser test that might be enlightening is whether gmail still displays the message as intended if the meta charset is removed.

But I don't  understand where interoperability comes into it: all I'm saying is if TB can't work out how to display a message because it doesn't understand the charset, then make an intelligent guess rather than giving up entirely.
(In reply to Masatoshi Kimura [:emk] from comment #4)
> (In reply to Masatoshi Kimura [:emk] from comment #3)
> > (In reply to rsx11m from comment #2)
> > > As an interesting observation, the <head><meta> attributes are using the
> > > correct ISO identifier in the charset specification:
> > > 
> > > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> > 
> > The transport level charset would take precedence over the meta.
> 
> That said, if the transport level charset label is unrecognizable, maybe the
> chaset shoule be taken from the meta instead of finalizing to UTF-8, yes.

This indeed is what the HTML spec requires.

> (In reply to David Earl from comment #5)
> > Another possibility would be to say if you don't recognise the charset
> > string, to see if you can get a match between a form which has all
> > punctuation (dashes, underscores, periods etc) removed against the same for
> > your table of charset names treated the same way, case-insensitively of
> > course.

Opera tried that for a short while and it was not Web compatible. r- as far as landing that sort of thing to the m-c side goes.

(In reply to Joshua Cranmer [:jcranmer] (busy until March 8) from comment #6)
> There is a strict registration procedure for character sets at
> <http://www.iana.org/assignments/character-sets/character-sets.xml>;

We don't support the IANA registry for this stuff.

> an
> algorithm for matching character sets that is web-compatible is at
> <http://encoding.spec.whatwg.org/>.

That's the spec we implement. No new aliases should land on the m-c side unless they go into that spec as well.

But in this case, the right fix would be to check the meta if the MIME-level charset is unknown.
There is a problem that Java has "ISO8851_1" as a canonical name for the charset.
See https://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html

This leads many users of JavaMail etc. to use this "wrong" charset name in the email header.

It would be nice if thunderbird could compensate for this, and recognise "ISO8859_1" as an alias for "ISO-8859-1"

Terje B.
(In reply to Terje Bråten from comment #12)
> There is a problem that Java has "ISO8851_1" as a canonical name for the
> charset.
> See https://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html

"ISO8851_1" is canonical in pre-java.nio APIs. java.nio shipped in Java 1.4 in February 2002. Is the issue with ancient JavaMail jars floating around out there or has JavaMail not been updated to use the java.nio APIs since 2002?

On the bright side, senders that do the right thing and use UTF-8 for all outgoing mail end up using an Encoding Standard-compatible label even if they use the pre-java.nio canonical name.
I just got an email today that had this problem. I could see in the header that it was sent by JavaMail and the charset was set to "ISO8855_1". So yes, it is still an issue.
Per https://lists.w3.org/Archives/Public/public-whatwg-archive/2009Aug/0077.html , we can't just put all pre-java.nio names in the spec. (The Web relies on "EUC_JP" not being treated as "EUC-JP".)
Considering that in bug 1511950 Thunderbird decided to deviate from the list of labels in the Encoding Standard instead of trying to push email-relevant labels into the Encoding Standard, does it mean that pre-java.nio Java names should be added to Thunderbird's alias list?

Specifically, should these labels be added to Thunderbird?
ISO8859_1
ISO8859_2
ISO8859_3
ISO8859_4
ISO8859_5
ISO8859_6
ISO8859_7
ISO8859_8
ISO8859_9
ISO8859_13
ISO8859_15
UnicodeBigUnmarked
UnicodeLittleUnmarked
Cp874
MS874
MacCyrillic
MacRoman
MS950_HKSCS
Big5_HKSCS
MS949
MS950
TIS620
EUC_JP
EUC_KR
EUC_CN
ISO2022JP
ISO2022KR
ISO2022CN
ISO2022_CN_GB
ISO2022_CN_CNS

For reference, these pre-java.nio-compatible names are Encoding Standard-compatible:
UTF8
Cp866
KOI8_R
Big5
GB18030
Cp1250
Cp1251
Cp1252
Cp1253
Cp1254
Cp1255
Cp1256
Cp1257
Cp1258
SJIS
MS932

(The last one on the above list was added to the Encoding Standard to cater to Thunderbird.)

Unclear without testing:
UTF-16

Irrelevant because the BOM overrides label:
UnicodeBig
UnicodeLittle
Assignee: nobody → hsivonen
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #9030094 - Flags: review?(jorgk)
Comment on attachment 9030094 [details] [diff] [review]
Adjust charsetalias.properties

Hmm. Maybe it's a bad idea to add aliases that there is no demonstrated compat need for. I'll scope this down.
Attachment #9030094 - Attachment is obsolete: true
Attachment #9030094 - Flags: review?(jorgk)
Thanks for looking into this. I looked at charsetalias.properties recently and thought we should remove some junk for it. Given that TB 60 ran without the file completely and only now we got a complaint about cp932, I don't think we should add more stuff here.
Posted patch Support more labels (obsolete) — Splinter Review
Attachment #9030102 - Flags: review?(jorgk)
(In reply to Jorg K (GMT+1) from comment #20)
> Thanks for looking into this. I looked at charsetalias.properties recently
> and thought we should remove some junk for it. Given that TB 60 ran without
> the file completely and only now we got a complaint about cp932, I don't
> think we should add more stuff here.

Oops. Sorry, I attached the new patch before seeing your comment.
(In reply to Jorg K (GMT+1) from comment #20)
> only now we got a complaint about cp932

Comment 12 was 2 months ago, and, in retrospect, bug 1120813, whose fix made it all the way to the Encoding Standard (as the only non-Web-motivated label there), was probably also caused by JavaMail, there seems to be a valid concern to support the kind of labels (old versions of?) JavaMail can emit.
Hmm, I'm a little confused now. First you said: "I'll scope this down" and now the second patch is about as long at the first, only in a different order.

I think I'll have to sort original and patched file to see what the real change is. Or what's the difference between the two apart from the order?
Comment on attachment 9030102 [details] [diff] [review]
Support more labels

Review of attachment 9030102 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/intl/charsetalias.properties
@@ -15,4 @@
>  
>  646=windows-1252
> -iso-8859-1=ISO-8859-1
> -utf-16=UTF-16

What about utf-16?

@@ +69,5 @@
>  
> +# ISO 2022 series without hyphens
> +iso2022jp=ISO-2022-JP
> +iso2022kr=replacement
> +iso2022cn=replacement

What is "replacement"?
Henri, while we're here, can you shed some light onto this code:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompose.cpp#1611-1616

This was added in bug 234958 comment #4 in a different location and has since been shuffled around. Can't that just be another label? We already have cp949=EUC-KR and ms949=EUC-KR.
(In reply to Jorg K (GMT+1) from comment #24)
> Hmm, I'm a little confused now. First you said: "I'll scope this down" and
> now the second patch is about as long at the first, only in a different
> order.
> 
> I think I'll have to sort original and patched file to see what the real
> change is. Or what's the difference between the two apart from the order?

The difference was not adding Solaris oddities or UTF-16BE/LE aliases. The latest patch also omits Mac encodings and mappings to replacement.

(In reply to Jorg K (GMT+1) from comment #25)
> > -utf-16=UTF-16
> 
> What about utf-16?

utf-16 is an Encoding Standard label, so it doesn't need to be here.

> @@ +69,5 @@
> >  
> > +# ISO 2022 series without hyphens
> > +iso2022jp=ISO-2022-JP
> > +iso2022kr=replacement
> > +iso2022cn=replacement
> 
> What is "replacement"?

An XSS avoidance measure for ISO-2022-KR and ISO-2022-CN on the Web. Possible not relevant to email.

(In reply to Jorg K (GMT+1) from comment #26)
> Henri, while we're here, can you shed some light onto this code:
> https://dxr.mozilla.org/comm-central/rev/
> 2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/
> nsMsgCompose.cpp#1611-1616
> 
> This was added in bug 234958 comment #4 in a different location and has
> since been shuffled around. Can't that just be another label? We already
> have cp949=EUC-KR and ms949=EUC-KR.

That code seems to be for blocking a Netscape-made-up internal label that no longer exists internally. Later these were blocked using a property file instead of one-off checks like that.

The latest patch has comments for everything in the file.
Attachment #9030102 - Attachment is obsolete: true
Attachment #9030102 - Flags: review?(jorgk)
Attachment #9030134 - Flags: review?(jorgk)
Oh, and the reason why the lack of MS932 or cp932 gets reported but other similar labels don't get reported is that the fallback for Japanese is ISO-2022-JP and not Shift_JIS, but for the other cases, the fallback saves the day if the recipient is using the localization that the encoding is affiliated with. It still makes sense to fix these issues in a way that doesn't depend on the recipient using a localization with just the right fallback.
(In reply to Henri Sivonen (:hsivonen) from comment #27)
> That code seems to be for blocking a Netscape-made-up internal label that no
> longer exists internally. Later these were blocked using a property file
> instead of one-off checks like that.
So can we remove the
  if (aCharset.Equals("x-windows-949", nsCaseInsensitiveCStringComparator()))
    aCharset = "EUC-KR";
or is there a better way to achieve the desired effect?

> The latest patch has comments for everything in the file.
Thanks, I'll take a closer look when I'm back at my desk.
(In reply to Jorg K (GMT+1) from comment #29)
> (In reply to Henri Sivonen (:hsivonen) from comment #27)
> > That code seems to be for blocking a Netscape-made-up internal label that no
> > longer exists internally. Later these were blocked using a property file
> > instead of one-off checks like that.
> So can we remove the
>   if (aCharset.Equals("x-windows-949", nsCaseInsensitiveCStringComparator()))
>     aCharset = "EUC-KR";
> or is there a better way to achieve the desired effect?

It's a bit unclear where the argument for that function comes from. If the argument is already supposed to be a Gecko-canonical name, then the bit of code can be removed.
I think what happens is this:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompose.cpp#1611
fixCharset() is taking the charset it found in a message in the wild and is turning it into a canonical name:

https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompose.cpp#1618-1626

So if we receive cp932 and want to answer in the *same* charset, we still answer in Shift_JIS.

I'd love to remove that hard-coded check, but I have the feeling that
x-windows-949=EUC-KR
should be added in your patch.

Would you agree?
Reading bug 234958 comment #0 (quote) "Mozilla generates an outgoing message with 'x-windows-949' label" which was undesired. So given that Mozilla most likely won't do this any more, the code could be dropped. However, since we can expect to find that still used in the wild, adding it as alias for EUC-KR woudn't hurt.
Comment on attachment 9030134 [details] [diff] [review]
Add labels and document all of them

Review of attachment 9030134 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the careful clean-up. Looks OK to me. r+ with the comments below considered. And please let me land it since I land patches after merges to M-C to trigger builds.

I've checked that the removed aliases latin1, l1, cp819, csisolatin1, etc. are all aliases for windows-1252 now.

::: mailnews/intl/charsetalias.properties
@@ +20,1 @@
>  x-imap4-modified-utf7=x-imap4-modified-utf7

Agreed, please remove. MUTF-7 is handled internally now.

@@ +98,1 @@
>  5601=EUC-KR

As per our discussion, I'd add x-windows-949=EUC-KR here and remove the code snipped. You can quote bug 234958.
Attachment #9030134 - Flags: review?(jorgk) → review+
The issue is a little more twisted. Some of our testing relies on recognising x-mac-croatian:
https://searchfox.org/comm-central/search?q=x-mac-croatian&case=true&regexp=false&path=

Also, if those Mac encodings were removed, we'd also need to clean-up
mailnews/intl/charsetData.properties.

I guess to bring this bug to a close we should take the clean-up but leave the Mac charsets in place for testing.
Blocks: 1512977
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/0b7555f983a5
Recognize plausible legacy Java-style encoding names and comment the alias file. r=jorgk DONTBUILD
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Henri, I went ahead and landed this. I filed bug 1512977 as a follow-up.
Target Milestone: --- → Thunderbird 65.0
(In reply to Jorg K (GMT+1) from comment #37)
> Henri, I went ahead and landed this. I filed bug 1512977 as a follow-up.

Thanks. I was about to file a follow-up about removing the Solaris nl_langinfo-motivated items, but it seems that at least 646 might be worth keeping. :-/
(In reply to Henri Sivonen (:hsivonen) from comment #39)
> (In reply to Jorg K (GMT+1) from comment #37)
> > Henri, I went ahead and landed this. I filed bug 1512977 as a follow-up.
> 
> Thanks. I was about to file a follow-up about removing the Solaris
> nl_langinfo-motivated items, but it seems that at least 646 might be worth
> keeping. :-/

(Or maybe not, since all fallbacks handle ASCII anyway.)
You need to log in before you can comment on or make changes to this bug.