820767 - Content-Type charset too strictly interpreted (e.g., ISO8859_1 != ISO-8859-1)

Reporter

Description

•

13 years ago

Attached file vm.eml — Details

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11 Steps to reproduce: Read an HTML-formatted email Actual results: Pound signs (that is, UK currency symbol, Unicode 00A3, £) came out as unknown characters (black diamond with question mark). Appears correctly in GMail. Complete email source attached (slightly redacted for my security). Expected results: I think it should have shown pound signs correctly. I looked at the header and I see that in the HTML part it has Content-Type: text/html; charset="ISO8859_1" Content-Transfer-Encoding: quoted-printable and the HTML that follows itself has <head> ... <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8= 859-1"> </head> If I change Content-Type: text/html; charset="ISO8859_1" to Content-Type: text/html; charset="ISO-8859-1" (by editing the raw mail file in emacs), it then works. Some research (http://www.w3.org/Protocols/rfc1341/7_1_Text.html ) suggests that ISO8859_1 is wrong: "The defined charset values are: US-ASCII as defined in [US-ASCII]. ISO-8859-X where "X" is to be replaced, as necessary, for the parts of ISO-8859 [ISO- 8859]. ...No other character set name may be used in Internet mail without the publication of a formal specification and its registration with IANA as described in Appendix F, or by private agreement, in which case the character set name must begin with "X-". and I don't know where the correspondent's mail system (JavaMail) got ISO8859_1 from as an alternative form (it;'s not listed among the recognised aliases I found at http://www.iana.org/assignments/character-sets/character-sets.xml ) HOWEVER, GMail displays this "correctly", that is with pound signs intact, and I suspect many other mail clients would as well. So my reading is that the sender email is strictly speaking wrong, but on the principle of "write strictly, read relaxed", and the competition gets it right, TB ought to be more flexible in its interpretation of the charset.

rsx11m

Comment 1

•

13 years ago

My guess is that Gmail just strips and special characters from those attributes, thus recognizing "ISO88591" as a valid encoding (but don't know for sure). This should be a prudent handling for Mozilla applications as well, thus being more error tolerant with non-compliant messages (while still encoding it correctly in any messages sent, of course) as long as there is no ambiguity.

Component: Untriaged → MIME

Product: Thunderbird → MailNews Core

Summary: Content-Type charset too strictly interpreted → Content-Type charset too strictly interpreted (e.g., ISO8859_1 != ISO-8859-1)

rsx11m

Updated

•

13 years ago

Attachment #691280 - Attachment mime type: application/octet-stream → text/plain

rsx11m

Comment 2

•

13 years ago

As an interesting observation, the <head><meta> attributes are using the correct ISO identifier in the charset specification: > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Masatoshi Kimura [:emk]

Comment 3

•

13 years ago

(In reply to rsx11m from comment #2) > As an interesting observation, the <head><meta> attributes are using the > correct ISO identifier in the charset specification: > > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> The transport level charset would take precedence over the meta.

Masatoshi Kimura [:emk]

Comment 4

•

13 years ago

(In reply to Masatoshi Kimura [:emk] from comment #3) > (In reply to rsx11m from comment #2) > > As an interesting observation, the <head><meta> attributes are using the > > correct ISO identifier in the charset specification: > > > > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> > > The transport level charset would take precedence over the meta. That said, if the transport level charset label is unrecognizable, maybe the chaset shoule be taken from the meta instead of finalizing to UTF-8, yes.

David Earl

Reporter

Comment 5

•

13 years ago

That would certainly solve this particular case (and indeed if you Save As HTML from TB and view in a browser, it is correct because of this). But what's worrying me is that something prompted the sender's mail client to use this particular form of ISO8859_1; I don't know why, but they must have done this deliberately and got that string from somewhere. The fact that Google recognises it too may mean they are aware of it too - or maybe, since they are using a browser to display, they are just using the meta, so they don't even see there's a problem. Another possibility would be to say if you don't recognise the charset string, to see if you can get a match between a form which has all punctuation (dashes, underscores, periods etc) removed against the same for your table of charset names treated the same way, case-insensitively of course.

Joshua Cranmer [:jcranmer]

Comment 6

•

13 years ago

(In reply to David Earl from comment #5) > Another possibility would be to say if you don't recognise the charset > string, to see if you can get a match between a form which has all > punctuation (dashes, underscores, periods etc) removed against the same for > your table of charset names treated the same way, case-insensitively of > course. There is a strict registration procedure for character sets at <http://www.iana.org/assignments/character-sets/character-sets.xml>; an algorithm for matching character sets that is web-compatible is at <http://encoding.spec.whatwg.org/>. Following your recommendation would be in violation of pretty much every specification that partially occupies this space; given that there are other mechanisms which would correctly detect this issue, I don't think it qualifies as a useful workaround at this point in term. In this scenario, we should be rejecting the charset type of the protocol as invalid and the GUI should fall back onto the <meta> declaration instead. Actually, given that the default presumed type is US-ASCII, which in practice is best mapped to ISO-8859-1 for decoding, we shouldn't even need to have the meta...

David Earl

Reporter

Comment 7

•

13 years ago

OK, so what are you going to do with the text/plain alternative multipart content? This example didn't actually have one, but many corresponding examples would. I did find the iana page when researching this, and the key thing is that the erroneous name is not among the aliases. It's also not listed on your second reference, so neither will help. I'm just suggesting a better backstop position when all else fails than a purist "this is an error" response. The worst that happens is you get errors displayed differently if it isn't correct, but still errors. I'm sure if it can happen in this context, it will happen in other contexts which don't have a meta available.

David Earl

Reporter

Comment 8

•

13 years ago

(In reply to rsx11m from comment #2) > As an interesting observation, the <head><meta> attributes are using the > correct ISO identifier in the charset specification: > > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> I imagine that the HTML content of the email was produced in entirely separate software from the software used for the mail distribution.

Joshua Cranmer [:jcranmer]

Comment 9

•

13 years ago

(In reply to David Earl from comment #7) > OK, so what are you going to do with the text/plain alternative multipart > content? This example didn't actually have one, but many corresponding > examples would. Are you sure that Gmail, Outlook, etc., would actually decode the message properly in such a scenario? I'm willing to bet that most programs delegate the charset decision to whatever library looks up charsets; neither the Java runtime nor the C# runtime appear to do the kind of normalization you talk about, so I highly doubt that we would be interoperable with most mail clients if we implemented your suggestion.

David Earl

Reporter

Comment 10

•

13 years ago

No, I'm not sure, and it would take a bit of effort to find out. One lesser test that might be enlightening is whether gmail still displays the message as intended if the meta charset is removed. But I don't understand where interoperability comes into it: all I'm saying is if TB can't work out how to display a message because it doesn't understand the charset, then make an intelligent guess rather than giving up entirely.

vm.eml 13 years ago David Earl 27.90 KB, text/plain		Details
Adjust charsetalias.properties 7 years ago Henri Sivonen (:hsivonen) 3.73 KB, patch		Details \| Diff \| Splinter Review
Support more labels 7 years ago Henri Sivonen (:hsivonen) 4.76 KB, patch		Details \| Diff \| Splinter Review
Add labels and document all of them 7 years ago Henri Sivonen (:hsivonen) 7.45 KB, patch	jorgk-bmo : review+	Details \| Diff \| Splinter Review
Same patch but left x-mac-NNNNN in place for now 7 years ago Jorg K (CEST = GMT+2) 7.61 KB, patch	jorgk-bmo : review+	Details \| Diff \| Splinter Review