Closed Bug 936466 Opened 11 years ago Closed 10 years ago

React to the removal of TIS-620, ISO-8859-11 and ISO-8859-9

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
normal

Tracking

(thunderbird36 fixed)

RESOLVED FIXED
Thunderbird 36.0
Tracking Status
thunderbird36 --- fixed

People

(Reporter: hsivonen, Assigned: mkmelin)

References

Details

Attachments

(1 file)

Firefox no longer uses the code for ISO-8859-1, TIS-620, ISO-8859-11, ISO-8859-9 and GB2312, because those labels map to windows-1252, windows-874, windows-874, windows-1254 and gbk, respectively.

I suggest Thunderbird aliases them likewise for mail and we get rid of the code. The main foreseeable problem would be that some other MUA doesn't recognize the windows-1252, windows-874, windows-1254 and gbk labels. Alternatively, please move the code to comm-central.
Why would we want to promote the windows-* nonstandard labels instead of the ISO ones? Could you link the Firefox bug that made this change?

I for one would not understand what windows-1252 is as I was always drilled that I want ISO-8859-1 for western european text.

If the exercise is only to merge the code as it is identical, I am all for it. But we could keep both labels to point to the same decoder/encoder code.
(In reply to :aceman from comment #1)
> Why would we want to promote the windows-* nonstandard labels instead of the
> ISO ones?

The windows-*  labels are not non-standard. They are specified in http://encoding.spec.whatwg.org/ and are registered with the IANA, too.

> Could you link the Firefox bug that made this change?

Bug 801402.

> I for one would not understand what windows-1252 is as I was always drilled
> that I want ISO-8859-1 for western european text.

Why? windows-1252 is a superset of ISO-8859-1. To avoid breaking recipients that only do ISO-8859-1? Because of some anti-Microsoft sentiment? Nowaways, people should be drilled to use UTF-8 for all kinds of text. (See bug 862292.)

> If the exercise is only to merge the code as it is identical, I am all for
> it. But we could keep both labels to point to the same decoder/encoder code.

I'm okay with Thunderbird opting to do that for email if you do the work. Firefox will treat the non-windows labels as aliases for the windows labels, though.
(In reply to Henri Sivonen (:hsivonen) from comment #0)
> Firefox no longer uses the code for ISO-8859-1, TIS-620, ISO-8859-11,
> ISO-8859-9 and GB2312, because those labels map to windows-1252,
> windows-874, windows-874, windows-1254 and gbk, respectively.
> 
> I suggest Thunderbird aliases them likewise for mail and we get rid of the
> code. The main foreseeable problem would be that some other MUA doesn't
> recognize the windows-1252, windows-874, windows-1254 and gbk labels.
> Alternatively, please move the code to comm-central.

How is the aliasing done? Would nsICharsetConverterManager::GetUnicode*("ISO-8859-1") still work properly? If so, then I don't see a reason to keep separate encoders specifically for Thunderbird.

We probably ought to prefer writing ISO-8859-1 instead of Windows-1252 as the charset in our bodies (IIRC, the MIME specifications mandate support only for ISO-8859-* charsets), but I suspect nearly every MUA is using an external charset conversion library, so the equivalencies here shouldn't be an issue. If it is an issue, it might be better worked around by hard-coding some preferred aliases.
(In reply to Joshua Cranmer [:jcranmer] from comment #3)
> (In reply to Henri Sivonen (:hsivonen) from comment #0)
> > Firefox no longer uses the code for ISO-8859-1, TIS-620, ISO-8859-11,
> > ISO-8859-9 and GB2312, because those labels map to windows-1252,
> > windows-874, windows-874, windows-1254 and gbk, respectively.
> > 
> > I suggest Thunderbird aliases them likewise for mail and we get rid of the
> > code. The main foreseeable problem would be that some other MUA doesn't
> > recognize the windows-1252, windows-874, windows-1254 and gbk labels.
> > Alternatively, please move the code to comm-central.
> 
> How is the aliasing done?

In Firefox, Encoding Standard-compliant aliasing is done using mozilla::dom::EncodingUtils::FindEncodingForLabel(). This frees up the old alias table in charsetalias.properties for other uses, so Thunderbird could do its aliasing in charsetalias.properties or introduce a (comm-central-specific) new mechanism analogous to mozilla::dom::EncodingUtils::FindEncodingForLabel().

> Would
> nsICharsetConverterManager::GetUnicode*("ISO-8859-1") still work properly?

"Properly" might be understood differently by different people. nsICharsetConverterManager::GetUnicode[En|De]coder() go through charsetalias.properties as before. As before, nsICharsetConverterManager::GetUnicode[En|De]coderRaw() don't.

Bug 919935 is about introducing a doCOMtaminated replacement for nsICharsetConverterManager::GetUnicode[En|De]coderRaw() for use in Firefox.

> If so, then I don't see a reason to keep separate encoders specifically for
> Thunderbird.

Depends on whether you are worried about emitting windows-125*-only or GBK-only bytes under ISO-8859-* or GB2312 labels. You probably shouldn't be too worried, since Thunderbird already intentionally encodes in windows-1252  when supposedly sending ISO-8859-1.

> We probably ought to prefer writing ISO-8859-1 instead of Windows-1252 as
> the charset in our bodies

Do you mean "headers" instead of "bodies"?

> (IIRC, the MIME specifications mandate support
> only for ISO-8859-* charsets),

I see no such requirement in http://tools.ietf.org/html/rfc2045 . Where's the requirement? And even if there was such a requirement, would it have any practical effect?

The windows-125* labels and gbk are IANA-registered just like ISO-8859-*. https://www.iana.org/assignments/character-sets/character-sets.xhtml So in that sense, it's not wrong to use those labels in MIME.

> but I suspect nearly every MUA is using an
> external charset conversion library, so the equivalencies here shouldn't be
> an issue.

Better yet, you could sidestep the entire issue by using UTF-8 for outgoing email...

> If it is an issue, it might be better worked around by hard-coding
> some preferred aliases.

Yes.
(In reply to Henri Sivonen (:hsivonen) from comment #4)
> > We probably ought to prefer writing ISO-8859-1 instead of Windows-1252 as
> > the charset in our bodies
> 
> Do you mean "headers" instead of "bodies"?

What I meant is we ought to prefer labelling the charset. My plan for headers is to force them to UTF-8 unconditionally, but I'm not quite ready to make that jump for bodies.

> > (IIRC, the MIME specifications mandate support
> > only for ISO-8859-* charsets),
> 
> I see no such requirement in http://tools.ietf.org/html/rfc2045 . Where's
> the requirement? And even if there was such a requirement, would it have any
> practical effect?

RFC 2047, section 3 recommends that ISO-8859-* be used in preference if there is no "private agreements between sender and recipients of a message." RFC 2047 section 7 further says that "For the ISO-8859-* character sets, the mail reading program must at least be able to display the characters which are also in the ASCII set."

> The windows-125* labels and gbk are IANA-registered just like ISO-8859-*.
> https://www.iana.org/assignments/character-sets/character-sets.xhtml So in
> that sense, it's not wrong to use those labels in MIME.

That list also includes a lot of charsets which are not likely to find widespread support, e.g., Adobe-Symbol-Encoding.

> Better yet, you could sidestep the entire issue by using UTF-8 for outgoing
> email...

This is a major step, and I'm not quite ready to make that step yet. My concern here is East Asian locales--primarily Japan and China--as well as people running old Usenet servers that assume some legacy charset instead of UTF-8.
(In reply to Joshua Cranmer [:jcranmer] from comment #5)
> (In reply to Henri Sivonen (:hsivonen) from comment #4)
> > > We probably ought to prefer writing ISO-8859-1 instead of Windows-1252 as
> > > the charset in our bodies
> > 
> > Do you mean "headers" instead of "bodies"?
> 
> What I meant is we ought to prefer labelling the charset. My plan for
> headers is to force them to UTF-8 unconditionally, but I'm not quite ready
> to make that jump for bodies.

I wonder how making headers UTF-8 unconditionally is going to work for the feature phones is Japan that, according to rumor, can't handle UTF-8 is message body.

> > > (IIRC, the MIME specifications mandate support
> > > only for ISO-8859-* charsets),
> > 
> > I see no such requirement in http://tools.ietf.org/html/rfc2045 . Where's
> > the requirement? And even if there was such a requirement, would it have any
> > practical effect?
> 
> RFC 2047, section 3 recommends that ISO-8859-* be used in preference if
> there is no "private agreements between sender and recipients of a message."
> RFC 2047 section 7 further says that "For the ISO-8859-* character sets, the
> mail reading program must at least be able to display the characters which
> are also in the ASCII set."

OK. Interesting. And annoying, considering that there are ISO-8859-* encodings that were minted after RFC 2047 (and after UTF-8, for shame!). The Breton, German and Italian localizations of Thunderbird default to ISO-8859-15, which can't have any compatibility justification that the writers of RFC 2047 had in mind. :-(

> This is a major step, and I'm not quite ready to make that step yet. My
> concern here is East Asian locales--primarily Japan and China--as well as
> people running old Usenet servers that assume some legacy charset instead of
> UTF-8.

You could start with bug 941545 (and with other ISO-8859-* locales) without risking Japan and China. (zh-TW Thunderbird is already defaulting to UTF-8!)
(In reply to Henri Sivonen (:hsivonen) from comment #6)
> > RFC 2047, section 3 recommends that ISO-8859-* be used in preference if
> > there is no "private agreements between sender and recipients of a message."
> > RFC 2047 section 7 further says that "For the ISO-8859-* character sets, the
> > mail reading program must at least be able to display the characters which
> > are also in the ASCII set."
> 
> OK. Interesting. And annoying, considering that there are ISO-8859-*
> encodings that were minted after RFC 2047 (and after UTF-8, for shame!).

Also, TIS-620 is an older label than ISO-8859-11 and IANA prefers TIS-620, so if you don't want the windows-874 label for that one, you probably don't really want the ISO-8859-11 label, either.
Depends on: 1070992
Bug 1070992 removed TIS-620, ISO-8859-11 and ISO-8859-9 as Gecko-canonical encodings from m-c.

Please do one of the following:
 1) Adjust charsetalias.properties to make TIS-620 and ISO-8859-11 aliases of windows-874 and ISO-8859-9 an alias of windows-1254.
 2) Adjust charsetalias.properties per above *and* add logic to use the TIS-620 and ISO-8859-9 labels instead of the windows-874 and windows-1254 labels for outgoing email. (As noted in comment 7, ISO-8859-11 isn't even in the IANA registry.)
 3) Import the removed code into c-c.

I prefer #1 (and, of course, would prefer even more using UTF-8 for outgoing email).
Blocks: cc-backlog
Summary: Alias ISO-8859-1, TIS-620, ISO-8859-11, ISO-8859-9 and GB2312 per the Encoding Standard, get rid of dedicated encoders/decoders → React to the removal of TIS-620, ISO-8859-11 and ISO-8859-9
I guess this should be it. 

I'm a bit confused though, as I don't see any difference. Things work on trunk and things work after this patch - but exactly the same and using the charset this bug says it should. (Tried viewing, replying)
Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED
Attachment #8495434 - Flags: review?(Pidgeot18)
Attachment #8495434 - Flags: feedback?(hsivonen)
Comment on attachment 8495434 [details] [diff] [review]
bug936466_react_to_tis620_etc_removal.patch

Review of attachment 8495434 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Magnus Melin from comment #10)
> I'm a bit confused though, as I don't see any difference. Things work on
> trunk and things work after this patch - but exactly the same and using the
> charset this bug says it should. (Tried viewing, replying)

So before and after windows-1254 is used when replying to ISO-8859-9 email and windows-874 is used when replying to TIS-620 email? Maybe that's an effect of the patch jcranmer made to limit the outgoing encodings to the ones in the menu. I guess that one uses the browser label mappings.

Did you check in a debugger if the ISO-8859-9 changes to windows-1254 at the point of reply without this patch but at the point of opening the email with this patch?

::: mailnews/intl/charsetalias.properties
@@ +297,5 @@
> +iso_8859-9=windows-1254
> +iso_8859-9:1989=windows-1254
> +iso-ir-148=windows-1254
> +l5=windows-1254
> +csisolatin5=windows-1254

This is missing the line
iso-8859-9=windows-1254
with a hyphen between iso and 8859 that is. f+ with that fixed.
Attachment #8495434 - Flags: feedback?(hsivonen) → feedback+
(In reply to Henri Sivonen (:hsivonen) from comment #11)
> So before and after windows-1254 is used when replying to ISO-8859-9 email
> and windows-874 is used when replying to TIS-620 email? 

Yes.

> Did you check in a debugger if the ISO-8859-9 changes to windows-1254 at the
> point of reply without this patch but at the point of opening the email with
> this patch?

msgHdr.Charset is "ISO-8859-9" with and without this patch at display time. I think that's taken directly from the message source though, as for a TIS-620 message it's still "TIS-620" after the patch - but "TIS-620" no longer exists w/ that casing when the patch is applied.
Comment on attachment 8495434 [details] [diff] [review]
bug936466_react_to_tis620_etc_removal.patch

Review of attachment 8495434 [details] [diff] [review]:
-----------------------------------------------------------------

So, in reviewing this patch, it struck me that perhaps a more tenable long-term approach is to make the mailnews charset management do EncodingUtils++ kind of scenario--we default to following EncodingUtils, and fallback to a smaller list of non-EncodingUtils-based charsets outside of that (hopefully, only UTF-7, though I think we need to carry some of the x-mac-* stuff until post-38).
Attachment #8495434 - Flags: review?(Pidgeot18) → review+
(In reply to Joshua Cranmer [:jcranmer] from comment #14)
> So, in reviewing this patch, it struck me that perhaps a more tenable
> long-term approach is to make the mailnews charset management do
> EncodingUtils++ kind of scenario--we default to following EncodingUtils, and
> fallback to a smaller list of non-EncodingUtils-based charsets outside of
> that (hopefully, only UTF-7, though I think we need to carry some of the
> x-mac-* stuff until post-38).

Filed bug 1074125. (I suggest consulting charsetalias.properties first and then consulting EncodingUtils, in case c-c ends up importing one of ISO-2022-KR, ISO-2022-CN or HZ-GB-2312.)
(In reply to Magnus Melin from comment #13)
> msgHdr.Charset is "ISO-8859-9" with and without this patch at display time.

That's unfortunate. I think it would make sense to resolve the labels into Gecko-canonical names as close to the initial message parsing as possible to avoid subtly bugs arising from non-canonical labels flowing inside the app internals. (The browser resolves labels as early as possible.)
https://hg.mozilla.org/comm-central/rev/9c786619a5f4 -> FIXED
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 36.0
No longer blocks: cc-backlog
The upshot is that Thunderbird no longer supports ISO-8859-1! (and it is _not_ a subset of the Microsoft-proprietary set windows-1252; both have 256 distinct assignments)
I got a message in Portuguese; in Thunderbird s rendering of the comment (I know not what the right name for it is) with "From" there were the substitution question-marks, but, when I saw the message under "View Source" I saw that they were not-ASCII Latin1 characters, and the message had
MIME-Version: 1.0
Content-type: text/html; charset=iso-8859-1
. Thunderbird misrendered this message.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: