Last Comment Bug 712876 - Replace ISO-8859-9 (latin5, etc.) decoder with windows-1254 decoder per HTML5/Encoding spec
: Replace ISO-8859-9 (latin5, etc.) decoder with windows-1254 decoder per HTML5...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla12
Assigned To: Masatoshi Kimura [:emk]
:
Mentors:
http://dvcs.w3.org/hg/encoding/raw-fi...
Depends on:
Blocks: 562096
  Show dependency treegraph
 
Reported: 2011-12-21 23:14 PST by Gordon P. Hemsley [:GPHemsley]
Modified: 2011-12-28 11:10 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Encoding label selection test (1.73 KB, text/html)
2011-12-23 20:07 PST, Masatoshi Kimura [:emk]
no flags Details
Compare iso-8859-9 encoder vs. windows-1254 encoder (641 bytes, application/hta)
2011-12-24 04:34 PST, Masatoshi Kimura [:emk]
no flags Details
Compare iso-8859-9 encoder vs. windows-1254 encoder (641 bytes, application/hta)
2011-12-24 04:57 PST, Masatoshi Kimura [:emk]
no flags Details
patch (6.67 KB, patch)
2011-12-24 07:09 PST, Masatoshi Kimura [:emk]
smontagu: review+
Details | Diff | Review
patch for check in. r=smontagu (6.76 KB, patch)
2011-12-25 05:33 PST, Masatoshi Kimura [:emk]
VYV03354: review+
Details | Diff | Review

Description Gordon P. Hemsley [:GPHemsley] 2011-12-21 23:14:37 PST
According to the recently-spun-off Encoding Standard [1], Gecko does not currently support the full list of aliases for the windows-1254 encoding, which are as follows:
"csisolatin5", "iso-8859-9", "iso-ir-148", "l5", "latin5", and "windows-1254". 

It is noted in [1] that these aliases should already be supported per the HTML(5| Living) Standard.

For the most recent version of the Encoding Standard, see [2].

I don't know the implementation details of such a thing, but this seems to me to be a candidate Good First Bug.

[1] http://dvcs.w3.org/hg/encoding/raw-file/8cafea8b65f9/Overview.html#windows-1254
[2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#windows-1254
Comment 1 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-12-21 23:16:12 PST
> I don't know the implementation details of such a thing

You add the relevant entries to intl/locale/src/charsetalias.properties
Comment 2 Gordon P. Hemsley [:GPHemsley] 2011-12-21 23:29:12 PST
Oh, perhaps the problem here is that they're all mapped to ISO-8859-9 instead of windows-1254:
http://hg.mozilla.org/mozilla-central/file/ed47a41ba26a/intl/locale/src/charsetalias.properties#l270

   270 #
   271 # Aliases for ISO-8859-9
   272 #
   273 latin5=ISO-8859-9
   274 iso_8859-9=ISO-8859-9
   275 # Currently .properties cannot handle : in key
   276 #iso_8859-9:1989=ISO-8859-9
   277 iso-ir-148=ISO-8859-9
   278 l5=ISO-8859-9
   279 csisolatin5=ISO-8859-9
Comment 3 Masatoshi Kimura [:emk] 2011-12-22 02:53:04 PST
What label should be used on sending?
For example, we uses ISO-8859-1 instead of windows-1252 unless the text contains windows-1252 specific characters.
Comment 4 Anne (:annevk) 2011-12-22 05:46:34 PST
Is that difference required? It seems better to always use windows-1252 and windows-1254. Pretty sure that is how it works in Opera.
Comment 5 Masatoshi Kimura [:emk] 2011-12-22 15:12:00 PST
Our charset converter is not only for the Web browser.
It will violate RFCs to always use windows-1252/1254 in mail messages.
Comment 6 Anne (:annevk) 2011-12-23 03:32:33 PST
How exactly would that violate those RFCs? The sender is in charge of picking the encoding, no?
Comment 7 Masatoshi Kimura [:emk] 2011-12-23 20:07:53 PST
Created attachment 584157 [details]
Encoding label selection test

At least IE9, Chrome for Win, Safari for Win, and Firefox Nightly do not always use windows-1252/1254. I know Opera is the dominant browser in the world, but I don't think it's a good idea to change all other browsers to align with Opera.
(This test doesn't work with Opera. Is there a way to detect the internal encoding name on Opera?)
Comment 8 Masatoshi Kimura [:emk] 2011-12-23 20:44:09 PST
Our behavior is consistent with IE9 and "correct" per IANA registry. Although the Encoding Standard can override any other standards by using the magic word "willful violation", it needs a good reason.
Comment 9 Anne (:annevk) 2011-12-24 03:02:36 PST
When it comes to decoding the actual octets Gecko is the only browser that decodes iso-8859-9 per iso-8859-9 rather than windows-1254. It may be that implementations do not do correct label reporting, but that does not mean they decode it differently.

I was just saying that it seems simpler to always use windows-1252/windows-1254 but if there are good reasons not to do that we can certainly change things around, but the original bug as filed still seems accurate.
Comment 10 Masatoshi Kimura [:emk] 2011-12-24 04:13:29 PST
We can replace ISO-8859-9 decoder's mapping table with windows-1254's one as we did for ISO-8859-11 decoder.
However we will not simply turn ISO-8859-9 into an alias of windows-1254. It will affect our mail message header and some recipients may not handle the label windows-1254 even if they can handle "incorrectly" labeled ISO-8859-9 messages which actually contain windows-1254 specific characters.
Comment 11 Masatoshi Kimura [:emk] 2011-12-24 04:34:25 PST
Created attachment 584182 [details]
Compare iso-8859-9 encoder vs. windows-1254 encoder

IE actually has a difference between iso-8859-9 encoder and windows-1254 encoder while both decoders are the same.
Comment 12 Masatoshi Kimura [:emk] 2011-12-24 04:57:38 PST
Created attachment 584185 [details]
Compare iso-8859-9 encoder vs. windows-1254 encoder

Sorry, the previous test had a bug. (But the difference is still present)
Comment 13 Masatoshi Kimura [:emk] 2011-12-24 07:09:11 PST
Created attachment 584194 [details] [diff] [review]
patch
Comment 14 Masatoshi Kimura [:emk] 2011-12-24 07:10:00 PST
I think the Encoding Standard should have an explanation about these "asymmetric" encodings so that browser vendors do not have to reverse engineer to find the trick.
Comment 15 Masatoshi Kimura [:emk] 2011-12-24 07:16:30 PST
https://tbpl.mozilla.org/?tree=Try&rev=8d8012438396
Comment 16 Anne (:annevk) 2011-12-24 07:16:56 PST
If there is agreement that we should do this for iso-8859-9 / windows-1254 it should certainly reflect that. I'm not convinced this is an actual problem for email clients though. And as far as browsers go Opera and Chrome both use the iso-8859-9 labels to mean windows-1254. However, please file a bug on the Encoding Standard so it can be considered. There's a pointer at the top of the document.
Comment 17 Masatoshi Kimura [:emk] 2011-12-24 07:41:31 PST
Filed W3C bug 15332.
Comment 18 Gordon P. Hemsley [:GPHemsley] 2011-12-24 08:32:37 PST
I'm not sure how this bug got so cranky so quickly, but I ask that we all please Assume Good Faith here.[1] I'm fairly certain that Anne's goal is to create an interoperable standard for encodings, not impose Opera's methods on everyone else.

From the point of view of someone not familiar with the inner workings of all these things, it is not clear to me whether the question of "which RFCs would this change violate?" has been answer, nor which IANA registry we are discussing. Also, I was not aware that IANA registries themselves could articulate rules—aren't they just databases of information that are relevant to certain RFCs?

Also, I should note that this new Encoding Standard is barely two weeks old. There is no reason to criticize its contents just yet. File bugs and participate in discussion first.

One final thought: Anne has collected a lot of data about how browsers handle these various encodings.[2][3] They might be worth a look.

[1] http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith
[2] http://dvcs.w3.org/hg/encoding/raw-file/8cafea8b65f9/single-octet-research.html
[3] http://lists.w3.org/Archives/Public/www-archive/2011Dec/att-0021/encoding-labels.html
Comment 19 Simon Montagu :smontagu 2011-12-25 01:26:24 PST
Comment on attachment 584194 [details] [diff] [review]
patch

Review of attachment 584194 [details] [diff] [review]:
-----------------------------------------------------------------

This is consistent with what we do with other encodings (e.g. EUC-JP[1] and Big5[2]). It doesn't conform with what the HTML5 spec currently says wrt "misinterpreting encodings for compatibility", which expects the misinterpretation to be symmetric. Will https://www.w3.org/Bugs/Public/show_bug.cgi?id=15332 get backported to HTML5?

[1]Bug 600715
[2]Bug 310299
Comment 20 Masatoshi Kimura [:emk] 2011-12-25 05:33:30 PST
Created attachment 584256 [details] [diff] [review]
patch for check in. r=smontagu
Comment 22 Matt Brubeck (:mbrubeck) 2011-12-28 11:10:12 PST
https://hg.mozilla.org/mozilla-central/rev/4fb24658d1f2

Note You need to log in before you can comment on or make changes to this bug.