Closed Bug 712876 Opened 13 years ago Closed 13 years ago

Replace ISO-8859-9 (latin5, etc.) decoder with windows-1254 decoder per HTML5/Encoding spec

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: GPHemsley, Assigned: emk)

References

()

Details

Attachments

(3 files, 2 obsolete files)

According to the recently-spun-off Encoding Standard [1], Gecko does not currently support the full list of aliases for the windows-1254 encoding, which are as follows: "csisolatin5", "iso-8859-9", "iso-ir-148", "l5", "latin5", and "windows-1254". It is noted in [1] that these aliases should already be supported per the HTML(5| Living) Standard. For the most recent version of the Encoding Standard, see [2]. I don't know the implementation details of such a thing, but this seems to me to be a candidate Good First Bug. [1] http://dvcs.w3.org/hg/encoding/raw-file/8cafea8b65f9/Overview.html#windows-1254 [2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#windows-1254
> I don't know the implementation details of such a thing You add the relevant entries to intl/locale/src/charsetalias.properties
Oh, perhaps the problem here is that they're all mapped to ISO-8859-9 instead of windows-1254: http://hg.mozilla.org/mozilla-central/file/ed47a41ba26a/intl/locale/src/charsetalias.properties#l270 270 # 271 # Aliases for ISO-8859-9 272 # 273 latin5=ISO-8859-9 274 iso_8859-9=ISO-8859-9 275 # Currently .properties cannot handle : in key 276 #iso_8859-9:1989=ISO-8859-9 277 iso-ir-148=ISO-8859-9 278 l5=ISO-8859-9 279 csisolatin5=ISO-8859-9
What label should be used on sending? For example, we uses ISO-8859-1 instead of windows-1252 unless the text contains windows-1252 specific characters.
Is that difference required? It seems better to always use windows-1252 and windows-1254. Pretty sure that is how it works in Opera.
Our charset converter is not only for the Web browser. It will violate RFCs to always use windows-1252/1254 in mail messages.
How exactly would that violate those RFCs? The sender is in charge of picking the encoding, no?
At least IE9, Chrome for Win, Safari for Win, and Firefox Nightly do not always use windows-1252/1254. I know Opera is the dominant browser in the world, but I don't think it's a good idea to change all other browsers to align with Opera. (This test doesn't work with Opera. Is there a way to detect the internal encoding name on Opera?)
Attachment #584157 - Attachment is patch: false
Attachment #584157 - Attachment mime type: text/plain → text/html
Our behavior is consistent with IE9 and "correct" per IANA registry. Although the Encoding Standard can override any other standards by using the magic word "willful violation", it needs a good reason.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
When it comes to decoding the actual octets Gecko is the only browser that decodes iso-8859-9 per iso-8859-9 rather than windows-1254. It may be that implementations do not do correct label reporting, but that does not mean they decode it differently. I was just saying that it seems simpler to always use windows-1252/windows-1254 but if there are good reasons not to do that we can certainly change things around, but the original bug as filed still seems accurate.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
We can replace ISO-8859-9 decoder's mapping table with windows-1254's one as we did for ISO-8859-11 decoder. However we will not simply turn ISO-8859-9 into an alias of windows-1254. It will affect our mail message header and some recipients may not handle the label windows-1254 even if they can handle "incorrectly" labeled ISO-8859-9 messages which actually contain windows-1254 specific characters.
IE actually has a difference between iso-8859-9 encoder and windows-1254 encoder while both decoders are the same.
Summary: Support aliases for windows-1254 encoding (latin5, iso-8859-9, etc.) → Replace iso-8859-9 (, latin5, etc.) decoder with windows-1254 decoder per HTML5/Encoding spec
Sorry, the previous test had a bug. (But the difference is still present)
Attachment #584182 - Attachment is obsolete: true
Attached patch patch (obsolete) — Splinter Review
Assignee: smontagu → VYV03354
Status: REOPENED → ASSIGNED
Attachment #584194 - Flags: review?(smontagu)
I think the Encoding Standard should have an explanation about these "asymmetric" encodings so that browser vendors do not have to reverse engineer to find the trick.
If there is agreement that we should do this for iso-8859-9 / windows-1254 it should certainly reflect that. I'm not convinced this is an actual problem for email clients though. And as far as browsers go Opera and Chrome both use the iso-8859-9 labels to mean windows-1254. However, please file a bug on the Encoding Standard so it can be considered. There's a pointer at the top of the document.
I'm not sure how this bug got so cranky so quickly, but I ask that we all please Assume Good Faith here.[1] I'm fairly certain that Anne's goal is to create an interoperable standard for encodings, not impose Opera's methods on everyone else. From the point of view of someone not familiar with the inner workings of all these things, it is not clear to me whether the question of "which RFCs would this change violate?" has been answer, nor which IANA registry we are discussing. Also, I was not aware that IANA registries themselves could articulate rules—aren't they just databases of information that are relevant to certain RFCs? Also, I should note that this new Encoding Standard is barely two weeks old. There is no reason to criticize its contents just yet. File bugs and participate in discussion first. One final thought: Anne has collected a lot of data about how browsers handle these various encodings.[2][3] They might be worth a look. [1] http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith [2] http://dvcs.w3.org/hg/encoding/raw-file/8cafea8b65f9/single-octet-research.html [3] http://lists.w3.org/Archives/Public/www-archive/2011Dec/att-0021/encoding-labels.html
Status: ASSIGNED → NEW
Summary: Replace iso-8859-9 (, latin5, etc.) decoder with windows-1254 decoder per HTML5/Encoding spec → Replace ISO-8859-9 (latin5, etc.) decoder with windows-1254 decoder per HTML5/Encoding spec
Status: NEW → ASSIGNED
Comment on attachment 584194 [details] [diff] [review] patch Review of attachment 584194 [details] [diff] [review]: ----------------------------------------------------------------- This is consistent with what we do with other encodings (e.g. EUC-JP[1] and Big5[2]). It doesn't conform with what the HTML5 spec currently says wrt "misinterpreting encodings for compatibility", which expects the misinterpretation to be symmetric. Will https://www.w3.org/Bugs/Public/show_bug.cgi?id=15332 get backported to HTML5? [1]Bug 600715 [2]Bug 310299
Attachment #584194 - Flags: review?(smontagu) → review+
Attachment #584194 - Attachment is obsolete: true
Attachment #584256 - Flags: review+
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: