Closed Bug 562091 Opened 10 years ago Closed 7 years ago

Make Unicode => EUC-KR converter identical to Unicode => UHC / Windows-949

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: jshin1987, Assigned: emk)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

In the past, we're stricter than other browsers when it comes to EUC-KR.  

1. ToUnicode direction, we're as lenient as other browsers in that we accept code points outside the 94x94 grid and treat them as Windows-949 (UHC). Those 2-byte sequences are used for Hangul syllables outside KS X 1001 (total 8,821 of them) in Windows-949
  
   In addition, we can also convert Hangul syllables represented in 8-byte sequences as specified in KS X 1001. 

2. FromUnicode direction, when the output encoding is EUC-KR (as opposed to Windows-949), we convert 8,821 Hangul syllables to 8-byte sequences instead of 2-byte sequences used in Windows-949. 


We can leave alone ToUnicode direction as it is now because there are some web pages containing 8-byte sequences (mainly generated by Firefox users who post to forums in EUC-KR. For instance, mozilla.or.kr has some postings with them). 

However, FromUnicode direction, I think we have to give up being too strict about EUC-KR especially considering that HTML5 stipulates that EUC-KR be treated synonymously with Windows-949. 

I see no problem at all with this change in Firefox (and other gecko-based browsers). It might be problematic in some cases for Thunderbird, but I bet it should be ok in the vast majority of cases. Testing with Outlook/Outlook Express and popular web mail services in Korea may be necessary.
HTML5 has this about the issue: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

---------------

When a user agent would otherwise use an encoding given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row. When a byte or sequence of bytes is treated differently due to this encoding aliasing, it is said to have been misinterpreted for compatibility.

Character encoding overrides
Input encoding	 Replacement encoding	 References
EUC-KR	 windows-949	[EUCKR] [WIN949]
GB2312	 GBK	[RFC1345] [GBK]
GB_2312-80	 GBK	[RFC1345] [GBK]
ISO-8859-1	 windows-1252	[RFC1345] [WIN1252]
ISO-8859-9	 windows-1254	[RFC1345] [WIN1254]
ISO-8859-11	 windows-874	[ISO885911] [WIN874]
KS_C_5601-1987	 windows-949	[RFC1345] [WIN949]
Shift_JIS	 Windows-31J	[SHIFTJIS] [WIN31J]
TIS-620	 windows-874	[TIS620] [WIN874]
US-ASCII	 windows-1252	[RFC1345] [WIN1252]

--------------------

I filed bug 562096 for other charsets (I filed EUC-KR vs Windows-949 because it has an additional complication).
Blocks: encoding
Assignee: smontagu → VYV03354
Status: NEW → ASSIGNED
Attachment #680396 - Flags: review?(smontagu)
Comment on attachment 680396 [details] [diff] [review]
Remove the EUC-KR conveter and rename x-windows-949 to EUC-KR

Review of attachment 680396 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/plugins/base/nsPluginInstanceOwner.cpp
@@ +991,5 @@
>      {"x-mac-icelandic", "MacIceland"},
>      {"macintosh",       "MacRoman"},
>      {"x-mac-romanian",  "MacRomania"},
>      {"x-mac-ukrainian", "MacUkraine"},
> +    {"Shift_JIS",       "MS932"},

This looks like part of another patch
Attachment #680396 - Flags: review?(smontagu) → review+
patch for checkin
Attachment #680396 - Attachment is obsolete: true
Attachment #681005 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/fd7a0ace6b0e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
http://mxr.mozilla.org/l10n-mozilla-aurora/search?string=x-windows-949 shows some usage of x-windows-949 outside of charsetTitles.properties, should we get bugs filed on replacing those?
Sure. x-windows-949 should be removed from intl.charsetmenu.browser.more3 and it should be replaced into EUC-KR in ko searchplugins.
Actually intl.charsetmenu.browser.more* should be just removed entirely because those properties are no longer localizable.
Depends on: 812027
Filed bug 812027 for ko searchplugins.
intl.charsetmenu.browser.more* would have a low priority because the garbage would be harmless.
Thanks, agreed.
You need to log in before you can comment on or make changes to this bug.