Closed Bug 562096 Opened 15 years ago Closed 11 years ago

Support charset aliasing per Encoding Standard

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 801402

People

(Reporter: jshin1987, Assigned: smontagu)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html has the following: ------------------- When a user agent would otherwise use an encoding given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row. When a byte or sequence of bytes is treated differently due to this encoding aliasing, it is said to have been misinterpreted for compatibility. Character encoding overrides Input encoding Replacement encoding References EUC-KR windows-949 [EUCKR] [WIN949] GB2312 GBK [RFC1345] [GBK] GB_2312-80 GBK [RFC1345] [GBK] ISO-8859-1 windows-1252 [RFC1345] [WIN1252] ISO-8859-9 windows-1254 [RFC1345] [WIN1254] ISO-8859-11 windows-874 [ISO885911] [WIN874] KS_C_5601-1987 windows-949 [RFC1345] [WIN949] Shift_JIS Windows-31J [SHIFTJIS] [WIN31J] TIS-620 windows-874 [TIS620] [WIN874] US-ASCII windows-1252 [RFC1345] [WIN1252] -------------------- We already do some of the above aliasing (e.g. ISO-8859-1 > windows-1252), but not all. For Korean-specific issue, see bug 562091. Because HTML5 stipulates that we do the above aliasing for both directions, we can get rid of some of converters to save some space. I tried to find a bug on this, but couldn't find. If it's already filed/resolved in the trunk, please accept my apology.
Depends on: 600715
Depends on: 712876
Blocks: encoding
This section would be superseded by Encoding Standard.
Summary: Support charset aliasing per HTML5 → Support charset aliasing per Encoding Standard
This charsetalias.properties will have the following effects. A. The following encodings will no longer be available. A.1 XSS vulnerable encodings: x-mac-arabic, x-mac-farsi, x-mac-hebrew, x-imap4-modified-utf7, UTF-7, T.61-8bit A.2 IBM encodings other than ibm864 and ibm866 IBM850, IBM852, IBM855, IBM857, IBM862, IBM864i A.3 Mac encodings other than macintosh (MacRoman) x-mac-ce, x-mac-croatian, x-mac-devanagari, x-mac-greek, x-mac-gujarati, x-mac-gurmukhi, x-mac-icelandic, x-mac-romanian, x-mac-turkish A.4 Vietnamese encodings x-viet-tcvn5712, x-viet-vps,VISCII A.5 Others x-euc-tw, armscii-8, x-johab, x-user-defined, ISO-IR-111, ISO-2022-CN, ISO-8859-6-E, ISO-8859-6-I, ISO-8859-8-E, ISO-8859-8-I B. The following aliases will be removed. UTF-16BE: csunicode11, csunicode, csunicodeascii, csunicodelatin1, iso-10646-j-1, iso-10646-ucs-2, iso-10646-ucs-basic, iso-10646-unicode-latin1, iso-10646, x-iso-10646-ucs-2-be, x-iso-10646-ucs-2-le IBM864: 864, csibm864, ibm-864 IBM866: 866, csibm866, cp-866 windows-1250: cp1250 windows-1251: cp1251, ansi-1251 windows-1252: cp1252, x-cp1252 windows-1253: x-cp1253 windows-1254: cp1254,x-cp1254 windows-1255: x-cp1255 windows-1256: x-cp1256 windows-1257: cp1257, x-cp1257 windows-1258: x-cp1258 windows-874: ibm874 ISO-8859-1: ibm819, cp819, iso-ir-100, iso88591 ISO-8859-2: iso88592, iso8859-3, iso88593 ISO-8859-4: iso8859-4, iso88594 ISO-8859-5: iso8859-5, iso88595 ISO-8859-6: asmo-708, iso8859-6, iso88596 ISO-8859-7: iso8859-7, iso88597, sun_eu_greek ISO-8859-8: iso8859-8, iso88598 ISO-8859-9: iso8859-9, iso88599, iso_8859-9 ISO-8859-10: iso885910 ISO-8859-11: iso8859-11, iso885911 ISO-8859-12: iso885912 ISO-8859-13: iso8859-13, iso885913 ISO-8859-14: iso885914 ISO-8859-15: iso8859-15, iso885915 EUC-KR: 5601 us-ascii: 646 Shift_JIS: cp932 ISO-2022-JP: csiso2022jp2, iso-2022-jp-2 TIS-620: tis620 gbk: windows-936 GB2312: zh_cn.euc Big5: zh_tw-big5 C. The following encodings will be replaced with other (usually superset) encodings. GB2312->gbk, us-ascii->windows-1252, ISO-8859-1->windows-1252, ISO-8859-9->windows-1254, ISO-8859-11->windows-874, TIS-620->windows-874, iso-8859-8-i->ISO-8859-8, Big5-HKSCS->Big5, UTF-16->UTF-16LE EUC-KR->x-windows-949 (Encoding Standard calls windows-949 as EUC-KR) D. The following aliases will be added. cn-big5=Big5 sjis=Shift_JIS windows-949=EUC-KR
Attachment #617222 - Attachment is patch: false
The following aliases (and even more) has been added again. IBM864: csibm864, ibm-864 IBM866: 866, csibm866 windows-1250: cp1250 windows-1251: cp1251 windows-1252: cp1252, x-cp1252 windows-1253: x-cp1253 windows-1254: cp1254,x-cp1254 windows-1255: x-cp1255 windows-1256: x-cp1256 windows-1257: cp1257, x-cp1257 windows-1258: x-cp1258 ISO-8859-1: ibm819, cp819, iso-ir-100, iso88591 ISO-8859-2: iso88592, iso8859-3, iso88593 ISO-8859-4: iso8859-4, iso88594 ISO-8859-5: iso8859-5, iso88595 ISO-8859-6: asmo-708, iso8859-6, iso88596 ISO-8859-7: iso8859-7, iso88597, sun_eu_greek ISO-8859-8: iso8859-8, iso88598 ISO-8859-9: iso8859-9, iso88599, iso_8859-9 ISO-8859-10: iso885910 ISO-8859-11: iso8859-11, iso885911 ISO-8859-12: iso885912 ISO-8859-13: iso8859-13, iso885913 ISO-8859-14: iso885914 ISO-8859-15: iso8859-15, iso885915
Attachment #617222 - Attachment is obsolete: true
Depends on: 802030
Does it make sense to do this all in one big change? Some of these changes have bigger compat implications than others, and some might just be spec bugs. Wouldn't it be more prudent to make this a tracker bug and pursue changes bit by bit in individual bugs?
Depends on: 802059
The browser side of this was fixed by bug 801402. Moving the legacy code to comm-central is bug 943268.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: