Closed Bug 470523 Opened 16 years ago Closed 10 years ago

ISO-2022-CN decoder is horribly broken

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: emk, Assigned: emk)

Details

Attachments

(2 files, 1 obsolete file)

It can't decode even RFC 1922 sample text correctly.
Attached patch overhauled ISO-2022-CN decoder (obsolete) — Splinter Review
Assignee: smontagu → VYV03354
Status: NEW → ASSIGNED
Attachment #353945 - Flags: review?(smontagu)
Oops, the previous patch was still vulnerable to some XSS attacks.
Attachment #353945 - Attachment is obsolete: true
Attachment #354014 - Flags: review?(smontagu)
Attachment #353945 - Flags: review?(smontagu)
The patch looks technically correct, but I'm not really sure that we need it. I don't know how commonly used ISO-2022-CN is in the first place, and it must be very unusual to have traditional and simple Chinese mixed in the same message, let alone the same line.

FWIW, iconv doesn't decode the sample text correctly either (the expected decoding is 交换交換)
> The patch looks technically correct, but I'm not really sure that we need it.
If we don't need it, we should completely remove it rather than leaving it broken. Broken support is worse than no support. (Remember Nav4 CSS support!)
> I don't know how commonly used ISO-2022-CN is in the first place,
I'm afraid that it's rarely used. ISO-2022-JP is an only survivor of 7-bit ISO/IEC 2022 family encodings. But it's defined in RFC, registered by IANA, and supported by Microsoft Windows (CP50227 and 50229).
> and it must be
> very unusual to have traditional and simple Chinese mixed in the same message,
> let alone the same line.
The current decoder treats a single shift as it's a locking shift. So traditional Chinese only text will not be parsed correctly if it contains plane 2 or 3 characters.
> FWIW, iconv doesn't decode the sample text correctly either (the expected
decoding is 交换交換)
Which iconv? GNU iconv 1.11 which is included in MozillaBuild handled it correctly.
I'm going to consider dropping support for ISO-2022-CN because:
1. IE and WebKit do not support this encoding.
2. Our implementation has been broken for a long time.
Why ISO-2022-CN is defined in Encoding Standard?
http://dvcs.w3.org/hg/encoding/raw-file/78076fc4d607/Overview.html#iso-2022-cn
I'm open to removing it. I believe Chrome specifically added basic support (not backed by encoding tables though) for it to prevent XSS exploits.
http://code.google.com/p/chromium/issues/detail?id=21354 is what Chrome has done. For "iso-2022-cn" and "iso-2022-cn-ext".
One idea was to always treat text/html; charset=iso-2022-cn as text/plain to prevent XSS problems.  That makes it explicitly not work at all, rather than be parsed as HTML with the wrong charset (leading to the XSS issue).

I don't know if that would complicate the parser; for example, meta charset could cause the content type to change instead of just the charset.
Given that Internet Explorer does not support this encoding at all, maybe we should just remove it from all browsers (and the spec)?
The ISO-2022-CN decoder will soon be removed (bug 997115).
We have introduced the "replacement" encoding to avoid the security issue.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Attachment #354014 - Flags: review?(smontagu)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: