Closed Bug 473539 Opened 16 years ago Closed 10 years ago

Add HTML5-compliant way to check if charsets are equivalent for decoding

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: hsivonen, Assigned: smontagu)

Details

(Keywords: html5)

HTML5 parsing needs a method for checking if two charsets are aliases of each other for decoding purposes. This makes it possible to avoid pointless parser restarts. The HTML5 aliases that the IANA does not recognize are listed at http://www.whatwg.org/specs/web-apps/current-work/#character-encoding-requirements
Apart from adding extra aliases to the properties file, isn't it sufficient to do GetCharsetAlias on both charsets and check if the results are the same?
If there's an alias API that returns the same value for both windows-1252 and iso-8859-1, comparing the alias value would work, yes. However, I was assuming that you wouldn't alias those for encoding. I'd be OK with having two alias APIs (one for decoding aliases and another for encoding aliases) or making it so that you can't actually output iso-8859-1-labeled content at all--only windows-1252-labeled content.
(In reply to comment #2) > If there's an alias API that returns the same value for both windows-1252 and > iso-8859-1, comparing the alias value would work, yes. However, I was assuming > that you wouldn't alias those for encoding. Currently we don't, but I thought that HTML5 didn't distinguish between encoding and decoding here. The reference in comment 0 seems to be obsolete, but I'm basing that on http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#character-encodings-0: "When a user agent would otherwise use an encoding specified by a label given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row." > I'd be OK with having two alias APIs (one for decoding aliases and another for > encoding aliases) or making it so that you can't actually output > iso-8859-1-labeled content at all--only windows-1252-labeled content. I think two alias APIs is the way to go. It would bloat the properties file/s a bit, but we will be able to compensate for that by removing some of the current decoders. Also if we implement the Charset Alias Matching rules from http://www.unicode.org/unicode/reports/tr22/#Charset_Alias_Matching, as specified in HTML5, we can cut out a lot of the properties file.
On a second thought, since submitting a form in 'iso-8859-1' requires encoding as windows-1252, the API dichotomy shouldn't be encode/decode but Web content / editor component. But then it makes no sense to support outputting real ISO-8859-1 from an editor component (Composer, BlueGriffon), since the editor might as well output UTF-8.
No longer needed, since we now implement Encoding Standard-compliant labels.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.