Closed Bug 473539 Opened 17 years ago Closed 10 years ago

Add HTML5-compliant way to check if charsets are equivalent for decoding

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: hsivonen, Assigned: smontagu)

Details

(Keywords: html5)

Henri Sivonen (:hsivonen)

Reporter

Description

•

17 years ago

HTML5 parsing needs a method for checking if two charsets are aliases of each other for decoding purposes. This makes it possible to avoid pointless parser restarts. The HTML5 aliases that the IANA does not recognize are listed at http://www.whatwg.org/specs/web-apps/current-work/#character-encoding-requirements

Simon Montagu :smontagu

Assignee

Comment 1

•

16 years ago

Apart from adding extra aliases to the properties file, isn't it sufficient to do GetCharsetAlias on both charsets and check if the results are the same?

Henri Sivonen (:hsivonen)

Reporter

Comment 2

•

16 years ago

If there's an alias API that returns the same value for both windows-1252 and iso-8859-1, comparing the alias value would work, yes. However, I was assuming that you wouldn't alias those for encoding. I'd be OK with having two alias APIs (one for decoding aliases and another for encoding aliases) or making it so that you can't actually output iso-8859-1-labeled content at all--only windows-1252-labeled content.

Simon Montagu :smontagu

Assignee

Comment 3

•

16 years ago

(In reply to comment #2) > If there's an alias API that returns the same value for both windows-1252 and > iso-8859-1, comparing the alias value would work, yes. However, I was assuming > that you wouldn't alias those for encoding. Currently we don't, but I thought that HTML5 didn't distinguish between encoding and decoding here. The reference in comment 0 seems to be obsolete, but I'm basing that on http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#character-encodings-0: "When a user agent would otherwise use an encoding specified by a label given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row." > I'd be OK with having two alias APIs (one for decoding aliases and another for > encoding aliases) or making it so that you can't actually output > iso-8859-1-labeled content at all--only windows-1252-labeled content. I think two alias APIs is the way to go. It would bloat the properties file/s a bit, but we will be able to compensate for that by removing some of the current decoders. Also if we implement the Charset Alias Matching rules from http://www.unicode.org/unicode/reports/tr22/#Charset_Alias_Matching, as specified in HTML5, we can cut out a lot of the properties file.

Henri Sivonen (:hsivonen)

Reporter

Comment 4

•

16 years ago

On a second thought, since submitting a form in 'iso-8859-1' requires encoding as windows-1252, the API dichotomy shouldn't be encode/decode but Web content / editor component. But then it makes no sense to support outputting real ISO-8859-1 from an editor component (Composer, BlueGriffon), since the editor might as well output UTF-8.

Henri Sivonen (:hsivonen)

Reporter

Comment 5

•

10 years ago

No longer needed, since we now implement Encoding Standard-compliant labels.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add HTML5-compliant way to check if charsets are equivalent for decoding

Categories

(Core :: Internationalization, enhancement)

Tracking

()

People

(Reporter: hsivonen, Assigned: smontagu)

References

Details

(Keywords: html5)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5