Closed
Bug 473539
Opened 16 years ago
Closed 10 years ago
Add HTML5-compliant way to check if charsets are equivalent for decoding
Categories
(Core :: Internationalization, enhancement)
Core
Internationalization
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: hsivonen, Assigned: smontagu)
Details
(Keywords: html5)
HTML5 parsing needs a method for checking if two charsets are aliases of each other for decoding purposes. This makes it possible to avoid pointless parser restarts.
The HTML5 aliases that the IANA does not recognize are listed at
http://www.whatwg.org/specs/web-apps/current-work/#character-encoding-requirements
Assignee | ||
Comment 1•16 years ago
|
||
Apart from adding extra aliases to the properties file, isn't it sufficient to do GetCharsetAlias on both charsets and check if the results are the same?
Reporter | ||
Comment 2•16 years ago
|
||
If there's an alias API that returns the same value for both windows-1252 and iso-8859-1, comparing the alias value would work, yes. However, I was assuming that you wouldn't alias those for encoding.
I'd be OK with having two alias APIs (one for decoding aliases and another for encoding aliases) or making it so that you can't actually output iso-8859-1-labeled content at all--only windows-1252-labeled content.
Assignee | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> If there's an alias API that returns the same value for both windows-1252 and
> iso-8859-1, comparing the alias value would work, yes. However, I was assuming
> that you wouldn't alias those for encoding.
Currently we don't, but I thought that HTML5 didn't distinguish between encoding and decoding here. The reference in comment 0 seems to be obsolete, but I'm basing that on http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#character-encodings-0:
"When a user agent would otherwise use an encoding specified by a label given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row."
> I'd be OK with having two alias APIs (one for decoding aliases and another for
> encoding aliases) or making it so that you can't actually output
> iso-8859-1-labeled content at all--only windows-1252-labeled content.
I think two alias APIs is the way to go. It would bloat the properties file/s a bit, but we will be able to compensate for that by removing some of the current decoders. Also if we implement the Charset Alias Matching rules from http://www.unicode.org/unicode/reports/tr22/#Charset_Alias_Matching, as specified in HTML5, we can cut out a lot of the properties file.
Reporter | ||
Comment 4•16 years ago
|
||
On a second thought, since submitting a form in 'iso-8859-1' requires encoding as windows-1252, the API dichotomy shouldn't be encode/decode but Web content / editor component.
But then it makes no sense to support outputting real ISO-8859-1 from an editor component (Composer, BlueGriffon), since the editor might as well output UTF-8.
Reporter | ||
Comment 5•10 years ago
|
||
No longer needed, since we now implement Encoding Standard-compliant labels.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•