Closed Bug 890478 Opened 12 years ago Closed 12 years ago

document.characterSet and document.inputEncoding wrong for iso-8859-1

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: glazou, Unassigned)

References

Details

Attachments

(3 files)

Both document.characterSet and document.inputEncoding incorrectly report windows-1252 when the encoding of the document is iso-8859-1. See test case attached. I hit this bug working on a new feature of BlueGriffon: I have to serialize in a standalone XML document a subtree of the currently edited document. I need to output the xml declaration but since the inputEncoding and the characterSet of the whole document are not reliable, I can't... Blocker. FWIW, Chrome and Blink reply correct iso-8859-1 values. IE 10 replies correct iso-8859-1 values. IE 11 replies correct iso-8859-1 values. Opera (Presto) replies windows-1252 for document.characterSet and does not implement document.inputEncoding. So since Opera is now based on Blink, Firefox is the only major browser choking on this. (Note: I supposed this is related to the HTML parser, please fix the Component of the bug if this is not the case)
(In reply to Daniel Glazman (:glazou) from comment #0) > Both document.characterSet and document.inputEncoding incorrectly report > windows-1252 when the encoding of the document is iso-8859-1. windows-1252 is the preferred label for iso-8859-1 per the Encoding Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ > See test case > attached. I hit this bug working on a new feature of BlueGriffon: I have to > serialize in a standalone XML document a subtree of the currently edited > document. I need to output the xml declaration but since the inputEncoding > and the characterSet of the whole document are not reliable, I can't... > Blocker. All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to XML is always the right thing to do. > IE 10 replies correct iso-8859-1 values. > IE 11 replies correct iso-8859-1 values. This surprises me. What about older versions of IE?
(In reply to Henri Sivonen (:hsivonen) from comment #1) > > IE 10 replies correct iso-8859-1 values. > > IE 11 replies correct iso-8859-1 values. > > This surprises me. What about older versions of IE? Nothing, since the test case is application/xhtml+xml.
(In reply to Henri Sivonen (:hsivonen) from comment #1) > windows-1252 is the preferred label for iso-8859-1 per the Encoding > Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ Can I say here how ridiculous I find this decision? Most people have no idea what is windows-1252 while iso-8859-1 has been a well known name for ages. Honestly, having iso-8859-1, a non-exotic name, explicitely listed in the xml encoding and the meta charset with the DOM returning windows-1252 is hard to believe. Whatever is good or bad, Gecko is the only rendering engine returning windows-1252 here. And that is an issue.
(In reply to Henri Sivonen (:hsivonen) from comment #1) > (In reply to Daniel Glazman (:glazou) from comment #0) > > Both document.characterSet and document.inputEncoding incorrectly report > > windows-1252 when the encoding of the document is iso-8859-1. > > windows-1252 is the preferred label for iso-8859-1 per the Encoding > Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ > > > See test case > > attached. I hit this bug working on a new feature of BlueGriffon: I have to > > serialize in a standalone XML document a subtree of the currently edited > > document. I need to output the xml declaration but since the inputEncoding > > and the characterSet of the whole document are not reliable, I can't... > > Blocker. > > All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to > XML is always the right thing to do. > > > IE 10 replies correct iso-8859-1 values. > > IE 11 replies correct iso-8859-1 values. > > This surprises me. What about older versions of IE? Internet Explorer uses "iso-8859-1" as the canonical name for "windows-1252" encoding.
Attached file text/html testcase
IE8 also used "iso-8859-1".
Attached file windows-1252 testcase
Looks like IE and WebKit treats windows-1252 as a different encoding from iso-8859-1 even if the mapping is idential.
(In reply to Daniel Glazman (:glazou) from comment #3) > Can I say here how ridiculous I find this decision? Anne is in the CC. > Whatever is good or bad, Gecko is the only rendering engine returning > windows-1252 here. And that is an issue. Does it break any Web sites? (If you are generating XML today and you use an encoding other than UTF-8, you are adding to the encoding mess, frankly.)
FWIW, this was based on Rebel Opera and the assumption we could get away with it. We investigated this at the time we switched. We can make the Encoding Standard say anything, but our current behavior makes the most sense.
(In reply to Anne (:annevk) from comment #8) > FWIW, this was based on Rebel Opera and the assumption we could get away > with it. We investigated this at the time we switched. We can make the > Encoding Standard say anything, but our current behavior makes the most > sense. I would really like to understand better "makes the most sense" when ISO-8859-1 is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159 (0x80-0x9F) and that means they're ***NOT*** identical...
They are in implementations.
(In reply to Daniel Glazman (:glazou) from comment #9) > I would really like to understand better "makes the most sense" when > ISO-8859-1 > is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159 > (0x80-0x9F) > and that means they're ***NOT*** identical... Unfortunately it is a fiction which doesn't reflect the reality. Virtually all browsers (at least Gecko, Trident, Presto, WebKit, and Blink) have the iso-8859-1 decoder which is exactly the same as windows-1252 decoder (including 0x80-0x9F range). We shouldn't pretend as if we really support iso-8859-1.
From bug 897302: same problem with text/plain and encoding shown in "View Page Info".
Not a problem. We've just stopped telling a lie.
The lie is that the server declares iso-8859-1 and "Page Info" says "Encoding: windows-1252". While I wanted to test the encoding after upgrading Apache and saw this "Page Info" result, I thought that there was a problem with the upgrade, while Firefox is just displaying incorrect information.
(In reply to Vincent Lefevre from comment #15) > The lie is that the server declares iso-8859-1 and "Page Info" says > "Encoding: windows-1252". > > While I wanted to test the encoding after upgrading Apache and saw this > "Page Info" result, I thought that there was a problem with the upgrade, > while Firefox is just displaying incorrect information. We are honest about that we are aliasing iso-8859-1 to windows-1252 despite that it is actually a different encoding. Other browsers pretend to supporting iso-8859-1 while it is not.
(In reply to Masatoshi Kimura [:emk] from comment #16) > We are honest about that we are aliasing iso-8859-1 to windows-1252 despite > that it is actually a different encoding. Other browsers pretend to > supporting iso-8859-1 while it is not. Excuse emk, but I don't care about "being honest", I care about not breaking applications relying on the fact document.inputEncoding should really reflect the content of the encoding declared in the document's instance as it is, and this changes breaks at least two. If you want an API reflecting the "real" charset, add one, but don't change what has been stable for 15 years just in the name of "purity". Sigh.
(In reply to Masatoshi Kimura [:emk] from comment #16) > We are honest about that we are aliasing iso-8859-1 to windows-1252 despite > that it is actually a different encoding. But this is not clear from the "Page Info". And aliasing iso-8859-1 to windows-1252 is a bug anyway, which leads to inconsistencies and may confuse the user/developer. For instance, converting the control character to a Numeric Character Reference would yield a different rendering. Firefox should use the right rendering in the first place. Moreover, documents which don't use character codes for which there is a difference between iso-8859-1 and windows-1252 don't care about this aliasing. The announced encoding must be the correct one (the one declared by the server, or from the document e.g. for XML).
Daniel, Vincent: If a paged declared "latin1", what would you expect document.characterSet and Page Info to say and why?
(In reply to Henri Sivonen (:hsivonen) from comment #19) > Daniel, Vincent: If a paged declared "latin1", what would you expect > document.characterSet and Page Info to say and why? Is "latin1" standard (official)? If yes, I would expect "latin1" (if and only if "iso-8859-1" is completely synonymous including for the 0x80-0x9f range[*], it may be OK to say "iso-8859-1", just because of the "iso-" prefix). If no, then I'd expect some kind of failure (because users should not be encouraged to use non-standard things), possibly with a user-configurable fallback (such as charset auto-detect). [*] AFAIK, 0x80-0x9f is undefined in some variants, but I don't know about latin1. IMHO, Page Info should always display what charset has been served (from HTTP headers or from the document) + the charset actually used for interpretation if different, but only if interpretation was needed.
That seems way complex for something that is close to obsolete. The standard HTML, CSS et al use is http://encoding.spec.whatwg.org/ and per that what we do is correct. This is INVALID unless there are compatibility issues or other browser vendors refuse to implement the relevant standard in which case we need discussion elsewhere first.
(In reply to Vincent Lefevre from comment #20) > Is "latin1" standard (official)? "latin1" is (and even "windows-1252" is) a standard. http://encoding.spec.whatwg.org/#concept-encoding-get > it may be OK to say "iso-8859-1", just because of the "iso-" prefix That's nothing but a cargo-cult belief. > If no, then I'd expect some kind of failure (because users should > not be encouraged to use non-standard things), possibly with a > user-configurable fallback (such as charset auto-detect). It it completely backward-incompatible. Your fictions does not match the read-world at all. (In reply to Anne (:annevk) from comment #21) > This is INVALID unless there are compatibility issues or other browser > vendors refuse to implement the relevant standard in which case we need > discussion elsewhere first. agreed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
(In reply to Masatoshi Kimura [:emk] from comment #22) > http://encoding.spec.whatwg.org/#concept-encoding-get This is in complete contradiction with: http://www.w3.org/International/questions/qa-controls (old, but not obsoleted).
That's not a specification. You can give them feedback though.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: