890478 - document.characterSet and document.inputEncoding wrong for iso-8859-1

Daniel Glazman (:glazou) (not active in Mozilla any more)

Reporter

Description

•

12 years ago

Attached file test case showing issue — Details

Both document.characterSet and document.inputEncoding incorrectly report windows-1252 when the encoding of the document is iso-8859-1. See test case attached. I hit this bug working on a new feature of BlueGriffon: I have to serialize in a standalone XML document a subtree of the currently edited document. I need to output the xml declaration but since the inputEncoding and the characterSet of the whole document are not reliable, I can't... Blocker. FWIW, Chrome and Blink reply correct iso-8859-1 values. IE 10 replies correct iso-8859-1 values. IE 11 replies correct iso-8859-1 values. Opera (Presto) replies windows-1252 for document.characterSet and does not implement document.inputEncoding. So since Opera is now based on Blink, Firefox is the only major browser choking on this. (Note: I supposed this is related to the HTML parser, please fix the Component of the bug if this is not the case)

Henri Sivonen (:hsivonen)

Comment 1

•

12 years ago

(In reply to Daniel Glazman (:glazou) from comment #0) > Both document.characterSet and document.inputEncoding incorrectly report > windows-1252 when the encoding of the document is iso-8859-1. windows-1252 is the preferred label for iso-8859-1 per the Encoding Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ > See test case > attached. I hit this bug working on a new feature of BlueGriffon: I have to > serialize in a standalone XML document a subtree of the currently edited > document. I need to output the xml declaration but since the inputEncoding > and the characterSet of the whole document are not reliable, I can't... > Blocker. All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to XML is always the right thing to do. > IE 10 replies correct iso-8859-1 values. > IE 11 replies correct iso-8859-1 values. This surprises me. What about older versions of IE?

Henri Sivonen (:hsivonen)

Comment 2

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1) > > IE 10 replies correct iso-8859-1 values. > > IE 11 replies correct iso-8859-1 values. > > This surprises me. What about older versions of IE? Nothing, since the test case is application/xhtml+xml.

Daniel Glazman (:glazou) (not active in Mozilla any more)

Reporter

Comment 3

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1) > windows-1252 is the preferred label for iso-8859-1 per the Encoding > Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ Can I say here how ridiculous I find this decision? Most people have no idea what is windows-1252 while iso-8859-1 has been a well known name for ages. Honestly, having iso-8859-1, a non-exotic name, explicitely listed in the xml encoding and the meta charset with the DOM returning windows-1252 is hard to believe. Whatever is good or bad, Gecko is the only rendering engine returning windows-1252 here. And that is an issue.

Masatoshi Kimura [:emk]

Comment 4

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1) > (In reply to Daniel Glazman (:glazou) from comment #0) > > Both document.characterSet and document.inputEncoding incorrectly report > > windows-1252 when the encoding of the document is iso-8859-1. > > windows-1252 is the preferred label for iso-8859-1 per the Encoding > Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/ > > > See test case > > attached. I hit this bug working on a new feature of BlueGriffon: I have to > > serialize in a standalone XML document a subtree of the currently edited > > document. I need to output the xml declaration but since the inputEncoding > > and the characterSet of the whole document are not reliable, I can't... > > Blocker. > > All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to > XML is always the right thing to do. > > > IE 10 replies correct iso-8859-1 values. > > IE 11 replies correct iso-8859-1 values. > > This surprises me. What about older versions of IE? Internet Explorer uses "iso-8859-1" as the canonical name for "windows-1252" encoding.

Masatoshi Kimura [:emk]

Comment 5

•

12 years ago

Attached file text/html testcase — Details

IE8 also used "iso-8859-1".

Masatoshi Kimura [:emk]

Comment 6

•

12 years ago

Attached file windows-1252 testcase — Details

Looks like IE and WebKit treats windows-1252 as a different encoding from iso-8859-1 even if the mapping is idential.

Henri Sivonen (:hsivonen)

Comment 7

•

12 years ago

(In reply to Daniel Glazman (:glazou) from comment #3) > Can I say here how ridiculous I find this decision? Anne is in the CC. > Whatever is good or bad, Gecko is the only rendering engine returning > windows-1252 here. And that is an issue. Does it break any Web sites? (If you are generating XML today and you use an encoding other than UTF-8, you are adding to the encoding mess, frankly.)

Anne (:annevk)

Comment 8

•

12 years ago

FWIW, this was based on Rebel Opera and the assumption we could get away with it. We investigated this at the time we switched. We can make the Encoding Standard say anything, but our current behavior makes the most sense.

Daniel Glazman (:glazou) (not active in Mozilla any more)

Reporter

Comment 9

•

12 years ago

(In reply to Anne (:annevk) from comment #8) > FWIW, this was based on Rebel Opera and the assumption we could get away > with it. We investigated this at the time we switched. We can make the > Encoding Standard say anything, but our current behavior makes the most > sense. I would really like to understand better "makes the most sense" when ISO-8859-1 is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159 (0x80-0x9F) and that means they're ***NOT*** identical...

Anne (:annevk)

Comment 10

•

12 years ago

They are in implementations.

Masatoshi Kimura [:emk]

Comment 11

•

12 years ago

(In reply to Daniel Glazman (:glazou) from comment #9) > I would really like to understand better "makes the most sense" when > ISO-8859-1 > is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159 > (0x80-0x9F) > and that means they're ***NOT*** identical... Unfortunately it is a fiction which doesn't reflect the reality. Virtually all browsers (at least Gecko, Trident, Presto, WebKit, and Blink) have the iso-8859-1 decoder which is exactly the same as windows-1252 decoder (including 0x80-0x9F range). We shouldn't pretend as if we really support iso-8859-1.

Vincent Lefevre

Comment 13

•

12 years ago

From bug 897302: same problem with text/plain and encoding shown in "View Page Info".

Masatoshi Kimura [:emk]

Comment 14

•

12 years ago

Not a problem. We've just stopped telling a lie.

Vincent Lefevre

Comment 15

•

12 years ago

The lie is that the server declares iso-8859-1 and "Page Info" says "Encoding: windows-1252". While I wanted to test the encoding after upgrading Apache and saw this "Page Info" result, I thought that there was a problem with the upgrade, while Firefox is just displaying incorrect information.

Masatoshi Kimura [:emk]

Comment 16

•

12 years ago

(In reply to Vincent Lefevre from comment #15) > The lie is that the server declares iso-8859-1 and "Page Info" says > "Encoding: windows-1252". > > While I wanted to test the encoding after upgrading Apache and saw this > "Page Info" result, I thought that there was a problem with the upgrade, > while Firefox is just displaying incorrect information. We are honest about that we are aliasing iso-8859-1 to windows-1252 despite that it is actually a different encoding. Other browsers pretend to supporting iso-8859-1 while it is not.

Daniel Glazman (:glazou) (not active in Mozilla any more)

Reporter

Comment 17

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #16) > We are honest about that we are aliasing iso-8859-1 to windows-1252 despite > that it is actually a different encoding. Other browsers pretend to > supporting iso-8859-1 while it is not. Excuse emk, but I don't care about "being honest", I care about not breaking applications relying on the fact document.inputEncoding should really reflect the content of the encoding declared in the document's instance as it is, and this changes breaks at least two. If you want an API reflecting the "real" charset, add one, but don't change what has been stable for 15 years just in the name of "purity". Sigh.

Vincent Lefevre

Comment 18

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #16) > We are honest about that we are aliasing iso-8859-1 to windows-1252 despite > that it is actually a different encoding. But this is not clear from the "Page Info". And aliasing iso-8859-1 to windows-1252 is a bug anyway, which leads to inconsistencies and may confuse the user/developer. For instance, converting the control character to a Numeric Character Reference would yield a different rendering. Firefox should use the right rendering in the first place. Moreover, documents which don't use character codes for which there is a difference between iso-8859-1 and windows-1252 don't care about this aliasing. The announced encoding must be the correct one (the one declared by the server, or from the document e.g. for XML).

Henri Sivonen (:hsivonen)

Comment 19

•

12 years ago

Daniel, Vincent: If a paged declared "latin1", what would you expect document.characterSet and Page Info to say and why?

Vincent Lefevre

Comment 20

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #19) > Daniel, Vincent: If a paged declared "latin1", what would you expect > document.characterSet and Page Info to say and why? Is "latin1" standard (official)? If yes, I would expect "latin1" (if and only if "iso-8859-1" is completely synonymous including for the 0x80-0x9f range[*], it may be OK to say "iso-8859-1", just because of the "iso-" prefix). If no, then I'd expect some kind of failure (because users should not be encouraged to use non-standard things), possibly with a user-configurable fallback (such as charset auto-detect). [*] AFAIK, 0x80-0x9f is undefined in some variants, but I don't know about latin1. IMHO, Page Info should always display what charset has been served (from HTTP headers or from the document) + the charset actually used for interpretation if different, but only if interpretation was needed.

Anne (:annevk)

Comment 21

•

12 years ago

That seems way complex for something that is close to obsolete. The standard HTML, CSS et al use is http://encoding.spec.whatwg.org/ and per that what we do is correct. This is INVALID unless there are compatibility issues or other browser vendors refuse to implement the relevant standard in which case we need discussion elsewhere first.

Masatoshi Kimura [:emk]

Comment 22

•

12 years ago

(In reply to Vincent Lefevre from comment #20) > Is "latin1" standard (official)? "latin1" is (and even "windows-1252" is) a standard. http://encoding.spec.whatwg.org/#concept-encoding-get > it may be OK to say "iso-8859-1", just because of the "iso-" prefix That's nothing but a cargo-cult belief. > If no, then I'd expect some kind of failure (because users should > not be encouraged to use non-standard things), possibly with a > user-configurable fallback (such as charset auto-detect). It it completely backward-incompatible. Your fictions does not match the read-world at all. (In reply to Anne (:annevk) from comment #21) > This is INVALID unless there are compatibility issues or other browser > vendors refuse to implement the relevant standard in which case we need > discussion elsewhere first. agreed.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → INVALID

Vincent Lefevre

Comment 23

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #22) > http://encoding.spec.whatwg.org/#concept-encoding-get This is in complete contradiction with: http://www.w3.org/International/questions/qa-controls (old, but not obsoleted).

Anne (:annevk)

Comment 24

•

12 years ago

That's not a specification. You can give them feedback though.

Status: RESOLVED → VERIFIED

test case showing issue 12 years ago Daniel Glazman (:glazou) (not active in Mozilla any more) 819 bytes, application/xhtml+xml		Details
text/html testcase 12 years ago Masatoshi Kimura [:emk] 970 bytes, text/html		Details
windows-1252 testcase 12 years ago Masatoshi Kimura [:emk] 978 bytes, text/html		Details