Closed
Bug 890478
Opened 12 years ago
Closed 12 years ago
document.characterSet and document.inputEncoding wrong for iso-8859-1
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
VERIFIED
INVALID
People
(Reporter: glazou, Unassigned)
References
Details
Attachments
(3 files)
Both document.characterSet and document.inputEncoding incorrectly report windows-1252 when the encoding of the document is iso-8859-1. See test case attached. I hit this bug working on a new feature of BlueGriffon: I have to serialize in a standalone XML document a subtree of the currently edited document. I need to output the xml declaration but since the inputEncoding and the characterSet of the whole document are not reliable, I can't... Blocker.
FWIW, Chrome and Blink reply correct iso-8859-1 values.
IE 10 replies correct iso-8859-1 values.
IE 11 replies correct iso-8859-1 values.
Opera (Presto) replies windows-1252 for document.characterSet
and does not implement document.inputEncoding.
So since Opera is now based on Blink, Firefox is the only major browser choking on this.
(Note: I supposed this is related to the HTML parser, please fix the Component of the bug if this is not the case)
Comment 1•12 years ago
|
||
(In reply to Daniel Glazman (:glazou) from comment #0)
> Both document.characterSet and document.inputEncoding incorrectly report
> windows-1252 when the encoding of the document is iso-8859-1.
windows-1252 is the preferred label for iso-8859-1 per the Encoding Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/
> See test case
> attached. I hit this bug working on a new feature of BlueGriffon: I have to
> serialize in a standalone XML document a subtree of the currently edited
> document. I need to output the xml declaration but since the inputEncoding
> and the characterSet of the whole document are not reliable, I can't...
> Blocker.
All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to XML is always the right thing to do.
> IE 10 replies correct iso-8859-1 values.
> IE 11 replies correct iso-8859-1 values.
This surprises me. What about older versions of IE?
Comment 2•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> > IE 10 replies correct iso-8859-1 values.
> > IE 11 replies correct iso-8859-1 values.
>
> This surprises me. What about older versions of IE?
Nothing, since the test case is application/xhtml+xml.
Reporter | ||
Comment 3•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> windows-1252 is the preferred label for iso-8859-1 per the Encoding
> Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/
Can I say here how ridiculous I find this decision? Most people have no
idea what is windows-1252 while iso-8859-1 has been a well known name for ages.
Honestly, having iso-8859-1, a non-exotic name, explicitely listed in the xml
encoding and the meta charset with the DOM returning windows-1252 is hard to believe.
Whatever is good or bad, Gecko is the only rendering engine returning windows-1252
here. And that is an issue.
Comment 4•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> (In reply to Daniel Glazman (:glazou) from comment #0)
> > Both document.characterSet and document.inputEncoding incorrectly report
> > windows-1252 when the encoding of the document is iso-8859-1.
>
> windows-1252 is the preferred label for iso-8859-1 per the Encoding
> Standard, so Gecko is correct per spec. http://encoding.spec.whatwg.org/
>
> > See test case
> > attached. I hit this bug working on a new feature of BlueGriffon: I have to
> > serialize in a standalone XML document a subtree of the currently edited
> > document. I need to output the xml declaration but since the inputEncoding
> > and the characterSet of the whole document are not reliable, I can't...
> > Blocker.
>
> All XML parsers MUST support UTF-8. Hence, using UTF-8 when serializing to
> XML is always the right thing to do.
>
> > IE 10 replies correct iso-8859-1 values.
> > IE 11 replies correct iso-8859-1 values.
>
> This surprises me. What about older versions of IE?
Internet Explorer uses "iso-8859-1" as the canonical name for "windows-1252" encoding.
Comment 5•12 years ago
|
||
IE8 also used "iso-8859-1".
Comment 6•12 years ago
|
||
Looks like IE and WebKit treats windows-1252 as a different encoding from iso-8859-1 even if the mapping is idential.
Comment 7•12 years ago
|
||
(In reply to Daniel Glazman (:glazou) from comment #3)
> Can I say here how ridiculous I find this decision?
Anne is in the CC.
> Whatever is good or bad, Gecko is the only rendering engine returning
> windows-1252 here. And that is an issue.
Does it break any Web sites?
(If you are generating XML today and you use an encoding other than UTF-8, you are adding to the encoding mess, frankly.)
Comment 8•12 years ago
|
||
FWIW, this was based on Rebel Opera and the assumption we could get away with it. We investigated this at the time we switched. We can make the Encoding Standard say anything, but our current behavior makes the most sense.
Reporter | ||
Comment 9•12 years ago
|
||
(In reply to Anne (:annevk) from comment #8)
> FWIW, this was based on Rebel Opera and the assumption we could get away
> with it. We investigated this at the time we switched. We can make the
> Encoding Standard say anything, but our current behavior makes the most
> sense.
I would really like to understand better "makes the most sense" when ISO-8859-1
is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159 (0x80-0x9F)
and that means they're ***NOT*** identical...
Comment 10•12 years ago
|
||
They are in implementations.
Comment 11•12 years ago
|
||
(In reply to Daniel Glazman (:glazou) from comment #9)
> I would really like to understand better "makes the most sense" when
> ISO-8859-1
> is identical to Windows-1252 ***EXCEPT FOR*** the code points 128-159
> (0x80-0x9F)
> and that means they're ***NOT*** identical...
Unfortunately it is a fiction which doesn't reflect the reality. Virtually all browsers (at least Gecko, Trident, Presto, WebKit, and Blink) have the iso-8859-1 decoder which is exactly the same as windows-1252 decoder (including 0x80-0x9F range).
We shouldn't pretend as if we really support iso-8859-1.
Comment 13•12 years ago
|
||
From bug 897302: same problem with text/plain and encoding shown in "View Page Info".
Comment 14•12 years ago
|
||
Not a problem. We've just stopped telling a lie.
Comment 15•12 years ago
|
||
The lie is that the server declares iso-8859-1 and "Page Info" says "Encoding: windows-1252".
While I wanted to test the encoding after upgrading Apache and saw this "Page Info" result, I thought that there was a problem with the upgrade, while Firefox is just displaying incorrect information.
Comment 16•12 years ago
|
||
(In reply to Vincent Lefevre from comment #15)
> The lie is that the server declares iso-8859-1 and "Page Info" says
> "Encoding: windows-1252".
>
> While I wanted to test the encoding after upgrading Apache and saw this
> "Page Info" result, I thought that there was a problem with the upgrade,
> while Firefox is just displaying incorrect information.
We are honest about that we are aliasing iso-8859-1 to windows-1252 despite that it is actually a different encoding. Other browsers pretend to supporting iso-8859-1 while it is not.
Reporter | ||
Comment 17•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #16)
> We are honest about that we are aliasing iso-8859-1 to windows-1252 despite
> that it is actually a different encoding. Other browsers pretend to
> supporting iso-8859-1 while it is not.
Excuse emk, but I don't care about "being honest", I care about not breaking
applications relying on the fact document.inputEncoding should really reflect
the content of the encoding declared in the document's instance as it is,
and this changes breaks at least two. If you want an API reflecting the
"real" charset, add one, but don't change what has been stable for 15 years just
in the name of "purity". Sigh.
Comment 18•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #16)
> We are honest about that we are aliasing iso-8859-1 to windows-1252 despite
> that it is actually a different encoding.
But this is not clear from the "Page Info". And aliasing iso-8859-1 to windows-1252 is a bug anyway, which leads to inconsistencies and may confuse the user/developer. For instance, converting the control character to a Numeric Character Reference would yield a different rendering. Firefox should use the right rendering in the first place.
Moreover, documents which don't use character codes for which there is a difference between iso-8859-1 and windows-1252 don't care about this aliasing. The announced encoding must be the correct one (the one declared by the server, or from the document e.g. for XML).
Comment 19•12 years ago
|
||
Daniel, Vincent: If a paged declared "latin1", what would you expect document.characterSet and Page Info to say and why?
Comment 20•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #19)
> Daniel, Vincent: If a paged declared "latin1", what would you expect
> document.characterSet and Page Info to say and why?
Is "latin1" standard (official)? If yes, I would expect "latin1" (if and only if "iso-8859-1" is completely synonymous including for the 0x80-0x9f range[*], it may be OK to say "iso-8859-1", just because of the "iso-" prefix). If no, then I'd expect some kind of failure (because users should not be encouraged to use non-standard things), possibly with a user-configurable fallback (such as charset auto-detect).
[*] AFAIK, 0x80-0x9f is undefined in some variants, but I don't know about latin1.
IMHO, Page Info should always display what charset has been served (from HTTP headers or from the document) + the charset actually used for interpretation if different, but only if interpretation was needed.
Comment 21•12 years ago
|
||
That seems way complex for something that is close to obsolete. The standard HTML, CSS et al use is http://encoding.spec.whatwg.org/ and per that what we do is correct.
This is INVALID unless there are compatibility issues or other browser vendors refuse to implement the relevant standard in which case we need discussion elsewhere first.
Comment 22•12 years ago
|
||
(In reply to Vincent Lefevre from comment #20)
> Is "latin1" standard (official)?
"latin1" is (and even "windows-1252" is) a standard.
http://encoding.spec.whatwg.org/#concept-encoding-get
> it may be OK to say "iso-8859-1", just because of the "iso-" prefix
That's nothing but a cargo-cult belief.
> If no, then I'd expect some kind of failure (because users should
> not be encouraged to use non-standard things), possibly with a
> user-configurable fallback (such as charset auto-detect).
It it completely backward-incompatible. Your fictions does not match the read-world at all.
(In reply to Anne (:annevk) from comment #21)
> This is INVALID unless there are compatibility issues or other browser
> vendors refuse to implement the relevant standard in which case we need
> discussion elsewhere first.
agreed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Comment 23•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #22)
> http://encoding.spec.whatwg.org/#concept-encoding-get
This is in complete contradiction with: http://www.w3.org/International/questions/qa-controls (old, but not obsoleted).
Comment 24•12 years ago
|
||
That's not a specification. You can give them feedback though.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•