User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113
With WindowsXP Notepad you have the possibility to save a file (e.g. html page)
in UTF-8 charset. A 3 byte UTF-8 header is added to the file.
If the file claims to use a different charset (for example by the meta tag <meta
http-equiv="Content-Type" content="text/html; charset=ISO8859-1"> or because the
webbrowser adds a different encoding to the response) the UTF 8 header is
displayed in the page.
Steps to Reproduce:
1.Create a html page with windows notepad and save it as UTF-8.
2.Make the page claim that it is e.g. ISO8859-1
3.View the page in Mozilla (see attached file "WrongCharsetDeclared.html").
1. Download and install a default apache webserver.
2. If the server uses the default configuration, the httpd.conf file should
contain the following line:
(this creates a response header specifying the charset ISO8859-1 for the
returned html file, no matter how the file actually is encoded).
If not, add it.
3. Open the page with mozilla and see 3 interesting chars.
I think mozilla should check for availability of this UTF-8 header bytes instead
of trying to render them.
I know that the apache is somehow malconfigured, but this was the default
install, and I am not the only one who runs in this problem.
I am not the only one who has this problem.
Go to http://www.aopen.nl/products/vga/ (hardware manufacturer) and you will
find the same problem (also this page claims to be Windows-1252).
Suse 9.0 Konqueror has the same problem, IE does not. It's not a problem of
Mozilla version (validated it with 1.5, too) or OS (Windows, Linux).
Created attachment 144765 [details]
This page displays the UTF-8 header with a malconfigured apache
Created attachment 144766 [details]
Empty UTF-8 file, just contains the header
Created attachment 144767 [details]
This page declares a wrong charset, so UTF-8 header is displayed
hmm, this is interesting. should mozilla use the BOM or the meta charset when
both are present?
Note that if the webserver sends a charset, mozilla will not look at any other
source of charset information; this is intentional.
IMO there is no bug here. If the meta charset is inaccurate, the document is
displayed incorrectly. I don't see any reason why the BOM should override the
*** Bug 445108 has been marked as a duplicate of this bug. ***
(In reply to comment #4)
> hmm, this is interesting. should mozilla use the BOM or the meta charset when
> both are present?
> Note that if the webserver sends a charset, mozilla will not look at any
> other source of charset information; this is intentional.
The intention is well intended and "politically correct". However it is wrong.
* IE and Webnkit respects the BOM higher than Content-Type header.
* The IE/WEbkit behaviour is in tune with XML 1.0.
* The Firefox/Opera behavior triggers Quirks-Mode in HTML and trigger Yellow Screen of Death in XML - those errors are not seen in Webkit or IE.
Summary: for the encoding, then the BOM should take have higher priority than the HTTP header.
Test cases: http://malform.no/testing/html5/bom/
HTML5 bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
It is illogical to even allow the user override the UTF-8 encoding, because doing such a thing will *either* make the page render in Quirks-Mode *or* will make the page suffer Yellow Screen of Death.
(In reply to comment #7)
> * The IE/WEbkit behaviour is in tune with XML 1.0.
That's not true.
> XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
... which clearly defers to RFC 3023, which says that the charset parameter is authoritative.
(In reply to comment #8)
> (In reply to comment #7)
> > * The IE/WEbkit behaviour is in tune with XML 1.0.
> That's not true.
Beg to differ - or at least question it. See below.
> > XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
> ... which clearly defers to RFC 3023, which says that the charset parameter
> is authoritative.
Quoting XML 1.0:
]] F.2 Priorities in the Presence of External Encoding Information [[
Thus, Appendix F.2 talks about presence of external encoding info. The preceding F.1 speaks about internal encoding info. F.2 a bit later says:
]] their relative priority and the preferred method of handling conflict should be specified [[
Thus, F.2 explains how derivated specifications (like XHTML specs) should behave.
Note as well that it refers to RFC 3023 as "useful guidance", and nothing more. The most important part of F.2, is clearly the last two sentences, which I'll quote. And remember once more that F.2 speaks about "Presence of External Encoding Information". Hence, the last two sentences should also be applied to a situation where there is external encoding info:
In the interests of interoperability, however, the following rule is recommended.
If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding.
I don't know if it is contested that an XML entity served via HTTP "is in a file"? And even if it is contested, I would like to know, in very much detail, what Webkit and IE is breaking w.r.t. the XML spec.
"in a file" means in a file on the local filesystem, not something retrieved via HTTP (note the distinction between files and network protocols that it makes earlier in F.2: "as in some file systems and some network protocols".)
(In reply to comment #10)
I think that it had wanted to remove all unclarity, then it should have said "in a file in a file system".
I would think that a far more important contrast is "in a database record" versus "in a file, including a file served via HTTP".
*** Bug 783946 has been marked as a duplicate of this bug. ***