Last Comment Bug 238694 - UTF-8 BOM are rendered if charset is malconfigured
: UTF-8 BOM are rendered if charset is malconfigured
Status: RESOLVED INVALID
:
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: x86 Linux
: -- normal (vote)
: ---
Assigned To: Simon Montagu :smontagu
: Yuying Long
:
Mentors:
: 445108 783946 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-03-25 13:38 PST by Wolfgang Knauf
Modified: 2012-08-21 04:08 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
This page displays the UTF-8 header with a malconfigured apache (216 bytes, text/html)
2004-03-25 13:40 PST, Wolfgang Knauf
no flags Details
Empty UTF-8 file, just contains the header (3 bytes, text/html)
2004-03-25 13:40 PST, Wolfgang Knauf
no flags Details
This page declares a wrong charset, so UTF-8 header is displayed (220 bytes, text/html)
2004-03-25 13:41 PST, Wolfgang Knauf
no flags Details

Description Wolfgang Knauf 2004-03-25 13:38:55 PST
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113

With WindowsXP Notepad you have the possibility to save a file (e.g. html page)
in UTF-8 charset. A 3 byte UTF-8 header is added to the file.
If the file claims to use a different charset (for example by the meta tag <meta
http-equiv="Content-Type" content="text/html; charset=ISO8859-1"> or because the
webbrowser adds a different encoding to the response) the UTF 8 header is
displayed in the page.

Reproducible: Always
Steps to Reproduce:
1.Create a html page with windows notepad and save it as UTF-8.
2.Make the page claim that it is e.g. ISO8859-1
3.View the page in Mozilla (see attached file "WrongCharsetDeclared.html").

Other approach:
1. Download and install a default apache webserver.
2. If the server uses the default configuration, the httpd.conf file should
contain the following line:
AddDefaultCharset ISO8859-1
(this creates a response header specifying the charset ISO8859-1 for the
returned html file, no matter how the file actually is encoded).
If not, add it.
3. Open the page with mozilla and see 3 interesting chars.



Expected Results:  
I think mozilla should check for availability of this UTF-8 header bytes instead
of trying to render them.
I know that the apache is somehow malconfigured, but this was the default
install, and I am not the only one who runs in this problem.

I am not the only one who has this problem.

Go to http://www.aopen.nl/products/vga/ (hardware manufacturer) and you will
find the same problem (also this page claims to be Windows-1252).

Suse 9.0 Konqueror has the same problem, IE does not. It's not a problem of
Mozilla version (validated it with 1.5, too) or OS (Windows, Linux).
Comment 1 Wolfgang Knauf 2004-03-25 13:40:10 PST
Created attachment 144765 [details]
This page displays the UTF-8 header with a malconfigured apache
Comment 2 Wolfgang Knauf 2004-03-25 13:40:40 PST
Created attachment 144766 [details]
Empty UTF-8 file, just contains the header
Comment 3 Wolfgang Knauf 2004-03-25 13:41:11 PST
Created attachment 144767 [details]
This page declares a wrong charset, so UTF-8 header is displayed
Comment 4 Christian :Biesinger (don't email me, ping me on IRC) 2004-05-24 09:21:18 PDT
hmm, this is interesting. should mozilla use the BOM or the meta charset when
both are present?

Note that if the webserver sends a charset, mozilla will not look at any other
source of charset information; this is intentional.
Comment 5 Simon Montagu :smontagu 2004-05-24 12:07:45 PDT
IMO there is no bug here. If the meta charset is inaccurate, the document is
displayed incorrectly. I don't see any reason why the BOM should override the
meta charset.
Comment 6 Jo Hermans 2008-07-14 06:07:31 PDT
*** Bug 445108 has been marked as a duplicate of this bug. ***
Comment 7 Leif Halvard Silli 2011-06-06 18:48:06 PDT
(In reply to comment #4)
> hmm, this is interesting. should mozilla use the BOM or the meta charset when
> both are present?
> 
> Note that if the webserver sends a charset, mozilla will not look at any
> other source of charset information; this is intentional.

The intention is well intended and "politically correct". However it is wrong.

 * IE and Webnkit respects the BOM higher than Content-Type header.
 * The IE/WEbkit behaviour is in tune with XML 1.0.
 * The Firefox/Opera behavior triggers Quirks-Mode in HTML and trigger Yellow Screen of Death in XML - those errors are not seen in Webkit or IE.

Summary: for the encoding, then the BOM should take have higher priority than the HTTP header.

Test cases: http://malform.no/testing/html5/bom/
 HTML5 bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
  XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

It is illogical to even allow the user override the UTF-8 encoding, because doing such a thing will *either* make the page render in Quirks-Mode *or* will make the page suffer Yellow Screen of Death.
Comment 8 David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) 2011-06-06 19:37:11 PDT
(In reply to comment #7)
>  * The IE/WEbkit behaviour is in tune with XML 1.0.

That's not true.

>   XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

... which clearly defers to RFC 3023, which says that the charset parameter is authoritative.
Comment 9 Leif Halvard Silli 2011-06-06 20:07:10 PDT
(In reply to comment #8)
> (In reply to comment #7)
> >  * The IE/WEbkit behaviour is in tune with XML 1.0.
> 
> That's not true.

Beg to differ - or at least question it. See below.

> >   XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
> 
> ... which clearly defers to RFC 3023, which says that the charset parameter
> is authoritative.

Quoting XML 1.0:

]] F.2 Priorities in the Presence of External Encoding Information [[

Thus, Appendix F.2 talks about presence of external encoding info. The preceding F.1 speaks about internal encoding info. F.2 a bit later says:

]] their relative priority and the preferred method of handling conflict should be specified [[

Thus, F.2 explains how derivated specifications (like XHTML specs) should behave.

Note as well that it refers to RFC 3023 as "useful guidance", and nothing more. The most important part of F.2, is clearly the last two sentences, which I'll quote. And remember once more that F.2 speaks about "Presence of External Encoding Information". Hence, the last two sentences should also be applied to a situation where there is external encoding info:

]]
In the interests of interoperability, however, the following rule is recommended.

    If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding.
[[

I don't know if it is contested that an XML entity served via HTTP "is in a file"? And even if it is contested, I would like to know, in very much detail, what Webkit and IE is breaking w.r.t. the XML spec.
Comment 10 David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) 2011-06-06 20:38:48 PDT
"in a file" means in a file on the local filesystem, not something retrieved via HTTP (note the distinction between files and network protocols that it makes earlier in F.2:  "as in some file systems and some network protocols".)
Comment 11 Leif Halvard Silli 2011-06-07 07:54:10 PDT
(In reply to comment #10)

I think that it had wanted to remove all unclarity, then it should have said "in a file in a file system".

I would think that a far more important contrast is "in a database record" versus "in a file, including a file served via HTTP".
Comment 12 [:Aleksej] 2012-08-21 04:08:03 PDT
*** Bug 783946 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.