Closed Bug 238694 Opened 20 years ago Closed 20 years ago

UTF-8 BOM are rendered if charset is malconfigured

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: wolfgang.knauf, Assigned: smontagu)

References

Details

Attachments

(3 files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113

With WindowsXP Notepad you have the possibility to save a file (e.g. html page)
in UTF-8 charset. A 3 byte UTF-8 header is added to the file.
If the file claims to use a different charset (for example by the meta tag <meta
http-equiv="Content-Type" content="text/html; charset=ISO8859-1"> or because the
webbrowser adds a different encoding to the response) the UTF 8 header is
displayed in the page.

Reproducible: Always
Steps to Reproduce:
1.Create a html page with windows notepad and save it as UTF-8.
2.Make the page claim that it is e.g. ISO8859-1
3.View the page in Mozilla (see attached file "WrongCharsetDeclared.html").

Other approach:
1. Download and install a default apache webserver.
2. If the server uses the default configuration, the httpd.conf file should
contain the following line:
AddDefaultCharset ISO8859-1
(this creates a response header specifying the charset ISO8859-1 for the
returned html file, no matter how the file actually is encoded).
If not, add it.
3. Open the page with mozilla and see 3 interesting chars.



Expected Results:  
I think mozilla should check for availability of this UTF-8 header bytes instead
of trying to render them.
I know that the apache is somehow malconfigured, but this was the default
install, and I am not the only one who runs in this problem.

I am not the only one who has this problem.

Go to http://www.aopen.nl/products/vga/ (hardware manufacturer) and you will
find the same problem (also this page claims to be Windows-1252).

Suse 9.0 Konqueror has the same problem, IE does not. It's not a problem of
Mozilla version (validated it with 1.5, too) or OS (Windows, Linux).
hmm, this is interesting. should mozilla use the BOM or the meta charset when
both are present?

Note that if the webserver sends a charset, mozilla will not look at any other
source of charset information; this is intentional.
Assignee: general → smontagu
Component: Browser-General → Internationalization
QA Contact: general → amyy
Summary: UTF-8 Header bytes are rendered if charset is malconfigured → UTF-8 BOM are rendered if charset is malconfigured
IMO there is no bug here. If the meta charset is inaccurate, the document is
displayed incorrectly. I don't see any reason why the BOM should override the
meta charset.
Status: UNCONFIRMED → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
(In reply to comment #4)
> hmm, this is interesting. should mozilla use the BOM or the meta charset when
> both are present?
> 
> Note that if the webserver sends a charset, mozilla will not look at any
> other source of charset information; this is intentional.

The intention is well intended and "politically correct". However it is wrong.

 * IE and Webnkit respects the BOM higher than Content-Type header.
 * The IE/WEbkit behaviour is in tune with XML 1.0.
 * The Firefox/Opera behavior triggers Quirks-Mode in HTML and trigger Yellow Screen of Death in XML - those errors are not seen in Webkit or IE.

Summary: for the encoding, then the BOM should take have higher priority than the HTTP header.

Test cases: http://malform.no/testing/html5/bom/
 HTML5 bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
  XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

It is illogical to even allow the user override the UTF-8 encoding, because doing such a thing will *either* make the page render in Quirks-Mode *or* will make the page suffer Yellow Screen of Death.
(In reply to comment #7)
>  * The IE/WEbkit behaviour is in tune with XML 1.0.

That's not true.

>   XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

... which clearly defers to RFC 3023, which says that the charset parameter is authoritative.
(In reply to comment #8)
> (In reply to comment #7)
> >  * The IE/WEbkit behaviour is in tune with XML 1.0.
> 
> That's not true.

Beg to differ - or at least question it. See below.

> >   XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
> 
> ... which clearly defers to RFC 3023, which says that the charset parameter
> is authoritative.

Quoting XML 1.0:

]] F.2 Priorities in the Presence of External Encoding Information [[

Thus, Appendix F.2 talks about presence of external encoding info. The preceding F.1 speaks about internal encoding info. F.2 a bit later says:

]] their relative priority and the preferred method of handling conflict should be specified [[

Thus, F.2 explains how derivated specifications (like XHTML specs) should behave.

Note as well that it refers to RFC 3023 as "useful guidance", and nothing more. The most important part of F.2, is clearly the last two sentences, which I'll quote. And remember once more that F.2 speaks about "Presence of External Encoding Information". Hence, the last two sentences should also be applied to a situation where there is external encoding info:

]]
In the interests of interoperability, however, the following rule is recommended.

    If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding.
[[

I don't know if it is contested that an XML entity served via HTTP "is in a file"? And even if it is contested, I would like to know, in very much detail, what Webkit and IE is breaking w.r.t. the XML spec.
"in a file" means in a file on the local filesystem, not something retrieved via HTTP (note the distinction between files and network protocols that it makes earlier in F.2:  "as in some file systems and some network protocols".)
(In reply to comment #10)

I think that it had wanted to remove all unclarity, then it should have said "in a file in a file system".

I would think that a far more important contrast is "in a database record" versus "in a file, including a file served via HTTP".
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: