Closed Bug 662458 Opened 13 years ago Closed 12 years ago

To avoid Quirks-Mode or Yellow Screen of Death, the UTF-8 BOM must have higher encoding priority than HTTP Content-Type:

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
major

Tracking

()

RESOLVED DUPLICATE of bug 716579

People

(Reporter: xn--mlform-iua, Unassigned)

References

()

Details

User-Agent:       Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Build Identifier: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

Currently Mozilla gives higher priority to the charset=*; attribute of the HTTP Content-Type: header than it gives to the UTF-8 Byte-Order Mark (BOM). 

As a consequence, if the Content-Type: header is wrong (e.g. "charset=KOI8-R"), the HTML parser as well as the XML parser will see illegal characters in the beginning of the document (before the DOCTYPE or before the XML declaration etc.)

This, in turn, will bring cause Quirks-Mode, "gibberish" characters and visible BOM characters in the HTML parser, and a fatal error (Yellow Screen of Death) in the XML parser.

The issue can be solved by giving the UTF-8 BOM given higher priority than the HTTP Content-Type: header. XML 1.0 in fact gives this interoparability recommendation! Whereas for HTML5, there is still time to better specify how to handle the BOM. Internet Explorer and Webkit already behave as XML 1.0 suggest.

Reproducible: Always

Steps to Reproduce:
1. Create two copies of an UTF-8 encoded XHTML page.
2. Let the copies contain a BOM, some non-ASCII chars + Quirks-Mode CSS (width:100; instead of with:100px;)
3. Serve the copies with the name of a non-UTF-8 charset in the HTTP Content-Type: header:
3.a. Serve one copy as HTML + "Content-Type: text/html; charset=KOI8-R"
3.b. Serve other copy as XHTML + "Content-Type: application/xhtml+xml; charset=KOI8-R"
4. Load both files in Firefox, one after another, and check for the following artefacts:
   For HTML:
   - correct rendering of non-ASCII letters, 
   - is the BOM visible?
   - is quirks-mode triggered
   For XHTML:
   - yellow screen of death

Actual Results:  
For the HTML page:
* non-ASCII characters are rendered as mojibake/gibberish/unreadable
* the BOM is visible in the document
* the page is in quirks-mode

For the XHTML page:
* yellow screen of death (because Mozilla sees the BOM characters as illegal gibberish before the actual XML document begins.

Expected Results:  
For the HTML file:
* Same behavior as in Internet Explorer and Webkit browsers.
* That is: no non-ASCII problems, no visible-BOM, only 'no-Quirks' (aka strict) mode
For the XML/XHTML file:
* Same behavior as Webkit (and probably IE9) show
* Adherence to XML 1.0's recommendation to respect BOM more than HTTP
  http://www.w3.org/TR/xml/#sec-guessing-with-ext-info
* That is: no Yellow Screen of Death. Instead, normal rendering

UTF-16 BOM:
This bug only focuses on the UTF-8 BOM. One should think same/similar rules apply for the UTF-16 BOM. However a separate "shadow bug" for UTF-16 BOM is eventually recommended. Reason: I have not studied the UTF-16 BOM very much, the UTF-8 BOM has particular relevance in HTML5, how to handle BOM in HTML is, for the first time, specced in HTML5. (HTML4 only speaks about UTF-16 BOM). And the UTF-8 BOM has often been misunderstood and questioned.

User overriding:
Note, as well that Webkit and IE does not allow you to override the encoding whenever the encoding is UTF-8 *and* there is a UTF-8 Byte Order Mark. This is very logical, because, should the user change the encoding manually, then the HTML page would land in Quirks-Mode while the XML page would land in Yellow Screen of Death. The IE/Webkit behaviour turns UTF-8 + BOM into a particular safe, secure and efficient encoding.

Some related links:
Test case: http://malform.no/testing/html5/bom/
HTML5 bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
 XML spec: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info

Some related bugs:
 Bug 236325; bug 287553; bug 238694;

Priority:
  I suggeset bug priority "Mayor" because: 
  * it relates to HTML5, 
  * it breaks with IE and Webkit, 
  * it triggers Quirks-Mode and
  * it uneccessarily triggers YSOD in XML.
XML 1.0 mentions RFC3023. It is notable how RFC3023 only speaks about the UTF-16 BOM and not about the UTF-8 BOM

http://tools.ietf.org/html/rfc3023#page-15
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
Before I can accept that you resolve this bug as a duplicate of bug 716579, please confirm that the outcome of bug 716579 will be the same.

In that regard, XML says: <http://www.w3.org/TR/REC-xml/> 

]] It is a fatal error if an XML entity is determined (via default, encoding declaration, or **higher-level protocol**) to be in a certain encoding but contains byte sequences that are not legal in that encoding.[[

The purpose of this bug is to (bug 662458) is to make Firefox *ignore* what XML says as far as BOM is concerned. Namely: The bug suggests that even if a higher protocol (such as HTTP) says e.g. "charset=KOI8-R", Firefox' XML parser will not consider it as a fatal error *provided* that the document starts with a BOM.

Also: One thing that this bug (bug 662458) does not take up is when BOM contradicts <?xml version="1.0" encoding="charset-name" ?>. Is bug 716579 going to *not* consider it a fatal error if BOM contradicts the XML declaration, too? (That is certainly fully OK to me - I just want to know.)

Finally, what about situations when there is no BOM? Forexample, what if higher protocol says "KOI8-R", but declaration says <?xml version="1.0" encoding="UTF-8"?> This, too, is currently an error. Should it not continue to be?

PS: I would say that XML 1.0 5th edition does give higher priority to document internal encoding info (such as BOM and XML declaration) than it gives to HTTP. In fact: That is precisely why it requires that any contradiction between higher level and document level information is a fatal error. And thus, when it comes to XML, then it isn't 100% correct to say that this has anything to with priority. Rather, it has solely to do with what is considered a fatal error. 

Thus, for XML, this new direction - that I am in favor of - would do two things to the XML parser: 
1) It would change some errors from 'fatal error' to non-fatal errors
2) And "replace" them with encoding declaration hierarchy rules.
Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---
> XML parser will not consider it as a fatal error *provided* that the document starts
> with a BOM.

Once the other bug is fixed, XML parser would just ignore the HTTP charset if the document has a BOM.  If the document does not decode vie the charset the BOM specified, that will still be a fatal error.

> Finally, what about situations when there is no BOM?

They won't change from current behavior.
I was not able to parse your second sentence unless I added some "error correction". 

But I suppose you meant that if "if the document does not decode **via the BOM specified**, that will still be a fatal error".

With that, then it sounds like like this is a duplicate of the, newer, bug 716579.
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.