Closed Bug 716579 Opened 13 years ago Closed 12 years ago

BOM should override HTTP-level charset

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: jsmith, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 4 obsolete files)

Step: Go to http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html

Expected: See Chrome portion of screenshot.

Actual: See Firefox portion of screenshot. HTML showed up with two "?" characters at the beginning, with the page itself not rendered. When looking at the source of the page, there appears to be many random characters showing up, making it impossible to understand the underlying source of the page in Firefox.

Additional notes:

http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html renders correctly also in IE 9.
The site sends:

  Content-Type: text/html; charset=utf-8

then follows with UTF-encoded data.  I believe that we're doing the right thing here and respecting the Content-Type header.  I have no idea what WebKit and IE are doing...  Henri, do you?
I believe Anne has been looking into this stuff recently too.
The site also has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, but of course everyone involved ignores that.
I guess one question is what should happen to the nulls if the page _is_ decoded as UTF-8.  The page is all ASCII, so hard to say what UAs are really doing with it.
Additional notes:

Opera has similar behavior to Firefox (see screenshot attached).
Chrome's View > Encoding menu has "UTF-16LE" checked on this page, for what it's worth.
(In reply to Boris Zbarsky (:bz) from comment #1)
> I have no idea what
> WebKit and IE are doing...  Henri, do you?

They are allowing the UTF-16 BOM override HTTP. Per earlier discussion with Anne, we should do this, too, and the Encoding spec is expected to require it.
Assignee: nobody → hsivonen
OS: Windows 7 → All
Hardware: x86 → All
Summary: http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html does not render correctly in Firefox → BOM should override HTTP-level charset
Version: 9 Branch → Trunk
Does it override all HTTP charsets, or just "unicode" ones?
All. Basically you have an algorithm you pass the incoming stream of octets and the determined label, if the stream of octets starts with one of the three BOMs you skip the BOM and pick UTF-8 / UTF-16LE / UTF-16BE based on it, otherwise you use the label. (In due course the standards will use an algorithm like this if I get my way.)
Blocks: encoding
Please verify that bug 662458 really is a duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3
Blocks: 786149
In the first phase, I’m going to leave the possibility of overriding the charset from the UI in place. (Mainly because our code assumes that the UI is *always* able to override and changing that is harder than changing the precedence of BOM relative to HTTP.)

(In reply to Leif Halvard Silli from comment #12)
> Please verify that bug 662458 really is a duplicate:
> https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023 and let the BOM take precedence in the XML case as well.
What do other browsers do for XML documents?
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023
> and let the BOM take precedence in the XML case as well.

Ignore RFC 3023 with respect to defaulting for now. First of all, the HTTP-specific default of ISO-8859-1 is gone in HTTPbis (in WGLC /really/ /soon/ /now/). Also, a revision of RFC 3023bis is supposed to be in progress as well.
 (In reply to Masatoshi Kimura [:emk] from comment #15)
> What do other browsers do for XML documents?

For mainstream browsers, then, according to the data I have collected[*], Webkit, Chrome and IE9 let the BOM override also for XML. But note that according to the same data, IE10 consumer preview, contrary to IE9 and earlier, does not let BOM override HTTP whether in HTML or in XML.

[*] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm

    NOTE regarding Chrome and Webkit: 

      XML
   * has a default encoding (UTF-8)
   * does not allow user overrides. 

     vs

      Chrome and Webkit's XML handling:
   * allow user override also for XML
   * default to parent browsing context's encoding [#]

[#] http://www.xn--mlform-iua.no/blog/utf8files/locale_default_vs_doc_of_parent_browsing_context/#frame6

    These peculiarities in Webkit and Chrome makes it more secure and useful, for authors, to use the BOM. Because, withuot the BOM, then XML does not work reliably, in these browsers. But OTOH, one could argue that it would have been much more important that Webkit and Chrome, for XML files, *stopped* defaulting to the parent browsing context and stopped allowing user overrides. So while it may be OK for Firefox to let BOM override everything also for XML, it is far less necessary to do so and probably without effect on interoperability.

PPS: For non-mainstrean XML parsers, such as libxml2 and many others, then they show many ideosyncrasies with regard to the BOM versus other encoding information - I collected many data on this in the following bug report: https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
One more NOTE on Chrome and Webkit: If the XML file (for example an SVG file) is used as an image (e.g. <img src="svg-file.svg">), then parent browsing context and user override is (fortunately) ignored.
Attached patch Fix for HTML (obsolete) — Splinter Review
Attached patch Fix for HTML and XML (obsolete) — Splinter Review
I rolled XML into the same patch. Fixing HTML and XML out of sync would have cause broken code, because they share the charset source constant order.
Attachment #672750 - Attachment is obsolete: true
Attachment #673807 - Attachment is obsolete: true
Attachment #673808 - Flags: review?(bugs)
This is pretty horrible to review. Would -w make the patch more readable?
Attached patch hg diff -w version (obsolete) — Splinter Review
Comment on attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf

>- return !oCharset.IsEmpty();
>+  // This code is rather pointless to have. Might as well reuse expat as
>+  // seen in nsHtml5StreamParser. -- hsivonen
>+  oCharset.Truncate();
>+  if ((0x3C == aBytes[0]) &&
>+      (0x3F == aBytes[1]) &&
>+      (0x78 == aBytes[2]) &&
>+      (0x6D == aBytes[3]) &&
>+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
I must be missing something here. Why you compare 4 bytes ("<?xm") first and then
5 bytes.


>+    // This code was bogus when I found it. It expects the BOM or the XML
>+    // declaration to be entirely in the first network buffer. -- hsivonen
>+    if (source < kCharsetFromByteOrderMark &&
>+        count >= 2 && ((buf[0] == 0xFE && buf[1] == 0xFF) ||
>+                       (buf[0] == 0xFF && buf[1] == 0xFE))) {
>+      // Don't bother with different code path for little-endian and
>+      // big-endian. Our decoder will re-sniff anyway.
>+      preferred.AssignLiteral("UTF-16");
>+      source = kCharsetFromByteOrderMark;
So to make this somehow readable, could we please add something like
bool
nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);



>+    } else if (source < kCharsetFromByteOrderMark &&
>+               count >= 3 &&
>+               buf[0] == 0xEF &&
>+               buf[1] == 0xBB &&
>+               buf[2] == 0xBF) {
>+      // The decoder will swallow the BOM
>+      preferred.AssignLiteral("UTF-8");
>+      source = kCharsetFromByteOrderMark;
Similar for UTF8
Attachment #673808 - Flags: review?(bugs) → review-
(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 673808 [details] [diff] [review]
> Fix for HTML and XML, without printf
> 
> >- return !oCharset.IsEmpty();
> >+  // This code is rather pointless to have. Might as well reuse expat as
> >+  // seen in nsHtml5StreamParser. -- hsivonen
> >+  oCharset.Truncate();
> >+  if ((0x3C == aBytes[0]) &&
> >+      (0x3F == aBytes[1]) &&
> >+      (0x78 == aBytes[2]) &&
> >+      (0x6D == aBytes[3]) &&
> >+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
> I must be missing something here. Why you compare 4 bytes ("<?xm") first and
> then
> 5 bytes.

I was wondering what that was about but I didn’t change it, because I was just reindenting old code there. I made more reasonable now. In general, I tried to resist fixing more bugs than the one at hand. When you look at nsParser/nsScanner to fix one bug, you see three more.

> So to make this somehow readable, could we please add something like
> bool
> nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);

Used nsContentUtils::CheckForBOM().
Attachment #673808 - Attachment is obsolete: true
Attachment #675485 - Flags: review?(bugs)
Attachment #675063 - Attachment is obsolete: true
Comment on attachment 675485 [details] [diff] [review]
Address review comments

>+  if (mBomState == BOM_SNIFFING_OVER &&
>+      mCharsetSource >= kCharsetFromChannel) {
>+      nsCOMPtr<nsICharsetConverterManager> convManager =
>+        do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>+      convManager->GetUnicodeDecoder(mCharset.get(),
>+                                       getter_AddRefs(mUnicodeDecoder));
Fix the indentation of the latter parameter 

>+      if (mUnicodeDecoder) {
>+        mUnicodeDecoder->SetInputErrorBehavior(
>+            nsIUnicodeDecoder::kOnError_Recover);
>+        mFeedChardet = false;
>+        mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
>+        mMetaScanner = nullptr;
>+        return WriteSniffingBufferAndCurrentSegment(aFromSegment,
>+                                                    aCount,
>+                                                    aWriteCount);
>+      } else {
>+        // Strange. nsHTMLDocument is supposed to make sure this does not
>+        // happen. Let's deal with this anyway, since who knows how
>+        // kCharsetFromOtherComponent is used.
>+        mCharsetSource = kCharsetFromWeakDocTypeDefault;
>+      }
>+  }
2 space indent, please

The changes to nsHtml5StreamParser::SniffStreamBytes could use some comments.
Like how mCharset ends up being set.


>   ParserWriteStruct* pws = static_cast<ParserWriteStruct*>(closure);
>-  const char* buf = fromRawSegment;
>+  const unsigned char* buf = (const unsigned char*)fromRawSegment;
I'd prefer static_cast<...>

>-  if(NS_FAILED(res) && (mCharsetSource == kCharsetUninitialized))
>-  {
>-     // failed - unknown alias , fallback to ISO-8859-1
>-    mCharset.AssignLiteral("ISO-8859-1");
>+  if (NS_FAILED(res)) {
>+    return NS_OK;
Assert


I'll
Attachment #675485 - Flags: review?(bugs) → review+
Thanks. Landed with review comments addressed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12288a8a5037
https://hg.mozilla.org/mozilla-central/rev/12288a8a5037
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
I think the change to the XML parsing may have caused bug 814900
(In reply to Leif Halvard Silli from comment #30)
> I think the change to the XML parsing may have caused bug 814900

Nope. The regression range points to bug 801402.
Depends on: 844007

Restricted during to spam for now.

Restrict Comments: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: