Last Comment Bug 716579 - BOM should override HTTP-level charset
: BOM should override HTTP-level charset
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: mozilla19
Assigned To: Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
:
Mentors:
: 662458 (view as bug list)
Depends on: 844007
Blocks: encoding 786149
  Show dependency treegraph
 
Reported: 2012-01-09 09:45 PST by Jason Smith [:jsmith]
Modified: 2013-02-22 11:21 PST (History)
7 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
HP Shopping Website No Firefox Render (600.89 KB, image/png)
2012-01-09 09:45 PST, Jason Smith [:jsmith]
no flags Details
IE and Opera Comparison for HP Shopping Website Render (609.18 KB, image/png)
2012-01-09 10:41 PST, Jason Smith [:jsmith]
no flags Details
Fix for HTML (7.16 KB, patch)
2012-10-18 05:19 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review
Fix for HTML and XML (35.84 KB, patch)
2012-10-22 03:18 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review
Fix for HTML and XML, without printf (35.20 KB, patch)
2012-10-22 03:20 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
bugs: review-
Details | Diff | Splinter Review
hg diff -w version (31.27 KB, patch)
2012-10-25 04:12 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review
Address review comments (35.31 KB, patch)
2012-10-26 01:05 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
bugs: review+
Details | Diff | Splinter Review
Address review comments, with -w (31.86 KB, patch)
2012-10-26 01:11 PDT, Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01)
no flags Details | Diff | Splinter Review

Description Jason Smith [:jsmith] 2012-01-09 09:45:30 PST
Created attachment 587025 [details]
HP Shopping Website No Firefox Render

Step: Go to http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html

Expected: See Chrome portion of screenshot.

Actual: See Firefox portion of screenshot. HTML showed up with two "?" characters at the beginning, with the page itself not rendered. When looking at the source of the page, there appears to be many random characters showing up, making it impossible to understand the underlying source of the page in Firefox.

Additional notes:

http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html renders correctly also in IE 9.
Comment 1 Boris Zbarsky [:bz] 2012-01-09 10:33:35 PST
The site sends:

  Content-Type: text/html; charset=utf-8

then follows with UTF-encoded data.  I believe that we're doing the right thing here and respecting the Content-Type header.  I have no idea what WebKit and IE are doing...  Henri, do you?
Comment 2 Boris Zbarsky [:bz] 2012-01-09 10:36:27 PST
I believe Anne has been looking into this stuff recently too.
Comment 3 Boris Zbarsky [:bz] 2012-01-09 10:38:50 PST
The site also has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, but of course everyone involved ignores that.
Comment 4 Boris Zbarsky [:bz] 2012-01-09 10:40:24 PST
I guess one question is what should happen to the nulls if the page _is_ decoded as UTF-8.  The page is all ASCII, so hard to say what UAs are really doing with it.
Comment 5 Jason Smith [:jsmith] 2012-01-09 10:41:22 PST
Created attachment 587043 [details]
IE and Opera Comparison for HP Shopping Website Render

Additional notes:

Opera has similar behavior to Firefox (see screenshot attached).
Comment 6 Boris Zbarsky [:bz] 2012-01-09 10:48:13 PST
Chrome's View > Encoding menu has "UTF-16LE" checked on this page, for what it's worth.
Comment 7 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-01-09 21:44:44 PST
(In reply to Boris Zbarsky (:bz) from comment #1)
> I have no idea what
> WebKit and IE are doing...  Henri, do you?

They are allowing the UTF-16 BOM override HTTP. Per earlier discussion with Anne, we should do this, too, and the Encoding spec is expected to require it.
Comment 8 Boris Zbarsky [:bz] 2012-01-09 21:52:34 PST
Does it override all HTTP charsets, or just "unicode" ones?
Comment 9 Anne (:annevk) 2012-01-09 22:43:28 PST
All. Basically you have an algorithm you pass the incoming stream of octets and the determined label, if the stream of octets starts with one of the three BOMs you skip the BOM and pick UTF-8 / UTF-16LE / UTF-16BE based on it, otherwise you use the label. (In due course the standards will use an algorithm like this if I get my way.)
Comment 10 Anne (:annevk) 2012-01-09 23:21:55 PST
FYI: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359
Comment 11 Masatoshi Kimura [:emk] 2012-04-20 07:26:09 PDT
*** Bug 662458 has been marked as a duplicate of this bug. ***
Comment 12 Leif Halvard Silli 2012-04-20 08:36:13 PDT
Please verify that bug 662458 really is a duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3
Comment 13 Leif Halvard Silli 2012-04-20 11:08:34 PDT
*** Bug 662458 has been marked as a duplicate of this bug. ***
Comment 14 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-02 04:40:28 PDT
In the first phase, I’m going to leave the possibility of overriding the charset from the UI in place. (Mainly because our code assumes that the UI is *always* able to override and changing that is harder than changing the precedence of BOM relative to HTTP.)

(In reply to Leif Halvard Silli from comment #12)
> Please verify that bug 662458 really is a duplicate:
> https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023 and let the BOM take precedence in the XML case as well.
Comment 15 Masatoshi Kimura [:emk] 2012-10-02 04:49:41 PDT
What do other browsers do for XML documents?
Comment 16 Julian Reschke 2012-10-02 05:17:39 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023
> and let the BOM take precedence in the XML case as well.

Ignore RFC 3023 with respect to defaulting for now. First of all, the HTTP-specific default of ISO-8859-1 is gone in HTTPbis (in WGLC /really/ /soon/ /now/). Also, a revision of RFC 3023bis is supposed to be in progress as well.
Comment 17 Leif Halvard Silli 2012-10-02 05:50:22 PDT
 (In reply to Masatoshi Kimura [:emk] from comment #15)
> What do other browsers do for XML documents?

For mainstream browsers, then, according to the data I have collected[*], Webkit, Chrome and IE9 let the BOM override also for XML. But note that according to the same data, IE10 consumer preview, contrary to IE9 and earlier, does not let BOM override HTTP whether in HTML or in XML.

[*] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm

    NOTE regarding Chrome and Webkit: 

      XML
   * has a default encoding (UTF-8)
   * does not allow user overrides. 

     vs

      Chrome and Webkit's XML handling:
   * allow user override also for XML
   * default to parent browsing context's encoding [#]

[#] http://www.xn--mlform-iua.no/blog/utf8files/locale_default_vs_doc_of_parent_browsing_context/#frame6

    These peculiarities in Webkit and Chrome makes it more secure and useful, for authors, to use the BOM. Because, withuot the BOM, then XML does not work reliably, in these browsers. But OTOH, one could argue that it would have been much more important that Webkit and Chrome, for XML files, *stopped* defaulting to the parent browsing context and stopped allowing user overrides. So while it may be OK for Firefox to let BOM override everything also for XML, it is far less necessary to do so and probably without effect on interoperability.

PPS: For non-mainstrean XML parsers, such as libxml2 and many others, then they show many ideosyncrasies with regard to the BOM versus other encoding information - I collected many data on this in the following bug report: https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
Comment 18 Leif Halvard Silli 2012-10-02 06:08:12 PDT
One more NOTE on Chrome and Webkit: If the XML file (for example an SVG file) is used as an image (e.g. <img src="svg-file.svg">), then parent browsing context and user override is (fortunately) ignored.
Comment 19 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-18 05:19:28 PDT
Created attachment 672750 [details] [diff] [review]
Fix for HTML
Comment 20 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-22 03:18:12 PDT
Created attachment 673807 [details] [diff] [review]
Fix for HTML and XML

I rolled XML into the same patch. Fixing HTML and XML out of sync would have cause broken code, because they share the charset source constant order.
Comment 21 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-22 03:20:26 PDT
Created attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf
Comment 22 Olli Pettay [:smaug] 2012-10-25 04:10:04 PDT
This is pretty horrible to review. Would -w make the patch more readable?
Comment 23 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-25 04:12:48 PDT
Created attachment 675063 [details] [diff] [review]
hg diff -w version
Comment 24 Olli Pettay [:smaug] 2012-10-25 16:44:19 PDT
Comment on attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf

>- return !oCharset.IsEmpty();
>+  // This code is rather pointless to have. Might as well reuse expat as
>+  // seen in nsHtml5StreamParser. -- hsivonen
>+  oCharset.Truncate();
>+  if ((0x3C == aBytes[0]) &&
>+      (0x3F == aBytes[1]) &&
>+      (0x78 == aBytes[2]) &&
>+      (0x6D == aBytes[3]) &&
>+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
I must be missing something here. Why you compare 4 bytes ("<?xm") first and then
5 bytes.


>+    // This code was bogus when I found it. It expects the BOM or the XML
>+    // declaration to be entirely in the first network buffer. -- hsivonen
>+    if (source < kCharsetFromByteOrderMark &&
>+        count >= 2 && ((buf[0] == 0xFE && buf[1] == 0xFF) ||
>+                       (buf[0] == 0xFF && buf[1] == 0xFE))) {
>+      // Don't bother with different code path for little-endian and
>+      // big-endian. Our decoder will re-sniff anyway.
>+      preferred.AssignLiteral("UTF-16");
>+      source = kCharsetFromByteOrderMark;
So to make this somehow readable, could we please add something like
bool
nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);



>+    } else if (source < kCharsetFromByteOrderMark &&
>+               count >= 3 &&
>+               buf[0] == 0xEF &&
>+               buf[1] == 0xBB &&
>+               buf[2] == 0xBF) {
>+      // The decoder will swallow the BOM
>+      preferred.AssignLiteral("UTF-8");
>+      source = kCharsetFromByteOrderMark;
Similar for UTF8
Comment 25 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-26 01:05:40 PDT
Created attachment 675485 [details] [diff] [review]
Address review comments

(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 673808 [details] [diff] [review]
> Fix for HTML and XML, without printf
> 
> >- return !oCharset.IsEmpty();
> >+  // This code is rather pointless to have. Might as well reuse expat as
> >+  // seen in nsHtml5StreamParser. -- hsivonen
> >+  oCharset.Truncate();
> >+  if ((0x3C == aBytes[0]) &&
> >+      (0x3F == aBytes[1]) &&
> >+      (0x78 == aBytes[2]) &&
> >+      (0x6D == aBytes[3]) &&
> >+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
> I must be missing something here. Why you compare 4 bytes ("<?xm") first and
> then
> 5 bytes.

I was wondering what that was about but I didn’t change it, because I was just reindenting old code there. I made more reasonable now. In general, I tried to resist fixing more bugs than the one at hand. When you look at nsParser/nsScanner to fix one bug, you see three more.

> So to make this somehow readable, could we please add something like
> bool
> nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);

Used nsContentUtils::CheckForBOM().
Comment 26 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-10-26 01:11:18 PDT
Created attachment 675486 [details] [diff] [review]
Address review comments, with -w
Comment 27 Olli Pettay [:smaug] 2012-10-30 05:09:58 PDT
Comment on attachment 675485 [details] [diff] [review]
Address review comments

>+  if (mBomState == BOM_SNIFFING_OVER &&
>+      mCharsetSource >= kCharsetFromChannel) {
>+      nsCOMPtr<nsICharsetConverterManager> convManager =
>+        do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>+      convManager->GetUnicodeDecoder(mCharset.get(),
>+                                       getter_AddRefs(mUnicodeDecoder));
Fix the indentation of the latter parameter 

>+      if (mUnicodeDecoder) {
>+        mUnicodeDecoder->SetInputErrorBehavior(
>+            nsIUnicodeDecoder::kOnError_Recover);
>+        mFeedChardet = false;
>+        mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
>+        mMetaScanner = nullptr;
>+        return WriteSniffingBufferAndCurrentSegment(aFromSegment,
>+                                                    aCount,
>+                                                    aWriteCount);
>+      } else {
>+        // Strange. nsHTMLDocument is supposed to make sure this does not
>+        // happen. Let's deal with this anyway, since who knows how
>+        // kCharsetFromOtherComponent is used.
>+        mCharsetSource = kCharsetFromWeakDocTypeDefault;
>+      }
>+  }
2 space indent, please

The changes to nsHtml5StreamParser::SniffStreamBytes could use some comments.
Like how mCharset ends up being set.


>   ParserWriteStruct* pws = static_cast<ParserWriteStruct*>(closure);
>-  const char* buf = fromRawSegment;
>+  const unsigned char* buf = (const unsigned char*)fromRawSegment;
I'd prefer static_cast<...>

>-  if(NS_FAILED(res) && (mCharsetSource == kCharsetUninitialized))
>-  {
>-     // failed - unknown alias , fallback to ISO-8859-1
>-    mCharset.AssignLiteral("ISO-8859-1");
>+  if (NS_FAILED(res)) {
>+    return NS_OK;
Assert


I'll
Comment 28 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-11-06 03:59:36 PST
Thanks. Landed with review comments addressed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12288a8a5037
Comment 29 Ryan VanderMeulen [:RyanVM] 2012-11-06 18:24:46 PST
https://hg.mozilla.org/mozilla-central/rev/12288a8a5037
Comment 30 Leif Halvard Silli 2012-11-24 14:49:38 PST
I think the change to the XML parsing may have caused bug 814900
Comment 31 Henri Sivonen (:hsivonen) (Not doing reviews or reading bugmail until 2016-08-01) 2012-11-26 00:49:20 PST
(In reply to Leif Halvard Silli from comment #30)
> I think the change to the XML parsing may have caused bug 814900

Nope. The regression range points to bug 801402.

Note You need to log in before you can comment on or make changes to this bug.