BOM should override HTTP-level charset

RESOLVED FIXED in mozilla19

Status

()

Core
HTML: Parser
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: jsmith, Assigned: hsivonen)

Tracking

(Blocks: 1 bug)

Trunk
mozilla19
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 4 obsolete attachments)

(Reporter)

Description

6 years ago
Created attachment 587025 [details]
HP Shopping Website No Firefox Render

Step: Go to http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html

Expected: See Chrome portion of screenshot.

Actual: See Firefox portion of screenshot. HTML showed up with two "?" characters at the beginning, with the page itself not rendered. When looking at the source of the page, there appears to be many random characters showing up, making it impossible to understand the underlying source of the page in Firefox.

Additional notes:

http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html renders correctly also in IE 9.
The site sends:

  Content-Type: text/html; charset=utf-8

then follows with UTF-encoded data.  I believe that we're doing the right thing here and respecting the Content-Type header.  I have no idea what WebKit and IE are doing...  Henri, do you?
I believe Anne has been looking into this stuff recently too.
The site also has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, but of course everyone involved ignores that.
I guess one question is what should happen to the nulls if the page _is_ decoded as UTF-8.  The page is all ASCII, so hard to say what UAs are really doing with it.
(Reporter)

Comment 5

6 years ago
Created attachment 587043 [details]
IE and Opera Comparison for HP Shopping Website Render

Additional notes:

Opera has similar behavior to Firefox (see screenshot attached).
Chrome's View > Encoding menu has "UTF-16LE" checked on this page, for what it's worth.
(Assignee)

Comment 7

6 years ago
(In reply to Boris Zbarsky (:bz) from comment #1)
> I have no idea what
> WebKit and IE are doing...  Henri, do you?

They are allowing the UTF-16 BOM override HTTP. Per earlier discussion with Anne, we should do this, too, and the Encoding spec is expected to require it.
Assignee: nobody → hsivonen
OS: Windows 7 → All
Hardware: x86 → All
Summary: http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html does not render correctly in Firefox → BOM should override HTTP-level charset
Version: 9 Branch → Trunk
Does it override all HTTP charsets, or just "unicode" ones?

Comment 9

6 years ago
All. Basically you have an algorithm you pass the incoming stream of octets and the determined label, if the stream of octets starts with one of the three BOMs you skip the BOM and pick UTF-8 / UTF-16LE / UTF-16BE based on it, otherwise you use the label. (In due course the standards will use an algorithm like this if I get my way.)

Comment 10

6 years ago
FYI: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359
Duplicate of this bug: 662458
Blocks: 746911

Comment 12

5 years ago
Please verify that bug 662458 really is a duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

Updated

5 years ago
Duplicate of this bug: 662458
Blocks: 786149
(Assignee)

Comment 14

5 years ago
In the first phase, I’m going to leave the possibility of overriding the charset from the UI in place. (Mainly because our code assumes that the UI is *always* able to override and changing that is harder than changing the precedence of BOM relative to HTTP.)

(In reply to Leif Halvard Silli from comment #12)
> Please verify that bug 662458 really is a duplicate:
> https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023 and let the BOM take precedence in the XML case as well.
What do other browsers do for XML documents?

Comment 16

5 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023
> and let the BOM take precedence in the XML case as well.

Ignore RFC 3023 with respect to defaulting for now. First of all, the HTTP-specific default of ISO-8859-1 is gone in HTTPbis (in WGLC /really/ /soon/ /now/). Also, a revision of RFC 3023bis is supposed to be in progress as well.

Comment 17

5 years ago
 (In reply to Masatoshi Kimura [:emk] from comment #15)
> What do other browsers do for XML documents?

For mainstream browsers, then, according to the data I have collected[*], Webkit, Chrome and IE9 let the BOM override also for XML. But note that according to the same data, IE10 consumer preview, contrary to IE9 and earlier, does not let BOM override HTTP whether in HTML or in XML.

[*] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm

    NOTE regarding Chrome and Webkit: 

      XML
   * has a default encoding (UTF-8)
   * does not allow user overrides. 

     vs

      Chrome and Webkit's XML handling:
   * allow user override also for XML
   * default to parent browsing context's encoding [#]

[#] http://www.xn--mlform-iua.no/blog/utf8files/locale_default_vs_doc_of_parent_browsing_context/#frame6

    These peculiarities in Webkit and Chrome makes it more secure and useful, for authors, to use the BOM. Because, withuot the BOM, then XML does not work reliably, in these browsers. But OTOH, one could argue that it would have been much more important that Webkit and Chrome, for XML files, *stopped* defaulting to the parent browsing context and stopped allowing user overrides. So while it may be OK for Firefox to let BOM override everything also for XML, it is far less necessary to do so and probably without effect on interoperability.

PPS: For non-mainstrean XML parsers, such as libxml2 and many others, then they show many ideosyncrasies with regard to the BOM versus other encoding information - I collected many data on this in the following bug report: https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

Comment 18

5 years ago
One more NOTE on Chrome and Webkit: If the XML file (for example an SVG file) is used as an image (e.g. <img src="svg-file.svg">), then parent browsing context and user override is (fortunately) ignored.
(Assignee)

Comment 19

5 years ago
Created attachment 672750 [details] [diff] [review]
Fix for HTML
(Assignee)

Comment 20

5 years ago
Created attachment 673807 [details] [diff] [review]
Fix for HTML and XML

I rolled XML into the same patch. Fixing HTML and XML out of sync would have cause broken code, because they share the charset source constant order.
Attachment #672750 - Attachment is obsolete: true
(Assignee)

Comment 21

5 years ago
Created attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf
Attachment #673807 - Attachment is obsolete: true
(Assignee)

Updated

5 years ago
Attachment #673808 - Flags: review?(bugs)
This is pretty horrible to review. Would -w make the patch more readable?
(Assignee)

Comment 23

5 years ago
Created attachment 675063 [details] [diff] [review]
hg diff -w version
Comment on attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf

>- return !oCharset.IsEmpty();
>+  // This code is rather pointless to have. Might as well reuse expat as
>+  // seen in nsHtml5StreamParser. -- hsivonen
>+  oCharset.Truncate();
>+  if ((0x3C == aBytes[0]) &&
>+      (0x3F == aBytes[1]) &&
>+      (0x78 == aBytes[2]) &&
>+      (0x6D == aBytes[3]) &&
>+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
I must be missing something here. Why you compare 4 bytes ("<?xm") first and then
5 bytes.


>+    // This code was bogus when I found it. It expects the BOM or the XML
>+    // declaration to be entirely in the first network buffer. -- hsivonen
>+    if (source < kCharsetFromByteOrderMark &&
>+        count >= 2 && ((buf[0] == 0xFE && buf[1] == 0xFF) ||
>+                       (buf[0] == 0xFF && buf[1] == 0xFE))) {
>+      // Don't bother with different code path for little-endian and
>+      // big-endian. Our decoder will re-sniff anyway.
>+      preferred.AssignLiteral("UTF-16");
>+      source = kCharsetFromByteOrderMark;
So to make this somehow readable, could we please add something like
bool
nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);



>+    } else if (source < kCharsetFromByteOrderMark &&
>+               count >= 3 &&
>+               buf[0] == 0xEF &&
>+               buf[1] == 0xBB &&
>+               buf[2] == 0xBF) {
>+      // The decoder will swallow the BOM
>+      preferred.AssignLiteral("UTF-8");
>+      source = kCharsetFromByteOrderMark;
Similar for UTF8
Attachment #673808 - Flags: review?(bugs) → review-
(Assignee)

Comment 25

5 years ago
Created attachment 675485 [details] [diff] [review]
Address review comments

(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 673808 [details] [diff] [review]
> Fix for HTML and XML, without printf
> 
> >- return !oCharset.IsEmpty();
> >+  // This code is rather pointless to have. Might as well reuse expat as
> >+  // seen in nsHtml5StreamParser. -- hsivonen
> >+  oCharset.Truncate();
> >+  if ((0x3C == aBytes[0]) &&
> >+      (0x3F == aBytes[1]) &&
> >+      (0x78 == aBytes[2]) &&
> >+      (0x6D == aBytes[3]) &&
> >+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
> I must be missing something here. Why you compare 4 bytes ("<?xm") first and
> then
> 5 bytes.

I was wondering what that was about but I didn’t change it, because I was just reindenting old code there. I made more reasonable now. In general, I tried to resist fixing more bugs than the one at hand. When you look at nsParser/nsScanner to fix one bug, you see three more.

> So to make this somehow readable, could we please add something like
> bool
> nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);

Used nsContentUtils::CheckForBOM().
Attachment #673808 - Attachment is obsolete: true
Attachment #675485 - Flags: review?(bugs)
(Assignee)

Comment 26

5 years ago
Created attachment 675486 [details] [diff] [review]
Address review comments, with -w
Attachment #675063 - Attachment is obsolete: true
Comment on attachment 675485 [details] [diff] [review]
Address review comments

>+  if (mBomState == BOM_SNIFFING_OVER &&
>+      mCharsetSource >= kCharsetFromChannel) {
>+      nsCOMPtr<nsICharsetConverterManager> convManager =
>+        do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>+      convManager->GetUnicodeDecoder(mCharset.get(),
>+                                       getter_AddRefs(mUnicodeDecoder));
Fix the indentation of the latter parameter 

>+      if (mUnicodeDecoder) {
>+        mUnicodeDecoder->SetInputErrorBehavior(
>+            nsIUnicodeDecoder::kOnError_Recover);
>+        mFeedChardet = false;
>+        mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
>+        mMetaScanner = nullptr;
>+        return WriteSniffingBufferAndCurrentSegment(aFromSegment,
>+                                                    aCount,
>+                                                    aWriteCount);
>+      } else {
>+        // Strange. nsHTMLDocument is supposed to make sure this does not
>+        // happen. Let's deal with this anyway, since who knows how
>+        // kCharsetFromOtherComponent is used.
>+        mCharsetSource = kCharsetFromWeakDocTypeDefault;
>+      }
>+  }
2 space indent, please

The changes to nsHtml5StreamParser::SniffStreamBytes could use some comments.
Like how mCharset ends up being set.


>   ParserWriteStruct* pws = static_cast<ParserWriteStruct*>(closure);
>-  const char* buf = fromRawSegment;
>+  const unsigned char* buf = (const unsigned char*)fromRawSegment;
I'd prefer static_cast<...>

>-  if(NS_FAILED(res) && (mCharsetSource == kCharsetUninitialized))
>-  {
>-     // failed - unknown alias , fallback to ISO-8859-1
>-    mCharset.AssignLiteral("ISO-8859-1");
>+  if (NS_FAILED(res)) {
>+    return NS_OK;
Assert


I'll
Attachment #675485 - Flags: review?(bugs) → review+
(Assignee)

Comment 28

5 years ago
Thanks. Landed with review comments addressed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12288a8a5037
https://hg.mozilla.org/mozilla-central/rev/12288a8a5037
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19

Comment 30

5 years ago
I think the change to the XML parsing may have caused bug 814900
(Assignee)

Comment 31

5 years ago
(In reply to Leif Halvard Silli from comment #30)
> I think the change to the XML parsing may have caused bug 814900

Nope. The regression range points to bug 801402.
Depends on: 844007
You need to log in before you can comment on or make changes to this bug.