Closed Bug 716579 Opened 13 years ago Closed 12 years ago

BOM should override HTTP-level charset

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla19

People

(Reporter: jsmith, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 4 obsolete files)

HP Shopping Website No Firefox Render 13 years ago Jason Smith [:jsmith] 600.89 KB, image/png		Details
IE and Opera Comparison for HP Shopping Website Render 13 years ago Jason Smith [:jsmith] 609.18 KB, image/png		Details
Fix for HTML 12 years ago Henri Sivonen (:hsivonen) 7.16 KB, patch		Details \| Diff \| Splinter Review
Fix for HTML and XML 12 years ago Henri Sivonen (:hsivonen) 35.84 KB, patch		Details \| Diff \| Splinter Review
Fix for HTML and XML, without printf 12 years ago Henri Sivonen (:hsivonen) 35.20 KB, patch	smaug : review-	Details \| Diff \| Splinter Review
hg diff -w version 12 years ago Henri Sivonen (:hsivonen) 31.27 KB, patch		Details \| Diff \| Splinter Review
Address review comments 12 years ago Henri Sivonen (:hsivonen) 35.31 KB, patch	smaug : review+	Details \| Diff \| Splinter Review
Address review comments, with -w 12 years ago Henri Sivonen (:hsivonen) 31.86 KB, patch		Details \| Diff \| Splinter Review

Jason Smith [:jsmith]

Reporter

Description

•

13 years ago

Attached image HP Shopping Website No Firefox Render — Details

Step: Go to http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html

Expected: See Chrome portion of screenshot.

Actual: See Firefox portion of screenshot. HTML showed up with two "?" characters at the beginning, with the page itself not rendered. When looking at the source of the page, there appears to be many random characters showing up, making it impossible to understand the underlying source of the page in Firefox.

Additional notes:

http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html renders correctly also in IE 9.

Boris Zbarsky [:bzbarsky]

Comment 1

•

13 years ago

The site sends:

  Content-Type: text/html; charset=utf-8

then follows with UTF-encoded data.  I believe that we're doing the right thing here and respecting the Content-Type header.  I have no idea what WebKit and IE are doing...  Henri, do you?

Boris Zbarsky [:bzbarsky]

Comment 2

•

13 years ago

I believe Anne has been looking into this stuff recently too.

Boris Zbarsky [:bzbarsky]

Comment 3

•

13 years ago

The site also has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, but of course everyone involved ignores that.

Boris Zbarsky [:bzbarsky]

Comment 4

•

13 years ago

I guess one question is what should happen to the nulls if the page _is_ decoded as UTF-8.  The page is all ASCII, so hard to say what UAs are really doing with it.

Jason Smith [:jsmith]

Reporter

Comment 5

•

13 years ago

Attached image IE and Opera Comparison for HP Shopping Website Render — Details

Additional notes:

Opera has similar behavior to Firefox (see screenshot attached).

Boris Zbarsky [:bzbarsky]

Comment 6

•

13 years ago

Chrome's View > Encoding menu has "UTF-16LE" checked on this page, for what it's worth.

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #1)
> I have no idea what
> WebKit and IE are doing...  Henri, do you?

They are allowing the UTF-16 BOM override HTTP. Per earlier discussion with Anne, we should do this, too, and the Encoding spec is expected to require it.

Assignee: nobody → hsivonen

OS: Windows 7 → All

Hardware: x86 → All

Summary: http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html does not render correctly in Firefox → BOM should override HTTP-level charset

Version: 9 Branch → Trunk

Boris Zbarsky [:bzbarsky]

Comment 8

•

13 years ago

Does it override all HTTP charsets, or just "unicode" ones?

Anne (:annevk)

Comment 9

•

13 years ago

All. Basically you have an algorithm you pass the incoming stream of octets and the determined label, if the stream of octets starts with one of the three BOMs you skip the BOM and pick UTF-8 / UTF-16LE / UTF-16BE based on it, otherwise you use the label. (In due course the standards will use an algorithm like this if I get my way.)

Anne (:annevk)

Comment 10

•

13 years ago

FYI: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359

Masatoshi Kimura [:emk]

Updated

•

12 years ago

Blocks: encoding

Leif Halvard Silli

Comment 12

•

12 years ago

Please verify that bug 662458 really is a duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

Masatoshi Kimura [:emk]

Updated

•

12 years ago

Blocks: 786149

Henri Sivonen (:hsivonen)

Assignee

Comment 14

•

12 years ago

In the first phase, I’m going to leave the possibility of overriding the charset from the UI in place. (Mainly because our code assumes that the UI is *always* able to override and changing that is harder than changing the precedence of BOM relative to HTTP.)

(In reply to Leif Halvard Silli from comment #12)
> Please verify that bug 662458 really is a duplicate:
> https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3

The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023 and let the BOM take precedence in the XML case as well.

Masatoshi Kimura [:emk]

Comment 15

•

12 years ago

What do other browsers do for XML documents?

Julian Reschke

Comment 16

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #14)
> The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023
> and let the BOM take precedence in the XML case as well.

Ignore RFC 3023 with respect to defaulting for now. First of all, the HTTP-specific default of ISO-8859-1 is gone in HTTPbis (in WGLC /really/ /soon/ /now/). Also, a revision of RFC 3023bis is supposed to be in progress as well.

Leif Halvard Silli

Comment 17

•

12 years ago

 (In reply to Masatoshi Kimura [:emk] from comment #15)
> What do other browsers do for XML documents?

For mainstream browsers, then, according to the data I have collected[*], Webkit, Chrome and IE9 let the BOM override also for XML. But note that according to the same data, IE10 consumer preview, contrary to IE9 and earlier, does not let BOM override HTTP whether in HTML or in XML.

[*] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm

    NOTE regarding Chrome and Webkit: 

      XML
   * has a default encoding (UTF-8)
   * does not allow user overrides. 

     vs

      Chrome and Webkit's XML handling:
   * allow user override also for XML
   * default to parent browsing context's encoding [#]

[#] http://www.xn--mlform-iua.no/blog/utf8files/locale_default_vs_doc_of_parent_browsing_context/#frame6

    These peculiarities in Webkit and Chrome makes it more secure and useful, for authors, to use the BOM. Because, withuot the BOM, then XML does not work reliably, in these browsers. But OTOH, one could argue that it would have been much more important that Webkit and Chrome, for XML files, *stopped* defaulting to the parent browsing context and stopped allowing user overrides. So while it may be OK for Firefox to let BOM override everything also for XML, it is far less necessary to do so and probably without effect on interoperability.

PPS: For non-mainstrean XML parsers, such as libxml2 and many others, then they show many ideosyncrasies with regard to the BOM versus other encoding information - I collected many data on this in the following bug report: https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

Leif Halvard Silli

Comment 18

•

12 years ago

One more NOTE on Chrome and Webkit: If the XML file (for example an SVG file) is used as an image (e.g. <img src="svg-file.svg">), then parent browsing context and user override is (fortunately) ignored.

Henri Sivonen (:hsivonen)

Assignee

Comment 19

•

12 years ago

Attached patch Fix for HTML (obsolete) — Details — Splinter Review

Henri Sivonen (:hsivonen)

Assignee

Comment 20

•

12 years ago

Attached patch Fix for HTML and XML (obsolete) — Details — Splinter Review

I rolled XML into the same patch. Fixing HTML and XML out of sync would have cause broken code, because they share the charset source constant order.

Attachment #672750 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Comment 21

•

12 years ago

Attached patch Fix for HTML and XML, without printf (obsolete) — Details — Splinter Review

Attachment #673807 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Assignee

Updated

•

12 years ago

Attachment #673808 - Flags: review?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 22

•

12 years ago

This is pretty horrible to review. Would -w make the patch more readable?

Henri Sivonen (:hsivonen)

Assignee

Comment 23

•

12 years ago

Attached patch hg diff -w version (obsolete) — Details — Splinter Review

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 24

•

12 years ago

Comment on attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf

>- return !oCharset.IsEmpty();
>+  // This code is rather pointless to have. Might as well reuse expat as
>+  // seen in nsHtml5StreamParser. -- hsivonen
>+  oCharset.Truncate();
>+  if ((0x3C == aBytes[0]) &&
>+      (0x3F == aBytes[1]) &&
>+      (0x78 == aBytes[2]) &&
>+      (0x6D == aBytes[3]) &&
>+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
I must be missing something here. Why you compare 4 bytes ("<?xm") first and then
5 bytes.


>+    // This code was bogus when I found it. It expects the BOM or the XML
>+    // declaration to be entirely in the first network buffer. -- hsivonen
>+    if (source < kCharsetFromByteOrderMark &&
>+        count >= 2 && ((buf[0] == 0xFE && buf[1] == 0xFF) ||
>+                       (buf[0] == 0xFF && buf[1] == 0xFE))) {
>+      // Don't bother with different code path for little-endian and
>+      // big-endian. Our decoder will re-sniff anyway.
>+      preferred.AssignLiteral("UTF-16");
>+      source = kCharsetFromByteOrderMark;
So to make this somehow readable, could we please add something like
bool
nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);



>+    } else if (source < kCharsetFromByteOrderMark &&
>+               count >= 3 &&
>+               buf[0] == 0xEF &&
>+               buf[1] == 0xBB &&
>+               buf[2] == 0xBF) {
>+      // The decoder will swallow the BOM
>+      preferred.AssignLiteral("UTF-8");
>+      source = kCharsetFromByteOrderMark;
Similar for UTF8

Attachment #673808 - Flags: review?(bugs) → review-

Henri Sivonen (:hsivonen)

Assignee

Comment 25

•

12 years ago

Attached patch Address review comments — Details — Splinter Review

(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 673808 [details] [diff] [review]
> Fix for HTML and XML, without printf
> 
> >- return !oCharset.IsEmpty();
> >+  // This code is rather pointless to have. Might as well reuse expat as
> >+  // seen in nsHtml5StreamParser. -- hsivonen
> >+  oCharset.Truncate();
> >+  if ((0x3C == aBytes[0]) &&
> >+      (0x3F == aBytes[1]) &&
> >+      (0x78 == aBytes[2]) &&
> >+      (0x6D == aBytes[3]) &&
> >+      (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
> I must be missing something here. Why you compare 4 bytes ("<?xm") first and
> then
> 5 bytes.

I was wondering what that was about but I didn’t change it, because I was just reindenting old code there. I made more reasonable now. In general, I tried to resist fixing more bugs than the one at hand. When you look at nsParser/nsScanner to fix one bug, you see three more.

> So to make this somehow readable, could we please add something like
> bool
> nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);

Used nsContentUtils::CheckForBOM().

Attachment #673808 - Attachment is obsolete: true

Attachment #675485 - Flags: review?(bugs)

Henri Sivonen (:hsivonen)

Assignee

Comment 26

•

12 years ago

Attached patch Address review comments, with -w — Details — Splinter Review

Attachment #675063 - Attachment is obsolete: true

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 27

•

12 years ago

Comment on attachment 675485 [details] [diff] [review]
Address review comments

>+  if (mBomState == BOM_SNIFFING_OVER &&
>+      mCharsetSource >= kCharsetFromChannel) {
>+      nsCOMPtr<nsICharsetConverterManager> convManager =
>+        do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>+      convManager->GetUnicodeDecoder(mCharset.get(),
>+                                       getter_AddRefs(mUnicodeDecoder));
Fix the indentation of the latter parameter 

>+      if (mUnicodeDecoder) {
>+        mUnicodeDecoder->SetInputErrorBehavior(
>+            nsIUnicodeDecoder::kOnError_Recover);
>+        mFeedChardet = false;
>+        mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
>+        mMetaScanner = nullptr;
>+        return WriteSniffingBufferAndCurrentSegment(aFromSegment,
>+                                                    aCount,
>+                                                    aWriteCount);
>+      } else {
>+        // Strange. nsHTMLDocument is supposed to make sure this does not
>+        // happen. Let's deal with this anyway, since who knows how
>+        // kCharsetFromOtherComponent is used.
>+        mCharsetSource = kCharsetFromWeakDocTypeDefault;
>+      }
>+  }
2 space indent, please

The changes to nsHtml5StreamParser::SniffStreamBytes could use some comments.
Like how mCharset ends up being set.


>   ParserWriteStruct* pws = static_cast<ParserWriteStruct*>(closure);
>-  const char* buf = fromRawSegment;
>+  const unsigned char* buf = (const unsigned char*)fromRawSegment;
I'd prefer static_cast<...>

>-  if(NS_FAILED(res) && (mCharsetSource == kCharsetUninitialized))
>-  {
>-     // failed - unknown alias , fallback to ISO-8859-1
>-    mCharset.AssignLiteral("ISO-8859-1");
>+  if (NS_FAILED(res)) {
>+    return NS_OK;
Assert


I'll

Attachment #675485 - Flags: review?(bugs) → review+

Henri Sivonen (:hsivonen)

Assignee

Comment 28

•

12 years ago

Thanks. Landed with review comments addressed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12288a8a5037

Ryan VanderMeulen [:RyanVM]

Comment 29

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/12288a8a5037

Status: NEW → RESOLVED

Closed: 12 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

Leif Halvard Silli

Comment 30

•

12 years ago

I think the change to the XML parsing may have caused bug 814900

Henri Sivonen (:hsivonen)

Assignee

Comment 31

•

12 years ago

(In reply to Leif Halvard Silli from comment #30)
> I think the change to the XML parsing may have caused bug 814900

Nope. The regression range points to bug 801402.

Boris Zbarsky [:bzbarsky]

Updated

•

11 years ago

Depends on: 844007

Comment hidden (spam)