Closed
Bug 716579
Opened 13 years ago
Closed 12 years ago
BOM should override HTTP-level charset
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
mozilla19
People
(Reporter: jsmith, Assigned: hsivonen)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 4 obsolete files)
600.89 KB,
image/png
|
Details | |
609.18 KB,
image/png
|
Details | |
35.31 KB,
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
31.86 KB,
patch
|
Details | Diff | Splinter Review |
Step: Go to http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html
Expected: See Chrome portion of screenshot.
Actual: See Firefox portion of screenshot. HTML showed up with two "?" characters at the beginning, with the page itself not rendered. When looking at the source of the page, there appears to be many random characters showing up, making it impossible to understand the underlying source of the page in Firefox.
Additional notes:
http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html renders correctly also in IE 9.
Comment 1•13 years ago
|
||
The site sends:
Content-Type: text/html; charset=utf-8
then follows with UTF-encoded data. I believe that we're doing the right thing here and respecting the Content-Type header. I have no idea what WebKit and IE are doing... Henri, do you?
Comment 2•13 years ago
|
||
I believe Anne has been looking into this stuff recently too.
Comment 3•13 years ago
|
||
The site also has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">, but of course everyone involved ignores that.
Comment 4•13 years ago
|
||
I guess one question is what should happen to the nulls if the page _is_ decoded as UTF-8. The page is all ASCII, so hard to say what UAs are really doing with it.
Reporter | ||
Comment 5•13 years ago
|
||
Additional notes:
Opera has similar behavior to Firefox (see screenshot attached).
Comment 6•13 years ago
|
||
Chrome's View > Encoding menu has "UTF-16LE" checked on this page, for what it's worth.
Assignee | ||
Comment 7•13 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #1)
> I have no idea what
> WebKit and IE are doing... Henri, do you?
They are allowing the UTF-16 BOM override HTTP. Per earlier discussion with Anne, we should do this, too, and the Encoding spec is expected to require it.
Assignee: nobody → hsivonen
OS: Windows 7 → All
Hardware: x86 → All
Summary: http://www.shopping.hp.com/shopping/html/popup/mtfs_webdetails_master.html does not render correctly in Firefox → BOM should override HTTP-level charset
Version: 9 Branch → Trunk
Comment 8•13 years ago
|
||
Does it override all HTTP charsets, or just "unicode" ones?
Comment 9•13 years ago
|
||
All. Basically you have an algorithm you pass the incoming stream of octets and the determined label, if the stream of octets starts with one of the three BOMs you skip the BOM and pick UTF-8 / UTF-16LE / UTF-16BE based on it, otherwise you use the label. (In due course the standards will use an algorithm like this if I get my way.)
Comment 10•13 years ago
|
||
Comment 12•13 years ago
|
||
Please verify that bug 662458 really is a duplicate: https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3
Assignee | ||
Comment 14•12 years ago
|
||
In the first phase, I’m going to leave the possibility of overriding the charset from the UI in place. (Mainly because our code assumes that the UI is *always* able to override and changing that is harder than changing the precedence of BOM relative to HTTP.)
(In reply to Leif Halvard Silli from comment #12)
> Please verify that bug 662458 really is a duplicate:
> https://bugzilla.mozilla.org/show_bug.cgi?id=662458#c3
The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023 and let the BOM take precedence in the XML case as well.
Comment 15•12 years ago
|
||
What do other browsers do for XML documents?
Comment 16•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> The XML part is not duplicate. I’m inclined to violate the infamous RFC 3023
> and let the BOM take precedence in the XML case as well.
Ignore RFC 3023 with respect to defaulting for now. First of all, the HTTP-specific default of ISO-8859-1 is gone in HTTPbis (in WGLC /really/ /soon/ /now/). Also, a revision of RFC 3023bis is supposed to be in progress as well.
Comment 17•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #15)
> What do other browsers do for XML documents?
For mainstream browsers, then, according to the data I have collected[*], Webkit, Chrome and IE9 let the BOM override also for XML. But note that according to the same data, IE10 consumer preview, contrary to IE9 and earlier, does not let BOM override HTTP whether in HTML or in XML.
[*] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm
NOTE regarding Chrome and Webkit:
XML
* has a default encoding (UTF-8)
* does not allow user overrides.
vs
Chrome and Webkit's XML handling:
* allow user override also for XML
* default to parent browsing context's encoding [#]
[#] http://www.xn--mlform-iua.no/blog/utf8files/locale_default_vs_doc_of_parent_browsing_context/#frame6
These peculiarities in Webkit and Chrome makes it more secure and useful, for authors, to use the BOM. Because, withuot the BOM, then XML does not work reliably, in these browsers. But OTOH, one could argue that it would have been much more important that Webkit and Chrome, for XML files, *stopped* defaulting to the parent browsing context and stopped allowing user overrides. So while it may be OK for Firefox to let BOM override everything also for XML, it is far less necessary to do so and probably without effect on interoperability.
PPS: For non-mainstrean XML parsers, such as libxml2 and many others, then they show many ideosyncrasies with regard to the BOM versus other encoding information - I collected many data on this in the following bug report: https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
Comment 18•12 years ago
|
||
One more NOTE on Chrome and Webkit: If the XML file (for example an SVG file) is used as an image (e.g. <img src="svg-file.svg">), then parent browsing context and user override is (fortunately) ignored.
Assignee | ||
Comment 19•12 years ago
|
||
Assignee | ||
Comment 20•12 years ago
|
||
I rolled XML into the same patch. Fixing HTML and XML out of sync would have cause broken code, because they share the charset source constant order.
Attachment #672750 -
Attachment is obsolete: true
Assignee | ||
Comment 21•12 years ago
|
||
Attachment #673807 -
Attachment is obsolete: true
Assignee | ||
Updated•12 years ago
|
Attachment #673808 -
Flags: review?(bugs)
Comment 22•12 years ago
|
||
This is pretty horrible to review. Would -w make the patch more readable?
Assignee | ||
Comment 23•12 years ago
|
||
Comment 24•12 years ago
|
||
Comment on attachment 673808 [details] [diff] [review]
Fix for HTML and XML, without printf
>- return !oCharset.IsEmpty();
>+ // This code is rather pointless to have. Might as well reuse expat as
>+ // seen in nsHtml5StreamParser. -- hsivonen
>+ oCharset.Truncate();
>+ if ((0x3C == aBytes[0]) &&
>+ (0x3F == aBytes[1]) &&
>+ (0x78 == aBytes[2]) &&
>+ (0x6D == aBytes[3]) &&
>+ (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
I must be missing something here. Why you compare 4 bytes ("<?xm") first and then
5 bytes.
>+ // This code was bogus when I found it. It expects the BOM or the XML
>+ // declaration to be entirely in the first network buffer. -- hsivonen
>+ if (source < kCharsetFromByteOrderMark &&
>+ count >= 2 && ((buf[0] == 0xFE && buf[1] == 0xFF) ||
>+ (buf[0] == 0xFF && buf[1] == 0xFE))) {
>+ // Don't bother with different code path for little-endian and
>+ // big-endian. Our decoder will re-sniff anyway.
>+ preferred.AssignLiteral("UTF-16");
>+ source = kCharsetFromByteOrderMark;
So to make this somehow readable, could we please add something like
bool
nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);
>+ } else if (source < kCharsetFromByteOrderMark &&
>+ count >= 3 &&
>+ buf[0] == 0xEF &&
>+ buf[1] == 0xBB &&
>+ buf[2] == 0xBF) {
>+ // The decoder will swallow the BOM
>+ preferred.AssignLiteral("UTF-8");
>+ source = kCharsetFromByteOrderMark;
Similar for UTF8
Attachment #673808 -
Flags: review?(bugs) → review-
Assignee | ||
Comment 25•12 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 673808 [details] [diff] [review]
> Fix for HTML and XML, without printf
>
> >- return !oCharset.IsEmpty();
> >+ // This code is rather pointless to have. Might as well reuse expat as
> >+ // seen in nsHtml5StreamParser. -- hsivonen
> >+ oCharset.Truncate();
> >+ if ((0x3C == aBytes[0]) &&
> >+ (0x3F == aBytes[1]) &&
> >+ (0x78 == aBytes[2]) &&
> >+ (0x6D == aBytes[3]) &&
> >+ (0 == PL_strncmp("<?xml", (char*) aBytes, 5))) {
> I must be missing something here. Why you compare 4 bytes ("<?xm") first and
> then
> 5 bytes.
I was wondering what that was about but I didn’t change it, because I was just reindenting old code there. I made more reasonable now. In general, I tried to resist fixing more bugs than the one at hand. When you look at nsParser/nsScanner to fix one bug, you see three more.
> So to make this somehow readable, could we please add something like
> bool
> nsContentUtils::IsUTF16BOM(const unsigned char* aData, uint32_t aLen);
Used nsContentUtils::CheckForBOM().
Attachment #673808 -
Attachment is obsolete: true
Attachment #675485 -
Flags: review?(bugs)
Assignee | ||
Comment 26•12 years ago
|
||
Attachment #675063 -
Attachment is obsolete: true
Comment 27•12 years ago
|
||
Comment on attachment 675485 [details] [diff] [review]
Address review comments
>+ if (mBomState == BOM_SNIFFING_OVER &&
>+ mCharsetSource >= kCharsetFromChannel) {
>+ nsCOMPtr<nsICharsetConverterManager> convManager =
>+ do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>+ convManager->GetUnicodeDecoder(mCharset.get(),
>+ getter_AddRefs(mUnicodeDecoder));
Fix the indentation of the latter parameter
>+ if (mUnicodeDecoder) {
>+ mUnicodeDecoder->SetInputErrorBehavior(
>+ nsIUnicodeDecoder::kOnError_Recover);
>+ mFeedChardet = false;
>+ mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
>+ mMetaScanner = nullptr;
>+ return WriteSniffingBufferAndCurrentSegment(aFromSegment,
>+ aCount,
>+ aWriteCount);
>+ } else {
>+ // Strange. nsHTMLDocument is supposed to make sure this does not
>+ // happen. Let's deal with this anyway, since who knows how
>+ // kCharsetFromOtherComponent is used.
>+ mCharsetSource = kCharsetFromWeakDocTypeDefault;
>+ }
>+ }
2 space indent, please
The changes to nsHtml5StreamParser::SniffStreamBytes could use some comments.
Like how mCharset ends up being set.
> ParserWriteStruct* pws = static_cast<ParserWriteStruct*>(closure);
>- const char* buf = fromRawSegment;
>+ const unsigned char* buf = (const unsigned char*)fromRawSegment;
I'd prefer static_cast<...>
>- if(NS_FAILED(res) && (mCharsetSource == kCharsetUninitialized))
>- {
>- // failed - unknown alias , fallback to ISO-8859-1
>- mCharset.AssignLiteral("ISO-8859-1");
>+ if (NS_FAILED(res)) {
>+ return NS_OK;
Assert
I'll
Attachment #675485 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 28•12 years ago
|
||
Thanks. Landed with review comments addressed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/12288a8a5037
Comment 29•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Comment 30•12 years ago
|
||
I think the change to the XML parsing may have caused bug 814900
Assignee | ||
Comment 31•12 years ago
|
||
(In reply to Leif Halvard Silli from comment #30)
> I think the change to the XML parsing may have caused bug 814900
Nope. The regression range points to bug 801402.
You need to log in
before you can comment on or make changes to this bug.
Description
•