Last Comment Bug 687859 - BOM should take precedence for charset determination for scripts
: BOM should take precedence for charset determination for scripts
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: DOM (show other bugs)
: unspecified
: x86 All
: -- normal (vote)
: mozilla19
Assigned To: Henri Sivonen (:hsivonen)
:
Mentors:
Depends on:
Blocks: encoding
  Show dependency treegraph
 
Reported: 2011-09-20 08:36 PDT by dindog
Modified: 2012-10-23 03:55 PDT (History)
7 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
testcase.html & foo.js included (567 bytes, application/octet-stream)
2011-09-20 08:36 PDT, dindog
no flags Details
testcase (630 bytes, application/octet-stream)
2012-02-10 03:24 PST, dindog
no flags Details
Make the BOM take precedence (11.04 KB, patch)
2012-10-18 07:00 PDT, Henri Sivonen (:hsivonen)
bugs: review+
Details | Diff | Review

Description dindog 2011-09-20 08:36:20 PDT
Created attachment 561201 [details]
testcase.html & foo.js included

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:7.0) Gecko/20100101 Firefox/7.0
Build ID: 20110918030911

Steps to reproduce:

We find out that when setting a wrong charset for external sources, Firefox is the only browser fail to switch to correct charset.
testcase is a GB2312 charset HTML, with src='foo.js' in utf-8 bom, wrong charset by intention

Chrome, IE, Opera is fine. Firefox is the only one will fail in this testcase. It should improve
Comment 1 dindog 2012-02-10 03:24:46 PST
Created attachment 596008 [details]
testcase

Update the testcase and set the OS to all for I reproduce it in linxu too.
Comment 2 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-02-10 10:53:05 PST
Wrong charset in what sense?  What exact charset information is being provided?  The "charset" attribute on the element and the BOM in the file, with no HTTP-level charset?  Which charset is the "right" one?

The current HTML5 spec text about this says:

To obtain the Unicode string, the user agent run the following steps:

    If the resource's Content Type metadata, if any, specifies a character encoding, and
    the user agent supports that encoding, then let character encoding be that encoding,
    and jump to the bottom step in this series of steps.

    If the algorithm above set the script block's character encoding, then let character
    encoding be that encoding, and jump to the bottom step in this series of steps.
 
    For each of the rows in the following table, starting with the first one and going
    down, if the file has as many or more bytes available than the number of bytes in the
    first column, and the first bytes of the file match the bytes given in the first
    column, then set character encoding to the encoding given in the cell in the second
    column of that row, and jump to the bottom step in this series of steps:
    Bytes in Hexadecimal 	Encoding
    FE FF 	Big-endian UTF-16
    FF FE 	Little-endian UTF-16
    EF BB BF 	UTF-8

    Note: This step looks for Unicode Byte Order Marks (BOMs).

So the charset attribute overrides the BOM, per spec.  If you think the spec is wrong, please raise a spec issue.
Comment 3 dindog 2012-02-10 23:07:19 PST
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense?  What exact charset information is being
> provided?  The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset?  Which charset is the "right" one?
> 
> The current HTML5 spec text about this says:
> 
> To obtain the Unicode string, the user agent run the following steps:
> 
>     If the resource's Content Type metadata, if any, specifies a character
> encoding, and
>     the user agent supports that encoding, then let character encoding be
> that encoding,
>     and jump to the bottom step in this series of steps.
> 
>     If the algorithm above set the script block's character encoding, then
> let character
>     encoding be that encoding, and jump to the bottom step in this series of
> steps.
>  
>     For each of the rows in the following table, starting with the first one
> and going
>     down, if the file has as many or more bytes available than the number of
> bytes in the
>     first column, and the first bytes of the file match the bytes given in
> the first
>     column, then set character encoding to the encoding given in the cell in
> the second
>     column of that row, and jump to the bottom step in this series of steps:
>     Bytes in Hexadecimal 	Encoding
>     FE FF 	Big-endian UTF-16
>     FF FE 	Little-endian UTF-16
>     EF BB BF 	UTF-8
> 
>     Note: This step looks for Unicode Byte Order Marks (BOMs).
> 
> So the charset attribute overrides the BOM, per spec.  If you think the spec
> is wrong, please raise a spec issue.

I am not an developer, but here is the thing:
Our forum people found an website which Firefox acted different from other browser, and we found is the website's false, sort of. It return a wrong charset in HTTP header, and obviously I can't do that with a local testcase, but the issue remain if I declare a wrong charset in script tag.

Firefox is the only mainstream browsers will fail to run the script if the charset is incorrect, don't you think this should be fixed? The site's false indeed, but users won't think that way when they try every other browser work out fine, they will think it's Firefox's false
Comment 4 dindog 2012-02-10 23:32:40 PST
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense?  What exact charset information is being
> provided?  The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset?  Which charset is the "right" one?
Simple answer is both. Wrong HTTP header charset or attribute on the element will go wrong. (and the testcase demonstrates latter)
Comment 5 Boris Zbarsky [:bz] (Out June 25-July 6) 2012-02-11 05:56:59 PST
Henri, do IE and Opera allow a BOM in a script to override HTTP headers and @charset?
Comment 6 dindog 2012-05-07 08:55:12 PDT
Any news about this?

I see a similar bug fixed recently: Bug 693806, according to the spec though.

I mean, now that Firefox is the only one fail to run the testcase, should we pay a little attention to it?
Comment 7 Henri Sivonen (:hsivonen) 2012-10-18 07:00:16 PDT
Created attachment 672774 [details] [diff] [review]
Make the BOM take precedence
Comment 8 Henri Sivonen (:hsivonen) 2012-10-19 04:52:30 PDT
(In reply to Boris Zbarsky (:bz) from comment #5)
> Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> @charset?

I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
Comment 9 Masatoshi Kimura [:emk] 2012-10-19 05:05:38 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> (In reply to Boris Zbarsky (:bz) from comment #5)
> > Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> > @charset?
> 
> I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all
> of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
IE10 displayed 'โ‚ฌ' in all documents mode.
Comment 10 Henri Sivonen (:hsivonen) 2012-10-19 05:40:30 PDT
(In reply to Masatoshi Kimura [:emk] from comment #9)
> IE10 displayed 'โ‚ฌ' in all documents mode.

It strips the BOM and then decodes according to HTTP charset.
Comment 11 Henri Sivonen (:hsivonen) 2012-10-22 04:53:57 PDT
Filed bug 804113 about a potential pre-existing problem preserved by this patch.
Comment 12 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2012-10-22 11:26:30 PDT
Comment on attachment 672774 [details] [diff] [review]
Make the BOM take precedence

># HG changeset patch
># User Henri Sivonen <hsivonen@iki.fi>
># Date 1350558618 -10800
># Node ID 755369f4f020ee6983b3e312dd05e05c7f34ab8d
># Parent  89ff8e55638111c977538dbd22edf00b82f013ee
>Bug 687859 - Make the BOM take precedence when determining the character encoding of scripts. r=NOT_REVIEWED.
>
>diff --git a/content/base/src/nsScriptLoader.cpp b/content/base/src/nsScriptLoader.cpp
>--- a/content/base/src/nsScriptLoader.cpp
>+++ b/content/base/src/nsScriptLoader.cpp
>@@ -999,94 +999,103 @@ nsScriptLoader::ConvertToUTF16(nsIChanne
>                                uint32_t aLength, const nsAString& aHintCharset,
>                                nsIDocument* aDocument, nsString& aString)
> {
>   if (!aLength) {
>     aString.Truncate();
>     return NS_OK;
>   }
> 
>-  nsAutoCString characterSet;
>+  // The encoding info precedence is as follows from high to low:
>+  // The BOM
>+  // HTTP Content-Type (if name recognized)
>+  // charset attribute (if name recognized)
>+  // The encoding of the document
> 
>-  nsresult rv = NS_OK;
>-  if (aChannel) {
>-    rv = aChannel->GetContentCharset(characterSet);
>-  }
>-
>-  if (!aHintCharset.IsEmpty() && (NS_FAILED(rv) || characterSet.IsEmpty())) {
>-    // charset name is always ASCII.
>-    LossyCopyUTF16toASCII(aHintCharset, characterSet);
>-  }
>-
>-  if (NS_FAILED(rv) || characterSet.IsEmpty()) {
>-    DetectByteOrderMark(aData, aLength, characterSet);
>-  }
>-
>-  if (characterSet.IsEmpty() && aDocument) {
>-    // charset from document default
>-    characterSet = aDocument->GetDocumentCharacterSet();
>-  }
>-
>-  if (characterSet.IsEmpty()) {
>-    // fall back to ISO-8859-1, see bug 118404
>-    characterSet.AssignLiteral("ISO-8859-1");
>-  }
>+  nsAutoCString charset;
> 
>   nsCOMPtr<nsICharsetConverterManager> charsetConv =
>-    do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
>+    do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
> 
>   nsCOMPtr<nsIUnicodeDecoder> unicodeDecoder;
> 
>-  if (NS_SUCCEEDED(rv) && charsetConv) {
>-    rv = charsetConv->GetUnicodeDecoder(characterSet.get(),
>+  if (DetectByteOrderMark(aData, aLength, charset)) {
>+    // charset is now "UTF-8" or "UTF-16". The UTF-16 decoder will re-sniff
>+    // the BOM for endianness. Both the UTF-16 and the UTF-8 decoder will
>+    // take care of swallowing the BOM.
>+    charsetConv->GetUnicodeDecoderRaw(charset.get(),
>                                         getter_AddRefs(unicodeDecoder));
Could you fix the indentation of the latter parameter.

>+      NS_SUCCEEDED(aChannel->GetContentCharset(charset))) {
>+    charsetConv->GetUnicodeDecoder(charset.get(),
>+                                     getter_AddRefs(unicodeDecoder));
ditto

>+  }
> 
>-    rv = unicodeDecoder->GetMaxLength(reinterpret_cast<const char*>(aData),
>-                                      aLength, &unicodeLength);
>-    if (NS_SUCCEEDED(rv)) {
>-      if (!EnsureStringLength(aString, unicodeLength))
>-        return NS_ERROR_OUT_OF_MEMORY;
>+  if (!unicodeDecoder) {
>+    CopyUTF16toUTF8(aHintCharset, charset);
>+    charsetConv->GetUnicodeDecoder(charset.get(),
>+                                     getter_AddRefs(unicodeDecoder));
ditto

>-      PRUnichar *ustr = aString.BeginWriting();
>+  if (!unicodeDecoder && aDocument) {
>+    charset = aDocument->GetDocumentCharacterSet();
>+    charsetConv->GetUnicodeDecoderRaw(charset.get(),
>+                                        getter_AddRefs(unicodeDecoder));
>+  }
ditto

>+  if (!unicodeDecoder) {
>+    // Curiously, there are various callers that don't pass aDocument. The
>+    // fallback in the old code was ISO-8859-1, which behaved like
>+    // windows-1252. Saying windows-1252 for clarity and for compliance
>+    // with the Encoding Standard.
>+    charsetConv->GetUnicodeDecoderRaw("windows-1252",
>+                                        getter_AddRefs(unicodeDecoder));
ditto
Comment 13 Henri Sivonen (:hsivonen) 2012-10-22 23:28:05 PDT
Thanks. Landed with indent fixed.
https://hg.mozilla.org/integration/mozilla-inbound/rev/369d723c52ee
Comment 14 Ryan VanderMeulen [:RyanVM] 2012-10-23 03:55:04 PDT
https://hg.mozilla.org/mozilla-central/rev/369d723c52ee

Note You need to log in before you can comment on or make changes to this bug.