Closed Bug 687859 Opened 13 years ago Closed 12 years ago

BOM should take precedence for charset determination for scripts

Categories

(Core :: DOM: Core & HTML, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: dindog, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

Attached file testcase.html & foo.js included (obsolete) —
User Agent: Mozilla/5.0 (Windows NT 6.1; rv:7.0) Gecko/20100101 Firefox/7.0
Build ID: 20110918030911

Steps to reproduce:

We find out that when setting a wrong charset for external sources, Firefox is the only browser fail to switch to correct charset.
testcase is a GB2312 charset HTML, with src='foo.js' in utf-8 bom, wrong charset by intention

Chrome, IE, Opera is fine. Firefox is the only one will fail in this testcase. It should improve
Attached file testcase
Update the testcase and set the OS to all for I reproduce it in linxu too.
Attachment #561201 - Attachment is obsolete: true
Status: UNCONFIRMED → NEW
Component: General → File Handling
Ever confirmed: true
OS: Windows 7 → All
Product: Firefox → Core
QA Contact: general → file-handling
Version: 7 Branch → unspecified
Wrong charset in what sense?  What exact charset information is being provided?  The "charset" attribute on the element and the BOM in the file, with no HTTP-level charset?  Which charset is the "right" one?

The current HTML5 spec text about this says:

To obtain the Unicode string, the user agent run the following steps:

    If the resource's Content Type metadata, if any, specifies a character encoding, and
    the user agent supports that encoding, then let character encoding be that encoding,
    and jump to the bottom step in this series of steps.

    If the algorithm above set the script block's character encoding, then let character
    encoding be that encoding, and jump to the bottom step in this series of steps.
 
    For each of the rows in the following table, starting with the first one and going
    down, if the file has as many or more bytes available than the number of bytes in the
    first column, and the first bytes of the file match the bytes given in the first
    column, then set character encoding to the encoding given in the cell in the second
    column of that row, and jump to the bottom step in this series of steps:
    Bytes in Hexadecimal 	Encoding
    FE FF 	Big-endian UTF-16
    FF FE 	Little-endian UTF-16
    EF BB BF 	UTF-8

    Note: This step looks for Unicode Byte Order Marks (BOMs).

So the charset attribute overrides the BOM, per spec.  If you think the spec is wrong, please raise a spec issue.
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense?  What exact charset information is being
> provided?  The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset?  Which charset is the "right" one?
> 
> The current HTML5 spec text about this says:
> 
> To obtain the Unicode string, the user agent run the following steps:
> 
>     If the resource's Content Type metadata, if any, specifies a character
> encoding, and
>     the user agent supports that encoding, then let character encoding be
> that encoding,
>     and jump to the bottom step in this series of steps.
> 
>     If the algorithm above set the script block's character encoding, then
> let character
>     encoding be that encoding, and jump to the bottom step in this series of
> steps.
>  
>     For each of the rows in the following table, starting with the first one
> and going
>     down, if the file has as many or more bytes available than the number of
> bytes in the
>     first column, and the first bytes of the file match the bytes given in
> the first
>     column, then set character encoding to the encoding given in the cell in
> the second
>     column of that row, and jump to the bottom step in this series of steps:
>     Bytes in Hexadecimal 	Encoding
>     FE FF 	Big-endian UTF-16
>     FF FE 	Little-endian UTF-16
>     EF BB BF 	UTF-8
> 
>     Note: This step looks for Unicode Byte Order Marks (BOMs).
> 
> So the charset attribute overrides the BOM, per spec.  If you think the spec
> is wrong, please raise a spec issue.

I am not an developer, but here is the thing:
Our forum people found an website which Firefox acted different from other browser, and we found is the website's false, sort of. It return a wrong charset in HTTP header, and obviously I can't do that with a local testcase, but the issue remain if I declare a wrong charset in script tag.

Firefox is the only mainstream browsers will fail to run the script if the charset is incorrect, don't you think this should be fixed? The site's false indeed, but users won't think that way when they try every other browser work out fine, they will think it's Firefox's false
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense?  What exact charset information is being
> provided?  The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset?  Which charset is the "right" one?
Simple answer is both. Wrong HTTP header charset or attribute on the element will go wrong. (and the testcase demonstrates latter)
Henri, do IE and Opera allow a BOM in a script to override HTTP headers and @charset?
Component: File Handling → DOM
QA Contact: file-handling → general
Any news about this?

I see a similar bug fixed recently: Bug 693806, according to the spec though.

I mean, now that Firefox is the only one fail to run the testcase, should we pay a little attention to it?
Summary: firefox should verify external sources' charset → BOM should take precedence for charset determination for scripts
Assignee: nobody → hsivonen
Blocks: encoding
Status: NEW → ASSIGNED
Attachment #672774 - Flags: review?(bugs)
(In reply to Boris Zbarsky (:bz) from comment #5)
> Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> @charset?

I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> (In reply to Boris Zbarsky (:bz) from comment #5)
> > Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> > @charset?
> 
> I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all
> of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
IE10 displayed 'โฌ' in all documents mode.
(In reply to Masatoshi Kimura [:emk] from comment #9)
> IE10 displayed 'โฌ' in all documents mode.

It strips the BOM and then decodes according to HTTP charset.
Filed bug 804113 about a potential pre-existing problem preserved by this patch.
Comment on attachment 672774 [details] [diff] [review]
Make the BOM take precedence

># HG changeset patch
># User Henri Sivonen <hsivonen@iki.fi>
># Date 1350558618 -10800
># Node ID 755369f4f020ee6983b3e312dd05e05c7f34ab8d
># Parent  89ff8e55638111c977538dbd22edf00b82f013ee
>Bug 687859 - Make the BOM take precedence when determining the character encoding of scripts. r=NOT_REVIEWED.
>
>diff --git a/content/base/src/nsScriptLoader.cpp b/content/base/src/nsScriptLoader.cpp
>--- a/content/base/src/nsScriptLoader.cpp
>+++ b/content/base/src/nsScriptLoader.cpp
>@@ -999,94 +999,103 @@ nsScriptLoader::ConvertToUTF16(nsIChanne
>                                uint32_t aLength, const nsAString& aHintCharset,
>                                nsIDocument* aDocument, nsString& aString)
> {
>   if (!aLength) {
>     aString.Truncate();
>     return NS_OK;
>   }
> 
>-  nsAutoCString characterSet;
>+  // The encoding info precedence is as follows from high to low:
>+  // The BOM
>+  // HTTP Content-Type (if name recognized)
>+  // charset attribute (if name recognized)
>+  // The encoding of the document
> 
>-  nsresult rv = NS_OK;
>-  if (aChannel) {
>-    rv = aChannel->GetContentCharset(characterSet);
>-  }
>-
>-  if (!aHintCharset.IsEmpty() && (NS_FAILED(rv) || characterSet.IsEmpty())) {
>-    // charset name is always ASCII.
>-    LossyCopyUTF16toASCII(aHintCharset, characterSet);
>-  }
>-
>-  if (NS_FAILED(rv) || characterSet.IsEmpty()) {
>-    DetectByteOrderMark(aData, aLength, characterSet);
>-  }
>-
>-  if (characterSet.IsEmpty() && aDocument) {
>-    // charset from document default
>-    characterSet = aDocument->GetDocumentCharacterSet();
>-  }
>-
>-  if (characterSet.IsEmpty()) {
>-    // fall back to ISO-8859-1, see bug 118404
>-    characterSet.AssignLiteral("ISO-8859-1");
>-  }
>+  nsAutoCString charset;
> 
>   nsCOMPtr<nsICharsetConverterManager> charsetConv =
>-    do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
>+    do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
> 
>   nsCOMPtr<nsIUnicodeDecoder> unicodeDecoder;
> 
>-  if (NS_SUCCEEDED(rv) && charsetConv) {
>-    rv = charsetConv->GetUnicodeDecoder(characterSet.get(),
>+  if (DetectByteOrderMark(aData, aLength, charset)) {
>+    // charset is now "UTF-8" or "UTF-16". The UTF-16 decoder will re-sniff
>+    // the BOM for endianness. Both the UTF-16 and the UTF-8 decoder will
>+    // take care of swallowing the BOM.
>+    charsetConv->GetUnicodeDecoderRaw(charset.get(),
>                                         getter_AddRefs(unicodeDecoder));
Could you fix the indentation of the latter parameter.

>+      NS_SUCCEEDED(aChannel->GetContentCharset(charset))) {
>+    charsetConv->GetUnicodeDecoder(charset.get(),
>+                                     getter_AddRefs(unicodeDecoder));
ditto

>+  }
> 
>-    rv = unicodeDecoder->GetMaxLength(reinterpret_cast<const char*>(aData),
>-                                      aLength, &unicodeLength);
>-    if (NS_SUCCEEDED(rv)) {
>-      if (!EnsureStringLength(aString, unicodeLength))
>-        return NS_ERROR_OUT_OF_MEMORY;
>+  if (!unicodeDecoder) {
>+    CopyUTF16toUTF8(aHintCharset, charset);
>+    charsetConv->GetUnicodeDecoder(charset.get(),
>+                                     getter_AddRefs(unicodeDecoder));
ditto

>-      PRUnichar *ustr = aString.BeginWriting();
>+  if (!unicodeDecoder && aDocument) {
>+    charset = aDocument->GetDocumentCharacterSet();
>+    charsetConv->GetUnicodeDecoderRaw(charset.get(),
>+                                        getter_AddRefs(unicodeDecoder));
>+  }
ditto

>+  if (!unicodeDecoder) {
>+    // Curiously, there are various callers that don't pass aDocument. The
>+    // fallback in the old code was ISO-8859-1, which behaved like
>+    // windows-1252. Saying windows-1252 for clarity and for compliance
>+    // with the Encoding Standard.
>+    charsetConv->GetUnicodeDecoderRaw("windows-1252",
>+                                        getter_AddRefs(unicodeDecoder));
ditto
Attachment #672774 - Flags: review?(bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/369d723c52ee
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: