Closed Bug 809934 Opened 12 years ago Closed 7 years ago

Remove reliance on BOM-sniffing UTF-16 decoder

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 504831

People

(Reporter: emk, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 4 obsolete files)

Our current UTF-16 usage is based on the Unicode Standard and IANA registry. That is, UTF-16 sniffs the BOM.
However, Encoding Standard says utf-16 is the same as utf-16le. We should revert bug 335531 to comply the new spec.
We don't have to consider about mailnews here because UTF-16 is not suitable for MIME.
(In reply to Masatoshi Kimura [:emk] from comment #0)
> However, Encoding Standard says utf-16 is the same as utf-16le.

More to the point, the Encoding Standard sniffs the BOM first as part of the "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes the label to "utf-16be" before invoking the actual decoder if there is a big-endian BOM.

So if we proceed with this, pages that have a big-endian UTF-16 BOM would report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will that break scripts?
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?

Based on Google Code search, I’m guessing the answer is “No”. Search for .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> More to the point, the Encoding Standard sniffs the BOM first as part of the
> "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes
> the label to "utf-16be" before invoking the actual decoder if there is a
> big-endian BOM.
The "decode" algorithm will change not only "utf-16" but also any other encodings. IIUC it represents "BOM trumps everything" rule.

> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?
I thought it was already implemented by bug 716579. No?
(In reply to Masatoshi Kimura [:emk] from comment #3)
> > So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> > report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> > that break scripts?
> I thought it was already implemented by bug 716579. No?
Ah, bug 716579 always uses "utf-16" label and expects it will sniff the BOM.
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> Based on Google Code search, I’m guessing the answer is “No”. Search for
> .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.
I don't think Web pages can rely on the .characterSet label. For example, IE returns "unicode" or "unicodeFFFE" on utf-16 pages. WebKit uses "UTF-16LE" as a canonical name for little endian utf-16 pages (neither "UTF-16" nor "utf-16").
Attached file utf-16 sample (obsolete) —
Attached file utf-16le sample (obsolete) —
Attached file utf-16be sample (obsolete) —
Attached file Test (obsolete) —
Attached file utf-16 sample
Attached file utf-16le sample
Attached file utf-16be sample
Attached file Test
Attachment #680041 - Attachment is obsolete: true
Attachment #680042 - Attachment is obsolete: true
Attachment #680043 - Attachment is obsolete: true
Attachment #680045 - Attachment is obsolete: true
Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
Chrome: UTF-16LE UTF-16LE UTF-16BE
IE10: unicode unicode unicodeFEFF (in all document mode)
Opera: utf-16 utf-16 utf-16 (utf-16 sniffs the BOM?)

No browsers behave as the Encoding Standard defines!
Anne, why did you made "utf-16" the canonical name of utf-16 little-endian?
It conflicts all browsers, the Unicode Standard and the IANA registry. I think "utf-16le" would be better for the canonical name.
> Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
This is Firefox 16.0.2. On the latest Nighly, it was UTF-16LE UTF-16LE UTF-16BE.
Genuine IE8: unicode unicode unicodeFFFE (Unlike IE8 mode of IE10, the testcase didn't work. I loaded individual pages to get the result.)
MS changed the encoding name from "unicodeFFFE" to "unicodeFEFF". Maybe nobody cares about the encoding name...
Safari 5.1.7: UTF-16LE UTF-16LE UTF-16BE
emk, I think I have been tripped by utf-16 sniffing. http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html was my reasoning (that month contains some other messages  that may be of interest here).
(In reply to Anne van Kesteren from comment #18)
> emk, I think I have been tripped by utf-16 sniffing.
> http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html
> was my reasoning (that month contains some other messages  that may be of
> interest here).
I read the thread, but I didn't still see the reasoning behind that.
> utf-16le becomes a label for utf-16.
As my test has shown, at least Safari and Chrome treat UTF-16 as a label for UTF-16LE, not the other way round. Nightly also follows them now. Leif's reasoning doesn't also hold because the BOM will be removed inside the "decode" algorithm before the stream is passed to the decoder.
Moreover UTF-16LE was already consistent between browsers if the BOM is absent.
> * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the  
7A.
> ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the  
7A.
I believe it was fixed by bug 716579.
emk, my apologies for making you read through that. I misunderstood your comment. The standard is now updated.
Thank you. Filed bug 811127 to catch up the spec.
Please see bug 814900 (test file: https://bug814900.bugzilla.mozilla.org/attachment.cgi?id=684883)

I don't know if it that bug is due to the solution to _this_ bug - or due to the solution to bug 716579 - or due  to something else. 

But fact is that, right now, then Firefox 19 and 20 issues a fatal error for UTF-16 encoded, big-endian XML files with the BOM.

My suspision is that it is this bug, 809934, since this bug is about treating "utf-16" as  a label for "utf-16le".  And "utf-16le" does not permit the BOM, hence the XML parser should issue a fatal error.

Which kind, of demonstrates the problems I have both with this bug, as well as with what the Encoding Standards says about this.
Is this a duplicate of bug 504831?
(In reply to Anne (:annevk) from comment #23)
> Is this a duplicate of bug 504831?

I believe it is.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: