Closed
Bug 809934
Opened 12 years ago
Closed 8 years ago
Remove reliance on BOM-sniffing UTF-16 decoder
Categories
(Core :: DOM: Core & HTML, defect)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
DUPLICATE
of bug 504831
People
(Reporter: emk, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 4 obsolete files)
Our current UTF-16 usage is based on the Unicode Standard and IANA registry. That is, UTF-16 sniffs the BOM.
However, Encoding Standard says utf-16 is the same as utf-16le. We should revert bug 335531 to comply the new spec.
We don't have to consider about mailnews here because UTF-16 is not suitable for MIME.
Comment 1•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #0)
> However, Encoding Standard says utf-16 is the same as utf-16le.
More to the point, the Encoding Standard sniffs the BOM first as part of the "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes the label to "utf-16be" before invoking the actual decoder if there is a big-endian BOM.
So if we proceed with this, pages that have a big-endian UTF-16 BOM would report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will that break scripts?
Comment 2•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?
Based on Google Code search, I’m guessing the answer is “No”. Search for .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.
Reporter | ||
Comment 3•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> More to the point, the Encoding Standard sniffs the BOM first as part of the
> "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes
> the label to "utf-16be" before invoking the actual decoder if there is a
> big-endian BOM.
The "decode" algorithm will change not only "utf-16" but also any other encodings. IIUC it represents "BOM trumps everything" rule.
> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?
I thought it was already implemented by bug 716579. No?
Reporter | ||
Comment 4•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #3)
> > So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> > report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> > that break scripts?
> I thought it was already implemented by bug 716579. No?
Ah, bug 716579 always uses "utf-16" label and expects it will sniff the BOM.
Reporter | ||
Comment 5•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> Based on Google Code search, I’m guessing the answer is “No”. Search for
> .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.
I don't think Web pages can rely on the .characterSet label. For example, IE returns "unicode" or "unicodeFFFE" on utf-16 pages. WebKit uses "UTF-16LE" as a canonical name for little endian utf-16 pages (neither "UTF-16" nor "utf-16").
Reporter | ||
Comment 6•12 years ago
|
||
Reporter | ||
Comment 7•12 years ago
|
||
Reporter | ||
Comment 8•12 years ago
|
||
Reporter | ||
Comment 9•12 years ago
|
||
Reporter | ||
Comment 10•12 years ago
|
||
Reporter | ||
Comment 11•12 years ago
|
||
Reporter | ||
Comment 12•12 years ago
|
||
Reporter | ||
Comment 13•12 years ago
|
||
Attachment #680041 -
Attachment is obsolete: true
Attachment #680042 -
Attachment is obsolete: true
Attachment #680043 -
Attachment is obsolete: true
Attachment #680045 -
Attachment is obsolete: true
Reporter | ||
Comment 14•12 years ago
|
||
Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
Chrome: UTF-16LE UTF-16LE UTF-16BE
IE10: unicode unicode unicodeFEFF (in all document mode)
Opera: utf-16 utf-16 utf-16 (utf-16 sniffs the BOM?)
No browsers behave as the Encoding Standard defines!
Anne, why did you made "utf-16" the canonical name of utf-16 little-endian?
It conflicts all browsers, the Unicode Standard and the IANA registry. I think "utf-16le" would be better for the canonical name.
Reporter | ||
Comment 15•12 years ago
|
||
> Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
This is Firefox 16.0.2. On the latest Nighly, it was UTF-16LE UTF-16LE UTF-16BE.
Reporter | ||
Comment 16•12 years ago
|
||
Genuine IE8: unicode unicode unicodeFFFE (Unlike IE8 mode of IE10, the testcase didn't work. I loaded individual pages to get the result.)
MS changed the encoding name from "unicodeFFFE" to "unicodeFEFF". Maybe nobody cares about the encoding name...
Reporter | ||
Comment 17•12 years ago
|
||
Safari 5.1.7: UTF-16LE UTF-16LE UTF-16BE
Comment 18•12 years ago
|
||
emk, I think I have been tripped by utf-16 sniffing. http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html was my reasoning (that month contains some other messages that may be of interest here).
Reporter | ||
Comment 19•12 years ago
|
||
(In reply to Anne van Kesteren from comment #18)
> emk, I think I have been tripped by utf-16 sniffing.
> http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html
> was my reasoning (that month contains some other messages that may be of
> interest here).
I read the thread, but I didn't still see the reasoning behind that.
> utf-16le becomes a label for utf-16.
As my test has shown, at least Safari and Chrome treat UTF-16 as a label for UTF-16LE, not the other way round. Nightly also follows them now. Leif's reasoning doesn't also hold because the BOM will be removed inside the "decode" algorithm before the stream is passed to the decoder.
Moreover UTF-16LE was already consistent between browsers if the BOM is absent.
> * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the
7A.
> ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the
7A.
I believe it was fixed by bug 716579.
Comment 20•12 years ago
|
||
emk, my apologies for making you read through that. I misunderstood your comment. The standard is now updated.
Reporter | ||
Comment 21•12 years ago
|
||
Thank you. Filed bug 811127 to catch up the spec.
Comment 22•12 years ago
|
||
Please see bug 814900 (test file: https://bug814900.bugzilla.mozilla.org/attachment.cgi?id=684883)
I don't know if it that bug is due to the solution to _this_ bug - or due to the solution to bug 716579 - or due to something else.
But fact is that, right now, then Firefox 19 and 20 issues a fatal error for UTF-16 encoded, big-endian XML files with the BOM.
My suspision is that it is this bug, 809934, since this bug is about treating "utf-16" as a label for "utf-16le". And "utf-16le" does not permit the BOM, hence the XML parser should issue a fatal error.
Which kind, of demonstrates the problems I have both with this bug, as well as with what the Encoding Standards says about this.
Comment 23•10 years ago
|
||
Is this a duplicate of bug 504831?
Comment 24•8 years ago
|
||
(In reply to Anne (:annevk) from comment #23)
> Is this a duplicate of bug 504831?
I believe it is.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Updated•6 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•