Closed Bug 809934 Opened 12 years ago Closed 7 years ago

Remove reliance on BOM-sniffing UTF-16 decoder

Tracking

()

Status:

RESOLVED DUPLICATE of bug 504831

People

(Reporter: emk, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 4 obsolete files)

utf-16 sample 12 years ago Masatoshi Kimura [:emk] 2 bytes, text/plain;charset=utf-16		Details
utf-16le sample 12 years ago Masatoshi Kimura [:emk] 2 bytes, text/plain;charset=utf-16le		Details
utf-16be sample 12 years ago Masatoshi Kimura [:emk] 2 bytes, text/plain;charset=utf-16be		Details
Test 12 years ago Masatoshi Kimura [:emk] 650 bytes, text/html		Details
utf-16 sample 12 years ago Masatoshi Kimura [:emk] 4 bytes, text/plain;charset=utf-16		Details
utf-16le sample 12 years ago Masatoshi Kimura [:emk] 4 bytes, text/plain;charset=utf-16le		Details
utf-16be sample 12 years ago Masatoshi Kimura [:emk] 4 bytes, text/plain;charset=utf-16be		Details
Test 12 years ago Masatoshi Kimura [:emk] 650 bytes, text/html		Details

Masatoshi Kimura [:emk]

Reporter

Description

•

12 years ago

Our current UTF-16 usage is based on the Unicode Standard and IANA registry. That is, UTF-16 sniffs the BOM.
However, Encoding Standard says utf-16 is the same as utf-16le. We should revert bug 335531 to comply the new spec.
We don't have to consider about mailnews here because UTF-16 is not suitable for MIME.

Henri Sivonen (:hsivonen) (on leave)

Comment 1

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #0)
> However, Encoding Standard says utf-16 is the same as utf-16le.

More to the point, the Encoding Standard sniffs the BOM first as part of the "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes the label to "utf-16be" before invoking the actual decoder if there is a big-endian BOM.

So if we proceed with this, pages that have a big-endian UTF-16 BOM would report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will that break scripts?

Henri Sivonen (:hsivonen) (on leave)

Comment 2

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1)
> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?

Based on Google Code search, I’m guessing the answer is “No”. Search for .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.

Masatoshi Kimura [:emk]

Reporter

Comment 3

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1)
> More to the point, the Encoding Standard sniffs the BOM first as part of the
> "decode" algorithm, so if HTTP says "utf-16", the "decode" algorithm changes
> the label to "utf-16be" before invoking the actual decoder if there is a
> big-endian BOM.
The "decode" algorithm will change not only "utf-16" but also any other encodings. IIUC it represents "BOM trumps everything" rule.

> So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> that break scripts?
I thought it was already implemented by bug 716579. No?

Masatoshi Kimura [:emk]

Reporter

Comment 4

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #3)
> > So if we proceed with this, pages that have a big-endian UTF-16 BOM would
> > report "utf-16be" as document.characterSet even if HTTP said "utf-16". Will
> > that break scripts?
> I thought it was already implemented by bug 716579. No?
Ah, bug 716579 always uses "utf-16" label and expects it will sniff the BOM.

Masatoshi Kimura [:emk]

Reporter

Comment 5

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #2)
> Based on Google Code search, I’m guessing the answer is “No”. Search for
> .characterSet and UTF-16 or utf-16 in .js finds just Firefox sources.
I don't think Web pages can rely on the .characterSet label. For example, IE returns "unicode" or "unicodeFFFE" on utf-16 pages. WebKit uses "UTF-16LE" as a canonical name for little endian utf-16 pages (neither "UTF-16" nor "utf-16").

Masatoshi Kimura [:emk]

Reporter

Comment 6

•

12 years ago

Attached file utf-16 sample (obsolete) — Details

Masatoshi Kimura [:emk]

Reporter

Comment 7

•

12 years ago

Attached file utf-16le sample (obsolete) — Details

Masatoshi Kimura [:emk]

Reporter

Comment 8

•

12 years ago

Attached file utf-16be sample (obsolete) — Details

Masatoshi Kimura [:emk]

Reporter

Comment 9

•

12 years ago

Attached file Test (obsolete) — Details

Masatoshi Kimura [:emk]

Reporter

Comment 10

•

12 years ago

Attached file utf-16 sample — Details

Masatoshi Kimura [:emk]

Reporter

Comment 11

•

12 years ago

Attached file utf-16le sample — Details

Masatoshi Kimura [:emk]

Reporter

Comment 12

•

12 years ago

Attached file utf-16be sample — Details

Masatoshi Kimura [:emk]

Reporter

Comment 13

•

12 years ago

Attached file Test — Details

Attachment #680041 - Attachment is obsolete: true

Attachment #680042 - Attachment is obsolete: true

Attachment #680043 - Attachment is obsolete: true

Attachment #680045 - Attachment is obsolete: true

Masatoshi Kimura [:emk]

Reporter

Comment 14

•

12 years ago

Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
Chrome: UTF-16LE UTF-16LE UTF-16BE
IE10: unicode unicode unicodeFEFF (in all document mode)
Opera: utf-16 utf-16 utf-16 (utf-16 sniffs the BOM?)

No browsers behave as the Encoding Standard defines!
Anne, why did you made "utf-16" the canonical name of utf-16 little-endian?
It conflicts all browsers, the Unicode Standard and the IANA registry. I think "utf-16le" would be better for the canonical name.

Masatoshi Kimura [:emk]

Reporter

Comment 15

•

12 years ago

> Firefox: UTF-16 UTF-16LE UTF-16BE (UTF-16 sniffs the BOM)
This is Firefox 16.0.2. On the latest Nighly, it was UTF-16LE UTF-16LE UTF-16BE.

Masatoshi Kimura [:emk]

Reporter

Comment 16

•

12 years ago

Genuine IE8: unicode unicode unicodeFFFE (Unlike IE8 mode of IE10, the testcase didn't work. I loaded individual pages to get the result.)
MS changed the encoding name from "unicodeFFFE" to "unicodeFEFF". Maybe nobody cares about the encoding name...

Masatoshi Kimura [:emk]

Reporter

Comment 17

•

12 years ago

Safari 5.1.7: UTF-16LE UTF-16LE UTF-16BE

Anne (:annevk)

Comment 18

•

12 years ago

emk, I think I have been tripped by utf-16 sniffing. http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html was my reasoning (that month contains some other messages  that may be of interest here).

Masatoshi Kimura [:emk]

Reporter

Comment 19

•

12 years ago

(In reply to Anne van Kesteren from comment #18)
> emk, I think I have been tripped by utf-16 sniffing.
> http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Dec/0256.html
> was my reasoning (that month contains some other messages  that may be of
> interest here).
I read the thread, but I didn't still see the reasoning behind that.
> utf-16le becomes a label for utf-16.
As my test has shown, at least Safari and Chrome treat UTF-16 as a label for UTF-16LE, not the other way round. Nightly also follows them now. Leif's reasoning doesn't also hold because the BOM will be removed inside the "decode" algorithm before the stream is passed to the decoder.
Moreover UTF-16LE was already consistent between browsers if the BOM is absent.
> * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the  
7A.
> ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the  
7A.
I believe it was fixed by bug 716579.

Anne (:annevk)

Comment 20

•

12 years ago

emk, my apologies for making you read through that. I misunderstood your comment. The standard is now updated.

Masatoshi Kimura [:emk]

Reporter

Comment 21

•

12 years ago

Thank you. Filed bug 811127 to catch up the spec.

Leif Halvard Silli

Comment 22

•

12 years ago

Please see bug 814900 (test file: https://bug814900.bugzilla.mozilla.org/attachment.cgi?id=684883)

I don't know if it that bug is due to the solution to _this_ bug - or due to the solution to bug 716579 - or due  to something else. 

But fact is that, right now, then Firefox 19 and 20 issues a fatal error for UTF-16 encoded, big-endian XML files with the BOM.

My suspision is that it is this bug, 809934, since this bug is about treating "utf-16" as  a label for "utf-16le".  And "utf-16le" does not permit the BOM, hence the XML parser should issue a fatal error.

Which kind, of demonstrates the problems I have both with this bug, as well as with what the Encoding Standards says about this.

Anne (:annevk)

Comment 23

•

10 years ago

Is this a duplicate of bug 504831?

Henri Sivonen (:hsivonen) (on leave)

Comment 24

•

7 years ago

(In reply to Anne (:annevk) from comment #23)
> Is this a duplicate of bug 504831?

I believe it is.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → DUPLICATE

Nobody; OK to take it and work on it

Assignee

Updated

•

5 years ago

Component: DOM → DOM: Core & HTML

You need to log in before you can comment on or make changes to this bug.