Closed Bug 782412 Opened 12 years ago Closed 12 years ago

Improve UTF-16 decoders

Categories

(Core :: Internationalization, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: bsurender, Assigned: emk)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

The current encoder for UTF-16 code units does not support checking, validating and handling of unpaired surrogates in the incoming UTF-16 code units.
Component: DOM → Internationalization
Blocks: 764234
Component: Internationalization → DOM
Component: DOM → Internationalization
The decoder (nsUCS2BEToUnicode.cpp) should be able to handle arbitrary Web content that might have unpaired surrogates in it, but shouldn't the encoders (in general) be able to assume they're being given sane content?
The API Bonnie is implementing takes arbitrary stuff from JS and feeds it to the encoder so we have to do sanitization somewhere, either in the encoder or at a previous stage (and looping through the text twice seems undesirable).
Is this really the only encoder with problems (of the ones you'll ever use, if it's clearly commented explaining what needs to be done to expand the set)?
Decoders should have implemented some error handling for bug 174351.
Blocks: encoding
Renamed misleading filenames. Also, these files contains not only BE converter but also LE and BOM-sniffing converters.
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Attachment #680387 - Flags: review?(smontagu)
This is required to support TextDecoder's "stream" option correctly.
Attachment #680388 - Flags: review?(smontagu)
Attachment #680389 - Flags: review?(smontagu)
This is needed for TextDecoder's "fatal" option.
Attachment #680390 - Flags: review?(smontagu)
In theory, BOM detection is not a part of decoders. But this will greatly simplify the implement of TextDecoder. And part 3 will make it easier to add an option to disable BOM detection.
Attachment #680391 - Flags: review?(hsivonen)
Attachment #680388 - Attachment is patch: true
(In reply to Masatoshi Kimura [:emk] from comment #9)
> Created attachment 680391 [details] [diff] [review]
> Part 5: Remove a workaround from dom/encoding
> 
> In theory, BOM detection is not a part of decoders. But this will greatly
> simplify the implement of TextDecoder. And part 3 will make it easier to add
> an option to disable BOM detection.

I’m a bit confused about the purpose of this bug vs. the patches vs. other potential patches.

This bug says it’s about encoders. Why do the patches change decoders?

How exactly does decoder BOM handling change here? E.g. nsScriptLoader relies on the decoders to swallow the BOM.
(In reply to Henri Sivonen (:hsivonen) from comment #11)
> I’m a bit confused about the purpose of this bug vs. the patches vs. other
> potential patches.
> 
> This bug says it’s about encoders. Why do the patches change decoders?
Oh, I misread the bug title. I've morphed the bug because I already attached patches. I'll file a new bug if encoders needs to be fixed.

> How exactly does decoder BOM handling change here? E.g. nsScriptLoader
> relies on the decoders to swallow the BOM.
UTF-16BE decoder and UTF-16LE decoder will continue to swallow the BOM. This patch will only fix their poor behaviors on some edge cases.
For example, If only one byte was fed to the decoder, it returned NS_ERROR_ILLEGAL_INPUT. I added the own BOM handling to TextDecoder to avoid the bug. Without the workaround, TextDecoder("utf-16").decode(Uint8Array([0xFF]),{stream:true}) would throw the exception wrongly.
With this patch, UTF-16 decoders will work saner so that the workaround is no longer needed.
I didn't touch the BOM-sniffing decoder because it will not be used from the DOM after bug 801402 and eventually it will be removed.
Summary: Current nsIUnicodeEncoder, nsBasicEncoder, encoder implementation for UTF-16 does not support unpaired surrogates checking. → Improve UTF-16 decoders
Attachment #680387 - Flags: review?(smontagu) → review+
Attachment #680390 - Flags: review?(smontagu) → review+
Comment on attachment 680391 [details] [diff] [review]
Part 5: Remove a workaround from dom/encoding

r+ on the assumption that the other patches work as described.
Attachment #680391 - Flags: review?(hsivonen) → review+
Blocks: 814900
Attachment #680388 - Flags: review?(smontagu) → review+
Attachment #680389 - Flags: review?(smontagu) → review+
Keywords: checkin-needed
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 801402
User impact if declined: users will not be able to open Web pages encoded in UTF-16 big endian.
Testing completed (on m-c, etc.): baked on m-c for one weeks.
Risk to taking this patch (and alternatives if risky): moderately. If this is too risky, take attachment 684919 [details] [diff] [review] from bug 814900 instead.
String or UUID changes made by this patch: none.
Attachment #688972 - Flags: approval-mozilla-aurora?
(In reply to Masatoshi Kimura [:emk] from comment #16)
> Created attachment 688972 [details] [diff] [review]
> Folded patch for aurora
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): bug 801402
> User impact if declined: users will not be able to open Web pages encoded in
> UTF-16 big endian.
> Testing completed (on m-c, etc.): baked on m-c for one weeks.
> Risk to taking this patch (and alternatives if risky): moderately. If this
> is too risky, take attachment 684919 [details] [diff] [review] from bug
> 814900 instead.
> String or UUID changes made by this patch: none.

What kind of risk are we looking at here?  What sort of fallout would we see, could we potentially test for and catch early?  Anything we can point QA to in order to be more sure of taking this fix over the one in but 814900?  What is the difference between the two patches?
(In reply to Lukas Blakk [:lsblakk] from comment #17)
> What kind of risk are we looking at here?  What sort of fallout would we
> see, could we potentially test for and catch early?  Anything we can point
> QA to in order to be more sure of taking this fix over the one in but
> 814900?  What is the difference between the two patches?

This patch rewritten UTF-16 decoder substantially. There might be a undiscovered problem in the new code.
Second patch just changes back the canonical name of UTF-16 to UTF-16 (rather than UTF-16) which is the same setting as Firefox 18 or earlier.
But I didn't find any specific concern at the moment.
(In reply to Masatoshi Kimura [:emk] from comment #18)
> (In reply to Lukas Blakk [:lsblakk] from comment #17)
> > What kind of risk are we looking at here?  What sort of fallout would we
> > see, could we potentially test for and catch early?  Anything we can point
> > QA to in order to be more sure of taking this fix over the one in but
> > 814900?  What is the difference between the two patches?
> 
> This patch rewritten UTF-16 decoder substantially. There might be a
> undiscovered problem in the new code.

Sounds like we should back out bug 801402 instead of taking this patch, since it wasn't ready to ride the trains.
Attachment #688972 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
(In reply to Alex Keybl [:akeybl] from comment #20)
> (In reply to Masatoshi Kimura [:emk] from comment #18)
> > (In reply to Lukas Blakk [:lsblakk] from comment #17)
> > > What kind of risk are we looking at here?  What sort of fallout would we
> > > see, could we potentially test for and catch early?  Anything we can point
> > > QA to in order to be more sure of taking this fix over the one in but
> > > 814900?  What is the difference between the two patches?
> > 
> > This patch rewritten UTF-16 decoder substantially. There might be a
> > undiscovered problem in the new code.
> 
> Sounds like we should back out bug 801402 instead of taking this patch,
> since it wasn't ready to ride the trains.

Nevermind, we'll take bug 814900.
You need to log in before you can comment on or make changes to this bug.