Last Comment Bug 520095 - Should encoded U+FFFE and U+FFFF decode to U+FFFD?
: Should encoded U+FFFE and U+FFFF decode to U+FFFD?
Status: VERIFIED FIXED
[inbound]
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla8
Assigned To: Jeff Walden [:Waldo] (remove +bmo to email)
:
:
Mentors:
javascript: alert(decodeURIComponent(...
Depends on:
Blocks: test262 505991
  Show dependency treegraph
 
Reported: 2009-10-01 17:40 PDT by Jeff Walden [:Waldo] (remove +bmo to email)
Modified: 2011-11-24 15:52 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Adjust formatting, whitespace in Utf8ToOneUcs4Char (2.23 KB, patch)
2011-07-14 17:41 PDT, Jeff Walden [:Waldo] (remove +bmo to email)
jwalden+bmo: review+
Details | Diff | Splinter Review
Patch and test (2.16 KB, patch)
2011-07-14 17:47 PDT, Jeff Walden [:Waldo] (remove +bmo to email)
masa141421356: review+
Details | Diff | Splinter Review

Description Jeff Walden [:Waldo] (remove +bmo to email) 2009-10-01 17:40:28 PDT
Bug 172699 made these characters decode to the replacement character.  Why?  Spec sez this:

Let V be the value obtained by applying the UTF-8 transformation to Octets, that is, from an array of octets into a 32-bit value. If Octects does not contain a valid UTF-8 encoding of a Unicode code point throw a URIError exception.

According to Table 21 in the ES5 final draft (pending approval in December), U+FFFE and U+FFFF are valid (only the surrogate range is not valid).

Do we preserve our still-non-standard behavior?  Should this be fixed in ES5 errata (or ES6)?  I have no idea, hence this bug.
Comment 1 Jeff Walden [:Waldo] (remove +bmo to email) 2009-10-01 17:48:58 PDT
(Yet another option would be to throw, which might be better than replacing silently if we actually want to treat U+FFF[EF] as error-ful, and it's arguably more consistent with how we'll handle error-ful overlong UTF-8 once I land bug 511859.)
Comment 2 John G. Myers 2009-10-07 12:16:43 PDT
U+FFFE has security implications similar to those for overlong sequences.  By embedding U+FFFE noncharacters, attackers can get an insufficiently careful UTF-16 decoder to byte swap subsequent input, sneaking prohibited characters past security-enforcing input validators.
Comment 3 Jeff Walden [:Waldo] (remove +bmo to email) 2009-10-07 17:24:53 PDT
Hm, why not U+FEFF then, if that were part of the reasoning?  Why why why couldn't the existing behavior have had an explanatory comment by it...
Comment 4 John G. Myers 2009-10-07 18:46:48 PDT
U+FEFF is a valid, assigned character.  There is no reason to prohibit it.

The code would be quite unmaintainable if every bug fix added explanatory comments.  Such things generally belong in Bugzilla, for example bug 74198 comment 3.
Comment 5 Jeff Walden [:Waldo] (remove +bmo to email) 2009-10-07 19:00:50 PDT
Er, oh, never mind, I don't know what I was thinking with respect to U+FEFF.

I disagree about where explanatory comments should go, but do note that the bug that introduced that code didn't have such a comment.
Comment 6 John G. Myers 2009-10-07 19:06:38 PDT
Bug 172699, which introduced this code, has an initial comment referring to bug 172699, which has the comment.
Comment 7 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-14 17:28:35 PDT
This causes test262 failures:

S15.1.3.1_A2.4_T1	Complex tests, use RFC 3629	fail
S15.1.3.2_A2.4_T1	Complex tests, use RFC 3629	fail

Regardless what Unicode wants here, or what is purest from that point of view (and I do sympathize with it), I think ES5 wants us to treat Table 21 - UTF-8 Encodings as canonical.  Only bad or mismatched surrogates, or overlong sequences, are to be treated as errors.  That's what test262 wants.  It's what Chrome, Opera, IE, and Safari implement.  If anyone actually were to be broken by letting a couple more code points through, they'd also be broken in every other browser.  And if we did this, it looks like it'd address the dependency bug 505991.  We should make the change.
Comment 8 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-14 17:41:44 PDT
Created attachment 546056 [details] [diff] [review]
Adjust formatting, whitespace in Utf8ToOneUcs4Char

This is all straightforward changes to whitespace, variable declaration location, and so on, so it doesn't need review.  I'm just posting it in case anyone wants to actually apply the next patch, which will modify a tree with this patch in it.
Comment 9 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-14 17:47:34 PDT
Created attachment 546057 [details] [diff] [review]
Patch and test

This undoes replacement for these two characters and adds a test for non-replacing behavior.
Comment 10 John G. Myers 2011-07-14 18:59:28 PDT
I think it is a poor choice to introduce a potential security vulnerability just because some standard or test suite seems to require it. I believe the security vulnerability should be brought up with the maintainers of the standard/suite in question.
Comment 11 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-15 01:00:52 PDT
I disagree, obviously, but the concern is at least plausible.  es-discuss can take a pass at it, although I doubt it will codify either current replace-with-U+FFFD behavior or throw-a-URIError behavior not implemented by any browser to date.

https://mail.mozilla.org/pipermail/es-discuss/2011-July/015993.html

Hopefully that request will receive more responses than it did the first time I made it:

https://mail.mozilla.org/pipermail/es-discuss/2009-October/010014.html

My query of a few hours ago is actually in response to this message, which received no replies when I originally posted it.
Comment 12 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-18 16:26:55 PDT
The response so far:

https://mail.mozilla.org/pipermail/es-discuss/2011-July/016000.html
Comment 13 Masahiro YAMADA 2011-07-19 08:20:12 PDT
Comment on attachment 546057 [details] [diff] [review]
Patch and test

Looks fine.
Comment 14 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-20 14:53:33 PDT
In the absence of feedback from ECMA saying that censoring U+FFF{E,F} is appropriate, and in the presence of tentative feedback saying that it's the wrong thing to do, I landed the patch to make the switch:

http://hg.mozilla.org/integration/mozilla-inbound/rev/ffc7c2391d3f
Comment 15 Marco Bonardo [::mak] 2011-07-21 06:15:27 PDT
http://hg.mozilla.org/mozilla-central/rev/ffc7c2391d3f

Note You need to log in before you can comment on or make changes to this bug.