Last Comment Bug 660612 - Utf8ToOneUcs4Char passes invalid UTF-8 octets '%ED%A0%80', so decodeURIComponent('%ED%A0%80') doesn't throw
: Utf8ToOneUcs4Char passes invalid UTF-8 octets '%ED%A0%80', so decodeURICompon...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla8
Assigned To: Masahiro YAMADA
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-05-30 05:07 PDT by Yusuke Suzuki
Modified: 2011-07-12 03:51 PDT (History)
6 users (show)
mounir: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch rev.1 (1.52 KB, patch)
2011-06-04 05:43 PDT, Masahiro YAMADA
jwalden+bmo: review+
Details | Diff | Review
Test (4.06 KB, text/plain)
2011-06-04 05:45 PDT, Masahiro YAMADA
no flags Details
Patch rev.1.1 (553 bytes, patch)
2011-06-11 06:02 PDT, Masahiro YAMADA
jwalden+bmo: review+
Details | Diff | Review
Test rev.1.1 (276 bytes, patch)
2011-06-11 06:03 PDT, Masahiro YAMADA
no flags Details | Diff | Review
Test rev.1.2 (290 bytes, patch)
2011-06-11 06:08 PDT, Masahiro YAMADA
jwalden+bmo: review+
Details | Diff | Review
Unified patch (2.44 KB, patch)
2011-06-22 08:11 PDT, Masahiro YAMADA
jwalden+bmo: review+
Details | Diff | Review
Unified patch rev.2 (2.44 KB, patch)
2011-06-24 08:16 PDT, Masahiro YAMADA
jwalden+bmo: review+
jwalden+bmo: checkin+
Details | Diff | Review

Description Yusuke Suzuki 2011-05-30 05:07:26 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.9) Gecko/2008052906 Firefox/3.0
Build Identifier: 

according to the RFC3629, http://www.ietf.org/rfc/rfc3629.txt
following octets
%xED %A0 %80 (U+D800. this code point is in surrogate area)
is invalid UTF-8 octets.

so, according to the ES5.1, section 15.1.3 Decode Algorithm, step 4-d-vii-8,
"Let V be the value obtained by applying the UTF-8 transformation to Octets, that is, from an array of octets into a 32-bit value. If Octets does not contain a valid UTF-8 encoding of a Unicode code point throw a URIError exception."

but, decodeURIComponent('%ED%A0%80') doesn't throw URIError.

Reproducible: Always

Steps to Reproduce:
1. evaluates `decodeURIComponent('%ED%A0%80')`

Actual Results:  
1. returns '\uDC00'. This code point is in surrogate area.

Expected Results:  
1. throw URIError

I found the same bug in V8, and report the issue. http://code.google.com/p/v8/issues/detail?id=1415
JSC is checking in wtf/unicode/UTC8.cpp, decodeUTF8Sequence function, Handle 3-byte sequences section.

code:
  // UTF-16 surrogates should never appear in UTF-8 data
  if (c >= 0xD800 && c <= 0xDFFFF)
    return -1;
Comment 1 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-05-30 14:16:47 PDT
What do Chakra and Carakan do?  This might need a spec change, depending....
Comment 2 Yusuke Suzuki 2011-05-30 14:55:32 PDT
> What do Chakra and Carakan do?
Carakan doesn't throw URIError when this '%ED%A0%80' representation comes. (in Opera version 11.11)

But, Chakra (in IE9 and IE10 platform preview) and IE6, 7, 8 rejects this.
Comment 3 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-05-30 17:45:11 PDT
Probably safe to do this, then....  hopefully.
Comment 4 Masahiro YAMADA 2011-05-31 05:46:42 PDT
Rhino seems to have same problem.
I filed new bug for Rhino (bug 660799).
Comment 5 Masahiro YAMADA 2011-05-31 23:06:19 PDT
FYI: Safari5.0.5(Windows) rejects this.
Comment 6 Masahiro YAMADA 2011-06-04 05:43:50 PDT
Created attachment 537339 [details] [diff] [review]
patch rev.1
Comment 7 Masahiro YAMADA 2011-06-04 05:45:53 PDT
Created attachment 537340 [details]
Test
Comment 8 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-06 18:08:28 PDT
Comment on attachment 537339 [details] [diff] [review]
patch rev.1

Review of attachment 537339 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +6062,5 @@
>              ucs4Char = ucs4Char<<6 | (*utf8Buffer++ & 0x3F);
>          }
>          if (JS_UNLIKELY(ucs4Char < minucs4Char)) {
> +            ucs4Char = INVALID_UTF8;
> +        } else if (JS_UNLIKELY(ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)) {

I'd combine the two ifs into a single if, but it's not that important, really.
Comment 9 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-06 18:10:07 PDT
Actually: add a test for this to js/src/tests/ecma_5/Global, please, with the patch.  Nothing fancy, just check decodeURIComponent("%ED%A0%80") throws a URIError.
Comment 10 Masahiro YAMADA 2011-06-11 06:02:56 PDT
Created attachment 538687 [details] [diff] [review]
Patch rev.1.1
Comment 11 Masahiro YAMADA 2011-06-11 06:03:46 PDT
Created attachment 538688 [details] [diff] [review]
Test rev.1.1
Comment 12 Masahiro YAMADA 2011-06-11 06:08:31 PDT
Created attachment 538689 [details] [diff] [review]
Test rev.1.2
Comment 13 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-17 16:25:26 PDT
Comment on attachment 538687 [details] [diff] [review]
Patch rev.1.1

>-        if (JS_UNLIKELY(ucs4Char < minucs4Char)) {
>-            ucs4Char = OVERLONG_UTF8;
>+        if (JS_UNLIKELY((ucs4Char < minucs4Char) || (ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF))) {
>+            ucs4Char = INVALID_UTF8;

Nitpicky, yeah, but this is what I meant by combining them:

if (JS_UNLIKELY(ucs4Char < minucs4Char || (ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)))

Looks fine with that.
Comment 14 Masahiro YAMADA 2011-06-22 08:11:11 PDT
Created attachment 541051 [details] [diff] [review]
Unified patch
Comment 15 Jeff Walden [:Waldo] (remove +bmo to email) 2011-06-22 10:46:51 PDT
Comment on attachment 541051 [details] [diff] [review]
Unified patch

Review of attachment 541051 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +5973,5 @@
>          while (--utf8Length) {
>              JS_ASSERT((*utf8Buffer & 0xC0) == 0x80);
>              ucs4Char = ucs4Char<<6 | (*utf8Buffer++ & 0x3F);
>          }
> +        if (JS_UNLIKELY(ucs4Char < minucs4Char || ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)) {

You should over-parenthesize the second condition here, as I did in comment 13 -- first because I suspect people generally don't fully internalize the precedence of || versus &&, and second because some compilers will warn precisely for that reason.  r+ with that change, and I see no reason to take a last look after you've made it.  :-)
Comment 16 Masahiro YAMADA 2011-06-24 08:16:47 PDT
Created attachment 541685 [details] [diff] [review]
Unified patch rev.2

added ().
Comment 17 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-01 19:20:01 PDT
Masahiro, do you need this landed?  I can't remember if I've seen you commit patches or not.
Comment 18 Masahiro YAMADA 2011-07-02 00:21:06 PDT
(In reply to comment #17)
> Masahiro, do you need this landed?  I can't remember if I've seen you commit
> patches or not.

Jeff, I want to land.(But I'm not commiter).
Comment 19 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-06 16:24:59 PDT
I've got it queued up locally -- next time I have things to push, it'll get pushed to TM.
Comment 20 Jeff Walden [:Waldo] (remove +bmo to email) 2011-07-11 16:24:33 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/0d5626387b8a
Comment 21 Mounir Lamouri (:mounir) 2011-07-12 03:51:34 PDT
Merged:
http://hg.mozilla.org/mozilla-central/rev/0d5626387b8a

Note You need to log in before you can comment on or make changes to this bug.