Utf8ToOneUcs4Char passes invalid UTF-8 octets '%ED%A0%80', so decodeURIComponent('%ED%A0%80') doesn't throw

RESOLVED FIXED in mozilla8

Status

()

Core
JavaScript Engine
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: Yusuke Suzuki, Assigned: Masahiro YAMADA)

Tracking

Trunk
mozilla8
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 6 obsolete attachments)

(Reporter)

Description

6 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.9) Gecko/2008052906 Firefox/3.0
Build Identifier: 

according to the RFC3629, http://www.ietf.org/rfc/rfc3629.txt
following octets
%xED %A0 %80 (U+D800. this code point is in surrogate area)
is invalid UTF-8 octets.

so, according to the ES5.1, section 15.1.3 Decode Algorithm, step 4-d-vii-8,
"Let V be the value obtained by applying the UTF-8 transformation to Octets, that is, from an array of octets into a 32-bit value. If Octets does not contain a valid UTF-8 encoding of a Unicode code point throw a URIError exception."

but, decodeURIComponent('%ED%A0%80') doesn't throw URIError.

Reproducible: Always

Steps to Reproduce:
1. evaluates `decodeURIComponent('%ED%A0%80')`

Actual Results:  
1. returns '\uDC00'. This code point is in surrogate area.

Expected Results:  
1. throw URIError

I found the same bug in V8, and report the issue. http://code.google.com/p/v8/issues/detail?id=1415
JSC is checking in wtf/unicode/UTC8.cpp, decodeUTF8Sequence function, Handle 3-byte sequences section.

code:
  // UTF-16 surrogates should never appear in UTF-8 data
  if (c >= 0xD800 && c <= 0xDFFFF)
    return -1;
(Assignee)

Updated

6 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
What do Chakra and Carakan do?  This might need a spec change, depending....
(Reporter)

Comment 2

6 years ago
> What do Chakra and Carakan do?
Carakan doesn't throw URIError when this '%ED%A0%80' representation comes. (in Opera version 11.11)

But, Chakra (in IE9 and IE10 platform preview) and IE6, 7, 8 rejects this.
Probably safe to do this, then....  hopefully.
(Assignee)

Comment 4

6 years ago
Rhino seems to have same problem.
I filed new bug for Rhino (bug 660799).
(Assignee)

Comment 5

6 years ago
FYI: Safari5.0.5(Windows) rejects this.
(Assignee)

Comment 6

6 years ago
Created attachment 537339 [details] [diff] [review]
patch rev.1
Assignee: general → masa141421356
Status: NEW → ASSIGNED
Attachment #537339 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 7

6 years ago
Created attachment 537340 [details]
Test
Comment on attachment 537339 [details] [diff] [review]
patch rev.1

Review of attachment 537339 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +6062,5 @@
>              ucs4Char = ucs4Char<<6 | (*utf8Buffer++ & 0x3F);
>          }
>          if (JS_UNLIKELY(ucs4Char < minucs4Char)) {
> +            ucs4Char = INVALID_UTF8;
> +        } else if (JS_UNLIKELY(ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)) {

I'd combine the two ifs into a single if, but it's not that important, really.
Attachment #537339 - Flags: review?(jwalden+bmo) → review+
Actually: add a test for this to js/src/tests/ecma_5/Global, please, with the patch.  Nothing fancy, just check decodeURIComponent("%ED%A0%80") throws a URIError.
(Assignee)

Comment 10

6 years ago
Created attachment 538687 [details] [diff] [review]
Patch rev.1.1
Attachment #537339 - Attachment is obsolete: true
(Assignee)

Comment 11

6 years ago
Created attachment 538688 [details] [diff] [review]
Test rev.1.1
Attachment #537340 - Attachment is obsolete: true
(Assignee)

Updated

6 years ago
Attachment #538687 - Flags: review?(jwalden+bmo)
(Assignee)

Comment 12

6 years ago
Created attachment 538689 [details] [diff] [review]
Test rev.1.2
Attachment #538688 - Attachment is obsolete: true
Attachment #538689 - Flags: review?(jwalden+bmo)
Comment on attachment 538687 [details] [diff] [review]
Patch rev.1.1

>-        if (JS_UNLIKELY(ucs4Char < minucs4Char)) {
>-            ucs4Char = OVERLONG_UTF8;
>+        if (JS_UNLIKELY((ucs4Char < minucs4Char) || (ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF))) {
>+            ucs4Char = INVALID_UTF8;

Nitpicky, yeah, but this is what I meant by combining them:

if (JS_UNLIKELY(ucs4Char < minucs4Char || (ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)))

Looks fine with that.
Attachment #538687 - Flags: review?(jwalden+bmo) → review+
Attachment #538689 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 14

6 years ago
Created attachment 541051 [details] [diff] [review]
Unified patch
Attachment #538687 - Attachment is obsolete: true
Attachment #538689 - Attachment is obsolete: true
Attachment #541051 - Flags: review?(jwalden+bmo)
Comment on attachment 541051 [details] [diff] [review]
Unified patch

Review of attachment 541051 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsstr.cpp
@@ +5973,5 @@
>          while (--utf8Length) {
>              JS_ASSERT((*utf8Buffer & 0xC0) == 0x80);
>              ucs4Char = ucs4Char<<6 | (*utf8Buffer++ & 0x3F);
>          }
> +        if (JS_UNLIKELY(ucs4Char < minucs4Char || ucs4Char >= 0xD800 && ucs4Char <= 0xDFFF)) {

You should over-parenthesize the second condition here, as I did in comment 13 -- first because I suspect people generally don't fully internalize the precedence of || versus &&, and second because some compilers will warn precisely for that reason.  r+ with that change, and I see no reason to take a last look after you've made it.  :-)
Attachment #541051 - Flags: review?(jwalden+bmo) → review+
(Assignee)

Comment 16

6 years ago
Created attachment 541685 [details] [diff] [review]
Unified patch rev.2

added ().
Attachment #541051 - Attachment is obsolete: true
Attachment #541685 - Flags: review?(jwalden+bmo)
Attachment #541685 - Flags: review?(jwalden+bmo) → review+
Masahiro, do you need this landed?  I can't remember if I've seen you commit patches or not.
Summary: Utf8ToOneUcs4Char passes invalid UTF-8 octets '%ED%A0%80' → Utf8ToOneUcs4Char passes invalid UTF-8 octets '%ED%A0%80', so decodeURIComponent('%ED%A0%80') doesn't throw
(Assignee)

Updated

6 years ago
Attachment #541685 - Flags: checkin?(jwalden+bmo)
(Assignee)

Comment 18

6 years ago
(In reply to comment #17)
> Masahiro, do you need this landed?  I can't remember if I've seen you commit
> patches or not.

Jeff, I want to land.(But I'm not commiter).
I've got it queued up locally -- next time I have things to push, it'll get pushed to TM.
http://hg.mozilla.org/integration/mozilla-inbound/rev/0d5626387b8a
Whiteboard: [inbound]
Target Milestone: --- → mozilla8
Attachment #541685 - Flags: checkin?(jwalden+bmo) → checkin+
Merged:
http://hg.mozilla.org/mozilla-central/rev/0d5626387b8a
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [inbound]
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.