Closed Bug 331773 Opened 19 years ago Closed 12 years ago

encodeURI fails on decodeURI("%ED%A0%80")

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: danswer, Unassigned)

References

(Blocks 1 open bug)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

It is possible to give DecodeURI a string such that EncodeURI cannot process it.  This same string could just as well come from HTML markup.

Reproducible: Always

Steps to Reproduce:
var js = decodeURI ("%ED%A0%80");
alert (js.length + "\n" + js.charCodeAt(0));    // 1, 0xD800
alert (escape (js));                            // %uD800
alert (encodeURI (js));                         // fails here

Example 2:
<span id=myspan>&#55296;&#65538;</span>
<script type='text/javascript'>
var txt = document.getElementById('myspan').innerHTML;
alert (txt.length + "\n" + txt);        // length is 3
alert (escape (txt));                   // %uD800%uD800%uDC02
alert (encodeURI(txt));                 // fails here
</script>
Actual Results:  
In the examples, encodeURI fails (on \uD800-\uDFFF) even though decodeURI and escape are successful.

Expected Results:  
I expect that encodeURI should give me "%ED%A0%80" corresponding to the equivalent of (%uD800) what the escape is returning.

I have encountered this in trying to safely pass strings between the client and server.  Going from the server to the browser is not so bad because one can use either
1. decodeURI(utf-8 encoded string),
2. read the characters out from an html element (such as a span) into which they have been encoded with &#unicodePointInDecimal;
3. or "\xHH" or "\uHHHH" or "\uHHHH\uHHHH" the latter being for unicode characters with 17-21 bits where the 8 H correspond to the 8 nibble UTF-8 encoding.  Eg. &#65538; (dec) -> 10002 (unicode,hex) -> decodeURI("%F0%90%80%82") (UTF-8) -> "\uD800\uDC02" (UTF-16, right?)

To go from a javascript string to an ascii representation, one would expect to use encodeURI, but this fails on decodeURI ("%ED%A0%80").  The reason it fails, I presume, is because the specified character is not valid.  But in that case, escape should not work either.  I would rather have consistent behaviour - if we don't die on creating the string, then I would rather not die upon manipulating it, especially user entered string data.

Csaba Gabor from Vienna

For unicode charts see: http://www.macchiato.com/unicode/chart/
References: http://en.wikipedia.org/wiki/UTF-8 and 
            http://en.wikipedia.org/wiki/UTF-16/UCS-2
Example 1 in comment 0 is pretty much a dupe of bug 316338. Example 2 is fixed in trunk by bug 316394.
So basically, decodeURI can produce bogus UTF16?  Sounds like we should fix that in the JS engine.  Same for decodeURIFragment.
Assignee: smontagu → general
Blocks: 316338
Status: UNCONFIRMED → NEW
Component: Internationalization → JavaScript Engine
Ever confirmed: true
OS: Windows XP → All
QA Contact: amyy → general
Hardware: PC → All
Blocks: test262
No longer blocks: test262
Now deocdeURI("%ED%A0%80") throws URIError (see bug 660612)
Throwing is okay.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.