Closed Bug 44272 Opened 24 years ago Closed 21 years ago

javascript escape and unescape don't work properly with unicode chars

Categories

(Core :: DOM: Core & HTML, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: pete_a90, Assigned: jshin1987)

References

Details

(Keywords: intl, Whiteboard: [nsbeta3-])

Attachments

(4 files)

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
BuildID:    2000062908

The escape and unescape javascript functions have slight problems with unicode 
escaped fields.

Reproducible: Always
Steps to Reproduce:
1. Load javascript code to a new file
2. Open file
3. Sigh

Actual Results:
1.    <-- (nothing)
2.    <-- (nothing)
3. E85ow   <-- (E85ow)
4. heycool   <-- (this one works)

Expected Results:
1. %u015Fello
2. hello
3. %u1E85ow
4. There is no problem with this one, I'm just showing \u works.

<PRE>
<SCRIPT TYPE="text/javascript"><!--

document.writeln("Actual output:".bold());
document.writeln("1. " + escape("\u015Fello") + "   &lt;-- (nothing)");
document.writeln("2. " + unescape("%u0068ello") + "   &lt;-- (nothing)");
document.writeln("3. " + unescape("%u1E85ow") + "   &lt;-- (E85ow)");
document.writeln("4. \u0068eycool   &lt;-- (this one works)\n");

document.writeln("Expected output:".bold());
document.writeln("1. %u015Fello");
document.writeln("2. hello");
document.writeln("3. \u1E85ow");
document.writeln("4. There is no problem with this one, I'm just showing \\u 
works.");

//--></SCRIPT>
</PRE>
-->DOM, the [un]escape functions being called come from nsJSWindow, they 
override the engine versions.
Assignee: rogerl → jst
Status: UNCONFIRMED → NEW
Component: Javascript Engine → DOM Level 0
Ever confirmed: true
QA Contact: pschwartau → desale
Reassigning to ftang for investigation as an i18n content support problem.
Assignee: jst → ftang
Status: NEW → ASSIGNED
cc vidur and nisheeth. 
Keywords: nsbeta3
I take a look at ECMA-262 from 
http://developer.netscape.com/docs/javascript/e262-pdf.pdf
Here is some related information


FROM PAGE 9

Source Text
ECMAScript source text is represented as a sequence of characters representable 
using the Unicode version 2.0
character encoding.
SourceCharacter ::
any Unicode character
However, it is possible to represent every ECMAScript program using only ASCII 
characters (which are equivalent to
the first 128 Unicode characters). Non-ASCII Unicode characters may appear only 
within comments and string literals.
In string literals, any Unicode character may also be expressed as a Unicode 
escape sequence consisting of six ASCII
characters, namely \u plus four hexadecimal digits. Within a comment, such an 
escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode escape sequence 
contributes one character to the string
value of the literal.
Note that ECMAScript differs from the Java programming language in the behavior 
of Unicode escape sequences. In a
Java program, if the Unicode escape sequence \u000A, for example, occurs within 
a single-line comment, it is
interpreted as a line terminator (Unicode character 000A is line feed) and 
therefore the next character is not part of the
comment. Similarly, if the Unicode escape sequence \u000A occurs within a string 
literal in a Java program, it is
likewise interpreted as a line terminator, which is not allowed within a string 
literal—one must write \n instead of
\u000A to cause a line feed to be part of the string value of a string literal. 
In an ECMAScript program, a Unicode
escape sequence occurring within a comment is never interpreted and therefore 
cannot contribute to termination of the
comment. Similarly, a Unicode escape sequence occurring within a string literal 
in an ECMAScript program always
contributes a character to the string value of the literal and is never 
interpreted as a line terminator or as a quote mark
that might terminate the string literal.

FROM PAGE 60-61

15.1.2.4 escape(string)
The escape function computes a new version of a string value in which certain 
characters have been
replaced by a hexadecimal escape sequence. The result thus contains no special 
characters that might have
special meaning within a URL.
For characters whose Unicode encoding is 0xFF or less, a two-digit escape 
sequence of the form %xx is used
in accordance with RFC1738. For characters whose Unicode encoding is greater 
than 0xFF, a four-digit
escape sequence of the form %uxxxx is used
When the escape function is called with one argument string, the following steps 
are taken:
1. Call ToString(string).
2. Compute the number of characters in Result(1).
3. Let R be the empty string.
4. Let k be 0.
5. If k equals Result(2), return R.
6. Get the character at position k within Result(1).
7. If Result(6) is one of the 69 nonblank ASCII characters 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz 0123456789 @*_+-./, go to step 14.
8. Compute the 16-bit unsigned integer that is the Unicode character encoding of 
Result(6).
9. If Result(8), is less than 256, go to step 12.
10. Let S be a string containing six characters “%uwxyz” where wxyz are four 
hexadecimal digits encoding the
value of Result(8).
11. Go to step 15.
12. Let S be a string containing three characters “%xy” where xy are two 
hexadecimal digits encoding the
value of Result(8).
13. Go to step 15.
14. Let S be a string containing the single character Result(6).
15.Let R be a new string value computed by concatenating the previous value of R 
and S.
16. Increase k by 1.
17. Go to step 5.
15.1.2.5 unescape(string)
The unescape function computes a new version of a string value in which each 
escape sequences of the sort
that might be introduced by the escape function is replaced with the character 
that it represents.
When the unescape function is called with one argument string, the following 
steps are taken:
1. Call ToString(string).
2. Compute the number of characters in Result(1).
3. Let R be the empty string.
4. Let k be 0.
5. If k equals Result(2), return R.
6. Let c be the character at position k within Result(1).
7. If c is not %, go to step 18.
8. If k is greater than Result(2)-6, go to step 14.
9. If the character at position k+1 within result(1) is not u, go to step 14.
10. If the four characters at positions k+2, k+3, k+4, and k+5 within Result(1) 
are not all hexadecimal digits,
go to step 14.
11. Let c be the character whose Unicode encoding is the integer represented by 
the four hexadecimal digits at
positions k+2, k+3, k+4, and k+5 within Result(1).
12. Increase k by 5.
13. Go to step 18.
14. If k is greater than Result(2)-3, go to step 18.
15. If the two characters at positions k+1 and k+2 within Result(1) are not both 
hexadecimal digits, go to step
18.
16. Let c be the character whose Unicode encoding is the integer represented by 
two zeroes plus the two
hexadecimal digits at positions k+1 and k+2 within Result(1).
17. Increase k by 2.
18. Let R be a new string value computed by concatenating the previous value of 
R and c.
19. Increase k by 1.
20. Go to step 5.
ekrock, how important is this bug ? We override the escape and unescape so we 
won't break existing JavaScript code. However, we may need to do more to make it 
also follow ECMA Script.
There are really two issues here, right ?
Issue 1. \uxxxx in JavaScript
Issue 2. escape() and unescape() function with %uxxxx

We need ekrock's oppinion before decide +/-
clayton, jband, mccabe, brendan -- How important is this bug from the standpoint 
of ECMA-262 Version 3 compliance? We *are* shooting for full compliance with 
that specification.

From a practical standpoint: 
#1: We don't want to break existing code, and we don't here, so that's good. 
#2: escape and unescape are used primarily for encoding strings to be passed as 
URLs. If we don't fix this, it basically means that the first release may not be 
able to pass unicode characters in URLs to CGIs. Unfortunate, but probably not 
the end of the world.

My vote: if this is required for ECMA-262 V3 compliance, I'd like it fixed. If 
not, we'd love to fix it, but it could be Futured if you're out of time.
Marking nsbeta3- per I18N Bug Triage.
Whiteboard: [nsbeta3-]
I don't think this is a bug; escape and unescape aren't a required part of the
ECMA-262 spec.  Legacy server interactions require that they know about the
document encoding; this is what the DOM version of the escape and unescape
functions do.

(The engine itself has its own versions, which aren't visible within the
browser.  These versions follow the 'recommended specification' from the ECMA
spec.)

Please see extensive comments to 42221.

See also encodeURI and friends, now officially part of the ECMA spec.  They're
an effort to solve this problem without interacting with legacy server behavior.

Marking WONTFIX... but many thanks for the close look at how we behave.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Marking verified.
Status: RESOLVED → VERIFIED
*** Bug 96716 has been marked as a duplicate of this bug. ***
I would love to see the "extensive comments" in bug 42221, but it is protected. 
Please open it. This is still a bug, especially since it intentionally violates 
the EMCA spec for %uwxyz output from the escape function. IE does this as 
spec'd, so I don't see a valid argument about legacy server behavior--the 
server will still need to deal with IE, which is more and more likely as time 
goes on. 

As near as I can tell, encodeURI and encodeURIComponent still require the 
server to know the document encoding (correct me if I'm misunderstanding this). 
Using escape as spec'd, the document encoding is not necessary. The argument 
about breaking legacy JavaScript code also isn't persuasive: IE made the change 
quite a while ago and there are many other places in Mozilla that break legacy 
code (lack of support for layers among other things).
Bug 42221 is now world readable, reopening this to reconsider given that this
breaks IE compatibility. What about NS4x compatibility, does this break that too?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Thanks for opening bug 42221, JST. In bug 42221 comment #44 from 2000-08-24, 
Mike McCabe says:

"The issue isn't with escape and unescape per se, but with the fact that string
information that is to be transmitted to servers primarily passes through them. 
Interaction with servers evolved back when JavaScript used 8-bit characters in
strings, and more importantly, when JavaScript worked directly with the document
charactacter encoding rather than with a unicode reflection of it.

After we moved escape and unescape into the JS engine in an effort to keep up
with the evolving ECMA standard, all of these server interactions broke, as they
were now getting unicode instead of, say, shift-JIS.  We ended up keeping the
ECMA-draft versions in the engine (note that they are now not part of the
standard, because of this issue) and Joki implemented escape and unescape
functions in the DOM, where they had access to the document character encoding
and could to back-and-forth translations."

If I'm reading this correctly, the DOM escape will escape a character according 
to the document character encoding, while the JS engine's escape puts it into 
the %uwxyz unicode format. Given IE's behavior of doing the latter and it's 
market dominance, I find it hard to believe that there are many servers that 
still use the legacy behavior. Whether or not this is officially part of the 
EMCA standard, I'd rather see Mozilla be compatible with IE at this point than 
Netscape 4.
nhotta, can you take a look at this ?
Assignee: ftang → nhotta
Status: REOPENED → NEW
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.2alpha
*** Bug 121275 has been marked as a duplicate of this bug. ***
Sounds little serious & can cause serious effects on data security.
Raising priority to p2.
Severity major. Backward compatibility is defeating too.
Severity: normal → major
Priority: P3 → P2
Attached file testcase
Target Milestone: mozilla1.2alpha → ---
In the test case, 2&3 are not bug.

2. unescape("%u0068ello")
The % encode is % followed by hex digits. This data is not URL encoded correctly.

3. unescape("%u1E85ow")
he % encode is % followed by hex digits. This data is not URL encoded correctly.


The first case is a bug. 
1. escape("\u015Fello")

The character \u015F cannot be encoded to the document charset which is
ISO-8859-1 for the test case (since no charset is specified in the document so
the default ISO-8859-1 is used). If I set the document charset to UTF-8 using
the menu then the character is escaped correctly as "%C5%9F".
So we could change to fallback to UTF-8 when escaping character is out of the
current document charset range. 


*** Bug 191139 has been marked as a duplicate of this bug. ***
OS: Windows NT → All
Hardware: PC → All
Attached file interactive testcase
Anyone still paying attention to this one?  It seems like an easy fix,
considering the correct escape() function already exists.  I'm sure we could
come up with a compromise if anyone is actually worried about 4x compatibility
here.

I've attached a small html file that makes it easy to run strings through
window.escape() to see what comes out.
This is indeed a serious bug. I've just tested Opera 7.2 and
Konqueror(http://i18nl10n.com/moztest/jstest.euckr.html). Both of them, in
addition to MS IE, do the right thing. Only Mozilla is NOT compliant to the
standard. It's rather ironic that Mozilla doesn't abide by the standard given
that ECMAscript originated as Javascript at Netscape. 

It's a rather easy fix except for some code duplication issue. See also bug
224772 and bug 200984.
Blocks: 224772
taking (hope nhotta doesn't mind :-)). patch coming up shortly
Assignee: nhottanscp → jshin
Status: ASSIGNED → NEW
Attached patch patchSplinter Review
Escape() loops twice through the input string, once for the output length
calculation and once for the actual conversion. We may be better off just
looping once with the capaticy set to a reasonable estimate. What do you think,
bz? 

BTW, note that this doesn't result in any regression against bug 200984. With
MS IE, Konqueror, and Opera/Safari doing all what this patch does, the chance
of breaking some javascript code in the wild is very low.
How well can we guess the needed capacity?  It's only to within a factor of 6 or
so, no?
be, who would be a reviewer for JS compliance? Would you ? (waldemar is not
available, right?).

bz, in general factor of six, but we can call IsASCII (or equivalent) to narrow
it down to factor of 3. Alternatively, if the penalty for the wrong guess is not
so high, we can take a chance by using the expectation value assuming all ASCII
character has the same frequency (if ISASCII is true). I'm not familiar with the
string internal. So, I'll do what you think is the best. 
Status: NEW → ASSIGNED
Keywords: intl
Frankly, I think we should do the simplest code we can here, as a start.  Either
the factor-of-6 guess or what you did is just fine.
Comment on attachment 135177 [details] [diff] [review]
patch

Asking for r/sr.

Let's just go with the current patch as bz is all right with either way. 

I guess JS compliance review is not really necessary. Nonetheless, I'd be glad
to get review of brendan as well if he has something to say here.
Attachment #135177 - Flags: superreview?(bz-vacation)
Attachment #135177 - Flags: review?(smontagu)
I won't be able to review this for probably at least a week.
Comment on attachment 135177 [details] [diff] [review]
patch

Why not just use the JS engine's escape and unescape (don't #ifndef
MOZILLA_CLIENT them in js/src/js{api,str}.c), rather than have the DOM
implement them over again?

/be
Also note the encodeURI() and decodeURI() functions. I don't know if
these would be appropriate or better for handling Unicode characters -
> Why not just use the JS engine's escape and unescape

Because that would unfortunately break the world. Or it did when it was done
that way.  See bug 22594
bz: first, isn't the patch in this bug also going to break the world?  I don't
see it using the document charset (maybe I'm blind).

Second, this bug and others have argued that IE won, it doesn't have the legacy
escape/unescape hack, so why should we?  Does Safari?

/be
bz, brendan is right that my patch behaves exactly the same as JS engine 
version. I thought there was another reason to roll out a separate DOM 
version, but it turned out that that's not the case. so, it makes sense just to use 
JS engine version.  
Brendan, I haven't read this patch yet... ;)

jshin, what does passing a random accented char to escape() on an ISO-8859-1
page produce in IE, out of curiousity?

I'm all for removing redundant code if we can.  ;)
bz : If they're below U+0100, they'll be escaped as '%hh'. If not, they'll be 
escaped as '%uhhhh'.  It doesn't matter whether the page is in ISO-8859-1 or 
Shift_JIS or UTF-8.  
 
The behavior is all identical across the spectrum of major browsers, MS IE, 
Opera, Konqueror (and I guess Safari, too because Safari is based on 
Konqueror). Only Mozilla is NOT compliant to the ECMA standard and doing 
so breaks the world in 2003 :-) 
 
brendan, do you know which of a dozen interfaces (implemented by 
nsGlobalWindowImpl) defines escape/unescape overriding the JS engine? 
bug 22594 doesn't have any patch and the software archealogy is not fun :-) If 
you can't give an answer off the top of the head, I'll track it down 
 
I found it.  It's in nsIDOMWindowInternal.idl. Patch coming up. 
 
Ah, cool.  Bring on the code removal, then.  On the assumption that we will be
going with that patch, adding all sorts of fun deps that should be fixed or
maybe fixed (one of these is invalid, but may get fixed anyway).

Also, I thought we had a bug on window.escape totally failing to deal on UTF-16
encoded pages, but I can't find it.
Blocks: 192148, 207261, 212308
I wonder which has better code footprint, the C++ escape and unescape in the
patch here, or the C ones mccabe or rogerl wrote in js/src/jsstr.c?

/be
brendan,  do you want me to build an optimized build and check it out? just
measuring the binary size is sufficient or should I use a more sophisticated
method? is it worth the trouble?
it seems like we don't have a choice because escape/unescape in the JS engine 
are  also used internally so that we can't remove them, can we? 
 
Comment on attachment 135217 [details] [diff] [review]
alternative :(removing DOM version of escape/unescape)

r=me on the js changes, and if I'm a dom peer, on all changes.

jshin, just curious about code size.  If you have time and are curious, you can
use size on Linux (or codesighs) on builds with the first patch and with this
patch.

/be
Attachment #135217 - Flags: review+
Comment on attachment 135217 [details] [diff] [review]
alternative :(removing DOM version of escape/unescape)

sr=bzbarsky.  If this is what IE does, r=bzbarsky on the DOM changes too.

As a bonus, this makes escape() work in JS components with no windows, no?  ;)
Attachment #135217 - Flags: superreview+
The JS engine escape() function escapes ASCII characters '&', '=', and '?':

js> escape('&');
%26
js> escape('=');
%3D
js> escape('?');
%3F

If we revert to this, won't we break form submission?  See bug 174628.
Phil, do you know why does the JS engine's escape does that?  Crazy.

/be
Phil, the patch in bug 174628 involved switching
GlobalWindowImpl::OpenInternal() from using GlobalWindowImpl::Escape to escape
the URL to directly calling nsEscape with a slightly different set of escape
flags that does not escape those chars.  So changes to how window.escape works
will not affect the testcases in bug 174628 if I understand them correctly.  See
comment at
http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#4587 and
code following.
bz is right. this change doesn't affect what's fixed in bug 174628. 

I just checked in the fix. 
Status: ASSIGNED → RESOLVED
Closed: 24 years ago21 years ago
Resolution: --- → FIXED
Attachment #135177 - Flags: superreview?(bz-vacation)
Attachment #135177 - Flags: review?(smontagu)
brendan, what do you think of back-porting this to 1.4.2? 
jshin, once this cycles and all, test the bugs blocked by it?
mkaply, are you driving 1.4.2?  jshin, set the blocking1.4.2? if this tests well.

/be
Firebird nightly works well. I'm making this block 1.4.2. 
Flags: blocking1.4.2?
Let's give this a little time to bake and then I'll approve it for 1.4.2.
Looks like this has broke my super-duper XMLHTTPRequest-Powered chat :-).

I used escape() to have strings written in Russian escaped as %NN using current
page's charset (namely, "windows-1251"). Now escape does %uNNNN which server
code doesn't understand. So, is there a way to escape characters using some
specified charset?
Ivan: You can use encodeURIComponent() to use encode in UTF-8.  I don't know if
that helps.

[Note to self: escape()'s new behavior is described in comment 37.]
This new behaviour is inacceptable like each dropping backward compatibility.
The bug is not the old behaviour but the the ECMA specification, because this
spec unnecessarily breaks compatibility. If the protocol between sender and
receiver was to speak windows-1251 encoded via escape, then it is inacceptable
to change the behaviour of escape to do a transformation to Unicode. The
internal coding of the JS engine is not of interest. If the traditional
behaviour was to send hexcodes using the character encoding of the page, then
this is the correct implementation and can not be changed.

The ECMA specification is only valid for the core language. This core language
does not know anything about document encodings. This means the specification is
contextless. It must not break older traditions implemented in given contexts
like webpages.

If this is possible, then implementing the traditional behaviour as method of
the window object hiding the equal named core function may give the old
behaviour back to those who need it.

escape('Ä'); // escape as unicode
window.escape('Ä'); // escape using the encoding of the page.

This suggestion may be confusing to beginners, but it gives back the traditional
behaviour to those who need it.

if(escape !== window.escape) alert('Yeah, found two implementations.');

This suggestion is ECMA compliant.
Georg, Netscape lost the browser wars.  Mozilla breaking compatibilty with old
Netscape releases may be bad, but it seems worse to be incompatible with Safari,
Opera, and IE.

What's more, your suggestion that escape !== window.escape is not exactly
kosher, and I don't see how it will help the Netscape-only scripts that call
escape, not window.escape, and want it to behave as it did in 4.x.

Can you point to some pages that counted on the old Netscape behavior?

/be
so are we 100% convinced of this for 1.4.2?
We might have to give up getting this into 1.4.2 because we also have to fix bug 
225695 along with this. The change necessary to fix  bug 225695 is a lot more 
extensive than the fix for this one.
without a fix for 225695, this is missing the boat.
Flags: blocking1.4.2? → blocking1.4.2-
Blocks: 118373
*** Bug 232841 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: