Closed Bug 44272 Opened 24 years ago Closed 21 years ago

javascript escape and unescape don't work properly with unicode chars

Categories

(Core :: DOM: Core & HTML, defect, P2)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: pete_a90, Assigned: jshin1987)

References

Details

(Keywords: intl, Whiteboard: [nsbeta3-])

Attachments

(4 files)

From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt) BuildID: 2000062908 The escape and unescape javascript functions have slight problems with unicode escaped fields. Reproducible: Always Steps to Reproduce: 1. Load javascript code to a new file 2. Open file 3. Sigh Actual Results: 1. <-- (nothing) 2. <-- (nothing) 3. E85ow <-- (E85ow) 4. heycool <-- (this one works) Expected Results: 1. %u015Fello 2. hello 3. %u1E85ow 4. There is no problem with this one, I'm just showing \u works. <PRE> <SCRIPT TYPE="text/javascript"><!-- document.writeln("Actual output:".bold()); document.writeln("1. " + escape("\u015Fello") + " &lt;-- (nothing)"); document.writeln("2. " + unescape("%u0068ello") + " &lt;-- (nothing)"); document.writeln("3. " + unescape("%u1E85ow") + " &lt;-- (E85ow)"); document.writeln("4. \u0068eycool &lt;-- (this one works)\n"); document.writeln("Expected output:".bold()); document.writeln("1. %u015Fello"); document.writeln("2. hello"); document.writeln("3. \u1E85ow"); document.writeln("4. There is no problem with this one, I'm just showing \\u works."); //--></SCRIPT> </PRE>
-->DOM, the [un]escape functions being called come from nsJSWindow, they override the engine versions.
Assignee: rogerl → jst
Status: UNCONFIRMED → NEW
Component: Javascript Engine → DOM Level 0
Ever confirmed: true
QA Contact: pschwartau → desale
Reassigning to ftang for investigation as an i18n content support problem.
Assignee: jst → ftang
Status: NEW → ASSIGNED
cc vidur and nisheeth.
Keywords: nsbeta3
I take a look at ECMA-262 from http://developer.netscape.com/docs/javascript/e262-pdf.pdf Here is some related information FROM PAGE 9 Source Text ECMAScript source text is represented as a sequence of characters representable using the Unicode version 2.0 character encoding. SourceCharacter :: any Unicode character However, it is possible to represent every ECMAScript program using only ASCII characters (which are equivalent to the first 128 Unicode characters). Non-ASCII Unicode characters may appear only within comments and string literals. In string literals, any Unicode character may also be expressed as a Unicode escape sequence consisting of six ASCII characters, namely \u plus four hexadecimal digits. Within a comment, such an escape sequence is effectively ignored as part of the comment. Within a string literal, the Unicode escape sequence contributes one character to the string value of the literal. Note that ECMAScript differs from the Java programming language in the behavior of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode character 000A is line feed) and therefore the next character is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a line feed to be part of the string value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes a character to the string value of the literal and is never interpreted as a line terminator or as a quote mark that might terminate the string literal. FROM PAGE 60-61 15.1.2.4 escape(string) The escape function computes a new version of a string value in which certain characters have been replaced by a hexadecimal escape sequence. The result thus contains no special characters that might have special meaning within a URL. For characters whose Unicode encoding is 0xFF or less, a two-digit escape sequence of the form %xx is used in accordance with RFC1738. For characters whose Unicode encoding is greater than 0xFF, a four-digit escape sequence of the form %uxxxx is used When the escape function is called with one argument string, the following steps are taken: 1. Call ToString(string). 2. Compute the number of characters in Result(1). 3. Let R be the empty string. 4. Let k be 0. 5. If k equals Result(2), return R. 6. Get the character at position k within Result(1). 7. If Result(6) is one of the 69 nonblank ASCII characters ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 @*_+-./, go to step 14. 8. Compute the 16-bit unsigned integer that is the Unicode character encoding of Result(6). 9. If Result(8), is less than 256, go to step 12. 10. Let S be a string containing six characters “%uwxyz” where wxyz are four hexadecimal digits encoding the value of Result(8). 11. Go to step 15. 12. Let S be a string containing three characters “%xy” where xy are two hexadecimal digits encoding the value of Result(8). 13. Go to step 15. 14. Let S be a string containing the single character Result(6). 15.Let R be a new string value computed by concatenating the previous value of R and S. 16. Increase k by 1. 17. Go to step 5. 15.1.2.5 unescape(string) The unescape function computes a new version of a string value in which each escape sequences of the sort that might be introduced by the escape function is replaced with the character that it represents. When the unescape function is called with one argument string, the following steps are taken: 1. Call ToString(string). 2. Compute the number of characters in Result(1). 3. Let R be the empty string. 4. Let k be 0. 5. If k equals Result(2), return R. 6. Let c be the character at position k within Result(1). 7. If c is not %, go to step 18. 8. If k is greater than Result(2)-6, go to step 14. 9. If the character at position k+1 within result(1) is not u, go to step 14. 10. If the four characters at positions k+2, k+3, k+4, and k+5 within Result(1) are not all hexadecimal digits, go to step 14. 11. Let c be the character whose Unicode encoding is the integer represented by the four hexadecimal digits at positions k+2, k+3, k+4, and k+5 within Result(1). 12. Increase k by 5. 13. Go to step 18. 14. If k is greater than Result(2)-3, go to step 18. 15. If the two characters at positions k+1 and k+2 within Result(1) are not both hexadecimal digits, go to step 18. 16. Let c be the character whose Unicode encoding is the integer represented by two zeroes plus the two hexadecimal digits at positions k+1 and k+2 within Result(1). 17. Increase k by 2. 18. Let R be a new string value computed by concatenating the previous value of R and c. 19. Increase k by 1. 20. Go to step 5.
ekrock, how important is this bug ? We override the escape and unescape so we won't break existing JavaScript code. However, we may need to do more to make it also follow ECMA Script.
There are really two issues here, right ? Issue 1. \uxxxx in JavaScript Issue 2. escape() and unescape() function with %uxxxx We need ekrock's oppinion before decide +/-
clayton, jband, mccabe, brendan -- How important is this bug from the standpoint of ECMA-262 Version 3 compliance? We *are* shooting for full compliance with that specification. From a practical standpoint: #1: We don't want to break existing code, and we don't here, so that's good. #2: escape and unescape are used primarily for encoding strings to be passed as URLs. If we don't fix this, it basically means that the first release may not be able to pass unicode characters in URLs to CGIs. Unfortunate, but probably not the end of the world. My vote: if this is required for ECMA-262 V3 compliance, I'd like it fixed. If not, we'd love to fix it, but it could be Futured if you're out of time.
Marking nsbeta3- per I18N Bug Triage.
Whiteboard: [nsbeta3-]
I don't think this is a bug; escape and unescape aren't a required part of the ECMA-262 spec. Legacy server interactions require that they know about the document encoding; this is what the DOM version of the escape and unescape functions do. (The engine itself has its own versions, which aren't visible within the browser. These versions follow the 'recommended specification' from the ECMA spec.) Please see extensive comments to 42221. See also encodeURI and friends, now officially part of the ECMA spec. They're an effort to solve this problem without interacting with legacy server behavior. Marking WONTFIX... but many thanks for the close look at how we behave.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Marking verified.
Status: RESOLVED → VERIFIED
*** Bug 96716 has been marked as a duplicate of this bug. ***
I would love to see the "extensive comments" in bug 42221, but it is protected. Please open it. This is still a bug, especially since it intentionally violates the EMCA spec for %uwxyz output from the escape function. IE does this as spec'd, so I don't see a valid argument about legacy server behavior--the server will still need to deal with IE, which is more and more likely as time goes on. As near as I can tell, encodeURI and encodeURIComponent still require the server to know the document encoding (correct me if I'm misunderstanding this). Using escape as spec'd, the document encoding is not necessary. The argument about breaking legacy JavaScript code also isn't persuasive: IE made the change quite a while ago and there are many other places in Mozilla that break legacy code (lack of support for layers among other things).
Bug 42221 is now world readable, reopening this to reconsider given that this breaks IE compatibility. What about NS4x compatibility, does this break that too?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Thanks for opening bug 42221, JST. In bug 42221 comment #44 from 2000-08-24, Mike McCabe says: "The issue isn't with escape and unescape per se, but with the fact that string information that is to be transmitted to servers primarily passes through them. Interaction with servers evolved back when JavaScript used 8-bit characters in strings, and more importantly, when JavaScript worked directly with the document charactacter encoding rather than with a unicode reflection of it. After we moved escape and unescape into the JS engine in an effort to keep up with the evolving ECMA standard, all of these server interactions broke, as they were now getting unicode instead of, say, shift-JIS. We ended up keeping the ECMA-draft versions in the engine (note that they are now not part of the standard, because of this issue) and Joki implemented escape and unescape functions in the DOM, where they had access to the document character encoding and could to back-and-forth translations." If I'm reading this correctly, the DOM escape will escape a character according to the document character encoding, while the JS engine's escape puts it into the %uwxyz unicode format. Given IE's behavior of doing the latter and it's market dominance, I find it hard to believe that there are many servers that still use the legacy behavior. Whether or not this is officially part of the EMCA standard, I'd rather see Mozilla be compatible with IE at this point than Netscape 4.
nhotta, can you take a look at this ?
Assignee: ftang → nhotta
Status: REOPENED → NEW
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.2alpha
*** Bug 121275 has been marked as a duplicate of this bug. ***
Sounds little serious & can cause serious effects on data security. Raising priority to p2. Severity major. Backward compatibility is defeating too.
Severity: normal → major
Priority: P3 → P2
Attached file testcase
Target Milestone: mozilla1.2alpha → ---
In the test case, 2&3 are not bug. 2. unescape("%u0068ello") The % encode is % followed by hex digits. This data is not URL encoded correctly. 3. unescape("%u1E85ow") he % encode is % followed by hex digits. This data is not URL encoded correctly. The first case is a bug. 1. escape("\u015Fello") The character \u015F cannot be encoded to the document charset which is ISO-8859-1 for the test case (since no charset is specified in the document so the default ISO-8859-1 is used). If I set the document charset to UTF-8 using the menu then the character is escaped correctly as "%C5%9F". So we could change to fallback to UTF-8 when escaping character is out of the current document charset range.
*** Bug 191139 has been marked as a duplicate of this bug. ***
OS: Windows NT → All
Hardware: PC → All
Attached file interactive testcase
Anyone still paying attention to this one? It seems like an easy fix, considering the correct escape() function already exists. I'm sure we could come up with a compromise if anyone is actually worried about 4x compatibility here. I've attached a small html file that makes it easy to run strings through window.escape() to see what comes out.
This is indeed a serious bug. I've just tested Opera 7.2 and Konqueror(http://i18nl10n.com/moztest/jstest.euckr.html). Both of them, in addition to MS IE, do the right thing. Only Mozilla is NOT compliant to the standard. It's rather ironic that Mozilla doesn't abide by the standard given that ECMAscript originated as Javascript at Netscape. It's a rather easy fix except for some code duplication issue. See also bug 224772 and bug 200984.
Blocks: 224772
taking (hope nhotta doesn't mind :-)). patch coming up shortly
Assignee: nhottanscp → jshin
Status: ASSIGNED → NEW
Attached patch patchSplinter Review
Escape() loops twice through the input string, once for the output length calculation and once for the actual conversion. We may be better off just looping once with the capaticy set to a reasonable estimate. What do you think, bz? BTW, note that this doesn't result in any regression against bug 200984. With MS IE, Konqueror, and Opera/Safari doing all what this patch does, the chance of breaking some javascript code in the wild is very low.
How well can we guess the needed capacity? It's only to within a factor of 6 or so, no?
be, who would be a reviewer for JS compliance? Would you ? (waldemar is not available, right?). bz, in general factor of six, but we can call IsASCII (or equivalent) to narrow it down to factor of 3. Alternatively, if the penalty for the wrong guess is not so high, we can take a chance by using the expectation value assuming all ASCII character has the same frequency (if ISASCII is true). I'm not familiar with the string internal. So, I'll do what you think is the best.
Status: NEW → ASSIGNED
Keywords: intl
Frankly, I think we should do the simplest code we can here, as a start. Either the factor-of-6 guess or what you did is just fine.
Comment on attachment 135177 [details] [diff] [review] patch Asking for r/sr. Let's just go with the current patch as bz is all right with either way. I guess JS compliance review is not really necessary. Nonetheless, I'd be glad to get review of brendan as well if he has something to say here.
Attachment #135177 - Flags: superreview?(bz-vacation)
Attachment #135177 - Flags: review?(smontagu)
I won't be able to review this for probably at least a week.
Comment on attachment 135177 [details] [diff] [review] patch Why not just use the JS engine's escape and unescape (don't #ifndef MOZILLA_CLIENT them in js/src/js{api,str}.c), rather than have the DOM implement them over again? /be
Also note the encodeURI() and decodeURI() functions. I don't know if these would be appropriate or better for handling Unicode characters -
> Why not just use the JS engine's escape and unescape Because that would unfortunately break the world. Or it did when it was done that way. See bug 22594
bz: first, isn't the patch in this bug also going to break the world? I don't see it using the document charset (maybe I'm blind). Second, this bug and others have argued that IE won, it doesn't have the legacy escape/unescape hack, so why should we? Does Safari? /be
bz, brendan is right that my patch behaves exactly the same as JS engine version. I thought there was another reason to roll out a separate DOM version, but it turned out that that's not the case. so, it makes sense just to use JS engine version.
Brendan, I haven't read this patch yet... ;) jshin, what does passing a random accented char to escape() on an ISO-8859-1 page produce in IE, out of curiousity? I'm all for removing redundant code if we can. ;)
bz : If they're below U+0100, they'll be escaped as '%hh'. If not, they'll be escaped as '%uhhhh'. It doesn't matter whether the page is in ISO-8859-1 or Shift_JIS or UTF-8. The behavior is all identical across the spectrum of major browsers, MS IE, Opera, Konqueror (and I guess Safari, too because Safari is based on Konqueror). Only Mozilla is NOT compliant to the ECMA standard and doing so breaks the world in 2003 :-) brendan, do you know which of a dozen interfaces (implemented by nsGlobalWindowImpl) defines escape/unescape overriding the JS engine? bug 22594 doesn't have any patch and the software archealogy is not fun :-) If you can't give an answer off the top of the head, I'll track it down
I found it. It's in nsIDOMWindowInternal.idl. Patch coming up.
Ah, cool. Bring on the code removal, then. On the assumption that we will be going with that patch, adding all sorts of fun deps that should be fixed or maybe fixed (one of these is invalid, but may get fixed anyway). Also, I thought we had a bug on window.escape totally failing to deal on UTF-16 encoded pages, but I can't find it.
Blocks: 192148, 207261, 212308
I wonder which has better code footprint, the C++ escape and unescape in the patch here, or the C ones mccabe or rogerl wrote in js/src/jsstr.c? /be
brendan, do you want me to build an optimized build and check it out? just measuring the binary size is sufficient or should I use a more sophisticated method? is it worth the trouble?
it seems like we don't have a choice because escape/unescape in the JS engine are also used internally so that we can't remove them, can we?
Comment on attachment 135217 [details] [diff] [review] alternative :(removing DOM version of escape/unescape) r=me on the js changes, and if I'm a dom peer, on all changes. jshin, just curious about code size. If you have time and are curious, you can use size on Linux (or codesighs) on builds with the first patch and with this patch. /be
Attachment #135217 - Flags: review+
Comment on attachment 135217 [details] [diff] [review] alternative :(removing DOM version of escape/unescape) sr=bzbarsky. If this is what IE does, r=bzbarsky on the DOM changes too. As a bonus, this makes escape() work in JS components with no windows, no? ;)
Attachment #135217 - Flags: superreview+
The JS engine escape() function escapes ASCII characters '&', '=', and '?': js> escape('&'); %26 js> escape('='); %3D js> escape('?'); %3F If we revert to this, won't we break form submission? See bug 174628.
Phil, do you know why does the JS engine's escape does that? Crazy. /be
Phil, the patch in bug 174628 involved switching GlobalWindowImpl::OpenInternal() from using GlobalWindowImpl::Escape to escape the URL to directly calling nsEscape with a slightly different set of escape flags that does not escape those chars. So changes to how window.escape works will not affect the testcases in bug 174628 if I understand them correctly. See comment at http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#4587 and code following.
bz is right. this change doesn't affect what's fixed in bug 174628. I just checked in the fix.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago21 years ago
Resolution: --- → FIXED
Attachment #135177 - Flags: superreview?(bz-vacation)
Attachment #135177 - Flags: review?(smontagu)
brendan, what do you think of back-porting this to 1.4.2?
jshin, once this cycles and all, test the bugs blocked by it?
mkaply, are you driving 1.4.2? jshin, set the blocking1.4.2? if this tests well. /be
Firebird nightly works well. I'm making this block 1.4.2.
Flags: blocking1.4.2?
Let's give this a little time to bake and then I'll approve it for 1.4.2.
Looks like this has broke my super-duper XMLHTTPRequest-Powered chat :-). I used escape() to have strings written in Russian escaped as %NN using current page's charset (namely, "windows-1251"). Now escape does %uNNNN which server code doesn't understand. So, is there a way to escape characters using some specified charset?
Ivan: You can use encodeURIComponent() to use encode in UTF-8. I don't know if that helps. [Note to self: escape()'s new behavior is described in comment 37.]
This new behaviour is inacceptable like each dropping backward compatibility. The bug is not the old behaviour but the the ECMA specification, because this spec unnecessarily breaks compatibility. If the protocol between sender and receiver was to speak windows-1251 encoded via escape, then it is inacceptable to change the behaviour of escape to do a transformation to Unicode. The internal coding of the JS engine is not of interest. If the traditional behaviour was to send hexcodes using the character encoding of the page, then this is the correct implementation and can not be changed. The ECMA specification is only valid for the core language. This core language does not know anything about document encodings. This means the specification is contextless. It must not break older traditions implemented in given contexts like webpages. If this is possible, then implementing the traditional behaviour as method of the window object hiding the equal named core function may give the old behaviour back to those who need it. escape('Ä'); // escape as unicode window.escape('Ä'); // escape using the encoding of the page. This suggestion may be confusing to beginners, but it gives back the traditional behaviour to those who need it. if(escape !== window.escape) alert('Yeah, found two implementations.'); This suggestion is ECMA compliant.
Georg, Netscape lost the browser wars. Mozilla breaking compatibilty with old Netscape releases may be bad, but it seems worse to be incompatible with Safari, Opera, and IE. What's more, your suggestion that escape !== window.escape is not exactly kosher, and I don't see how it will help the Netscape-only scripts that call escape, not window.escape, and want it to behave as it did in 4.x. Can you point to some pages that counted on the old Netscape behavior? /be
so are we 100% convinced of this for 1.4.2?
We might have to give up getting this into 1.4.2 because we also have to fix bug 225695 along with this. The change necessary to fix bug 225695 is a lot more extensive than the fix for this one.
without a fix for 225695, this is missing the boat.
Flags: blocking1.4.2? → blocking1.4.2-
Blocks: 118373
*** Bug 232841 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: