Closed
Bug 44272
Opened 24 years ago
Closed 21 years ago
javascript escape and unescape don't work properly with unicode chars
Categories
(Core :: DOM: Core & HTML, defect, P2)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
People
(Reporter: pete_a90, Assigned: jshin1987)
References
Details
(Keywords: intl, Whiteboard: [nsbeta3-])
Attachments
(4 files)
734 bytes,
text/html
|
Details | |
719 bytes,
text/html
|
Details | |
8.32 KB,
patch
|
Details | Diff | Splinter Review | |
7.14 KB,
patch
|
brendan
:
review+
bzbarsky
:
superreview+
|
Details | Diff | Splinter Review |
From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt) BuildID: 2000062908 The escape and unescape javascript functions have slight problems with unicode escaped fields. Reproducible: Always Steps to Reproduce: 1. Load javascript code to a new file 2. Open file 3. Sigh Actual Results: 1. <-- (nothing) 2. <-- (nothing) 3. E85ow <-- (E85ow) 4. heycool <-- (this one works) Expected Results: 1. %u015Fello 2. hello 3. %u1E85ow 4. There is no problem with this one, I'm just showing \u works. <PRE> <SCRIPT TYPE="text/javascript"><!-- document.writeln("Actual output:".bold()); document.writeln("1. " + escape("\u015Fello") + " <-- (nothing)"); document.writeln("2. " + unescape("%u0068ello") + " <-- (nothing)"); document.writeln("3. " + unescape("%u1E85ow") + " <-- (E85ow)"); document.writeln("4. \u0068eycool <-- (this one works)\n"); document.writeln("Expected output:".bold()); document.writeln("1. %u015Fello"); document.writeln("2. hello"); document.writeln("3. \u1E85ow"); document.writeln("4. There is no problem with this one, I'm just showing \\u works."); //--></SCRIPT> </PRE>
Comment 1•24 years ago
|
||
-->DOM, the [un]escape functions being called come from nsJSWindow, they override the engine versions.
Assignee: rogerl → jst
Status: UNCONFIRMED → NEW
Component: Javascript Engine → DOM Level 0
Ever confirmed: true
QA Contact: pschwartau → desale
Comment 2•24 years ago
|
||
Reassigning to ftang for investigation as an i18n content support problem.
Assignee: jst → ftang
Updated•24 years ago
|
Status: NEW → ASSIGNED
Comment 4•24 years ago
|
||
I take a look at ECMA-262 from http://developer.netscape.com/docs/javascript/e262-pdf.pdf Here is some related information FROM PAGE 9 Source Text ECMAScript source text is represented as a sequence of characters representable using the Unicode version 2.0 character encoding. SourceCharacter :: any Unicode character However, it is possible to represent every ECMAScript program using only ASCII characters (which are equivalent to the first 128 Unicode characters). Non-ASCII Unicode characters may appear only within comments and string literals. In string literals, any Unicode character may also be expressed as a Unicode escape sequence consisting of six ASCII characters, namely \u plus four hexadecimal digits. Within a comment, such an escape sequence is effectively ignored as part of the comment. Within a string literal, the Unicode escape sequence contributes one character to the string value of the literal. Note that ECMAScript differs from the Java programming language in the behavior of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode character 000A is line feed) and therefore the next character is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a line feed to be part of the string value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes a character to the string value of the literal and is never interpreted as a line terminator or as a quote mark that might terminate the string literal. FROM PAGE 60-61 15.1.2.4 escape(string) The escape function computes a new version of a string value in which certain characters have been replaced by a hexadecimal escape sequence. The result thus contains no special characters that might have special meaning within a URL. For characters whose Unicode encoding is 0xFF or less, a two-digit escape sequence of the form %xx is used in accordance with RFC1738. For characters whose Unicode encoding is greater than 0xFF, a four-digit escape sequence of the form %uxxxx is used When the escape function is called with one argument string, the following steps are taken: 1. Call ToString(string). 2. Compute the number of characters in Result(1). 3. Let R be the empty string. 4. Let k be 0. 5. If k equals Result(2), return R. 6. Get the character at position k within Result(1). 7. If Result(6) is one of the 69 nonblank ASCII characters ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 @*_+-./, go to step 14. 8. Compute the 16-bit unsigned integer that is the Unicode character encoding of Result(6). 9. If Result(8), is less than 256, go to step 12. 10. Let S be a string containing six characters “%uwxyz” where wxyz are four hexadecimal digits encoding the value of Result(8). 11. Go to step 15. 12. Let S be a string containing three characters “%xy” where xy are two hexadecimal digits encoding the value of Result(8). 13. Go to step 15. 14. Let S be a string containing the single character Result(6). 15.Let R be a new string value computed by concatenating the previous value of R and S. 16. Increase k by 1. 17. Go to step 5. 15.1.2.5 unescape(string) The unescape function computes a new version of a string value in which each escape sequences of the sort that might be introduced by the escape function is replaced with the character that it represents. When the unescape function is called with one argument string, the following steps are taken: 1. Call ToString(string). 2. Compute the number of characters in Result(1). 3. Let R be the empty string. 4. Let k be 0. 5. If k equals Result(2), return R. 6. Let c be the character at position k within Result(1). 7. If c is not %, go to step 18. 8. If k is greater than Result(2)-6, go to step 14. 9. If the character at position k+1 within result(1) is not u, go to step 14. 10. If the four characters at positions k+2, k+3, k+4, and k+5 within Result(1) are not all hexadecimal digits, go to step 14. 11. Let c be the character whose Unicode encoding is the integer represented by the four hexadecimal digits at positions k+2, k+3, k+4, and k+5 within Result(1). 12. Increase k by 5. 13. Go to step 18. 14. If k is greater than Result(2)-3, go to step 18. 15. If the two characters at positions k+1 and k+2 within Result(1) are not both hexadecimal digits, go to step 18. 16. Let c be the character whose Unicode encoding is the integer represented by two zeroes plus the two hexadecimal digits at positions k+1 and k+2 within Result(1). 17. Increase k by 2. 18. Let R be a new string value computed by concatenating the previous value of R and c. 19. Increase k by 1. 20. Go to step 5.
Comment 5•24 years ago
|
||
ekrock, how important is this bug ? We override the escape and unescape so we won't break existing JavaScript code. However, we may need to do more to make it also follow ECMA Script.
Comment 6•24 years ago
|
||
There are really two issues here, right ? Issue 1. \uxxxx in JavaScript Issue 2. escape() and unescape() function with %uxxxx We need ekrock's oppinion before decide +/-
Comment 7•24 years ago
|
||
clayton, jband, mccabe, brendan -- How important is this bug from the standpoint of ECMA-262 Version 3 compliance? We *are* shooting for full compliance with that specification. From a practical standpoint: #1: We don't want to break existing code, and we don't here, so that's good. #2: escape and unescape are used primarily for encoding strings to be passed as URLs. If we don't fix this, it basically means that the first release may not be able to pass unicode characters in URLs to CGIs. Unfortunate, but probably not the end of the world. My vote: if this is required for ECMA-262 V3 compliance, I'd like it fixed. If not, we'd love to fix it, but it could be Futured if you're out of time.
Comment 9•24 years ago
|
||
I don't think this is a bug; escape and unescape aren't a required part of the ECMA-262 spec. Legacy server interactions require that they know about the document encoding; this is what the DOM version of the escape and unescape functions do. (The engine itself has its own versions, which aren't visible within the browser. These versions follow the 'recommended specification' from the ECMA spec.) Please see extensive comments to 42221. See also encodeURI and friends, now officially part of the ECMA spec. They're an effort to solve this problem without interacting with legacy server behavior. Marking WONTFIX... but many thanks for the close look at how we behave.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Comment 11•23 years ago
|
||
*** Bug 96716 has been marked as a duplicate of this bug. ***
Comment 12•23 years ago
|
||
I would love to see the "extensive comments" in bug 42221, but it is protected. Please open it. This is still a bug, especially since it intentionally violates the EMCA spec for %uwxyz output from the escape function. IE does this as spec'd, so I don't see a valid argument about legacy server behavior--the server will still need to deal with IE, which is more and more likely as time goes on. As near as I can tell, encodeURI and encodeURIComponent still require the server to know the document encoding (correct me if I'm misunderstanding this). Using escape as spec'd, the document encoding is not necessary. The argument about breaking legacy JavaScript code also isn't persuasive: IE made the change quite a while ago and there are many other places in Mozilla that break legacy code (lack of support for layers among other things).
Comment 13•23 years ago
|
||
Bug 42221 is now world readable, reopening this to reconsider given that this breaks IE compatibility. What about NS4x compatibility, does this break that too?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Comment 14•23 years ago
|
||
Thanks for opening bug 42221, JST. In bug 42221 comment #44 from 2000-08-24, Mike McCabe says: "The issue isn't with escape and unescape per se, but with the fact that string information that is to be transmitted to servers primarily passes through them. Interaction with servers evolved back when JavaScript used 8-bit characters in strings, and more importantly, when JavaScript worked directly with the document charactacter encoding rather than with a unicode reflection of it. After we moved escape and unescape into the JS engine in an effort to keep up with the evolving ECMA standard, all of these server interactions broke, as they were now getting unicode instead of, say, shift-JIS. We ended up keeping the ECMA-draft versions in the engine (note that they are now not part of the standard, because of this issue) and Joki implemented escape and unescape functions in the DOM, where they had access to the document character encoding and could to back-and-forth translations." If I'm reading this correctly, the DOM escape will escape a character according to the document character encoding, while the JS engine's escape puts it into the %uwxyz unicode format. Given IE's behavior of doing the latter and it's market dominance, I find it hard to believe that there are many servers that still use the legacy behavior. Whether or not this is officially part of the EMCA standard, I'd rather see Mozilla be compatible with IE at this point than Netscape 4.
Comment 15•23 years ago
|
||
nhotta, can you take a look at this ?
Assignee: ftang → nhotta
Status: REOPENED → NEW
Updated•22 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.2alpha
Comment 16•22 years ago
|
||
*** Bug 121275 has been marked as a duplicate of this bug. ***
Comment 17•22 years ago
|
||
Sounds little serious & can cause serious effects on data security. Raising priority to p2. Severity major. Backward compatibility is defeating too.
Severity: normal → major
Priority: P3 → P2
Comment 18•22 years ago
|
||
Updated•22 years ago
|
Target Milestone: mozilla1.2alpha → ---
Comment 19•22 years ago
|
||
In the test case, 2&3 are not bug. 2. unescape("%u0068ello") The % encode is % followed by hex digits. This data is not URL encoded correctly. 3. unescape("%u1E85ow") he % encode is % followed by hex digits. This data is not URL encoded correctly. The first case is a bug. 1. escape("\u015Fello") The character \u015F cannot be encoded to the document charset which is ISO-8859-1 for the test case (since no charset is specified in the document so the default ISO-8859-1 is used). If I set the document charset to UTF-8 using the menu then the character is escaped correctly as "%C5%9F". So we could change to fallback to UTF-8 when escaping character is out of the current document charset range.
Comment 20•22 years ago
|
||
*** Bug 191139 has been marked as a duplicate of this bug. ***
Comment 21•21 years ago
|
||
Anyone still paying attention to this one? It seems like an easy fix, considering the correct escape() function already exists. I'm sure we could come up with a compromise if anyone is actually worried about 4x compatibility here. I've attached a small html file that makes it easy to run strings through window.escape() to see what comes out.
Assignee | ||
Comment 22•21 years ago
|
||
This is indeed a serious bug. I've just tested Opera 7.2 and Konqueror(http://i18nl10n.com/moztest/jstest.euckr.html). Both of them, in addition to MS IE, do the right thing. Only Mozilla is NOT compliant to the standard. It's rather ironic that Mozilla doesn't abide by the standard given that ECMAscript originated as Javascript at Netscape. It's a rather easy fix except for some code duplication issue. See also bug 224772 and bug 200984.
Assignee | ||
Comment 23•21 years ago
|
||
taking (hope nhotta doesn't mind :-)). patch coming up shortly
Assignee: nhottanscp → jshin
Status: ASSIGNED → NEW
Assignee | ||
Comment 24•21 years ago
|
||
Escape() loops twice through the input string, once for the output length calculation and once for the actual conversion. We may be better off just looping once with the capaticy set to a reasonable estimate. What do you think, bz? BTW, note that this doesn't result in any regression against bug 200984. With MS IE, Konqueror, and Opera/Safari doing all what this patch does, the chance of breaking some javascript code in the wild is very low.
Comment 25•21 years ago
|
||
How well can we guess the needed capacity? It's only to within a factor of 6 or so, no?
Assignee | ||
Comment 26•21 years ago
|
||
be, who would be a reviewer for JS compliance? Would you ? (waldemar is not available, right?). bz, in general factor of six, but we can call IsASCII (or equivalent) to narrow it down to factor of 3. Alternatively, if the penalty for the wrong guess is not so high, we can take a chance by using the expectation value assuming all ASCII character has the same frequency (if ISASCII is true). I'm not familiar with the string internal. So, I'll do what you think is the best.
Status: NEW → ASSIGNED
Keywords: intl
Comment 27•21 years ago
|
||
Frankly, I think we should do the simplest code we can here, as a start. Either the factor-of-6 guess or what you did is just fine.
Assignee | ||
Comment 28•21 years ago
|
||
Comment on attachment 135177 [details] [diff] [review] patch Asking for r/sr. Let's just go with the current patch as bz is all right with either way. I guess JS compliance review is not really necessary. Nonetheless, I'd be glad to get review of brendan as well if he has something to say here.
Attachment #135177 -
Flags: superreview?(bz-vacation)
Attachment #135177 -
Flags: review?(smontagu)
Comment 29•21 years ago
|
||
I won't be able to review this for probably at least a week.
Comment 30•21 years ago
|
||
Comment on attachment 135177 [details] [diff] [review] patch Why not just use the JS engine's escape and unescape (don't #ifndef MOZILLA_CLIENT them in js/src/js{api,str}.c), rather than have the DOM implement them over again? /be
Comment 31•21 years ago
|
||
Also note the encodeURI() and decodeURI() functions. I don't know if these would be appropriate or better for handling Unicode characters -
Comment 32•21 years ago
|
||
> Why not just use the JS engine's escape and unescape Because that would unfortunately break the world. Or it did when it was done that way. See bug 22594
Comment 33•21 years ago
|
||
In particular, see http://bugzilla.mozilla.org/show_bug.cgi?id=22594#c12
Comment 34•21 years ago
|
||
bz: first, isn't the patch in this bug also going to break the world? I don't see it using the document charset (maybe I'm blind). Second, this bug and others have argued that IE won, it doesn't have the legacy escape/unescape hack, so why should we? Does Safari? /be
Assignee | ||
Comment 35•21 years ago
|
||
bz, brendan is right that my patch behaves exactly the same as JS engine version. I thought there was another reason to roll out a separate DOM version, but it turned out that that's not the case. so, it makes sense just to use JS engine version.
Comment 36•21 years ago
|
||
Brendan, I haven't read this patch yet... ;) jshin, what does passing a random accented char to escape() on an ISO-8859-1 page produce in IE, out of curiousity? I'm all for removing redundant code if we can. ;)
Assignee | ||
Comment 37•21 years ago
|
||
bz : If they're below U+0100, they'll be escaped as '%hh'. If not, they'll be escaped as '%uhhhh'. It doesn't matter whether the page is in ISO-8859-1 or Shift_JIS or UTF-8. The behavior is all identical across the spectrum of major browsers, MS IE, Opera, Konqueror (and I guess Safari, too because Safari is based on Konqueror). Only Mozilla is NOT compliant to the ECMA standard and doing so breaks the world in 2003 :-) brendan, do you know which of a dozen interfaces (implemented by nsGlobalWindowImpl) defines escape/unescape overriding the JS engine? bug 22594 doesn't have any patch and the software archealogy is not fun :-) If you can't give an answer off the top of the head, I'll track it down
Assignee | ||
Comment 38•21 years ago
|
||
I found it. It's in nsIDOMWindowInternal.idl. Patch coming up.
Comment 39•21 years ago
|
||
Ah, cool. Bring on the code removal, then. On the assumption that we will be going with that patch, adding all sorts of fun deps that should be fixed or maybe fixed (one of these is invalid, but may get fixed anyway). Also, I thought we had a bug on window.escape totally failing to deal on UTF-16 encoded pages, but I can't find it.
Comment 40•21 years ago
|
||
I wonder which has better code footprint, the C++ escape and unescape in the patch here, or the C ones mccabe or rogerl wrote in js/src/jsstr.c? /be
Assignee | ||
Comment 41•21 years ago
|
||
brendan, do you want me to build an optimized build and check it out? just measuring the binary size is sufficient or should I use a more sophisticated method? is it worth the trouble?
Assignee | ||
Comment 42•21 years ago
|
||
it seems like we don't have a choice because escape/unescape in the JS engine are also used internally so that we can't remove them, can we?
Comment 43•21 years ago
|
||
Comment on attachment 135217 [details] [diff] [review] alternative :(removing DOM version of escape/unescape) r=me on the js changes, and if I'm a dom peer, on all changes. jshin, just curious about code size. If you have time and are curious, you can use size on Linux (or codesighs) on builds with the first patch and with this patch. /be
Attachment #135217 -
Flags: review+
Comment 44•21 years ago
|
||
Comment on attachment 135217 [details] [diff] [review] alternative :(removing DOM version of escape/unescape) sr=bzbarsky. If this is what IE does, r=bzbarsky on the DOM changes too. As a bonus, this makes escape() work in JS components with no windows, no? ;)
Attachment #135217 -
Flags: superreview+
Comment 45•21 years ago
|
||
The JS engine escape() function escapes ASCII characters '&', '=', and '?': js> escape('&'); %26 js> escape('='); %3D js> escape('?'); %3F If we revert to this, won't we break form submission? See bug 174628.
Comment 46•21 years ago
|
||
Phil, do you know why does the JS engine's escape does that? Crazy. /be
Comment 47•21 years ago
|
||
Phil, the patch in bug 174628 involved switching GlobalWindowImpl::OpenInternal() from using GlobalWindowImpl::Escape to escape the URL to directly calling nsEscape with a slightly different set of escape flags that does not escape those chars. So changes to how window.escape works will not affect the testcases in bug 174628 if I understand them correctly. See comment at http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#4587 and code following.
Assignee | ||
Comment 48•21 years ago
|
||
bz is right. this change doesn't affect what's fixed in bug 174628. I just checked in the fix.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago → 21 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•21 years ago
|
Attachment #135177 -
Flags: superreview?(bz-vacation)
Attachment #135177 -
Flags: review?(smontagu)
Assignee | ||
Comment 49•21 years ago
|
||
brendan, what do you think of back-porting this to 1.4.2?
Comment 50•21 years ago
|
||
jshin, once this cycles and all, test the bugs blocked by it?
Comment 51•21 years ago
|
||
mkaply, are you driving 1.4.2? jshin, set the blocking1.4.2? if this tests well. /be
Assignee | ||
Comment 52•21 years ago
|
||
Firebird nightly works well. I'm making this block 1.4.2.
Flags: blocking1.4.2?
Comment 53•21 years ago
|
||
Let's give this a little time to bake and then I'll approve it for 1.4.2.
Comment 54•21 years ago
|
||
Looks like this has broke my super-duper XMLHTTPRequest-Powered chat :-). I used escape() to have strings written in Russian escaped as %NN using current page's charset (namely, "windows-1251"). Now escape does %uNNNN which server code doesn't understand. So, is there a way to escape characters using some specified charset?
Comment 55•21 years ago
|
||
Ivan: You can use encodeURIComponent() to use encode in UTF-8. I don't know if that helps. [Note to self: escape()'s new behavior is described in comment 37.]
Comment 56•21 years ago
|
||
This new behaviour is inacceptable like each dropping backward compatibility. The bug is not the old behaviour but the the ECMA specification, because this spec unnecessarily breaks compatibility. If the protocol between sender and receiver was to speak windows-1251 encoded via escape, then it is inacceptable to change the behaviour of escape to do a transformation to Unicode. The internal coding of the JS engine is not of interest. If the traditional behaviour was to send hexcodes using the character encoding of the page, then this is the correct implementation and can not be changed. The ECMA specification is only valid for the core language. This core language does not know anything about document encodings. This means the specification is contextless. It must not break older traditions implemented in given contexts like webpages. If this is possible, then implementing the traditional behaviour as method of the window object hiding the equal named core function may give the old behaviour back to those who need it. escape('Ä'); // escape as unicode window.escape('Ä'); // escape using the encoding of the page. This suggestion may be confusing to beginners, but it gives back the traditional behaviour to those who need it. if(escape !== window.escape) alert('Yeah, found two implementations.'); This suggestion is ECMA compliant.
Comment 57•21 years ago
|
||
Georg, Netscape lost the browser wars. Mozilla breaking compatibilty with old Netscape releases may be bad, but it seems worse to be incompatible with Safari, Opera, and IE. What's more, your suggestion that escape !== window.escape is not exactly kosher, and I don't see how it will help the Netscape-only scripts that call escape, not window.escape, and want it to behave as it did in 4.x. Can you point to some pages that counted on the old Netscape behavior? /be
Comment 58•21 years ago
|
||
so are we 100% convinced of this for 1.4.2?
Assignee | ||
Comment 59•21 years ago
|
||
We might have to give up getting this into 1.4.2 because we also have to fix bug 225695 along with this. The change necessary to fix bug 225695 is a lot more extensive than the fix for this one.
Comment 60•21 years ago
|
||
without a fix for 225695, this is missing the boat.
Flags: blocking1.4.2? → blocking1.4.2-
Comment 61•20 years ago
|
||
*** Bug 232841 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•