Closed
Bug 44272
Opened 24 years ago
Closed 21 years ago
javascript escape and unescape don't work properly with unicode chars
Categories
(Core :: DOM: Core & HTML, defect, P2)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
People
(Reporter: pete_a90, Assigned: jshin1987)
References
Details
(Keywords: intl, Whiteboard: [nsbeta3-])
Attachments
(4 files)
734 bytes,
text/html
|
Details | |
719 bytes,
text/html
|
Details | |
8.32 KB,
patch
|
Details | Diff | Splinter Review | |
7.14 KB,
patch
|
brendan
:
review+
bzbarsky
:
superreview+
|
Details | Diff | Splinter Review |
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
BuildID: 2000062908
The escape and unescape javascript functions have slight problems with unicode
escaped fields.
Reproducible: Always
Steps to Reproduce:
1. Load javascript code to a new file
2. Open file
3. Sigh
Actual Results:
1. <-- (nothing)
2. <-- (nothing)
3. E85ow <-- (E85ow)
4. heycool <-- (this one works)
Expected Results:
1. %u015Fello
2. hello
3. %u1E85ow
4. There is no problem with this one, I'm just showing \u works.
<PRE>
<SCRIPT TYPE="text/javascript"><!--
document.writeln("Actual output:".bold());
document.writeln("1. " + escape("\u015Fello") + " <-- (nothing)");
document.writeln("2. " + unescape("%u0068ello") + " <-- (nothing)");
document.writeln("3. " + unescape("%u1E85ow") + " <-- (E85ow)");
document.writeln("4. \u0068eycool <-- (this one works)\n");
document.writeln("Expected output:".bold());
document.writeln("1. %u015Fello");
document.writeln("2. hello");
document.writeln("3. \u1E85ow");
document.writeln("4. There is no problem with this one, I'm just showing \\u
works.");
//--></SCRIPT>
</PRE>
Comment 1•24 years ago
|
||
-->DOM, the [un]escape functions being called come from nsJSWindow, they
override the engine versions.
Assignee: rogerl → jst
Status: UNCONFIRMED → NEW
Component: Javascript Engine → DOM Level 0
Ever confirmed: true
QA Contact: pschwartau → desale
Comment 2•24 years ago
|
||
Reassigning to ftang for investigation as an i18n content support problem.
Assignee: jst → ftang
Updated•24 years ago
|
Status: NEW → ASSIGNED
Comment 4•24 years ago
|
||
I take a look at ECMA-262 from
http://developer.netscape.com/docs/javascript/e262-pdf.pdf
Here is some related information
FROM PAGE 9
Source Text
ECMAScript source text is represented as a sequence of characters representable
using the Unicode version 2.0
character encoding.
SourceCharacter ::
any Unicode character
However, it is possible to represent every ECMAScript program using only ASCII
characters (which are equivalent to
the first 128 Unicode characters). Non-ASCII Unicode characters may appear only
within comments and string literals.
In string literals, any Unicode character may also be expressed as a Unicode
escape sequence consisting of six ASCII
characters, namely \u plus four hexadecimal digits. Within a comment, such an
escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode escape sequence
contributes one character to the string
value of the literal.
Note that ECMAScript differs from the Java programming language in the behavior
of Unicode escape sequences. In a
Java program, if the Unicode escape sequence \u000A, for example, occurs within
a single-line comment, it is
interpreted as a line terminator (Unicode character 000A is line feed) and
therefore the next character is not part of the
comment. Similarly, if the Unicode escape sequence \u000A occurs within a string
literal in a Java program, it is
likewise interpreted as a line terminator, which is not allowed within a string
literal—one must write \n instead of
\u000A to cause a line feed to be part of the string value of a string literal.
In an ECMAScript program, a Unicode
escape sequence occurring within a comment is never interpreted and therefore
cannot contribute to termination of the
comment. Similarly, a Unicode escape sequence occurring within a string literal
in an ECMAScript program always
contributes a character to the string value of the literal and is never
interpreted as a line terminator or as a quote mark
that might terminate the string literal.
FROM PAGE 60-61
15.1.2.4 escape(string)
The escape function computes a new version of a string value in which certain
characters have been
replaced by a hexadecimal escape sequence. The result thus contains no special
characters that might have
special meaning within a URL.
For characters whose Unicode encoding is 0xFF or less, a two-digit escape
sequence of the form %xx is used
in accordance with RFC1738. For characters whose Unicode encoding is greater
than 0xFF, a four-digit
escape sequence of the form %uxxxx is used
When the escape function is called with one argument string, the following steps
are taken:
1. Call ToString(string).
2. Compute the number of characters in Result(1).
3. Let R be the empty string.
4. Let k be 0.
5. If k equals Result(2), return R.
6. Get the character at position k within Result(1).
7. If Result(6) is one of the 69 nonblank ASCII characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz 0123456789 @*_+-./, go to step 14.
8. Compute the 16-bit unsigned integer that is the Unicode character encoding of
Result(6).
9. If Result(8), is less than 256, go to step 12.
10. Let S be a string containing six characters “%uwxyz” where wxyz are four
hexadecimal digits encoding the
value of Result(8).
11. Go to step 15.
12. Let S be a string containing three characters “%xy” where xy are two
hexadecimal digits encoding the
value of Result(8).
13. Go to step 15.
14. Let S be a string containing the single character Result(6).
15.Let R be a new string value computed by concatenating the previous value of R
and S.
16. Increase k by 1.
17. Go to step 5.
15.1.2.5 unescape(string)
The unescape function computes a new version of a string value in which each
escape sequences of the sort
that might be introduced by the escape function is replaced with the character
that it represents.
When the unescape function is called with one argument string, the following
steps are taken:
1. Call ToString(string).
2. Compute the number of characters in Result(1).
3. Let R be the empty string.
4. Let k be 0.
5. If k equals Result(2), return R.
6. Let c be the character at position k within Result(1).
7. If c is not %, go to step 18.
8. If k is greater than Result(2)-6, go to step 14.
9. If the character at position k+1 within result(1) is not u, go to step 14.
10. If the four characters at positions k+2, k+3, k+4, and k+5 within Result(1)
are not all hexadecimal digits,
go to step 14.
11. Let c be the character whose Unicode encoding is the integer represented by
the four hexadecimal digits at
positions k+2, k+3, k+4, and k+5 within Result(1).
12. Increase k by 5.
13. Go to step 18.
14. If k is greater than Result(2)-3, go to step 18.
15. If the two characters at positions k+1 and k+2 within Result(1) are not both
hexadecimal digits, go to step
18.
16. Let c be the character whose Unicode encoding is the integer represented by
two zeroes plus the two
hexadecimal digits at positions k+1 and k+2 within Result(1).
17. Increase k by 2.
18. Let R be a new string value computed by concatenating the previous value of
R and c.
19. Increase k by 1.
20. Go to step 5.
Comment 5•24 years ago
|
||
ekrock, how important is this bug ? We override the escape and unescape so we
won't break existing JavaScript code. However, we may need to do more to make it
also follow ECMA Script.
Comment 6•24 years ago
|
||
There are really two issues here, right ?
Issue 1. \uxxxx in JavaScript
Issue 2. escape() and unescape() function with %uxxxx
We need ekrock's oppinion before decide +/-
Comment 7•24 years ago
|
||
clayton, jband, mccabe, brendan -- How important is this bug from the standpoint
of ECMA-262 Version 3 compliance? We *are* shooting for full compliance with
that specification.
From a practical standpoint:
#1: We don't want to break existing code, and we don't here, so that's good.
#2: escape and unescape are used primarily for encoding strings to be passed as
URLs. If we don't fix this, it basically means that the first release may not be
able to pass unicode characters in URLs to CGIs. Unfortunate, but probably not
the end of the world.
My vote: if this is required for ECMA-262 V3 compliance, I'd like it fixed. If
not, we'd love to fix it, but it could be Futured if you're out of time.
Comment 9•24 years ago
|
||
I don't think this is a bug; escape and unescape aren't a required part of the
ECMA-262 spec. Legacy server interactions require that they know about the
document encoding; this is what the DOM version of the escape and unescape
functions do.
(The engine itself has its own versions, which aren't visible within the
browser. These versions follow the 'recommended specification' from the ECMA
spec.)
Please see extensive comments to 42221.
See also encodeURI and friends, now officially part of the ECMA spec. They're
an effort to solve this problem without interacting with legacy server behavior.
Marking WONTFIX... but many thanks for the close look at how we behave.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
Comment 11•23 years ago
|
||
*** Bug 96716 has been marked as a duplicate of this bug. ***
Comment 12•23 years ago
|
||
I would love to see the "extensive comments" in bug 42221, but it is protected.
Please open it. This is still a bug, especially since it intentionally violates
the EMCA spec for %uwxyz output from the escape function. IE does this as
spec'd, so I don't see a valid argument about legacy server behavior--the
server will still need to deal with IE, which is more and more likely as time
goes on.
As near as I can tell, encodeURI and encodeURIComponent still require the
server to know the document encoding (correct me if I'm misunderstanding this).
Using escape as spec'd, the document encoding is not necessary. The argument
about breaking legacy JavaScript code also isn't persuasive: IE made the change
quite a while ago and there are many other places in Mozilla that break legacy
code (lack of support for layers among other things).
Comment 13•23 years ago
|
||
Bug 42221 is now world readable, reopening this to reconsider given that this
breaks IE compatibility. What about NS4x compatibility, does this break that too?
Status: VERIFIED → REOPENED
Resolution: WONTFIX → ---
Comment 14•23 years ago
|
||
Thanks for opening bug 42221, JST. In bug 42221 comment #44 from 2000-08-24,
Mike McCabe says:
"The issue isn't with escape and unescape per se, but with the fact that string
information that is to be transmitted to servers primarily passes through them.
Interaction with servers evolved back when JavaScript used 8-bit characters in
strings, and more importantly, when JavaScript worked directly with the document
charactacter encoding rather than with a unicode reflection of it.
After we moved escape and unescape into the JS engine in an effort to keep up
with the evolving ECMA standard, all of these server interactions broke, as they
were now getting unicode instead of, say, shift-JIS. We ended up keeping the
ECMA-draft versions in the engine (note that they are now not part of the
standard, because of this issue) and Joki implemented escape and unescape
functions in the DOM, where they had access to the document character encoding
and could to back-and-forth translations."
If I'm reading this correctly, the DOM escape will escape a character according
to the document character encoding, while the JS engine's escape puts it into
the %uwxyz unicode format. Given IE's behavior of doing the latter and it's
market dominance, I find it hard to believe that there are many servers that
still use the legacy behavior. Whether or not this is officially part of the
EMCA standard, I'd rather see Mozilla be compatible with IE at this point than
Netscape 4.
Comment 15•23 years ago
|
||
nhotta, can you take a look at this ?
Assignee: ftang → nhotta
Status: REOPENED → NEW
Updated•23 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.2alpha
Comment 16•22 years ago
|
||
*** Bug 121275 has been marked as a duplicate of this bug. ***
Comment 17•22 years ago
|
||
Sounds little serious & can cause serious effects on data security.
Raising priority to p2.
Severity major. Backward compatibility is defeating too.
Severity: normal → major
Priority: P3 → P2
Comment 18•22 years ago
|
||
Updated•22 years ago
|
Target Milestone: mozilla1.2alpha → ---
Comment 19•22 years ago
|
||
In the test case, 2&3 are not bug.
2. unescape("%u0068ello")
The % encode is % followed by hex digits. This data is not URL encoded correctly.
3. unescape("%u1E85ow")
he % encode is % followed by hex digits. This data is not URL encoded correctly.
The first case is a bug.
1. escape("\u015Fello")
The character \u015F cannot be encoded to the document charset which is
ISO-8859-1 for the test case (since no charset is specified in the document so
the default ISO-8859-1 is used). If I set the document charset to UTF-8 using
the menu then the character is escaped correctly as "%C5%9F".
So we could change to fallback to UTF-8 when escaping character is out of the
current document charset range.
Comment 20•22 years ago
|
||
*** Bug 191139 has been marked as a duplicate of this bug. ***
Comment 21•21 years ago
|
||
Anyone still paying attention to this one? It seems like an easy fix,
considering the correct escape() function already exists. I'm sure we could
come up with a compromise if anyone is actually worried about 4x compatibility
here.
I've attached a small html file that makes it easy to run strings through
window.escape() to see what comes out.
Assignee | ||
Comment 22•21 years ago
|
||
This is indeed a serious bug. I've just tested Opera 7.2 and
Konqueror(http://i18nl10n.com/moztest/jstest.euckr.html). Both of them, in
addition to MS IE, do the right thing. Only Mozilla is NOT compliant to the
standard. It's rather ironic that Mozilla doesn't abide by the standard given
that ECMAscript originated as Javascript at Netscape.
It's a rather easy fix except for some code duplication issue. See also bug
224772 and bug 200984.
Assignee | ||
Comment 23•21 years ago
|
||
taking (hope nhotta doesn't mind :-)). patch coming up shortly
Assignee: nhottanscp → jshin
Status: ASSIGNED → NEW
Assignee | ||
Comment 24•21 years ago
|
||
Escape() loops twice through the input string, once for the output length
calculation and once for the actual conversion. We may be better off just
looping once with the capaticy set to a reasonable estimate. What do you think,
bz?
BTW, note that this doesn't result in any regression against bug 200984. With
MS IE, Konqueror, and Opera/Safari doing all what this patch does, the chance
of breaking some javascript code in the wild is very low.
Comment 25•21 years ago
|
||
How well can we guess the needed capacity? It's only to within a factor of 6 or
so, no?
Assignee | ||
Comment 26•21 years ago
|
||
be, who would be a reviewer for JS compliance? Would you ? (waldemar is not
available, right?).
bz, in general factor of six, but we can call IsASCII (or equivalent) to narrow
it down to factor of 3. Alternatively, if the penalty for the wrong guess is not
so high, we can take a chance by using the expectation value assuming all ASCII
character has the same frequency (if ISASCII is true). I'm not familiar with the
string internal. So, I'll do what you think is the best.
Status: NEW → ASSIGNED
Keywords: intl
Comment 27•21 years ago
|
||
Frankly, I think we should do the simplest code we can here, as a start. Either
the factor-of-6 guess or what you did is just fine.
Assignee | ||
Comment 28•21 years ago
|
||
Comment on attachment 135177 [details] [diff] [review]
patch
Asking for r/sr.
Let's just go with the current patch as bz is all right with either way.
I guess JS compliance review is not really necessary. Nonetheless, I'd be glad
to get review of brendan as well if he has something to say here.
Attachment #135177 -
Flags: superreview?(bz-vacation)
Attachment #135177 -
Flags: review?(smontagu)
Comment 29•21 years ago
|
||
I won't be able to review this for probably at least a week.
Comment 30•21 years ago
|
||
Comment on attachment 135177 [details] [diff] [review]
patch
Why not just use the JS engine's escape and unescape (don't #ifndef
MOZILLA_CLIENT them in js/src/js{api,str}.c), rather than have the DOM
implement them over again?
/be
Comment 31•21 years ago
|
||
Also note the encodeURI() and decodeURI() functions. I don't know if
these would be appropriate or better for handling Unicode characters -
Comment 32•21 years ago
|
||
> Why not just use the JS engine's escape and unescape
Because that would unfortunately break the world. Or it did when it was done
that way. See bug 22594
Comment 33•21 years ago
|
||
In particular, see http://bugzilla.mozilla.org/show_bug.cgi?id=22594#c12
Comment 34•21 years ago
|
||
bz: first, isn't the patch in this bug also going to break the world? I don't
see it using the document charset (maybe I'm blind).
Second, this bug and others have argued that IE won, it doesn't have the legacy
escape/unescape hack, so why should we? Does Safari?
/be
Assignee | ||
Comment 35•21 years ago
|
||
bz, brendan is right that my patch behaves exactly the same as JS engine
version. I thought there was another reason to roll out a separate DOM
version, but it turned out that that's not the case. so, it makes sense just to use
JS engine version.
Comment 36•21 years ago
|
||
Brendan, I haven't read this patch yet... ;)
jshin, what does passing a random accented char to escape() on an ISO-8859-1
page produce in IE, out of curiousity?
I'm all for removing redundant code if we can. ;)
Assignee | ||
Comment 37•21 years ago
|
||
bz : If they're below U+0100, they'll be escaped as '%hh'. If not, they'll be
escaped as '%uhhhh'. It doesn't matter whether the page is in ISO-8859-1 or
Shift_JIS or UTF-8.
The behavior is all identical across the spectrum of major browsers, MS IE,
Opera, Konqueror (and I guess Safari, too because Safari is based on
Konqueror). Only Mozilla is NOT compliant to the ECMA standard and doing
so breaks the world in 2003 :-)
brendan, do you know which of a dozen interfaces (implemented by
nsGlobalWindowImpl) defines escape/unescape overriding the JS engine?
bug 22594 doesn't have any patch and the software archealogy is not fun :-) If
you can't give an answer off the top of the head, I'll track it down
Assignee | ||
Comment 38•21 years ago
|
||
I found it. It's in nsIDOMWindowInternal.idl. Patch coming up.
Comment 39•21 years ago
|
||
Ah, cool. Bring on the code removal, then. On the assumption that we will be
going with that patch, adding all sorts of fun deps that should be fixed or
maybe fixed (one of these is invalid, but may get fixed anyway).
Also, I thought we had a bug on window.escape totally failing to deal on UTF-16
encoded pages, but I can't find it.
Comment 40•21 years ago
|
||
I wonder which has better code footprint, the C++ escape and unescape in the
patch here, or the C ones mccabe or rogerl wrote in js/src/jsstr.c?
/be
Assignee | ||
Comment 41•21 years ago
|
||
brendan, do you want me to build an optimized build and check it out? just
measuring the binary size is sufficient or should I use a more sophisticated
method? is it worth the trouble?
Assignee | ||
Comment 42•21 years ago
|
||
it seems like we don't have a choice because escape/unescape in the JS engine
are also used internally so that we can't remove them, can we?
Comment 43•21 years ago
|
||
Comment on attachment 135217 [details] [diff] [review]
alternative :(removing DOM version of escape/unescape)
r=me on the js changes, and if I'm a dom peer, on all changes.
jshin, just curious about code size. If you have time and are curious, you can
use size on Linux (or codesighs) on builds with the first patch and with this
patch.
/be
Attachment #135217 -
Flags: review+
Comment 44•21 years ago
|
||
Comment on attachment 135217 [details] [diff] [review]
alternative :(removing DOM version of escape/unescape)
sr=bzbarsky. If this is what IE does, r=bzbarsky on the DOM changes too.
As a bonus, this makes escape() work in JS components with no windows, no? ;)
Attachment #135217 -
Flags: superreview+
Comment 45•21 years ago
|
||
The JS engine escape() function escapes ASCII characters '&', '=', and '?':
js> escape('&');
%26
js> escape('=');
%3D
js> escape('?');
%3F
If we revert to this, won't we break form submission? See bug 174628.
Comment 46•21 years ago
|
||
Phil, do you know why does the JS engine's escape does that? Crazy.
/be
Comment 47•21 years ago
|
||
Phil, the patch in bug 174628 involved switching
GlobalWindowImpl::OpenInternal() from using GlobalWindowImpl::Escape to escape
the URL to directly calling nsEscape with a slightly different set of escape
flags that does not escape those chars. So changes to how window.escape works
will not affect the testcases in bug 174628 if I understand them correctly. See
comment at
http://lxr.mozilla.org/seamonkey/source/dom/src/base/nsGlobalWindow.cpp#4587 and
code following.
Assignee | ||
Comment 48•21 years ago
|
||
bz is right. this change doesn't affect what's fixed in bug 174628.
I just checked in the fix.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago → 21 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•21 years ago
|
Attachment #135177 -
Flags: superreview?(bz-vacation)
Attachment #135177 -
Flags: review?(smontagu)
Assignee | ||
Comment 49•21 years ago
|
||
brendan, what do you think of back-porting this to 1.4.2?
Comment 50•21 years ago
|
||
jshin, once this cycles and all, test the bugs blocked by it?
Comment 51•21 years ago
|
||
mkaply, are you driving 1.4.2? jshin, set the blocking1.4.2? if this tests well.
/be
Assignee | ||
Comment 52•21 years ago
|
||
Firebird nightly works well. I'm making this block 1.4.2.
Flags: blocking1.4.2?
Comment 53•21 years ago
|
||
Let's give this a little time to bake and then I'll approve it for 1.4.2.
Comment 54•21 years ago
|
||
Looks like this has broke my super-duper XMLHTTPRequest-Powered chat :-).
I used escape() to have strings written in Russian escaped as %NN using current
page's charset (namely, "windows-1251"). Now escape does %uNNNN which server
code doesn't understand. So, is there a way to escape characters using some
specified charset?
Comment 55•21 years ago
|
||
Ivan: You can use encodeURIComponent() to use encode in UTF-8. I don't know if
that helps.
[Note to self: escape()'s new behavior is described in comment 37.]
Comment 56•21 years ago
|
||
This new behaviour is inacceptable like each dropping backward compatibility.
The bug is not the old behaviour but the the ECMA specification, because this
spec unnecessarily breaks compatibility. If the protocol between sender and
receiver was to speak windows-1251 encoded via escape, then it is inacceptable
to change the behaviour of escape to do a transformation to Unicode. The
internal coding of the JS engine is not of interest. If the traditional
behaviour was to send hexcodes using the character encoding of the page, then
this is the correct implementation and can not be changed.
The ECMA specification is only valid for the core language. This core language
does not know anything about document encodings. This means the specification is
contextless. It must not break older traditions implemented in given contexts
like webpages.
If this is possible, then implementing the traditional behaviour as method of
the window object hiding the equal named core function may give the old
behaviour back to those who need it.
escape('Ä'); // escape as unicode
window.escape('Ä'); // escape using the encoding of the page.
This suggestion may be confusing to beginners, but it gives back the traditional
behaviour to those who need it.
if(escape !== window.escape) alert('Yeah, found two implementations.');
This suggestion is ECMA compliant.
Comment 57•21 years ago
|
||
Georg, Netscape lost the browser wars. Mozilla breaking compatibilty with old
Netscape releases may be bad, but it seems worse to be incompatible with Safari,
Opera, and IE.
What's more, your suggestion that escape !== window.escape is not exactly
kosher, and I don't see how it will help the Netscape-only scripts that call
escape, not window.escape, and want it to behave as it did in 4.x.
Can you point to some pages that counted on the old Netscape behavior?
/be
Comment 58•21 years ago
|
||
so are we 100% convinced of this for 1.4.2?
Assignee | ||
Comment 59•21 years ago
|
||
We might have to give up getting this into 1.4.2 because we also have to fix bug
225695 along with this. The change necessary to fix bug 225695 is a lot more
extensive than the fix for this one.
Comment 60•21 years ago
|
||
without a fix for 225695, this is missing the boat.
Flags: blocking1.4.2? → blocking1.4.2-
Comment 61•20 years ago
|
||
*** Bug 232841 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•