Closed Bug 51355 Opened 24 years ago Closed 23 years ago

moz escapes javascript: urls containing non-ascii chars

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: mikko.rantalainen, Assigned: nhottanscp)

References

Details

(Keywords: highrisk, intl, Whiteboard: [PDT+]wait for tree open to check in)

Attachments

(11 files)

Any "special" character in href causes javascript to fail because browser
escapes all "special" characters. Following code works in Netscape Communicator
4.73 but two last fail under mozilla (build ID:2000090421 - quite probably
others builds too).

<html>
<a href='javascript:alert("Hello World!")'>Hello World!</a><br>
<a href='javascript:alert("Hello Wörld!")'>Hello Wörld!</a><br>
<a href='javascript:alert("Hello W&#246;rld!")'>Hello W&#246;rld!</a><br> </html>

As seen in the status bar mozilla translates second and third href as
'javascript:alert(%22Hello%20W%F6rld!%22)' which doesn't work. I'm not sure
whether the second href should work or not but IMHO third one should because I
have escaped &ouml; character. (Escaping " as %22 doesn't make any difference
either).

IMO correct solution would be to use string between " or ' characters in whole
with no other translation but *unescaping* before scripting.
Even NN4.73, if you type this into the URL bar:

              javascript:alert("Hello W&#246;rld!");

it doesn't work. There seems to be browser code that escapes the
characters if they are part of HTML. Therefore I'm sending this bug
over to the Parser component for further triage, as it doesn't seem
to be a JS Engine issue -
Assignee: rogerl → rickg
Status: UNCONFIRMED → NEW
Component: Javascript Engine → Parser
Ever confirmed: true
QA Contact: pschwartau → janc
I don't believe we plan to support javascript in attributes. 
Vidur?
marking wont fix, since I don't think we're supporting js attributes.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
javascript: in a href is not the same as JavaScript entities. We're not 
supporting the latter. The former should and does work. The entities and Unicode 
characters in the attributes in the examples shown should be correctly 
preserved.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Vidur's right, my mistake. Note that the parsing engine is passing the href's 
along unchanged. I suspect the culprit is either the link handling code or the 
sink. I'll dig a bit more.
Status: REOPENED → ASSIGNED
I've now confirmed that the javascript attributes are being parsed and stored as 
written. I now suspect that the anchor handling code is converting these before 
showing them.
Reassigning to trudele for triage. I think the content model is well formed, and 
that the problem is in the link handling code.
Assignee: rickg → trudelle
Status: ASSIGNED → NEW
Keywords: nsbeta3
Priority: P3 → P2
Thanks to billlaw, I've tracked this down to nsIOService::Escape(). The valid 
nsString is being converted to a whacky C-String (hence the %20's). Please take 
a look.
Assignee: trudelle → warren
Note: this may also be the solution to bug 40469 (-?)
The problem specifically is the use of Escape here: 
http://lxr.mozilla.org/seamonkey/source/layout/html/content/src/nsHTMLUtils.cpp#
150

We are converting all questionable characters in the string to their escaped 
values. Probably what we should do is construct the appropriate kind of url for 
the protocol in question (here, javascript:), and let the protocol's url parser 
deal with the escaping (if necessary). For javascript: we'll want to escape the 
things inside the quotes (I think), but not the entire url string. For "standard 
URLs" (most of the other protocols) we'll escape according to their rules.
Not working on any platform:
2000-10-05-09-MN6 : Windows
2000-10-05-13-MN6 : Mac
2000-10-05-09-MN6 : Linux
 
Changing platform/os to all/all

Nominating for rtm
Keywords: rtm
OS: Linux → All
Hardware: PC → All
Warren, is there a trivial fix for this?  This bug has languished almost a month
in the rtm nomination state.  If we don't need this for rtm, please mark it rtm-.
Whiteboard: [need info]
rtm-, no activity.  please update the bug if you're actually working on it.
Whiteboard: [need info] → [rtm-]
Even a classic non breakable space html entity gives same wrong behavior.  
Try this (attached next) : 
<html><head><title>non breakable space</title></head><body>
<a href='javascript:alert("Hello&nbsp;World!")'>Hello&nbsp;World!</a><br>
or<br>
<a href="javascript:alert('Hello&nbsp;World!')">Hello&nbsp;World!</a>
</body></html>
updated qa contact.
QA Contact: janc → bsharma
*** Bug 75542 has been marked as a duplicate of this bug. ***
The escapes are part of the URL syntax, so it should be just as valid if the
page contained the escaped characters.  Isn't the problem really that
something needs to unescape them?
Reassigning to component's default owner. Warren, hope you don't mind.
Assignee: warren → harishd
The problem is either in the JS engine or in layout.

Reassigning bug to waterson. ccing attinasi & rogerl
Assignee: harishd → waterson
*** Bug 77043 has been marked as a duplicate of this bug. ***
*** Bug 60576 has been marked as a duplicate of this bug. ***
Attached file extended testcase
*** Bug 55468 has been marked as a duplicate of this bug. ***
*** Bug 63314 has been marked as a duplicate of this bug. ***
*** Bug 66055 has been marked as a duplicate of this bug. ***
cc jst@netscape.com,ftang@netscape.com since I'm copying their comments from 
duplicate bugs.  (pschwartau@netscape.com and jag@tty.nl are already cc'ed 
here.)


From bug 55468:

------- Additional Comments From Johnny Stenback 2000-11-21 02:01 -------

Mozilla's javascript: URL handling is a mess, we escape the non-ASCII characters
in their original charset but we never unescape them - also, the escaped URL has
no charset info in it so even if we did unescape the JS URL before executing it
there's no way to know what charset the URL is encoded as... oh my



From bug 66055:

------- Additional Comments From Frank Tang 2001-01-30 11:32 -------

We probably don't care if this only a issue in alert. But I think the alert here
is just use as a simplified test case. The real problem is these JavaScript is
inside the HREF. We escape them since other URL need to be escaped but we didn't
consider the case for JavaScript. What we should do is find out the code between
the layout engine and the javaScript code and unescape it back by using the
document charset.

I think this bug is in the border line of fixing or not fixing. Let's keep this
as future untill we find breakage in the real world case.



From bug 63314:

------- Additional Comments From Peter ``jag'' Annema 2000-12-19 12:29 -------

Actually, this should be escaped (it's supposed to be an URI[0]), so I guess it 
should be unescaped before handing it to the JS engine. I strongly suspect this 
to be a dupe of an existing bug.

[0] http://www.w3.org/TR/html4/struct/links.html#adef-href


------- Additional Comments From Phil Schwartau 2001-04-26 11:06 -------

This WW3 link may be of interest:

     http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
Keywords: 4xp, intl
Summary: "javascript:code" doesn't work in hrefs properly → moz escapes javascript: urls containing non-ascii chars
cc jst@netscape.com,ftang@netscape.com since I just copied their comments from 
duplicate bugs.
My attached testcase doesn't cover:
- %hh in javascript: URLs that contain non-ascii characters
- %hh in javascript: URLs that do not contain non-ascii characters
- unescaping with the correct charset (?) (from bug 55468)
Keywords: mozilla1.0
*** Bug 69043 has been marked as a duplicate of this bug. ***
Keywords: mostfreq
*** Bug 79602 has been marked as a duplicate of this bug. ***
How did I get this bug?
Assignee: waterson → harishd
The problem seems to be around line: 150 in nsHTMLUtils.cpp ( what rickg had 
pointed out ).

http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsHTMLUtils.cpp
#150

Back to waterson :-)
Assignee: harishd → waterson
This should probably go to ftang.
Assignee: waterson → ftang
nhotta- can you help this one?
Assignee: ftang → nhotta
I tried additional cases.

<a href='javascript:alert("Hello W\u00F6rld!")'>Hello W&#246;rld!</a><br>
this works.
<a href='javascript:alert("Hello W%F6rld!")'>Hello W&#246;rld!</a><br>
this works but does not unescape "%F6".

Not sure which spec the string should follow, URI, HTML, JS or all?
Is there a spec for the case, href=javascript?
Only \uxxxx seems to be working now.
nhotta: try this
 <a href="Hello W&#246;rld">Hello W&#246;rld</a>
I tried that (it was in the original report) and it did not work.
I assume "&#246;" is decoded to unicode and if that is escaped to "%F6", it
still should works even if that is not unescaped (as my second example).
I am going to look at the actual value in nsHTMLUtils.cpp.
But I would like to know what is really allowed in that string (%xx, \uxxxx,
&#xxx;).
I looked at the value, that was "%22Hello%20W%F6rld!%22" as mentioned in the
original report.
So it has to be unescaped before passed to JavaScript. The problem is when
unescape and convert to unicode, you don't know about the charset.

There is a proposal of adding a hint_charset field to nsIURI. It also proposes
to use UTF-8, so the current conversion and escape code may not be needed, you
just need to set the hint_charset and call NS_ConvertUCS2toUTF8.
It was posted on 5/11/2001 to n.p.m.netlib, cc to people from that post. Is
there a bug for that proposal?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.2
Filed bug 84032 for adding "hint_charset" to nsIURI.
Asked jst and rpotts for reviews.
Should charset conversion apply to all new protocols by default?  CF bug 81717
(converting \ to / in links should happen for only http,https,ftp rather than
everything except javascript).
I had a look at the attached patch and there's a bunch of issues with this patch:

In nsGenericHTMLElement.cpp:

       nsAutoString href;
-      href.AssignWithConversion(hrefCStr);
+      href = NS_ConvertUTF8toUCS2(hrefCStr);

If you'd make this:

-      nsAutoString href;
-      href.AssignWithConversion(hrefCStr);
+      NS_ConvertUTF8toUCS2 href(hrefCStr);

you'd save a string copy, no?

The changes in nsHTMLUtils.cpp seem like major overkill to me, we convert a
unicode string into a utf8 cstring (and this conversion is very likely to end up
doing an malloc since URL's are often longer than what fits in nsAutoString's
internal buffer) and then create a new URI object and parse the whole URI string
only to find out if the URI string starts with "javascript:"? Wouldn't simple
string manipulation be more efficient (and less code) here?

Oh, is the attached patch a cvs diff -w? If not you better fix the indentation
in nsHTMLUtils.cpp:

+    if (bIsJavascript) {
+      spec = utf8Spec;
+    }
+    else {
     // Otherwise, we'll need to use aDocument to cough up a character
     // set converter, and re-encode the relative portion of the URL as
     ...
     ^
     |____ indent!

nsInputStreamChannel.cpp, can this get more inefficient? :-)

     nsString name;
-    name.AppendWithConversion(urlStr);
+    name.Append(NS_ConvertUTF8toUCS2(urlStr));
     *result = name.ToNewUnicode();

How about:

-    nsString name;
-    name.AppendWithConversion(urlStr);
-    *result = name.ToNewUnicode();
+    *result = ToNewUnicode(NS_ConvertUTF8toUCS2(urlStr));

in nsSimpleURI::SetSpec():

     nsAutoString spec;
-    spec.AssignWithConversion(aSpec);
+    spec = NS_ConvertUTF8toUCS2(aSpec);

Any reason to not make 'spec' be of type NS_ConvertUTF8toUCS2 in stead of being
of type nsAutoString?

nsJSProtocolHandler.cpp:

         nsAutoString scriptString;
-        scriptString.AssignWithConversion(script);
+        scriptString.Assign(NS_ConvertUTF8toUCS2(script));
         rv = scriptContext->EvaluateString(scriptString,

Why not:

-        nsAutoString scriptString;
-        scriptString.AssignWithConversion(script);
+        NS_ConvertUTF8toUCS2 scriptString(script);
         rv = scriptContext->EvaluateString(scriptString,

Fix those issues and I'll have one more look...
Jesse, I am not clear about your question. Charset conversion does not convert \
to /, that unrelated.
Much better! One more thing, in nsHTMLUtils.cpp:

+    PRInt32 pos = aSpec.Find(":");
+    nsAutoString scheme;
+    if ((pos != -1) &&
+        (aSpec.Left(scheme, pos) != -1) &&
+         scheme.EqualsIgnoreCase("javascript")) {
+      spec.Assign(NS_ConvertUCS2toUTF8(aSpec.GetUnicode()));
+    }

This code will end up doing aSpec.Left() and scheme.EqualsIgnoreCase() for every
URL when we could limit the cases to where the ':' is found at the right place,
i.e. if ((pos == nsCRT::strlen("javascript")) && ...), or something like that.

Oh, and use aSpec.FindChar(':') when looking for a single character.

Make those changes and you'll have sr=jst
That part only executed if the URI contains non ASCII characters. Anyway, faster
is better, I will update the patch.
I don't see a reason for the pos != -1 check here, if you're worried about the
performance hit from nsCRT::strlen("javascript") in the cases where a ':' isn't
found in the URI then you could make the check be simply:

  if (pos == 10 /* strlen("javascript") */ &&
      ...

which would be even faster and even less code, and you could actually optimize
this even further by never searching for the ':' character, in stead you could
simply check if the 10th character is a ':' (after making sure the string is
longer than 10 characters) and then do the check if the data before the ':' was
"javascript" or something else, but since this is not performance critical or
anything, I'm ok with this as is, except for the "pos != -1" check. sr=jst if
you remove that check...
sr on the nsSimpleURI and nsInputStreamChannel string changes.  I wonder if 
there are other places in necko that we should be doing this?.  
Under mozilla/netwerk, there are
59 of ToNewCstring,
43 of AssignWithConversion,
11 of NS_ConvertASCIItoUCS2

Changing them to ToNewUTF8String, NS_ConvertUTF8toUCS2 would not break ASCII
(UTF-8 is a superset of ASCII), though I am not sure about performance difference.
remove the N6 time [rtm-] since it cause confusion.
Whiteboard: [rtm-]
Blocks: 83989
The current patch makes it so javascript: is the only protocol where the URL
isn't subject to charset conversion.  Is this what we want to do, or do we want
to list the protocols that *do* get converted?  (I'm bringing this up because in
bug 81717, another URL conversion, converting \ to /, is being applied to
everything except javascript: and causing data: URLs to break.)
>I wonder if 
there are other places in necko that we should be doing this?.  
Let's sperate that into different bug. Let's deal with known issue for now and 
unknown issue later. 

Whiteboard: r/sr=jst+dougt, need a=
Jesse, I think it's fine to add other protocols if they also want UTF-8 string.
If there is a specific example, please file a separate bug. Note that protocol
handlers may also need change in order to accept UTF-8 URI.
Testing information:
I ran http://www.mozilla.org/quality/precheckin-tests.html which include http
and ftp. Tested links in 4.01 spec http://www.w3.org/TR/html4/.
Also tried non ASCII cases http://www.fcenter.ru/wn/hn02082000.htm to make sure
non javascript case is not affected.
Filed separately as bug 85310 about the remaining ToNewCString in necko.
pdt+ base on 6/11 pdt meeting.
Whiteboard: r/sr=jst+dougt, need a= → [PDT+]r/sr=jst+dougt, need a=
I review the 06/08/01 10:02 patch, I feel the change is right but have high risk 
except the nsHTMLUtils.cpp part. I afraid this may cause some unknown regression 
since some changs is in the low leverl code. 

In this late cycle, I suggest we hold this patch and ask naoki to work on a work 
around changes which only in nsHTMLUtils.cpp. In the case of JavaScript protocol, 
instead of convert from Unicode to UTF8, we could convert to the javascript 
escape \u for non ASCII data. In this way, we can gunarantee it only impact 
JavaScript protocol but not other code. 

We should definitely consider the rest of the 06/08/01 10:02 patch after we 
branch off from moz.0.9.3. It is safer to test that kind of change in the early 
development cycle. 

Adding highrisk and patch keywords based on ftang's comments.
Keywords: highrisk, patch
Frank reviewed the patch. His comments,
* use nsCRT::IsAscii
* check ECMA script spec for \uxxxx format
From the ECMA script spec,
http://www.ecma.ch/ecma1/stand/ecma-262.htm
> UnicodeEscapeSequence :: See 7.6
> \u HexDigit HexDigit HexDigit HexDigit

So it looks like it expects four digits although I didn't find a place which
says it had to be four digits.
I found that the current mozilla does not accept non four digits like \uF6, it
has to be \u00F6.

r=ftang
I would've put the \u as part of the format string to PR_snprintf() (and made
sure there's room for that too in buf) in stead of first appending \u to the
string and then doing the printf and then appending the result of printf to the
string, that would've been less code, but both ways work.

sr=jst
+    nsAutoString scheme;
+    if ((pos == (PRInt32) nsCRT::strlen("javascript")) &&
+        (aSpec.Left(scheme, pos) != -1) &&
+         scheme.EqualsIgnoreCase("javascript")) {

FYI, here's a way to consolidate the string constant and avoid the strlen of
that constant, without wiring in 10 (although 10 is well-known to be the length
of "javascript"!):

+    static const char kJavaScript[] = "javascript";
+    nsAutoString scheme;
+    if ((pos == (PRInt32)(sizeof kJavaScript - 1)) &&
+        (aSpec.Left(scheme, pos) != -1) &&
+         scheme.EqualsIgnoreCase(kJavaScript")) {

I agree with what jst said.  Instead of:

+      char buf[5];
+      spec.Truncate(0);
+      for (const PRUnichar* uch = aSpec.GetUnicode(); *uch; ++uch) {
+        if (!nsCRT::IsAscii(*uch)) {
+          spec.Append("\\u");
+          PR_snprintf(buf, sizeof(buf), "%.4x", *uch);
+          spec.Append(buf);

Why not

+      char buf[6+1];	// space for \uXXXX plus a NUL at the end
+      spec.Truncate(0);
+      for (const PRUnichar* uch = aSpec.GetUnicode(); *uch; ++uch) {
+        if (!nsCRT::IsAscii(*uch)) {
+          PR_snprintf(buf, sizeof(buf), "\\u%.4x", *uch);
+          spec.Append(buf);

/be
Whiteboard: [PDT+]r/sr=jst+dougt, need a= → [PDT+]need sr=
*** Bug 86092 has been marked as a duplicate of this bug. ***
sr=jst
Whiteboard: [PDT+]need sr= → [PDT+]need a=
a= asa@mozilla.org for checkin to the trunk.
(on behalf of drivers)
Whiteboard: [PDT+]need a= → [PDT+]wait for tree open to check in
checked in
Status: ASSIGNED → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
I checked now. It work! Thanx!  from Japan
verified win2k 2001061905
         macos 2001061908
         linux 2001061914
Status: RESOLVED → VERIFIED
*** Bug 88220 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: