Closed Bug 51355 Opened 24 years ago Closed 23 years ago

moz escapes javascript: urls containing non-ascii chars

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: mikko.rantalainen, Assigned: nhottanscp)

References

Details

(Keywords: highrisk, intl, Whiteboard: [PDT+]wait for tree open to check in)

Attachments

(11 files)

Any "special" character in href causes javascript to fail because browser escapes all "special" characters. Following code works in Netscape Communicator 4.73 but two last fail under mozilla (build ID:2000090421 - quite probably others builds too). <html> <a href='javascript:alert("Hello World!")'>Hello World!</a><br> <a href='javascript:alert("Hello Wörld!")'>Hello Wörld!</a><br> <a href='javascript:alert("Hello W&#246;rld!")'>Hello W&#246;rld!</a><br> </html> As seen in the status bar mozilla translates second and third href as 'javascript:alert(%22Hello%20W%F6rld!%22)' which doesn't work. I'm not sure whether the second href should work or not but IMHO third one should because I have escaped &ouml; character. (Escaping " as %22 doesn't make any difference either). IMO correct solution would be to use string between " or ' characters in whole with no other translation but *unescaping* before scripting.
Even NN4.73, if you type this into the URL bar: javascript:alert("Hello W&#246;rld!"); it doesn't work. There seems to be browser code that escapes the characters if they are part of HTML. Therefore I'm sending this bug over to the Parser component for further triage, as it doesn't seem to be a JS Engine issue -
Assignee: rogerl → rickg
Status: UNCONFIRMED → NEW
Component: Javascript Engine → Parser
Ever confirmed: true
QA Contact: pschwartau → janc
I don't believe we plan to support javascript in attributes. Vidur?
marking wont fix, since I don't think we're supporting js attributes.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → WONTFIX
javascript: in a href is not the same as JavaScript entities. We're not supporting the latter. The former should and does work. The entities and Unicode characters in the attributes in the examples shown should be correctly preserved.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Vidur's right, my mistake. Note that the parsing engine is passing the href's along unchanged. I suspect the culprit is either the link handling code or the sink. I'll dig a bit more.
Status: REOPENED → ASSIGNED
I've now confirmed that the javascript attributes are being parsed and stored as written. I now suspect that the anchor handling code is converting these before showing them.
Reassigning to trudele for triage. I think the content model is well formed, and that the problem is in the link handling code.
Assignee: rickg → trudelle
Status: ASSIGNED → NEW
Keywords: nsbeta3
Priority: P3 → P2
Thanks to billlaw, I've tracked this down to nsIOService::Escape(). The valid nsString is being converted to a whacky C-String (hence the %20's). Please take a look.
Assignee: trudelle → warren
Note: this may also be the solution to bug 40469 (-?)
The problem specifically is the use of Escape here: http://lxr.mozilla.org/seamonkey/source/layout/html/content/src/nsHTMLUtils.cpp# 150 We are converting all questionable characters in the string to their escaped values. Probably what we should do is construct the appropriate kind of url for the protocol in question (here, javascript:), and let the protocol's url parser deal with the escaping (if necessary). For javascript: we'll want to escape the things inside the quotes (I think), but not the entire url string. For "standard URLs" (most of the other protocols) we'll escape according to their rules.
Not working on any platform: 2000-10-05-09-MN6 : Windows 2000-10-05-13-MN6 : Mac 2000-10-05-09-MN6 : Linux Changing platform/os to all/all Nominating for rtm
Keywords: rtm
OS: Linux → All
Hardware: PC → All
Warren, is there a trivial fix for this? This bug has languished almost a month in the rtm nomination state. If we don't need this for rtm, please mark it rtm-.
Whiteboard: [need info]
rtm-, no activity. please update the bug if you're actually working on it.
Whiteboard: [need info] → [rtm-]
Even a classic non breakable space html entity gives same wrong behavior. Try this (attached next) : <html><head><title>non breakable space</title></head><body> <a href='javascript:alert("Hello&nbsp;World!")'>Hello&nbsp;World!</a><br> or<br> <a href="javascript:alert('Hello&nbsp;World!')">Hello&nbsp;World!</a> </body></html>
updated qa contact.
QA Contact: janc → bsharma
*** Bug 75542 has been marked as a duplicate of this bug. ***
The escapes are part of the URL syntax, so it should be just as valid if the page contained the escaped characters. Isn't the problem really that something needs to unescape them?
Reassigning to component's default owner. Warren, hope you don't mind.
Assignee: warren → harishd
The problem is either in the JS engine or in layout. Reassigning bug to waterson. ccing attinasi & rogerl
Assignee: harishd → waterson
*** Bug 77043 has been marked as a duplicate of this bug. ***
*** Bug 60576 has been marked as a duplicate of this bug. ***
Attached file extended testcase
*** Bug 55468 has been marked as a duplicate of this bug. ***
*** Bug 63314 has been marked as a duplicate of this bug. ***
*** Bug 66055 has been marked as a duplicate of this bug. ***
cc jst@netscape.com,ftang@netscape.com since I'm copying their comments from duplicate bugs. (pschwartau@netscape.com and jag@tty.nl are already cc'ed here.) From bug 55468: ------- Additional Comments From Johnny Stenback 2000-11-21 02:01 ------- Mozilla's javascript: URL handling is a mess, we escape the non-ASCII characters in their original charset but we never unescape them - also, the escaped URL has no charset info in it so even if we did unescape the JS URL before executing it there's no way to know what charset the URL is encoded as... oh my From bug 66055: ------- Additional Comments From Frank Tang 2001-01-30 11:32 ------- We probably don't care if this only a issue in alert. But I think the alert here is just use as a simplified test case. The real problem is these JavaScript is inside the HREF. We escape them since other URL need to be escaped but we didn't consider the case for JavaScript. What we should do is find out the code between the layout engine and the javaScript code and unescape it back by using the document charset. I think this bug is in the border line of fixing or not fixing. Let's keep this as future untill we find breakage in the real world case. From bug 63314: ------- Additional Comments From Peter ``jag'' Annema 2000-12-19 12:29 ------- Actually, this should be escaped (it's supposed to be an URI[0]), so I guess it should be unescaped before handing it to the JS engine. I strongly suspect this to be a dupe of an existing bug. [0] http://www.w3.org/TR/html4/struct/links.html#adef-href ------- Additional Comments From Phil Schwartau 2001-04-26 11:06 ------- This WW3 link may be of interest: http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
Keywords: 4xp, intl
Summary: "javascript:code" doesn't work in hrefs properly → moz escapes javascript: urls containing non-ascii chars
cc jst@netscape.com,ftang@netscape.com since I just copied their comments from duplicate bugs.
My attached testcase doesn't cover: - %hh in javascript: URLs that contain non-ascii characters - %hh in javascript: URLs that do not contain non-ascii characters - unescaping with the correct charset (?) (from bug 55468)
Keywords: mozilla1.0
*** Bug 69043 has been marked as a duplicate of this bug. ***
Keywords: mostfreq
*** Bug 79602 has been marked as a duplicate of this bug. ***
How did I get this bug?
Assignee: waterson → harishd
The problem seems to be around line: 150 in nsHTMLUtils.cpp ( what rickg had pointed out ). http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsHTMLUtils.cpp #150 Back to waterson :-)
Assignee: harishd → waterson
This should probably go to ftang.
Assignee: waterson → ftang
nhotta- can you help this one?
Assignee: ftang → nhotta
I tried additional cases. <a href='javascript:alert("Hello W\u00F6rld!")'>Hello W&#246;rld!</a><br> this works. <a href='javascript:alert("Hello W%F6rld!")'>Hello W&#246;rld!</a><br> this works but does not unescape "%F6". Not sure which spec the string should follow, URI, HTML, JS or all? Is there a spec for the case, href=javascript? Only \uxxxx seems to be working now.
nhotta: try this <a href="Hello W&#246;rld">Hello W&#246;rld</a>
I tried that (it was in the original report) and it did not work. I assume "&#246;" is decoded to unicode and if that is escaped to "%F6", it still should works even if that is not unescaped (as my second example). I am going to look at the actual value in nsHTMLUtils.cpp. But I would like to know what is really allowed in that string (%xx, \uxxxx, &#xxx;).
I looked at the value, that was "%22Hello%20W%F6rld!%22" as mentioned in the original report. So it has to be unescaped before passed to JavaScript. The problem is when unescape and convert to unicode, you don't know about the charset. There is a proposal of adding a hint_charset field to nsIURI. It also proposes to use UTF-8, so the current conversion and escape code may not be needed, you just need to set the hint_charset and call NS_ConvertUCS2toUTF8. It was posted on 5/11/2001 to n.p.m.netlib, cc to people from that post. Is there a bug for that proposal?
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.2
Filed bug 84032 for adding "hint_charset" to nsIURI.
Asked jst and rpotts for reviews.
Should charset conversion apply to all new protocols by default? CF bug 81717 (converting \ to / in links should happen for only http,https,ftp rather than everything except javascript).
I had a look at the attached patch and there's a bunch of issues with this patch: In nsGenericHTMLElement.cpp: nsAutoString href; - href.AssignWithConversion(hrefCStr); + href = NS_ConvertUTF8toUCS2(hrefCStr); If you'd make this: - nsAutoString href; - href.AssignWithConversion(hrefCStr); + NS_ConvertUTF8toUCS2 href(hrefCStr); you'd save a string copy, no? The changes in nsHTMLUtils.cpp seem like major overkill to me, we convert a unicode string into a utf8 cstring (and this conversion is very likely to end up doing an malloc since URL's are often longer than what fits in nsAutoString's internal buffer) and then create a new URI object and parse the whole URI string only to find out if the URI string starts with "javascript:"? Wouldn't simple string manipulation be more efficient (and less code) here? Oh, is the attached patch a cvs diff -w? If not you better fix the indentation in nsHTMLUtils.cpp: + if (bIsJavascript) { + spec = utf8Spec; + } + else { // Otherwise, we'll need to use aDocument to cough up a character // set converter, and re-encode the relative portion of the URL as ... ^ |____ indent! nsInputStreamChannel.cpp, can this get more inefficient? :-) nsString name; - name.AppendWithConversion(urlStr); + name.Append(NS_ConvertUTF8toUCS2(urlStr)); *result = name.ToNewUnicode(); How about: - nsString name; - name.AppendWithConversion(urlStr); - *result = name.ToNewUnicode(); + *result = ToNewUnicode(NS_ConvertUTF8toUCS2(urlStr)); in nsSimpleURI::SetSpec(): nsAutoString spec; - spec.AssignWithConversion(aSpec); + spec = NS_ConvertUTF8toUCS2(aSpec); Any reason to not make 'spec' be of type NS_ConvertUTF8toUCS2 in stead of being of type nsAutoString? nsJSProtocolHandler.cpp: nsAutoString scriptString; - scriptString.AssignWithConversion(script); + scriptString.Assign(NS_ConvertUTF8toUCS2(script)); rv = scriptContext->EvaluateString(scriptString, Why not: - nsAutoString scriptString; - scriptString.AssignWithConversion(script); + NS_ConvertUTF8toUCS2 scriptString(script); rv = scriptContext->EvaluateString(scriptString, Fix those issues and I'll have one more look...
Jesse, I am not clear about your question. Charset conversion does not convert \ to /, that unrelated.
Much better! One more thing, in nsHTMLUtils.cpp: + PRInt32 pos = aSpec.Find(":"); + nsAutoString scheme; + if ((pos != -1) && + (aSpec.Left(scheme, pos) != -1) && + scheme.EqualsIgnoreCase("javascript")) { + spec.Assign(NS_ConvertUCS2toUTF8(aSpec.GetUnicode())); + } This code will end up doing aSpec.Left() and scheme.EqualsIgnoreCase() for every URL when we could limit the cases to where the ':' is found at the right place, i.e. if ((pos == nsCRT::strlen("javascript")) && ...), or something like that. Oh, and use aSpec.FindChar(':') when looking for a single character. Make those changes and you'll have sr=jst
That part only executed if the URI contains non ASCII characters. Anyway, faster is better, I will update the patch.
I don't see a reason for the pos != -1 check here, if you're worried about the performance hit from nsCRT::strlen("javascript") in the cases where a ':' isn't found in the URI then you could make the check be simply: if (pos == 10 /* strlen("javascript") */ && ... which would be even faster and even less code, and you could actually optimize this even further by never searching for the ':' character, in stead you could simply check if the 10th character is a ':' (after making sure the string is longer than 10 characters) and then do the check if the data before the ':' was "javascript" or something else, but since this is not performance critical or anything, I'm ok with this as is, except for the "pos != -1" check. sr=jst if you remove that check...
sr on the nsSimpleURI and nsInputStreamChannel string changes. I wonder if there are other places in necko that we should be doing this?.
Under mozilla/netwerk, there are 59 of ToNewCstring, 43 of AssignWithConversion, 11 of NS_ConvertASCIItoUCS2 Changing them to ToNewUTF8String, NS_ConvertUTF8toUCS2 would not break ASCII (UTF-8 is a superset of ASCII), though I am not sure about performance difference.
remove the N6 time [rtm-] since it cause confusion.
Whiteboard: [rtm-]
Blocks: 83989
The current patch makes it so javascript: is the only protocol where the URL isn't subject to charset conversion. Is this what we want to do, or do we want to list the protocols that *do* get converted? (I'm bringing this up because in bug 81717, another URL conversion, converting \ to /, is being applied to everything except javascript: and causing data: URLs to break.)
>I wonder if there are other places in necko that we should be doing this?. Let's sperate that into different bug. Let's deal with known issue for now and unknown issue later.
Whiteboard: r/sr=jst+dougt, need a=
Jesse, I think it's fine to add other protocols if they also want UTF-8 string. If there is a specific example, please file a separate bug. Note that protocol handlers may also need change in order to accept UTF-8 URI.
Testing information: I ran http://www.mozilla.org/quality/precheckin-tests.html which include http and ftp. Tested links in 4.01 spec http://www.w3.org/TR/html4/. Also tried non ASCII cases http://www.fcenter.ru/wn/hn02082000.htm to make sure non javascript case is not affected.
Filed separately as bug 85310 about the remaining ToNewCString in necko.
pdt+ base on 6/11 pdt meeting.
Whiteboard: r/sr=jst+dougt, need a= → [PDT+]r/sr=jst+dougt, need a=
I review the 06/08/01 10:02 patch, I feel the change is right but have high risk except the nsHTMLUtils.cpp part. I afraid this may cause some unknown regression since some changs is in the low leverl code. In this late cycle, I suggest we hold this patch and ask naoki to work on a work around changes which only in nsHTMLUtils.cpp. In the case of JavaScript protocol, instead of convert from Unicode to UTF8, we could convert to the javascript escape \u for non ASCII data. In this way, we can gunarantee it only impact JavaScript protocol but not other code. We should definitely consider the rest of the 06/08/01 10:02 patch after we branch off from moz.0.9.3. It is safer to test that kind of change in the early development cycle.
Adding highrisk and patch keywords based on ftang's comments.
Keywords: highrisk, patch
Frank reviewed the patch. His comments, * use nsCRT::IsAscii * check ECMA script spec for \uxxxx format
From the ECMA script spec, http://www.ecma.ch/ecma1/stand/ecma-262.htm > UnicodeEscapeSequence :: See 7.6 > \u HexDigit HexDigit HexDigit HexDigit So it looks like it expects four digits although I didn't find a place which says it had to be four digits. I found that the current mozilla does not accept non four digits like \uF6, it has to be \u00F6.
r=ftang
I would've put the \u as part of the format string to PR_snprintf() (and made sure there's room for that too in buf) in stead of first appending \u to the string and then doing the printf and then appending the result of printf to the string, that would've been less code, but both ways work. sr=jst
+ nsAutoString scheme; + if ((pos == (PRInt32) nsCRT::strlen("javascript")) && + (aSpec.Left(scheme, pos) != -1) && + scheme.EqualsIgnoreCase("javascript")) { FYI, here's a way to consolidate the string constant and avoid the strlen of that constant, without wiring in 10 (although 10 is well-known to be the length of "javascript"!): + static const char kJavaScript[] = "javascript"; + nsAutoString scheme; + if ((pos == (PRInt32)(sizeof kJavaScript - 1)) && + (aSpec.Left(scheme, pos) != -1) && + scheme.EqualsIgnoreCase(kJavaScript")) { I agree with what jst said. Instead of: + char buf[5]; + spec.Truncate(0); + for (const PRUnichar* uch = aSpec.GetUnicode(); *uch; ++uch) { + if (!nsCRT::IsAscii(*uch)) { + spec.Append("\\u"); + PR_snprintf(buf, sizeof(buf), "%.4x", *uch); + spec.Append(buf); Why not + char buf[6+1]; // space for \uXXXX plus a NUL at the end + spec.Truncate(0); + for (const PRUnichar* uch = aSpec.GetUnicode(); *uch; ++uch) { + if (!nsCRT::IsAscii(*uch)) { + PR_snprintf(buf, sizeof(buf), "\\u%.4x", *uch); + spec.Append(buf); /be
Whiteboard: [PDT+]r/sr=jst+dougt, need a= → [PDT+]need sr=
*** Bug 86092 has been marked as a duplicate of this bug. ***
sr=jst
Whiteboard: [PDT+]need sr= → [PDT+]need a=
a= asa@mozilla.org for checkin to the trunk. (on behalf of drivers)
Whiteboard: [PDT+]need a= → [PDT+]wait for tree open to check in
checked in
Status: ASSIGNED → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
I checked now. It work! Thanx! from Japan
verified win2k 2001061905 macos 2001061908 linux 2001061914
Status: RESOLVED → VERIFIED
*** Bug 88220 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: