Closed Bug 153325 Opened 22 years ago Closed 17 years ago

javascript href broken when a variable contains percent escaped umlaut

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: s.a.moeller, Unassigned)

Details

(Keywords: intl)

Attachments

(1 file, 2 obsolete files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1a) Gecko/20020610
BuildID:    2002061108

javascript href is broken when a function's variable is containing an escaped
umlaut (i.e. %c4)

Reproducible: Always
Steps to Reproduce:
1. use anything like <a href="javascript:alert('%c4')">js link</a>
2. click the link

Actual Results:  nothing happens

Expected Results:  alert box opens (with either %c4, or Ä, or whatever displayed
on it)
Attached file testcase (obsolete) —
Confirmed on 2002 061908 Win2k.  Sending to parser since that's where the
similar bug 51355 was.
Assignee: rogerl → harishd
Component: JavaScript Engine → Parser
OS: Linux → All
QA Contact: pschwartau → moied
Hardware: PC → All
Status: UNCONFIRMED → NEW
Ever confirmed: true
Parser does not process attribute values.

I get an assertion ( "not a UTF8 string" )when I click on the problem link.

ConvertUTF8toUCS2::write(const char * 0x0012f5f8, unsigned int 10) line 660 + 20
bytes
nsCharSinkTraits<ConvertUTF8toUCS2>::write(ConvertUTF8toUCS2 & {...}, const char
* 0x0012f5f8, unsigned int 10) line 571
copy_string(nsReadingIterator<char> & {"????????????????"}, const
nsReadingIterator<char> & {"???????????"}, ConvertUTF8toUCS2 & {...}) line
90 + 39 bytes
NS_ConvertUTF8toUCS2::Init(const nsACString & {...}) line 1350 + 35 bytes
NS_ConvertUTF8toUCS2::NS_ConvertUTF8toUCS2(const nsACString & {...}) line 558
nsJSThunk::EvaluateScript() line 284
nsJSChannel::AsyncOpen(nsJSChannel * const 0x03b6e458, nsIStreamListener *
0x03990b98, nsISupports * 0x00000000) line 619 + 11 bytes
nsDocumentOpenInfo::Open(nsIChannel * 0x03b6e458, int 1, nsISupports *
0x03a21270) line 170 + 18 bytes
nsURILoader::OpenURIVia(nsURILoader * const 0x01a4c208, nsIChannel * 0x03b6e458,
int 1, nsISupports * 0x03a21270, unsigned int 0) line 538 + 20 bytes
nsURILoader::OpenURI(nsURILoader * const 0x01a4c208, nsIChannel * 0x03b6e458,
int 1, nsISupports * 0x03a21270) line 500
nsDocShell::DoChannelLoad(nsIChannel * 0x03b6e458, nsIURILoader * 0x01a4c208)
line 5184 + 39 bytes
nsDocShell::DoURILoad(nsIURI * 0x03cb17c8, nsIURI * 0x03c38e48, nsISupports *
0x031c5ce8, nsIInputStream * 0x00000000, nsIInputStream * 0x00000000, int 1,
nsIDocShell * * 0x00000000, nsIRequest * * 0x00000000) line 4959 + 38 bytes
nsDocShell::InternalLoad(nsDocShell * const 0x03a21270, nsIURI * 0x03cb17c8,
nsIURI * 0x03c38e48, nsISupports * 0x00000000, int 1, const unsigned short *
0x0012fc0c, nsIInputStream * 0x00000000, nsIInputStream * 0x00000000, unsigned
int 2097153, nsISHEntry * 0x00000000, int 1, nsIDocShell * * 0x00000000,
nsIRequest * * 0x00000000) line 4752 + 51 bytes
nsWebShell::OnLinkClickSync(nsWebShell * const 0x03a213b4, nsIContent *
0x034ce0a0, nsLinkVerb eLinkVerb_Replace, const unsigned short * 0x03d1b9a8,
const unsigned short * 0x100e5b00 gCommonEmptyBuffer, nsIInputStream *
0x00000000, nsIInputStream * 0x00000000, nsIDocShell * * 0x00000000, nsIRequest
* * 0x00000000) line 619 + 91 bytes
OnLinkClickEvent::HandleEvent() line 462
HandlePLEvent(OnLinkClickEvent * 0x03c8ea80) line 476
PL_HandleEvent(PLEvent * 0x03c8ea80) line 596 + 10 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x01497b18) line 526 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x000702f8, unsigned int 49272, unsigned int 0,
long 21592856) line 1077 + 9 bytes
Assignee: harishd → rogerl
Component: Parser → JavaScript Engine
QA Contact: moied → pschwartau
Stefan: nice testcase! Reassigning to DOM Level 0.

Stefan's testcase shows that certain %XX sequences work, but 
that others don't. That reminds me of bug 144429, 
"URL encoding in window.open regressed, URL parsing problem?"


------- Additional Comment_ #11 From Henrik Rundqvist 2002-05-16 05:11 ------- 
Another interesting thing about all this is why "%7E" (tilda)
slips through but "%E4" (a Swedish character) does not?

Study the attached testcase and you'll see what I mean.
Response if you click on "window.open":

  "The requested URL /~Gle was not found on this server."

Notice that the tilda is there. So only some %XX codes gets affected?


------- Additional Comment_ #12 From Johnny Stenback 2002-05-16 09:18 ------- 
The reason for tilde going through but Scandinavian chanracters not going 
through is that we assume that the string is a UTF8 string once we've
unescaped it. We try to convert the UTF8 string to a unicode string
and that can't be done for non-UTF8 encoded non-ASCII characters such as %E4. 
It's a bug in nsJSProtocolHandler.cpp...
Assignee: rogerl → jst
Component: JavaScript Engine → DOM Level 0
QA Contact: pschwartau → desale
Mass-reassigning bugs to dom_bugs@netscape.com
Assignee: jst → dom_bugs
The fix for bug 44272 made the testcase #4 gives me A-umlaut in the alert box if
character encoding is  ISO-8859-1/15. But, there's something more in this bug so
that I'm keeping this open. It seems wrong to interpret '%C4' as U+00C4 in JS
string literal. Shouldn't it be considered literal '%C4' (three characters). I
have to look up ECMAscript standard. 

BTW, if we want to be purists, this bug would be 'WONT FIX'. 'javascript:'
url-scheme is not allowed in href. 

Keywords: intl
jshin: what's the minimal testcase showing

> It seems wrong to interpret '%C4' as U+00C4 in JS
> string literal. Shouldn't it be considered literal '%C4' (three characters).

?

/be
note that javascript: is supported by many browsers including IE (which doesn't
support data:). wontfixing just because the scheme isn't standardized isn't
acceptable.
timeless: yes, javascript: is a de-facto standard we must and will support.  But
who suggested otherwise?

/be
jshin, tail of comment 6, presumably not particularly seriously.
Of course, I was not serious. Note that it's qualified by 'if we .... a purist'.
Anyway, it's a little bit surprising that you're the first to pick it up and
write that 'wontfix' is not acceptable. I thought you'd be more likely to be on
the otherside :-p. Also note that this bug was 'almost' fixed thanks to the fix
for bug 44272.

re: comment #6

brendan, what is the following JS code supposed to produce?

document.write('%C6')? 

Three characters, <U+0025 U+0043 U+0036> or a single character <U+00C6>? That
was my question. 

The testcase presents an interesting problem if the answer to the above is the
former because 'JS string literal' is used in a URL.

<a href="javascript:alert('%B0%A1')">Alert</a> 

What should show up inside the alert box if the above is in
non-ISO-8859-1/non-ISO-8859-15 page? Currently, the URL-unescaping is done
before JS part  is handed over to the JS engine. So, the result is dependent on
the current character encoding. How about these?

<a href="javascript:alert(unescape('%B0%A1'))">Alert</a>

Compare it with the following, the result of running which is charset-independent.

<script type="text/javascript">
document.write('%B0%A1');
document.write(unescape('%B0%A1'));
</script>


 
i'm rarely a standards purist. what i care about are things i use. and i use
data:, javascript:, view-source, and about: urls heavily.
> document.write('%C6')? 
> 
> Three characters, <U+0025 U+0043 U+0036> or a single character <U+00C6>? That
> was my question. 

If that document.write is in a .js file, then the answer is obvious: three
characters: '%C6' (or the U+0025 U+0043 U+0036 sequence, if you prefer that
spelling).  JS string literals are well-specified by ECMA-262 and there is
nothing about %-escaping in that spec.

If the string literal is embedded in an href= attribute value, then the JS
engine may not see the verbatim source string -- there may be a layer of
interpretation when the attribute is parsed, and one when the href url is loaded
(when the link is clicked).  The attribute value interpretation should handle
&amp; and other such entities, but not mess with %, right?

But the link url loading step will unescape (in nsJSProtocolHandler.cpp),
because it expects that the url was escaped (by whom?  by the page author?). 
Why is that unescaping dependent on the document charset?

/be
Thanks for bearing with my 'laziness' and confirming that '%XX' is notsubject to
any special processing. 

> he link url loading step will unescape (in nsJSProtocolHandler.cpp),
> because it expects that the url was escaped (by whom?  by the page author?). 

 This is the conflict between JS's notion of 'escaping' and url-escaping.
Why is it that escaped? Because in URL, every non-ASCII character (plus some
ASCII chars) has to be url-escaped and some authors apparently escape non-ASCII
characters in javascript: url.   Unfortunately, url-escaping was not
well-defined when it comes to which charset its byte sequence (when unescaped)
is in. Most of time, it has to be interpreted as in the document charset. That
is, what file is referred to by 'http://www.example.com/%b0%91.png' depends on
the document charset although the increasing number of sites (still very few,
though) began to support url-escaped UTF-8 URLs.
 
http://lxr.mozilla.org/seamonkey/source/dom/src/jsurl/nsJSProtocolHandler.cpp#195
http://lxr.mozilla.org/seamonkey/source/dom/src/jsurl/nsJSProtocolHandler.cpp#717
http://lxr.mozilla.org/seamonkey/source/dom/src/jsurl/nsJSProtocolHandler.cpp#774

> Why is that unescaping dependent on the document charset?

 See above. This bug is kinda tech-evangelism bug.  '\uHHHH' notation has to be
used in javascript: url if necessary. 
 
Summary: javascript href broken when a variable is containing escaped umlaut → javascript href broken when a variable contains percent escaped umlaut
Stefan, does this now work for you?
WFM last item in testcase, FF & SM
Yes, WFM now.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.0.7) Gecko/20060910 SeaMonkey/1.0.5
->WFM then
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → WORKSFORME
This bug is still present, if the charset is UTF-8. See new test case.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Attached file UTF-8 test case (obsolete) —
Attachment #88625 - Attachment is obsolete: true
Do things right!
\u00C4 (latin capital A with diaeresis) is Unicode value, in URL it must be encoded in UTF-8, at least as two bytes: { 0xC3, 0x84 }.
<a href="javascript:alert('%C3%84')"> - will work as expected.
Roman is right. In a UTF-8 document <a href="javascript:alert('%C4')"> is just malformed.
Status: REOPENED → RESOLVED
Closed: 18 years ago17 years ago
Resolution: --- → WORKSFORME
Why is it malformed? '%C4' is a valid string expression, isn't it? I have no interest in the particular letter it may or may not represent. It is just any string. And if I call alert(), or any other function, with a valid string as its parameter, that function should be executed. That's the problem: I see no alert!

By the way: <a href="javascript:alert('%C3%84')"> does not work, also.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I'm not going to get into a closing/reopening war here, but '%C4' is *not* a valid string expression in an href in a UTF-8 document. RFC 2396 defines that it has to be interpreted as the octet 0xC4, which is invalid UTF-8 on its own.

Also, there is a typo in the last line of attachment 292585 [details]: you updated the text but not the actual href attribute.
Oops. Sorry. My conclusion was based on that testcase. Indeed, it does not work because of my typo. You're right. So I'm convinced. Resolving as WORKSFORME, again.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: