Closed Bug 135309 Opened 22 years ago Closed 20 years ago

location.hash should not be unescaped

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
minor

Tracking

()

RESOLVED FIXED

People

(Reporter: sschinke, Assigned: bzbarsky)

References

()

Details

Attachments

(3 files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:0.9.9+) Gecko/20020403
BuildID:    2002040303

When viewing the DOM of a document with a complex hash containing escaped characters, the location.hash property unescapes those characters.

For example, #abc%21abc as anchor to a URL will give location.hash equal to #abc!abc

Location.hash is defined as how IE and Netscape did it (from the defenition of DOM level 0 from w3.org), and this is not how IE does it. I'm not sure how netscape does it, but the JS reference docs available from them (that I could find) indicate no escaping.

This is also inconsistent with how location.query is treated. Location.query doesn't unescape anything.

Reproducible: Always
Steps to Reproduce:
1.Open the attached test-case
2.Type in an anchor with escaped characters
3.refresh the page


Actual Results:  location.hash contains different characters than everything after and including the hash in location.href, as shown by the document.write statements

Expected Results:  location.hash should have contained the same string as everything after and including the hash in location.href.

This script will demonstrate the differences:
<script language="JavaScript">
document.write( "location.hash: " + location.hash );
document.write( "<br>");
document.write( "Intended location.hash: ");
var index = location.href.indexOf("#");
if (index == -1) {
	index = location.href.length;
}
document.write( location.href.substring(index));
</script>
Attached file testcase
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: 4xp
OS: Windows 98 → All
Hardware: PC → All
This bug appears to happen here:
http://lxr.mozilla.org/mozilla/source/dom/src/base/nsLocation.cpp#282

With the actual unescaping happening on line 299.

I'm not familiar with how to make any changes to remove that, or how to test for side effects. But hopefully this'll make it easier for whoever does patch it.

Unless this isn't the method the javascript uses. I have no idea how to view the JS bindings, so maybe it would be worth checking.

I've included the code segment below:

282 NS_IMETHODIMP
283 LocationImpl::GetHash(nsAString& aHash)
284 {
285   aHash.SetLength(0);
286 
287   nsCOMPtr<nsIURI> uri;
288   nsresult result = NS_OK;
289 
290   result = GetURI(getter_AddRefs(uri));
291 
292   nsCOMPtr<nsIURL> url(do_QueryInterface(uri));
293 
294   if (url) {
295     nsCAutoString ref;
296 
297     result = url->GetRef(ref);
298     // XXX danger... this may result in non-ASCII octets!
299     NS_UnescapeURL(ref);
300 
301     if (NS_SUCCEEDED(result) && !ref.IsEmpty()) {
302       aHash.Assign(NS_LITERAL_STRING("#") + NS_ConvertASCIItoUCS2(ref));
303     }
304   }
305 
306   return result;
307 }
Naoki, could you give this code some intl love like you've done to other methods
in this same class?
Assignee: jst → nhotta
Keywords: mozilla1.0
> Reproducible: Always
> Steps to Reproduce:
> 1.Open the attached test-case
> 2.Type in an anchor with escaped characters
Where to type? Can I have the exact string that I can paste?
> 3.refresh the page
> 



Status: NEW → ASSIGNED
Sorry,

The hash is typed in the location bar, where the URL is displayed when the test-case is open.

The sample string in my report will demonstrate this: #abc%21abc

Or, for a more general format for an escaped character:
<"%"><hex-digit><hex-digit>

Where hex-digit is [a-z,A-Z,0-9]

Regards,
Sam
I typed to the localtion bar #abc%21abc also tried #abc, nothing happened. Is
that supposed to show something?

Hi nhotta,

The test-case will write the hash to the document, as well as the desired hash (not unescaped).

Step 3 is important to let the javascript process the hash though.

Let me revise the steps.

To reproduce:

1)click on the attachment "testcase"
2)type a hash at the end of the URL, with unescaped characters into the location bar
3)press enter
4)refresh

Alternatively, follow this link to the attachment (provided bugzilla anchors the whole thing):
http://bugzilla.mozilla.org/attachment.cgi?id=77559&action=view#abc%21abc

Regards,
Sam
Also, if you'll place your cursor over the link
http://bugzilla.mozilla.org/attachment.cgi?id=77559&action=view#abc%21abc
You'll notice the status bar also displays it incorrectly.

Possibly the same bug?

Regards,
Sam
Thanks, I can reproduce this now.

I think this is not nsbeta1 type of bug, jst what do you think?
If this is trivial to fix, we might as well try, if not, lets wait until past
mozilla1.0. Can someone provide a URL for a real website out there that breaks
because of this?
If want to get escaped string by location.hash then it just need to take out the
unescaping line. Is location.hash supposed to perform unescaping?

The current code causes dataloss for non ASCII data because of the line,
302       aHash.Assign(NS_LITERAL_STRING("#") + NS_ConvertASCIItoUCS2(ref));

which can be avoided if we keep it escaped.
I know of no sites that even use escaped characters in hashes.

I have some dhtml/javascript/whatever that passes stuff around as a hash, and tested it on mozilla, which showed up as broken. It can be coded around client-side, but that makes that much more platform-dependent code.

It looks like it is just a cosmetic thing other than the possibility of some anchors being missed (but I havn't verified that mozilla will miss an anchor, or jump to one with the unescaped string).
cc to momoi, do you know any real sites use this and have the problem?
Summary: location.hash is set to unescaped hash portion of URL → location.hash should not be unescaped
We _do_ want to jump to anchors with the unescaped string.  Otherwise anchors on
all sorts of non-English sites will break.

Further, if the hash is non-ASCII, we really do want to unescape it because
otherwise it will all be url-encoded garbage.

So we have to either unescape completely (as we do) or unescape with the
esc_OnlyNonASCII flag.

I would like to know what IE's exact behavior is here before we make any changes...
Thanks for the comment. Finally a useful comment. anchor name attributes do 
indeed contain CDATA, so they can have any character in the document character 
set and allow character entities. I hadn't realized this.

So this does mean that in order for these arbitrary characters to be usable in 
a URL, they must be escaped.

I am not concerned with where Mozilla jumps (or doesn't) in a page, though, as 
this can be acheived whether or not location.hash or LocationImpl::GetHash 
unescapes the string. I am concerned with what shows up in the javascript DOM 
operation, though.

If Mozilla needs to unescape the anchor to jump to the correct anchor in a page 
it should be able to do so outside of the routine used to retrieve the anchor 
from the URL (rather, do it in the routine that jumps to the anchor). The 
anchor string is a literal.

If a web-page script developer wishes to take the anchor value and do something 
with it, they can always call .unescape on it if they need the unescaped data. 
However, location.hash can destroy data by ALWAYS unescaping the data.

IE's behavior as far as location.hash is to return the literal. As far as 
jumping to anchors, it looks like it doesn't use the escaped value at all. 
However, the first behaviour doesn't imply the second.

I have created a testcase to test jumping to encoded anchors and can upload it 
if needed.

My personal investment in this bug has passed as I was using location.hash as a 
workaround for location.query being unavailable/unusable in some situations 
(eg, when loading a local disk file though some proxies). This has now been 
remedied in the places where I need it.

However, IMO it is still the incorrect thing to do.

Regards,
Sam

Regards,
Sam
Yeah, the jumping code doesn't use location.hash.  I was just mentioning that
because it was brought up in the bug...  What I'm really curious about is what
IE's behavior is for location.hash in the following situations:

1)  The hash is in Chinese
2)  The hash contains non-ASCII ISO-8859-1 characters
3)  The hash contains literal %-signs

For example, what happens if the hash (as given on the linking page) is %21
followed by a c with a cedilla?  What if the c with a cedilla is URL-escaped in
the linking page?

The core issue is that the original string is not really stored anywhere.  We
have a URI object and it's been canonicalized to a certain extent.  So
retrieving the original string can be nontrivial, even if desirable.
I can answer case 3).

IE uses the hash as a literal (I believe I answered this much in my last) and 
matches anchors after decoding HTML character entities.

If you have the following anchors:

foo!
foo%21

The following link: http://site.com#foo%21 will lead to the second anchor in 
IE. Admittedly, the anchors are ambiguous, but if Mozilla and IE do this 
differently some pretty basic things can be broken. Does mozilla require #foo%
2521 to create an anchor to the second anchor?

W3.org and the HTML specifications _should_ have something to say about this. 
Do they say anything?

I don't know how to evaluate cases 1) and 2) though. I believe these _should_ 
be constrained by legal URL characters (though the hash need not be 
transmitted) but HTML 4.0 seems to say that named anchors can have any 
character-set character.

Regards,
Sam
> Does mozilla require #foo% 2521 to create an anchor to the second anchor?

It should, yes.

The HTML spec has nothing to do with this; this is pure URI handling.... The
HTML spec normatively references the URI RFCs.

The way to test the first two is to have a document with those sort of
characters in the anchor... I'll upload one in a second.
Attached file Testcase
Note that Mozilla currently screws up the encoding here; that's the intl issue
some of the earlier conversation in the bug was about.
The unescape call was added by darin in bug 124042, by the way; before that I
don't think we did that....
So it seems that on my testcase IE reports _different_ location.hash values,
which is just totally wrong -- the URIs are identical.

So I'm afraid there is no sane way to do what IE does on this without breaking
legitimate non-ascii anchors.

The intl issue should be fixed, though, which is what this patch does.
Attachment #148050 - Flags: superreview?(jst)
Attachment #148050 - Flags: review?(jshin)
The other self-consistent option would be to remove the unescaping altogether,
of course.  It was added for bug 124042, looks like.  Darin, do you know why you
added it?
Comment on attachment 148050 [details] [diff] [review]
Fix the intl issues

r=jshin

nit:

CopyASCIItoUCS2(ref, unicodeRef);

Pls, use CopyASCIItoUTF16 (although they're identical)
Attachment #148050 - Flags: review?(jshin) → review+
Comment on attachment 148050 [details] [diff] [review]
Fix the intl issues

   nsresult result = NS_OK;

I'd rename that to rv while poking around here...

but sr=jst either way.
Attachment #148050 - Flags: superreview?(jst) → superreview+
(In reply to comment #22)
> The other self-consistent option would be to remove the unescaping altogether,
> of course.  It was added for bug 124042, looks like.  Darin, do you know why you
> added it?

IIRC, that was the bug in which I made nsStandardURL::GetRef not return an
unescaped value.  So, I believe (and my memory may be failing me) that I simply
moved the unescaping from nsStandardURL::GetRef out into this DOM code.
Taking.
Assignee: nhottanscp → bzbarsky
Status: ASSIGNED → NEW
Checked in the patch to fix intl issues for 1.8a1.  Marking fixed, though
perhaps "wontfix" is more appropriate, since the fact is that we don't want to
duplicate IE's buggy behavior in general and that makes unescaping necessary
when dealing with non-ascii refs.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
if IE's jumping to an anchor w/ hash is a bug, wtf is is hash for??? htf would i
jump to an anchor???
You seem to have fundamentally misunderstood that comment.  The bug in IE is
that it treats <a name="#foo"> as being the same as <a name="foo">.  Which is wrong.
According to HTML5 and according to Safari, Opera, and IE, location.hash doesn't unescape the <fragment> part before returning it.

Do you have more documentation on why Gecko does do unescaping here? Is there an edge case I should be testing instead of just "#test%20test" and "?test%20test"?
Attachment #141381 - Attachment mime type: text/html → text/html; charset=windows-1251
That's not quite true for IE (or its URI representation is pretty broken) -- see the testcase attached to this bug (https://bugzilla.mozilla.org/attachment.cgi?id=141381).  Compare to Safari, which consistently escapes all the non-ASCII stuff on that testcase.  Opera does what IE does.

In any case, the point of Gecko's behavior is to have compat with IE in the non-ascii char case, since we had to pick one case or the other and compat for pages that happen to have non-ascii text in refs seemed like a better idea than compat for pages that explicitly put %-escapes into their refs...  It wouldn't be hard to change, but I still think this is the better behavior.
Am I missing something? IE7 on that testcase always returns what you see in the location bar, which is to say escaped URIs when they're escaped and not escaped when they're not, ascii or not.
Sure.  But the point is that while the IRI forms are different (escaped or not) the URI forms are the same (escaped) in Gecko for both of the tests in that testcase (though the encoding used for the escaping might be either the page encoding or UTF-8 depending on whether the page does the escaping or we do).  So in Gecko the URL bar shows escaped stuff for both tests.

We can't affect that in the location code, but we _can_ affect what's returned to JS if it asks for the location.hash...
I'd personally prefer it if location.hash never unescaped (i.e. returned whatever the URI was); if we have code that normalises the URI when you navigate, that's another issue altogether, IMHO.
The normalizing code is not really subject to change in the near (or possibly long term) -- it's all frozen APIs.  So the real question is which would break more web pages, and how badly, in terms of location.hash behavior.
biesi, darin, if you have opinions please let me know.

My thinking here is that we should assume browsers are using IRIs internally, only converting to URIs at the edges (e.g. when doing HTTP requests). I know this means big changes for Gecko, but I don't really see any other way to get alignment on this. Other browsers aren't going to abandon IRIs.
Just for the record, the behavior I describe in comment 33 change somewhat, in Firefox, between 2007-07-05-04 and 2007-07-06-04.  Bonsai range:

http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2007-07-05+01&maxdate=2007-07-06+07&cvsroot=%2Fcvsroot

Or at least what the url bar shows changed.  The actual data stored in the URI object has not, as far as I can tell, and if I remove the unescaping from location.href I get the two (different) escaped strings in location.hash.  As far as I can tell, the url bar just started doing unescaping, but only if the result is valid UTF-8.
For what it's worth, I filed bug 483660 on necko.
Depends on: 483660
No longer depends on: 483660
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #21)
> So it seems that on my testcase IE reports _different_ location.hash values,
> which is just totally wrong -- the URIs are identical.
At the time the comment was written, the URIs were idential on Gecko because URIs were escaped in the document encoding.
Now (more precisely, after bug 261929 is fixed) it does no longer hold. URIs will always be escaped in UTF-8, except the query part. So the URIs are different on all browsers (including Gecko) while Gecko returns the same value in location.hash which is not correct anymore.

(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #37)
> Or at least what the url bar shows changed.  The actual data stored in the
> URI object has not, as far as I can tell, and if I remove the unescaping
> from location.href I get the two (different) escaped strings in
> location.hash.  As far as I can tell, the url bar just started doing
> unescaping, but only if the result is valid UTF-8.
Actually bug 261929 changed the stored data in the URI object.

I think we should revisit the issue based on these facts. At least, location.hash should not unescape *ASCII* characters blindly which causes a problem such as bug 483304.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: