Closed Bug 135309 Opened 24 years ago Closed 22 years ago

location.hash should not be unescaped

Categories

(Core :: DOM: Core & HTML, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

minor

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: sschinke, Assigned: bzbarsky)

References

(
URL
)

Details

Attachments

(3 files)

testcase 24 years ago Sam Schinke 374 bytes, text/html		Details
Testcase 22 years ago Boris Zbarsky [:bzbarsky] 473 bytes, text/html; charset=windows-1251		Details
Fix the intl issues 22 years ago Boris Zbarsky [:bzbarsky] 1.95 KB, patch	jshin1987 : review+ jst : superreview+	Details \| Diff \| Splinter Review

Reporter

Description

•

24 years ago

From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:0.9.9+) Gecko/20020403 BuildID: 2002040303 When viewing the DOM of a document with a complex hash containing escaped characters, the location.hash property unescapes those characters. For example, #abc%21abc as anchor to a URL will give location.hash equal to #abc!abc Location.hash is defined as how IE and Netscape did it (from the defenition of DOM level 0 from w3.org), and this is not how IE does it. I'm not sure how netscape does it, but the JS reference docs available from them (that I could find) indicate no escaping. This is also inconsistent with how location.query is treated. Location.query doesn't unescape anything. Reproducible: Always Steps to Reproduce: 1.Open the attached test-case 2.Type in an anchor with escaped characters 3.refresh the page Actual Results: location.hash contains different characters than everything after and including the hash in location.href, as shown by the document.write statements Expected Results: location.hash should have contained the same string as everything after and including the hash in location.href. This script will demonstrate the differences: <script language="JavaScript"> document.write( "location.hash: " + location.hash ); document.write( "<br>"); document.write( "Intended location.hash: "); var index = location.href.indexOf("#"); if (index == -1) { index = location.href.length; } document.write( location.href.substring(index)); </script>

Reporter

Comment 1

•

24 years ago

Attached file testcase — Details

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

24 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: 4xp

OS: Windows 98 → All

Hardware: PC → All

Reporter

Comment 2

•

24 years ago

This bug appears to happen here: http://lxr.mozilla.org/mozilla/source/dom/src/base/nsLocation.cpp#282 With the actual unescaping happening on line 299. I'm not familiar with how to make any changes to remove that, or how to test for side effects. But hopefully this'll make it easier for whoever does patch it. Unless this isn't the method the javascript uses. I have no idea how to view the JS bindings, so maybe it would be worth checking. I've included the code segment below: 282 NS_IMETHODIMP 283 LocationImpl::GetHash(nsAString& aHash) 284 { 285 aHash.SetLength(0); 286 287 nsCOMPtr<nsIURI> uri; 288 nsresult result = NS_OK; 289 290 result = GetURI(getter_AddRefs(uri)); 291 292 nsCOMPtr<nsIURL> url(do_QueryInterface(uri)); 293 294 if (url) { 295 nsCAutoString ref; 296 297 result = url->GetRef(ref); 298 // XXX danger... this may result in non-ASCII octets! 299 NS_UnescapeURL(ref); 300 301 if (NS_SUCCEEDED(result) && !ref.IsEmpty()) { 302 aHash.Assign(NS_LITERAL_STRING("#") + NS_ConvertASCIItoUCS2(ref)); 303 } 304 } 305 306 return result; 307 }

Johnny Stenback (:jst)

Comment 3

•

24 years ago

Naoki, could you give this code some intl love like you've done to other methods in this same class?

Assignee: jst → nhotta

Keywords: mozilla1.0

Comment 4

•

24 years ago

> Reproducible: Always > Steps to Reproduce: > 1.Open the attached test-case > 2.Type in an anchor with escaped characters Where to type? Can I have the exact string that I can paste? > 3.refresh the page >

Status: NEW → ASSIGNED

Reporter

Comment 5

•

24 years ago

Sorry, The hash is typed in the location bar, where the URL is displayed when the test-case is open. The sample string in my report will demonstrate this: #abc%21abc Or, for a more general format for an escaped character: <"%"><hex-digit><hex-digit> Where hex-digit is [a-z,A-Z,0-9] Regards, Sam

Comment 6

•

24 years ago

I typed to the localtion bar #abc%21abc also tried #abc, nothing happened. Is that supposed to show something?

Reporter

Comment 7

•

24 years ago

Hi nhotta, The test-case will write the hash to the document, as well as the desired hash (not unescaped). Step 3 is important to let the javascript process the hash though. Let me revise the steps. To reproduce: 1)click on the attachment "testcase" 2)type a hash at the end of the URL, with unescaped characters into the location bar 3)press enter 4)refresh Alternatively, follow this link to the attachment (provided bugzilla anchors the whole thing): http://bugzilla.mozilla.org/attachment.cgi?id=77559&action=view#abc%21abc Regards, Sam

Reporter

Comment 8

•

24 years ago

Also, if you'll place your cursor over the link http://bugzilla.mozilla.org/attachment.cgi?id=77559&action=view#abc%21abc You'll notice the status bar also displays it incorrectly. Possibly the same bug? Regards, Sam

Comment 9

•

24 years ago

Thanks, I can reproduce this now. I think this is not nsbeta1 type of bug, jst what do you think?

Johnny Stenback (:jst)

Comment 10

•

24 years ago

If this is trivial to fix, we might as well try, if not, lets wait until past mozilla1.0. Can someone provide a URL for a real website out there that breaks because of this?

Comment 11

•

24 years ago

If want to get escaped string by location.hash then it just need to take out the unescaping line. Is location.hash supposed to perform unescaping? The current code causes dataloss for non ASCII data because of the line, 302 aHash.Assign(NS_LITERAL_STRING("#") + NS_ConvertASCIItoUCS2(ref)); which can be avoided if we keep it escaped.

Reporter

Comment 12

•

24 years ago

I know of no sites that even use escaped characters in hashes. I have some dhtml/javascript/whatever that passes stuff around as a hash, and tested it on mozilla, which showed up as broken. It can be coded around client-side, but that makes that much more platform-dependent code. It looks like it is just a cosmetic thing other than the possibility of some anchors being missed (but I havn't verified that mozilla will miss an anchor, or jump to one with the unescaped string).

Comment 13

•

24 years ago

cc to momoi, do you know any real sites use this and have the problem?

Updated

•

23 years ago

Summary: location.hash is set to unescaped hash portion of URL → location.hash should not be unescaped

Boris Zbarsky [:bzbarsky]

Assignee

Comment 14

•

22 years ago

We _do_ want to jump to anchors with the unescaped string. Otherwise anchors on all sorts of non-English sites will break. Further, if the hash is non-ASCII, we really do want to unescape it because otherwise it will all be url-encoded garbage. So we have to either unescape completely (as we do) or unescape with the esc_OnlyNonASCII flag. I would like to know what IE's exact behavior is here before we make any changes...

Reporter

Comment 15

•

22 years ago

Thanks for the comment. Finally a useful comment. anchor name attributes do indeed contain CDATA, so they can have any character in the document character set and allow character entities. I hadn't realized this. So this does mean that in order for these arbitrary characters to be usable in a URL, they must be escaped. I am not concerned with where Mozilla jumps (or doesn't) in a page, though, as this can be acheived whether or not location.hash or LocationImpl::GetHash unescapes the string. I am concerned with what shows up in the javascript DOM operation, though. If Mozilla needs to unescape the anchor to jump to the correct anchor in a page it should be able to do so outside of the routine used to retrieve the anchor from the URL (rather, do it in the routine that jumps to the anchor). The anchor string is a literal. If a web-page script developer wishes to take the anchor value and do something with it, they can always call .unescape on it if they need the unescaped data. However, location.hash can destroy data by ALWAYS unescaping the data. IE's behavior as far as location.hash is to return the literal. As far as jumping to anchors, it looks like it doesn't use the escaped value at all. However, the first behaviour doesn't imply the second. I have created a testcase to test jumping to encoded anchors and can upload it if needed. My personal investment in this bug has passed as I was using location.hash as a workaround for location.query being unavailable/unusable in some situations (eg, when loading a local disk file though some proxies). This has now been remedied in the places where I need it. However, IMO it is still the incorrect thing to do. Regards, Sam Regards, Sam

Boris Zbarsky [:bzbarsky]

Assignee

Comment 16

•

22 years ago

Yeah, the jumping code doesn't use location.hash. I was just mentioning that because it was brought up in the bug... What I'm really curious about is what IE's behavior is for location.hash in the following situations: 1) The hash is in Chinese 2) The hash contains non-ASCII ISO-8859-1 characters 3) The hash contains literal %-signs For example, what happens if the hash (as given on the linking page) is %21 followed by a c with a cedilla? What if the c with a cedilla is URL-escaped in the linking page? The core issue is that the original string is not really stored anywhere. We have a URI object and it's been canonicalized to a certain extent. So retrieving the original string can be nontrivial, even if desirable.

Reporter

Comment 17

•

22 years ago

I can answer case 3). IE uses the hash as a literal (I believe I answered this much in my last) and matches anchors after decoding HTML character entities. If you have the following anchors: foo! foo%21 The following link: http://site.com#foo%21 will lead to the second anchor in IE. Admittedly, the anchors are ambiguous, but if Mozilla and IE do this differently some pretty basic things can be broken. Does mozilla require #foo% 2521 to create an anchor to the second anchor? W3.org and the HTML specifications _should_ have something to say about this. Do they say anything? I don't know how to evaluate cases 1) and 2) though. I believe these _should_ be constrained by legal URL characters (though the hash need not be transmitted) but HTML 4.0 seems to say that named anchors can have any character-set character. Regards, Sam

Boris Zbarsky [:bzbarsky]

Assignee

Comment 18

•

22 years ago

> Does mozilla require #foo% 2521 to create an anchor to the second anchor? It should, yes. The HTML spec has nothing to do with this; this is pure URI handling.... The HTML spec normatively references the URI RFCs. The way to test the first two is to have a document with those sort of characters in the anchor... I'll upload one in a second.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 19

•

22 years ago

Attached file Testcase — Details

Note that Mozilla currently screws up the encoding here; that's the intl issue some of the earlier conversation in the bug was about.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 20

•

22 years ago

The unescape call was added by darin in bug 124042, by the way; before that I don't think we did that....

Boris Zbarsky [:bzbarsky]

Assignee

Comment 21

•

22 years ago

Attached patch Fix the intl issues — Details — Splinter Review

So it seems that on my testcase IE reports _different_ location.hash values, which is just totally wrong -- the URIs are identical. So I'm afraid there is no sane way to do what IE does on this without breaking legitimate non-ascii anchors. The intl issue should be fixed, though, which is what this patch does.

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

22 years ago

Attachment #148050 - Flags: superreview?(jst)

Attachment #148050 - Flags: review?(jshin)

Boris Zbarsky [:bzbarsky]

Assignee

Comment 22

•

22 years ago

The other self-consistent option would be to remove the unescaping altogether, of course. It was added for bug 124042, looks like. Darin, do you know why you added it?

Comment 23

•

22 years ago

Comment on attachment 148050 [details] [diff] [review] Fix the intl issues r=jshin nit: CopyASCIItoUCS2(ref, unicodeRef); Pls, use CopyASCIItoUTF16 (although they're identical)

Attachment #148050 - Flags: review?(jshin) → review+

Johnny Stenback (:jst)

Comment 24

•

22 years ago

Comment on attachment 148050 [details] [diff] [review] Fix the intl issues nsresult result = NS_OK; I'd rename that to rv while poking around here... but sr=jst either way.

Attachment #148050 - Flags: superreview?(jst) → superreview+

Comment 25

•

22 years ago

(In reply to comment #22) > The other self-consistent option would be to remove the unescaping altogether, > of course. It was added for bug 124042, looks like. Darin, do you know why you > added it? IIRC, that was the bug in which I made nsStandardURL::GetRef not return an unescaped value. So, I believe (and my memory may be failing me) that I simply moved the unescaping from nsStandardURL::GetRef out into this DOM code.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 26

•

22 years ago

Taking.

Assignee: nhottanscp → bzbarsky

Status: ASSIGNED → NEW

Boris Zbarsky [:bzbarsky]

Assignee

Comment 27

•

22 years ago

Checked in the patch to fix intl issues for 1.8a1. Marking fixed, though perhaps "wontfix" is more appropriate, since the fact is that we don't want to duplicate IE's buggy behavior in general and that makes unescaping necessary when dealing with non-ascii refs.

Status: NEW → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Comment 28

•

21 years ago

if IE's jumping to an anchor w/ hash is a bug, wtf is is hash for??? htf would i jump to an anchor???

Boris Zbarsky [:bzbarsky]

Assignee

Comment 29

•

21 years ago

You seem to have fundamentally misunderstood that comment. The bug in IE is that it treats <a name="#foo"> as being the same as <a name="foo">. Which is wrong.

Hixie (not reading bugmail)

Comment 30

•

19 years ago

According to HTML5 and according to Safari, Opera, and IE, location.hash doesn't unescape the <fragment> part before returning it. Do you have more documentation on why Gecko does do unescaping here? Is there an edge case I should be testing instead of just "#test%20test" and "?test%20test"?

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

19 years ago

Attachment #141381 - Attachment mime type: text/html → text/html; charset=windows-1251

Boris Zbarsky [:bzbarsky]

Assignee

Comment 31

•

19 years ago

That's not quite true for IE (or its URI representation is pretty broken) -- see the testcase attached to this bug (https://bugzilla.mozilla.org/attachment.cgi?id=141381). Compare to Safari, which consistently escapes all the non-ASCII stuff on that testcase. Opera does what IE does. In any case, the point of Gecko's behavior is to have compat with IE in the non-ascii char case, since we had to pick one case or the other and compat for pages that happen to have non-ascii text in refs seemed like a better idea than compat for pages that explicitly put %-escapes into their refs... It wouldn't be hard to change, but I still think this is the better behavior.

Hixie (not reading bugmail)

Comment 32

•

18 years ago

Am I missing something? IE7 on that testcase always returns what you see in the location bar, which is to say escaped URIs when they're escaped and not escaped when they're not, ascii or not.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 33

•

18 years ago

Sure. But the point is that while the IRI forms are different (escaped or not) the URI forms are the same (escaped) in Gecko for both of the tests in that testcase (though the encoding used for the escaping might be either the page encoding or UTF-8 depending on whether the page does the escaping or we do). So in Gecko the URL bar shows escaped stuff for both tests. We can't affect that in the location code, but we _can_ affect what's returned to JS if it asks for the location.hash...

Hixie (not reading bugmail)

Comment 34

•

18 years ago

I'd personally prefer it if location.hash never unescaped (i.e. returned whatever the URI was); if we have code that normalises the URI when you navigate, that's another issue altogether, IMHO.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 35

•

18 years ago

The normalizing code is not really subject to change in the near (or possibly long term) -- it's all frozen APIs. So the real question is which would break more web pages, and how badly, in terms of location.hash behavior.

Hixie (not reading bugmail)

Comment 36

•

18 years ago

biesi, darin, if you have opinions please let me know. My thinking here is that we should assume browsers are using IRIs internally, only converting to URIs at the edges (e.g. when doing HTTP requests). I know this means big changes for Gecko, but I don't really see any other way to get alignment on this. Other browsers aren't going to abandon IRIs.

WADA:World Anti-bad-Duping Agency

Updated

•

17 years ago

Blocks: 378962

Boris Zbarsky [:bzbarsky]

Assignee

Comment 37

•

17 years ago

Just for the record, the behavior I describe in comment 33 change somewhat, in Firefox, between 2007-07-05-04 and 2007-07-06-04. Bonsai range: http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2007-07-05+01&maxdate=2007-07-06+07&cvsroot=%2Fcvsroot Or at least what the url bar shows changed. The actual data stored in the URI object has not, as far as I can tell, and if I remove the unescaping from location.href I get the two (different) escaped strings in location.hash. As far as I can tell, the url bar just started doing unescaping, but only if the result is valid UTF-8.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 39

•

17 years ago

For what it's worth, I filed bug 483660 on necko.

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

17 years ago

Depends on: 483660

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

17 years ago

No longer depends on: 483660

Masatoshi Kimura [:emk]

Comment 40

•

13 years ago

(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #21) > So it seems that on my testcase IE reports _different_ location.hash values, > which is just totally wrong -- the URIs are identical. At the time the comment was written, the URIs were idential on Gecko because URIs were escaped in the document encoding. Now (more precisely, after bug 261929 is fixed) it does no longer hold. URIs will always be escaped in UTF-8, except the query part. So the URIs are different on all browsers (including Gecko) while Gecko returns the same value in location.hash which is not correct anymore. (In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #37) > Or at least what the url bar shows changed. The actual data stored in the > URI object has not, as far as I can tell, and if I remove the unescaping > from location.href I get the two (different) escaped strings in > location.hash. As far as I can tell, the url bar just started doing > unescaping, but only if the result is valid UTF-8. Actually bug 261929 changed the stored data in the URI object. I think we should revisit the issue based on these facts. At least, location.hash should not unescape *ASCII* characters blindly which causes a problem such as bug 483304.

You need to log in before you can comment on or make changes to this bug.