16 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 98 → All
Hardware: PC → All
Naoki, could you give this code some intl love like you've done to other methods in this same class?
Assignee: jst → nhotta
> Reproducible: Always > Steps to Reproduce: > 1.Open the attached test-case > 2.Type in an anchor with escaped characters Where to type? Can I have the exact string that I can paste? > 3.refresh the page >
Status: NEW → ASSIGNED
Sorry, The hash is typed in the location bar, where the URL is displayed when the test-case is open. The sample string in my report will demonstrate this: #abc%21abc Or, for a more general format for an escaped character: <"%"><hex-digit><hex-digit> Where hex-digit is [a-z,A-Z,0-9] Regards, Sam
I typed to the localtion bar #abc%21abc also tried #abc, nothing happened. Is that supposed to show something?
Also, if you'll place your cursor over the link http://bugzilla.mozilla.org/attachment.cgi?id=77559&action=view#abc%21abc You'll notice the status bar also displays it incorrectly. Possibly the same bug? Regards, Sam
Thanks, I can reproduce this now. I think this is not nsbeta1 type of bug, jst what do you think?
If this is trivial to fix, we might as well try, if not, lets wait until past mozilla1.0. Can someone provide a URL for a real website out there that breaks because of this?
If want to get escaped string by location.hash then it just need to take out the unescaping line. Is location.hash supposed to perform unescaping? The current code causes dataloss for non ASCII data because of the line, 302 aHash.Assign(NS_LITERAL_STRING("#") + NS_ConvertASCIItoUCS2(ref)); which can be avoided if we keep it escaped.
cc to momoi, do you know any real sites use this and have the problem?
Summary: location.hash is set to unescaped hash portion of URL → location.hash should not be unescaped
We _do_ want to jump to anchors with the unescaped string. Otherwise anchors on all sorts of non-English sites will break. Further, if the hash is non-ASCII, we really do want to unescape it because otherwise it will all be url-encoded garbage. So we have to either unescape completely (as we do) or unescape with the esc_OnlyNonASCII flag. I would like to know what IE's exact behavior is here before we make any changes...
Yeah, the jumping code doesn't use location.hash. I was just mentioning that because it was brought up in the bug... What I'm really curious about is what IE's behavior is for location.hash in the following situations: 1) The hash is in Chinese 2) The hash contains non-ASCII ISO-8859-1 characters 3) The hash contains literal %-signs For example, what happens if the hash (as given on the linking page) is %21 followed by a c with a cedilla? What if the c with a cedilla is URL-escaped in the linking page? The core issue is that the original string is not really stored anywhere. We have a URI object and it's been canonicalized to a certain extent. So retrieving the original string can be nontrivial, even if desirable.
I can answer case 3). IE uses the hash as a literal (I believe I answered this much in my last) and matches anchors after decoding HTML character entities. If you have the following anchors: foo! foo%21 The following link: http://site.com#foo%21 will lead to the second anchor in IE. Admittedly, the anchors are ambiguous, but if Mozilla and IE do this differently some pretty basic things can be broken. Does mozilla require #foo% 2521 to create an anchor to the second anchor? W3.org and the HTML specifications _should_ have something to say about this. Do they say anything? I don't know how to evaluate cases 1) and 2) though. I believe these _should_ be constrained by legal URL characters (though the hash need not be transmitted) but HTML 4.0 seems to say that named anchors can have any character-set character. Regards, Sam
> Does mozilla require #foo% 2521 to create an anchor to the second anchor? It should, yes. The HTML spec has nothing to do with this; this is pure URI handling.... The HTML spec normatively references the URI RFCs. The way to test the first two is to have a document with those sort of characters in the anchor... I'll upload one in a second.
Created attachment 141381 [details] Testcase Note that Mozilla currently screws up the encoding here; that's the intl issue some of the earlier conversation in the bug was about.
The unescape call was added by darin in bug 124042, by the way; before that I don't think we did that....
Created attachment 148050 [details] [diff] [review] Fix the intl issues So it seems that on my testcase IE reports _different_ location.hash values, which is just totally wrong -- the URIs are identical. So I'm afraid there is no sane way to do what IE does on this without breaking legitimate non-ascii anchors. The intl issue should be fixed, though, which is what this patch does.
14 years ago
The other self-consistent option would be to remove the unescaping altogether, of course. It was added for bug 124042, looks like. Darin, do you know why you added it?
Comment on attachment 148050 [details] [diff] [review] Fix the intl issues r=jshin nit: CopyASCIItoUCS2(ref, unicodeRef); Pls, use CopyASCIItoUTF16 (although they're identical)
Attachment #148050 - Flags: review?(jshin) → review+
Comment on attachment 148050 [details] [diff] [review] Fix the intl issues nsresult result = NS_OK; I'd rename that to rv while poking around here... but sr=jst either way.
Attachment #148050 - Flags: superreview?(jst) → superreview+
(In reply to comment #22) > The other self-consistent option would be to remove the unescaping altogether, > of course. It was added for bug 124042, looks like. Darin, do you know why you > added it? IIRC, that was the bug in which I made nsStandardURL::GetRef not return an unescaped value. So, I believe (and my memory may be failing me) that I simply moved the unescaping from nsStandardURL::GetRef out into this DOM code.
Assignee: nhottanscp → bzbarsky
Status: ASSIGNED → NEW
Checked in the patch to fix intl issues for 1.8a1. Marking fixed, though perhaps "wontfix" is more appropriate, since the fact is that we don't want to duplicate IE's buggy behavior in general and that makes unescaping necessary when dealing with non-ascii refs.
Status: NEW → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → FIXED
if IE's jumping to an anchor w/ hash is a bug, wtf is is hash for??? htf would i jump to an anchor???
You seem to have fundamentally misunderstood that comment. The bug in IE is that it treats <a name="#foo"> as being the same as <a name="foo">. Which is wrong.
According to HTML5 and according to Safari, Opera, and IE, location.hash doesn't unescape the <fragment> part before returning it. Do you have more documentation on why Gecko does do unescaping here? Is there an edge case I should be testing instead of just "#test%20test" and "?test%20test"?
11 years ago
Attachment #141381 - Attachment mime type: text/html → text/html; charset=windows-1251
That's not quite true for IE (or its URI representation is pretty broken) -- see the testcase attached to this bug (https://bugzilla.mozilla.org/attachment.cgi?id=141381). Compare to Safari, which consistently escapes all the non-ASCII stuff on that testcase. Opera does what IE does. In any case, the point of Gecko's behavior is to have compat with IE in the non-ascii char case, since we had to pick one case or the other and compat for pages that happen to have non-ascii text in refs seemed like a better idea than compat for pages that explicitly put %-escapes into their refs... It wouldn't be hard to change, but I still think this is the better behavior.
Am I missing something? IE7 on that testcase always returns what you see in the location bar, which is to say escaped URIs when they're escaped and not escaped when they're not, ascii or not.
Sure. But the point is that while the IRI forms are different (escaped or not) the URI forms are the same (escaped) in Gecko for both of the tests in that testcase (though the encoding used for the escaping might be either the page encoding or UTF-8 depending on whether the page does the escaping or we do). So in Gecko the URL bar shows escaped stuff for both tests. We can't affect that in the location code, but we _can_ affect what's returned to JS if it asks for the location.hash...
I'd personally prefer it if location.hash never unescaped (i.e. returned whatever the URI was); if we have code that normalises the URI when you navigate, that's another issue altogether, IMHO.
The normalizing code is not really subject to change in the near (or possibly long term) -- it's all frozen APIs. So the real question is which would break more web pages, and how badly, in terms of location.hash behavior.
biesi, darin, if you have opinions please let me know. My thinking here is that we should assume browsers are using IRIs internally, only converting to URIs at the edges (e.g. when doing HTTP requests). I know this means big changes for Gecko, but I don't really see any other way to get alignment on this. Other browsers aren't going to abandon IRIs.
Just for the record, the behavior I describe in comment 33 change somewhat, in Firefox, between 2007-07-05-04 and 2007-07-06-04. Bonsai range: http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2007-07-05+01&maxdate=2007-07-06+07&cvsroot=%2Fcvsroot Or at least what the url bar shows changed. The actual data stored in the URI object has not, as far as I can tell, and if I remove the unescaping from location.href I get the two (different) escaped strings in location.hash. As far as I can tell, the url bar just started doing unescaping, but only if the result is valid UTF-8.
9 years ago
Duplicate of this bug: 483304
For what it's worth, I filed bug 483660 on necko.
9 years ago
Depends on: 483660
(In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #21) > So it seems that on my testcase IE reports _different_ location.hash values, > which is just totally wrong -- the URIs are identical. At the time the comment was written, the URIs were idential on Gecko because URIs were escaped in the document encoding. Now (more precisely, after bug 261929 is fixed) it does no longer hold. URIs will always be escaped in UTF-8, except the query part. So the URIs are different on all browsers (including Gecko) while Gecko returns the same value in location.hash which is not correct anymore. (In reply to Boris Zbarsky (:bz) [In and out Aug 1 - 10, out Aug 11-20] from comment #37) > Or at least what the url bar shows changed. The actual data stored in the > URI object has not, as far as I can tell, and if I remove the unescaping > from location.href I get the two (different) escaped strings in > location.hash. As far as I can tell, the url bar just started doing > unescaping, but only if the result is valid UTF-8. Actually bug 261929 changed the stored data in the URI object. I think we should revisit the issue based on these facts. At least, location.hash should not unescape *ASCII* characters blindly which causes a problem such as bug 483304.
You need to log in before you can comment on or make changes to this bug.