Closed Bug 87996 Opened 24 years ago Closed 24 years ago

Anchors with special characters do not work properly

Categories

(Core :: DOM: Navigation, defect)

x86
Linux
defect
Not set
minor

Tracking

()

RESOLVED FIXED
mozilla0.9.7

People

(Reporter: eh, Assigned: adamlock)

References

()

Details

Attachments

(6 files)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.16-22 i586; en-US; rv:0.9.1) Gecko/20010608 BuildID: 2001060810 The standards (at least HTML 4.01) have limitations on what can be included in a NAME tag ("ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")." (http://www.w3.org/TR/html4/types.html#type-cdata)) and my example doesn't quite fit to those limitations, but it's kind of strange that if open a link with special characters in the fragment identifier in a new window, it works just fine (goes where it's supposed to), but if I try to use a link like this in the page, it doesn't do anything at all. So either these special characters in the fragment identifier (#foo) should be supported or then they shouldn't, but the way it works (and doesn't work) now is completely illogical. Reproducible: Always Steps to Reproduce: 1. Go to a page with an anchor with special characters (and a link to that anchor) 2. Try to press the link (doesn't do anything) 3. Try to open the link in a new window (opens the page and scrolls straight to the anchor) Actual Results: Described above Expected Results: Either it should ignore anchors with special characters all the time or then it should fully support them (better idea IMHO).
-> HTML Element
Assignee: asa → clayton
Status: UNCONFIRMED → NEW
Component: Browser-General → HTML Element
Ever confirmed: true
QA Contact: doronr → bsharma
Linux build 2001-06-26-06 I see this with the link to http://www.ifsociety.com/herodishonest/lyrics.php#Don%27t+you+fucking+get+it%3F Now, the name attribute of the target anchor is invalid (apart from using weird chars). It should not be HTML-escaped. Replacing %27 with ' and %3F with ? in the name attribute (but not the href attribute, which _should_ be escaped) makes this link work fine in Mozilla in both cases... We should still behave consistently, however. Over to docshell.
Assignee: clayton → adamlock
Component: HTML Element → Embedding: Docshell
QA Contact: bsharma → adamlock
This works in things like Sun's javadoc http://java.sun.com/j2se/1.4/docs/api/index.html (where there are wierd chars (parens, spaces) but not URL-encoded) and completely breaks in, for example, the New Hacker Dictionary where anchors often have spaces. Wierd.
The Perl manpages all show this problem. What's happening there is, the link says http://www.perldoc.com/perl5.6/lib/ExtUtils/MakeMaker.html#make%20install and the anchor says <a NAME="make%20install">. (I checked the source.) Not being able to jump around in the Perl manpages makes this a larger problem, IMO, since I anything *anywhere* generated by Perldoc will exhibit this problem--and that means all Perl packages. While these are identical, I suspect Mozilla is translating one to "make install" and not the other, and thus is unable to find it. It seems like there would be zero negative impact and 100% positive if we translated %20 to space (among others), since right now it's *impossible* to find these links with Mozilla (if I understand the problem right). The only case where it could affect browser behavior in an unexpected way is if someone had *both* <A NAME="make install"> and <A NAME="make%20install"> in their page. IMO, the way to fix this problem is to politely shoot the HTML developer in question.
I just made a test page that includes all three variants of link ("#make%20install", "#make+install", "#make install") and all three variants of <A NAME=...> (same three variants). None of the three links does anything at all. Specifying them on the browser address bar doesn't work either. So now the question is: how do you put a space into <A NAME> and have it work? My answer is: read A NAME the same way you read URIs. I think the spec intends the same, but I can't even find the place in the spec where it says %20 is legal in a URI--maybe you can. It does seem to be implied by the excerpted text below, however. http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars , which admittedly describes handling an illegal case, seems to indicate that both URIs and NAME attribute of anchors should be escaped in the same way. 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). They even include a special note: "The same conversion based on UTF-8 should be applied to values of the name attribute for the A element." If they were being escaped in the same way in Mozilla, I doubt this bug would be here:
*** Bug 94361 has been marked as a duplicate of this bug. ***
Urlencoded anchors often apear in texinfo documents converted to html, as stated by John Keiser (I currenty have this problem). A node like "Some node" will become #some%20node in a html document; these nodes properly work in netscape 4x, I'd recomment adding 4xp keyword.
Check out your own last two examples and you will see that it still does not work: <li><a href="lyrics.php#We+give%2C+they+take">We give, they take</a> <li><a href="lyrics.php#Won%27t+be+defeated">Won't be defeated</a> suomi
*** Bug 96218 has been marked as a duplicate of this bug. ***
*** Bug 100531 has been marked as a duplicate of this bug. ***
Attached patch Fixed test caseSplinter Review
Attachment #53271 - Attachment mime type: text/plain → text/html
Run the updated test case and you should see the second anchor works but the others do not. The reasons for failure/success in each case: 1. The anchor is unescaped "make install1", but the element name remains escaped in the DOM as "make%20install1" so they do not match. 2. Anchor and element name match as "make install2" so scrolling happens. 3. Anchor is "make+install3" whereas element name is "make install3" so they do not match. The code that attempts to create a list of matching named elements is in nsHTMLDocument::GetElementsByName() which is called by PresShell::GoToAnchor(). Therefore the proper fix is to probably to modify GetElementsByName to unescape the element name before doing the comparison. I'll work on a patch and CC the DOM people in the meantime for their opinion.
The attached patch fixes the problem with the first anchor, but probably degrades performance in this oft-called function. It works by unescaping the element name before comparing it in nsHTMLDocument::GetElementsByName(). Do the DOM folks know a better way to do this? Where is the element name set in the first place? Ideally the name should be unescaped when the element is built and that way we avoid having to do this every time GetElementsByName is called.
The patch I just attached makes us store the unescaped name in the DOM, that should IMO fix case 1 too, I don't have a build handy where I can test this now, can someone else test the patch and report back?
Yep. jst's patch fixes case 1 as well.
I like the content sink patch, it works and was what I was getting at. r=adamlock Test 3 still doesn't work however. That would require unescaping the anchor text too. We could justify doing nothing about this because it's an illegal anchor name anyway. For that matter, so is an anchor containing space characters. Perhaps we should be strict about this and simply toss non-conformant name/id attributes away?
Ack, it just occured to me, you'd have to do this for ID too.
I think it has to be this way around since the id attribute can be on any type of element: if (keyAtom == nsHTMLAtoms::id || (nodeType == eHTMLTag_a && keyAtom == nsHTMLAtoms::name)) {
But is #foo type URI's supposed to scroll to any element with the ID "foo"? I don't like doing this for all ID attributes, does IE do that?
IDs have more far-ranging consequences than names ... and the spec only talks about converting NAME. I don't think we should do it. r=jkeiser,jst-review.pl on either patch, depending on whether it's decided ID should be converted or not.
See: http://www.w3.org/TR/html4/struct/links.html#h-12.1.1 "Destination anchors in HTML documents may be specified either by the A element (naming it with the name attribute), or by any other element (naming with the id attribute)."
Agreed, but again, we're dealing with a weird case here. IDs are (IMO) more meant to be programmatically accessed, and thus need to retain their original values. Additionally, the spec only calls for converting NAME. In IE, what seems to happen is, the NAME or ID is converted on-the-fly when the anchor is actually compared. When you do myElement.id, it shows "x%20y", not "x y". Because of the consequences here, I think we should leave ID the same.
John Keiser, do you have a link pointing to where it says name values should be unescaped? It is my understanding that neither names nor ids should even contain spaces and we would be perfectly entitled to just throw away such values. This behaviour of this patch is not that strict but if we really followed the spec then none of these anchors should work.
Er... we need to scroll to all IDs in a document, regardless of element. HTML4 deprecated name, and gave id to just about all the elements. Also, since space in URL works and space in name works NOW (there are also a number of websites abusing this behavior), I think we need to support space in HTML ids as well (not in XML though, validated parsers would flag such documents invalid). By the way, would some of this be needed for XML, or is it taken care of by default?
If we need to do this for ID's we should IMO do this conversion in the code that searches for the element, not store the converted ID in the DOM.
Adam, see http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars (second note). And my comment on this bug at 2001-08-07 12:08. Heikki, no worries, spaces will still work with NAMEs and IDs, it's %20 that doesn't work right now. If I have my way, %20 will continue to not work with ID (and we will be able to safely have IDs with % in them) but will work with <A NAME>. JST, I agree.
jst, a space char is illegal in an id or name anyway so one could argue that it doesn't matter that you unescape it when you create the element. The reason I suggested doing it by putting the unescaping into the parsing is because I was uneasy about slowing down every call to the GetElementsByName function for the sake of an edge case. John Keiser, we're not talking about the href here. Yes this can contain escaped chars, but AFAIK the id and name shouldn't and don't need to be escaped. Basically the only legal form for these attrs is [a-zA-Z][a-zA-z0-9:.-]*. This bug deals with the fact that when they are escaped, they are not matched against the unescaped anchor in the scrolling code.
I think we need to decide: 1. If illegal chars are allowed in these elements and what to do if they're not 2. If illegal char means any illegal char or just a select few such as spaces 3. If chars may be escaped and where unescaping occurs
Adam, the second note at that URL specifically says, "The same conversion based on UTF-8 should be applied to values of the name attribute for the A element." In fact, that entire link says that these things are illegal but shows how to implement them if you do allow--and we are talking about allowing them because, as Heikki points out, many websites are currently "abusing" the behavior (including the Perl docs). I think that section of the spec adequately answers to the questions you asked--it says to escape non-URI characters using the URI encoding scheme. In actuality it recommends the opposite of what we are doing (unescaping them), but I think the effect is pretty much the same. What else do you think needs to be said? All I can see that's left to decide is whether to make ID escape, and I don't think we should do that unless we do something like what JST proposes (escape only when looking for it as an anchor). I don't think we should ever change the actual ID attribute. For example, some authors might want to use the % char in ID as something similar to . or -> in C++. But I have a funny feeling that fix is not going to be as quick as this one, so we may want to put in the fix for NAME and leave ID for another bug.
Do we know of existing sites that rely on being able to use %xx fu in element ID's?
Attachment #53613 - Flags: review+
So does anyone have a problem with committing the A NAME= patch now and dealing with ID in a separate bug (that fix would be larger)? This will improve usability in Mozilla directly.
I don't have a problem with the NAME patch for the sake of compatibility though I think ID should strictly conform to the spec.
Target Milestone: --- → mozilla0.9.6
-> 0.9.7
Target Milestone: mozilla0.9.6 → mozilla0.9.7
Comment on attachment 53613 [details] [diff] [review] Same as above, but only unescape name for anchors, as jkeiser pointed out to me. sr=rpotts@netscape.com
Attachment #53613 - Flags: superreview+
Comment on attachment 53613 [details] [diff] [review] Same as above, but only unescape name for anchors, as jkeiser pointed out to me. sr=rpotts@netscape.com
Fix checked in.
Marking this bug fixed. ID attributes are untouched, NAME attributes are unescaped.
Status: NEW → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
For the record, for HTML5-compliance, the HTML5 parser doesn't have special treatment for the name attribute on <a>. Any evidence of that part of the patch still being necessary?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: