I loaded this test page using Mozilla M14 (build ID 2000022820): ----------cut here <a href="#x x">Link here</a><p> <a name="x x">Target here</a> ----------cut here When I click on the "Link here" link, nothing happens (to test internal links like this, you have to reduce the height of the browser so it's shorter than the page, and/or add additional lines between the link and the target). The URL it was trying to go to was "file:///C|/TEMP/spacelink.html#x x" - it inserted an extra character before the space. Note that this name attribute isn't actually valid HTML, but Navigator 4.61 does handle it properly. At the very least, the extra character inserted into the URL looks like a bug. This might be related to bug #29312.
The attachment is the reporter's testcase; the problem is confirmed with 2000-03-04-17-M15 on WinNT. The incorrect #xÂ x" shows up in the status bar after clicking on the "Link here" link, but not in the URL bar. Each click on that link adds another blank entry onto the Go menu session history; it takes the same number of presses on the [Back] button to return to the testcase page. The HTML 4.0 DTD defines URIs to be CDATA, which can contain character entities. URIs are further restricted to ASCII characters by the HTML 4 spec: http://18.104.22.168/html/struct/links.html#h-12.2.1 and http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars On the other hand, at the latter URL under B.2.2 the use of & as the start character of a character entity takes precedence over its use as a delimiter between form fields. The spec makes no mention of character entities within fragment identifiers, but if they are legal inside the query part of the URI, they probably should be in the fragment identifier too, especially since RFC 2396 puts few limits on the legal ASCII characters in a fragment. The legal characters for URIs are defined in section 2 of RFC 2396 http://www.ietf.org/rfc/rfc2396.txt The fragment identifier is defined in section 4.1. No characters are reserved in the fragment identifier portion of a URI reference. The only characters excluded will be those listed in section 2.4.3 - control characters, space, and delimiters. Quoting from Appendix A: > fragment = *uric > uric = reserved | unreserved | escaped > reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," > unreserved = alphanum | mark > mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" [alphanum means what you'd expect] The question of whether the fragment should appear in the URL bar as "x x" or as "x x" is another question, but since the RFC allows those characters and the HTML 4.01 DTD calls URIs CDATA, it should be as legal as "x;acirc;x". If character entities are not allowable in fragment identifiers for some reason given by the HTML spec that I have missed, "x x" should still be legal as a literal in the fragment identifier only, and that fragment identifier ought to work. Section 2.1 of RFC 2396 states that the problem of identifying the correct character encoding for byte sequences defined by %xx in URIs is one left for a future version of the spec. Given that, allowing (non-numerical) character entities in all parts of a URL where they are not otherwise prohibited seems sensible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
This works for me on NT; chris, can you confirm?
Assignee: rickg → petersen
With the April 12th build, I get the "Â" character appearing the status bar of the window when a mouseover occurs on link. Clicking on link doesn't go to the target in the file. I have attached a modified version of the original testcase with BR elements to seperate the link and target.
Back to Rick.
Assignee: petersen → rickg
Assignee: rickg → harishd
A bit more data here: I fixed the ReduceEntities() function this weekend, and used this as a testcase: <a href="foo ¢®">. The odd thing is that the sink is doing exactly the right thing, but when you mouse over the link we see the extra funky "Â" character. I think it's in the HTML attribute handling of the content model or in the link handling code.
1) Can anyone explain a valid reason *why* someone would use or other entities as the target of a HREF? 2) Can anyone demonstrate that this usage is common (or even present at all) on the Web, esp. Top 100? 4xp, relnote, FUTURE unless someone demonstrates this is a real problem for real sites.
Keywords: 4xp, relnote
Target Milestone: --- → Future
This bug report originated from a real web page at http://www.geraldmweinberg.com. The maintainer changed the offending code when I suggested that it wasn't compliant with the standard. I don't have any further data on how common this construct is in general.
It's indeed html attribute handling. Over to Waterson.
Assignee: harishd → waterson
marking assigned, unless jst wants to look at it first.
Status: NEW → ASSIGNED
The simplest testcase is already attached. Marking "testcase"
massive update for QA contact.
QA Contact: petersen → lorca
Composer is also seeing some strangeness with entities in links/anchors. CC Charley so he can followup when he returns from sabbatical.
Nom. nsbeta1 for backward compatibility with existing (****) HTML content.
Reassigning QA Contact for all open and unverified bugs previously under Lorca's care to Gerardo as per phone conversation this morning.
QA Contact: lorca → gerardok
qa contact updated.
QA Contact: gerardok → bsharma
Both attached test cases are working for me using the Mozilla 0.9.1 build (Build ID: 2001060713) on Linux. Is this still a bug?
Build 2001061304 win32 installer sea talkback trunk 1 The "strange" character Â no longer appear 2 The "strange" character is now replaced with %A0 3 In the second testcase the link is proven to actually work! Questions: Is this correct behavior? Should the note in: http://developer.netscape.com/docs/technote/gecko/n6release.html be corrected?
I get the same results ("#xÂ x" is now x%A0x", and the link works) using Mozilla _0.9_ on WinNT. > Questions: > Is this correct behavior? Well, the important thing is that it is not incorrect: "%A0" is (roughly speaking) a synonym for " ", which is a synonym for " " in HTML's default character set. More to the point, this is exactly what the HTML 4 spec says should happen for non-ascii characters: http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars -- and since character entity references are meant to stand in for single non-ascii characters, this behaviour is sensible enough. Unless someone can point to a spec that says otherwise, the difference (in display or in HREF-matching) between " " and "%A0" for " " is a difference that makes no difference. > Should the note in: > http://developer.netscape.com/docs/technote/gecko/n6release.html > be corrected? No, those notes are for Netscape 6, which is finalized; this bug is as active as ever in all N6 binaries. Before Netscape 6.next comes out, I'm sure someone will go through all closed "relnote" bugs and prune notes accordingly. Quoting from above: > --- Additional Comments From ekrock 2000-05-20 23:14 --- > 1) Can anyone explain a valid reason *why* someone would use or other > entities as the target of a HREF? Valid, no, but I'd guess cut-n-paste from a heading element ;-> Calling this FIXED.
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
Verified on: build: 2001-07-02-04-Trunk platform: WinNT Loaded both the test cases and they load fine.
Status: RESOLVED → VERIFIED
SPAM. HTML Element component deprecated, changing component to Layout. See bug 88132 for details.
Component: HTML Element → Layout
You need to log in before you can comment on or make changes to this bug.