Closed Bug 30386 Opened 25 years ago Closed 23 years ago

" " in internal href adds strange "Â" character

Tracking

()

Status:

VERIFIED FIXED

Milestone:

Future

People

(Reporter: faught, Assigned: waterson)

Details

(Keywords: relnote, testcase, Whiteboard: relnote-devel)

Attachments

(2 files)

Testcase for   in fragment identifier 25 years ago Sean Richardson 175 bytes, text/html		Details
A link and target seperated by BR elements 25 years ago Chris Petersen 620 bytes, text/html		Details

Danny Faught

Reporter

Description

•

25 years ago

I loaded this test page using Mozilla M14 (build ID 2000022820): ----------cut here <a href="#x x">Link here</a><p> <a name="x x">Target here</a> ----------cut here When I click on the "Link here" link, nothing happens (to test internal links like this, you have to reduce the height of the browser so it's shorter than the page, and/or add additional lines between the link and the target). The URL it was trying to go to was "file:///C|/TEMP/spacelink.html#x x" - it inserted an extra character before the space. Note that this name attribute isn't actually valid HTML, but Navigator 4.61 does handle it properly. At the very least, the extra character inserted into the URL looks like a bug. This might be related to bug #29312.

Sean Richardson

Comment 1

•

25 years ago

Attached file Testcase for   in fragment identifier — Details

Sean Richardson

Comment 2

•

25 years ago

The attachment is the reporter's testcase; the problem is confirmed with 2000-03-04-17-M15 on WinNT. The incorrect #xÂ x" shows up in the status bar after clicking on the "Link here" link, but not in the URL bar. Each click on that link adds another blank entry onto the Go menu session history; it takes the same number of presses on the [Back] button to return to the testcase page. The HTML 4.0 DTD defines URIs to be CDATA, which can contain character entities. URIs are further restricted to ASCII characters by the HTML 4 spec: http://206.51.27.220/html/struct/links.html#h-12.2.1 and http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars On the other hand, at the latter URL under B.2.2 the use of & as the start character of a character entity takes precedence over its use as a delimiter between form fields. The spec makes no mention of character entities within fragment identifiers, but if they are legal inside the query part of the URI, they probably should be in the fragment identifier too, especially since RFC 2396 puts few limits on the legal ASCII characters in a fragment. The legal characters for URIs are defined in section 2 of RFC 2396 http://www.ietf.org/rfc/rfc2396.txt The fragment identifier is defined in section 4.1. No characters are reserved in the fragment identifier portion of a URI reference. The only characters excluded will be those listed in section 2.4.3 - control characters, space, and delimiters. Quoting from Appendix A: > fragment = *uric > uric = reserved | unreserved | escaped > reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," > unreserved = alphanum | mark > mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" [alphanum means what you'd expect] The question of whether the fragment should appear in the URL bar as "x x" or as "x x" is another question, but since the RFC allows those characters and the HTML 4.01 DTD calls URIs CDATA, it should be as legal as "x;acirc;x". If character entities are not allowable in fragment identifiers for some reason given by the HTML spec that I have missed, "x x" should still be legal as a literal in the fragment identifier only, and that fragment identifier ought to work. Section 2.1 of RFC 2396 states that the problem of identifying the correct character encoding for byte sequences defined by %xx in URIs is one left for a future version of the spec. Given that, allowing (non-numerical) character entities in all parts of a URL where they are not otherwise prohibited seems sensible.

Status: UNCONFIRMED → NEW

Ever confirmed: true

rickg

Comment 3

•

25 years ago

This works for me on NT; chris, can you confirm?

Assignee: rickg → petersen

Chris Petersen

Comment 4

•

25 years ago

With the April 12th build, I get the "Â" character appearing the status bar of the window when a mouseover occurs on link. Clicking on link doesn't go to the target in the file. I have attached a modified version of the original testcase with BR elements to seperate the link and target.

Chris Petersen

Comment 5

•

25 years ago

Attached file A link and target seperated by BR elements — Details

Chris Petersen

Comment 6

•

25 years ago

Back to Rick.

Assignee: petersen → rickg

rods (gone)

Comment 7

•

24 years ago

reassigning

Assignee: rickg → harishd

rickg

Comment 8

•

24 years ago

A bit more data here: I fixed the ReduceEntities() function this weekend, and used this as a testcase: <a href="foo ¢®">. The odd thing is that the sink is doing exactly the right thing, but when you mouse over the link we see the extra funky "Â" character. I think it's in the HTML attribute handling of the content model or in the link handling code.

ekrock's old account (dead)

Comment 9

•

24 years ago

1) Can anyone explain a valid reason *why* someone would use   or other entities as the target of a HREF? 2) Can anyone demonstrate that this usage is common (or even present at all) on the Web, esp. Top 100? 4xp, relnote, FUTURE unless someone demonstrates this is a real problem for real sites.

Keywords: 4xp, relnote

Target Milestone: --- → Future

Danny Faught

Reporter

Comment 10

•

24 years ago

This bug report originated from a real web page at http://www.geraldmweinberg.com. The maintainer changed the offending code when I suggested that it wasn't compliant with the standard. I don't have any further data on how common this construct is in general.

harishd

Comment 11

•

24 years ago

It's indeed html attribute handling. Over to Waterson.

Assignee: harishd → waterson

Chris Waterson

Assignee

Comment 12

•

24 years ago

marking assigned, unless jst wants to look at it first.

Status: NEW → ASSIGNED

Jeffrey Baker

Comment 13

•

24 years ago

The simplest testcase is already attached. Marking "testcase"

Keywords: testcase

gerardok

Comment 14

•

24 years ago

massive update for QA contact.

QA Contact: petersen → lorca

Kathleen :Brade

Comment 15

•

24 years ago

Composer is also seeing some strangeness with entities in links/anchors. CC Charley so he can followup when he returns from sabbatical.

Gervase Markham [:gerv]

Updated

•

24 years ago

Whiteboard: relnote-devel

ekrock's old account (dead)

Comment 16

•

24 years ago

Nom. nsbeta1 for backward compatibility with existing (****) HTML content.

Keywords: nsbeta1

Hixie (not reading bugmail)

Comment 17

•

24 years ago

Reassigning QA Contact for all open and unverified bugs previously under Lorca's care to Gerardo as per phone conversation this morning.

QA Contact: lorca → gerardok

bsharma

Comment 18

•

24 years ago

qa contact updated.

QA Contact: gerardok → bsharma

Fuzzy Gorilla

Comment 19

•

23 years ago

Both attached test cases are working for me using the Mozilla 0.9.1 build (Build ID: 2001060713) on Linux. Is this still a bug?

basic

Comment 20

•

23 years ago

Build 2001061304 win32 installer sea talkback trunk 1 The "strange" character Â no longer appear 2 The "strange" character is now replaced with %A0 3 In the second testcase the link is proven to actually work! Questions: Is this correct behavior? Should the note in: http://developer.netscape.com/docs/technote/gecko/n6release.html be corrected?

Sean Richardson

Comment 21

•

23 years ago

I get the same results ("#xÂ x" is now x%A0x", and the link works) using Mozilla _0.9_ on WinNT. > Questions: > Is this correct behavior? Well, the important thing is that it is not incorrect: "%A0" is (roughly speaking) a synonym for " ", which is a synonym for " " in HTML's default character set. More to the point, this is exactly what the HTML 4 spec says should happen for non-ascii characters: http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars -- and since character entity references are meant to stand in for single non-ascii characters, this behaviour is sensible enough. Unless someone can point to a spec that says otherwise, the difference (in display or in HREF-matching) between " " and "%A0" for " " is a difference that makes no difference. > Should the note in: > http://developer.netscape.com/docs/technote/gecko/n6release.html > be corrected? No, those notes are for Netscape 6, which is finalized; this bug is as active as ever in all N6 binaries. Before Netscape 6.next comes out, I'm sure someone will go through all closed "relnote" bugs and prune notes accordingly. Quoting from above: > --- Additional Comments From ekrock 2000-05-20 23:14 --- > 1) Can anyone explain a valid reason *why* someone would use   or other > entities as the target of a HREF? Valid, no, but I'd guess cut-n-paste from a heading element ;-> Calling this FIXED.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

bsharma

Comment 22

•

23 years ago

Verified on: build: 2001-07-02-04-Trunk platform: WinNT Loaded both the test cases and they load fine.

Status: RESOLVED → VERIFIED

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 23

•

23 years ago

SPAM. HTML Element component deprecated, changing component to Layout. See bug 88132 for details.

Component: HTML Element → Layout

You need to log in before you can comment on or make changes to this bug.

"&nbsp;" in internal href adds strange "Â" character

" " in internal href adds strange "Â" character