Closed Bug 30386 Opened 25 years ago Closed 23 years ago

" " in internal href adds strange "Â" character

Categories

(Core :: Layout, defect, P3)

x86
Windows NT
defect

Tracking

()

VERIFIED FIXED
Future

People

(Reporter: faught, Assigned: waterson)

Details

(Keywords: relnote, testcase, Whiteboard: relnote-devel)

Attachments

(2 files)

I loaded this test page using Mozilla M14 (build ID 2000022820): 

----------cut here
<a href="#x&nbsp;x">Link here</a><p>

<a name="x&nbsp;x">Target here</a>
----------cut here

When I click on the "Link here" link, nothing happens (to test internal links 
like this, you have to reduce the height of the browser so it's shorter than the 
page, and/or add additional lines between the link and the target).  The URL it 
was trying to go to was "file:///C|/TEMP/spacelink.html#x x" - it inserted an 
extra character before the space.

Note that this name attribute isn't actually valid HTML, but Navigator 4.61 does 
handle it properly.  At the very least, the extra character inserted into the 
URL looks like a bug.

This might be related to bug #29312.
The attachment is the reporter's testcase; the problem is confirmed with
2000-03-04-17-M15 on WinNT.

The incorrect #x x" shows up in the status bar after clicking on the
"Link here" link, but not in the URL bar. Each click on that link adds 
another blank entry onto the Go menu session history; it takes the same number
of presses on the [Back] button to return to the testcase page.

The HTML 4.0 DTD defines URIs to be CDATA, which can contain character entities.
URIs are further restricted to ASCII characters by the
HTML 4 spec: http://206.51.27.220/html/struct/links.html#h-12.2.1 and
http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars 

On the other hand, at the latter URL under B.2.2 the use of & as the start
character of a character entity takes precedence over its use as a delimiter
between form fields. The spec makes no mention of character entities within
fragment identifiers, but if they are legal inside the query part of the URI,
they probably should be in the fragment identifier too, especially since
RFC 2396 puts few limits on the legal ASCII characters in a fragment.

The legal characters for URIs are defined in section 2 of RFC 2396
http://www.ietf.org/rfc/rfc2396.txt  The fragment identifier is defined in
section 4.1. No characters are reserved in the fragment identifier portion of
a URI reference. The only characters excluded will be those listed in section 
2.4.3 - control characters, space, and delimiters. Quoting from Appendix A:
      > fragment      = *uric
      > uric          = reserved | unreserved | escaped
      > reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                      "$" | ","
      > unreserved    = alphanum | mark
      > mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                      "(" | ")"
      [alphanum means what you'd expect]

The question of whether the fragment should appear in the URL bar as "x x" or as
"x&nbsp;x" is another question, but since the RFC allows those characters and
the HTML 4.01 DTD calls URIs CDATA, it should be as legal as "x;acirc;x".

If character entities are not allowable in fragment identifiers for some
reason given by the HTML spec that I have missed, "x&nbsp;x" should still
be legal as a literal in the fragment identifier only, and that fragment 
identifier ought to work.

Section 2.1 of RFC 2396 states that the problem of identifying the correct
character encoding for byte sequences defined by %xx in URIs is one left
for a future version of the spec. Given that, allowing (non-numerical) 
character entities in all parts of a URL where they are not otherwise 
prohibited seems sensible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
This works for me on NT; chris, can you confirm?
Assignee: rickg → petersen
With the April 12th build, I get the  "Â" character appearing the status bar of 
the window when a mouseover occurs on link. Clicking on link doesn't go to the 
target in the file. I have attached a modified version of the original testcase 
with BR elements to seperate the link and target.
Back to Rick.
Assignee: petersen → rickg
reassigning
Assignee: rickg → harishd
A bit more data here: I fixed the ReduceEntities() function this weekend, and 
used this as a testcase: <a href="foo &cent;&reg;">.  The odd thing is that the 
sink is doing exactly the right thing, but when you mouse over the link we see 
the extra funky "Â" character. I think it's in the HTML attribute handling of 
the content model or in the link handling code. 
1) Can anyone explain a valid reason *why* someone would use &nbsp; or other 
entities as the target of a HREF?
2) Can anyone demonstrate that this usage is common (or even present at all) on 
the Web, esp. Top 100?

4xp, relnote, FUTURE unless someone demonstrates this is a real problem for real 
sites.
Keywords: 4xp, relnote
Target Milestone: --- → Future
This bug report originated from a real web page at
http://www.geraldmweinberg.com.  The maintainer changed the offending code when
I suggested that it wasn't compliant with the standard.  I don't have any
further data on how common this construct is in general.
It's indeed html attribute handling. Over to Waterson.
Assignee: harishd → waterson
marking assigned, unless jst wants to look at it first.
Status: NEW → ASSIGNED
The simplest testcase is already attached.  Marking "testcase"
Keywords: testcase
massive update for QA contact.
QA Contact: petersen → lorca
Composer is also seeing some strangeness with entities in links/anchors.  CC 
Charley so he can followup when he returns from sabbatical.
Whiteboard: relnote-devel
Nom. nsbeta1 for backward compatibility with existing (****) HTML content.
Keywords: nsbeta1
Reassigning QA Contact for all open and unverified bugs previously under Lorca's
care to Gerardo as per phone conversation this morning.
QA Contact: lorca → gerardok
qa contact updated.
QA Contact: gerardok → bsharma
Both attached test cases are working for me using the Mozilla 0.9.1 build
(Build ID: 2001060713) on Linux.  Is this still a bug?
Build 2001061304 win32 installer sea talkback trunk
1 The "strange" character  no longer appear
2 The "strange" character is now replaced with %A0
3 In the second testcase the link is proven to actually work!
Questions:
Is this correct behavior?
Should the note in:
http://developer.netscape.com/docs/technote/gecko/n6release.html
be corrected?
I get the same results ("#x x" is now x%A0x", and the link works) using 
Mozilla _0.9_ on WinNT.

 > Questions:
 > Is this correct behavior?

Well, the important thing is that it is not incorrect: "%A0" is (roughly 
speaking) a synonym for "&#160;", which is a synonym for "&nbsp;" in HTML's 
default character set. More to the point, this is exactly what the HTML 4 spec 
says should happen for non-ascii characters:
  http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars
-- and since character entity references are meant to stand in for single
non-ascii characters, this behaviour is sensible enough.

Unless someone can point to a spec that says otherwise, the difference 
(in display or in HREF-matching) between " " and "%A0" for "&nbsp;" is a 
difference that makes no difference.

 > Should the note in:
 > http://developer.netscape.com/docs/technote/gecko/n6release.html 
 > be corrected?

No, those notes are for Netscape 6, which is finalized; this bug is
as active as ever in all N6 binaries. Before Netscape 6.next comes out,
I'm sure someone will go through all closed "relnote" bugs and prune notes
accordingly.

Quoting from above: 
 > --- Additional Comments From ekrock 2000-05-20 23:14 ---
 > 1) Can anyone explain a valid reason *why* someone would use &nbsp; or other 
 > entities as the target of a HREF?
Valid, no, but I'd guess cut-n-paste from a heading element ;->

Calling this FIXED.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Verified on:
build: 2001-07-02-04-Trunk
platform: WinNT

Loaded both the test cases and they load fine.
Status: RESOLVED → VERIFIED
SPAM. HTML Element component deprecated, changing component to Layout. See bug
88132 for details.
Component: HTML Element → Layout
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: