42287 - Need to strip whitespace from ends of non-CDATA attributes in HTML

Reporter

Description

•

25 years ago

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win95; en-US; m16) Gecko/20000609
BuildID:    2000060908

If a page as a url reference like:

<A   href="  /categoria/doc.asp?cd=6365 " >


the URL that will be formed is:
http://www.negocios.pt/  /categoria/doc.asp?cd=6365


It seems that the URL parser doesn't remove the spaces before and after.

It works fine in Netscape 4.73

Reproducible: Always
Steps to Reproduce:
1.open site
2.click on link
3.bad URL

Actual Results:  gets a page saying the invalid URL

Expected Results:  Show the page

Bernd Mielke

Updated

•

25 years ago

Summary: URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " > → URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " >

Jerry Baker

Comment 1

•

25 years ago

I would think that the bug is actually in Netscape 4.x if it properly constructs 
that link. Quote marks are used to delimit the string that is contained between 
them. Mozilla is simply appending the data contained in the quote marks to the 
base URL. That is good and well. Spaces are illegal in URL's, and so the URL 
will fail. That is what should happen.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 2

•

25 years ago

I think stripping the spaces may actually be correct, but I'd have to check the 
spec.  Even so, this is a 4xp issue...

Hixie (not reading bugmail)

Comment 3

•

25 years ago

David is correct; per SGML and leading or trailing spaces or linefeeds should 
be trimmed in attributes.

I think.

Asa Dotzler [:asa]

Comment 4

•

25 years ago

setting bug status to New

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jerry Baker

Comment 5

•

25 years ago

Learn something new everyday. So <a href="
                     htt
p://ww
w.
yahoo.co
m                                 ">

is OK?

Hixie (not reading bugmail)

Comment 6

•

25 years ago

Not really, that should be turned into:

   <a href="htt p://ww w. yahoo.co m">

...which is probably not what you meant. (And I may be wrong about the 
normalisation in the middle as well; I can't remember if HREF is CDATA and if
CDATA is treated like an SGML minimum literal. There is already a bug on this
somewhere in Bugzilla, it may be closed.)

Jerry Baker

Comment 7

•

25 years ago

Linefeeds are ok, no?

rickg

Comment 8

•

25 years ago

Ian's right -- but linefeeds are not ok.

karnaze (gone)

Comment 9

•

25 years ago

Reassigning to Harish.

Assignee: clayton → harishd

harishd

Assignee

Comment 10

•

25 years ago

This is something that could wait. I think.  FUTURING the bug.

Let me know your concerns. Thanx.

Status: NEW → ASSIGNED

Target Milestone: --- → Future

Hixie (not reading bugmail)

Comment 11

•

24 years ago

*** Bug 22836 has been marked as a duplicate of this bug. ***

Hixie (not reading bugmail)

Comment 12

•

24 years ago

*** Bug 49532 has been marked as a duplicate of this bug. ***

Hixie (not reading bugmail)

Updated

•

24 years ago

Keywords: correctness, helpwanted, mostfreq, testcase

OS: Windows 95 → All

Hardware: PC → All

gerardok

Comment 13

•

24 years ago

massive update for QA contact.

QA Contact: petersen → lorca

harishd

Assignee

Updated

•

24 years ago

Target Milestone: Future → mozilla0.9

Hixie (not reading bugmail)

Comment 14

•

24 years ago

Reassigning QA Contact for all open and unverified bugs previously under Lorca's
care to Gerardo as per phone conversation this morning.

QA Contact: lorca → gerardok

harishd

Assignee

Comment 15

•

24 years ago

This bug has been marked "future" because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration -----

Target Milestone: mozilla0.9 → Future

bsharma

Comment 16

•

24 years ago

qa contact updated.

QA Contact: gerardok → bsharma

Christopher Hoess (gone)

Comment 17

•

23 years ago

First off, href is CDATA.  Goldfarb, p. 331 (on deriving the attribute value 
from the attribute value literal): "The interpretation is accomplished by 
replacing any entity references or character references within the literal and 
then normalizing the resulting attribute value by throwing out any entity ends 
and record starts and replacing any record end or separator characters with a 
space...At this point, if the declared value of the attribute is character data 
(CDATA), derivation of the attribute value is complete.  If the declared value 
is anything else...there is a further process in which any sequence of space 
characters is replaced by a single space.  Also, if there are any space 
characters at the beginning or end of the attribute value, those are deleted." 

In summary: we should be replacing linefeeds with whitespace (bug 47078) in all 
attributes.  This bug as currently phrased is invalid because "href" is CDATA 
and doesn't get the whitespace normalization.  However, the summary should be 
changed to reflect the need for whitespace stripping/normalization on all non-
CDATA attributes.

Christopher Hoess (gone)

Comment 18

•

23 years ago

After seeing bug 82243, I was inspired to compile a list of the attributes 
which are, in fact, in need of whitespace stripping in HTML 4.01 Strict (non-
CDATA): there are surprisingly few.  These are:
"lang", "id", "dir", and "tabindex" attributes.
A "shape"
PARAM "valuetype"
FORM "method"
LABEL "for"
INPUT "maxlength"
SELECT "size"
TEXTAREA "rows"
TEXTAREA "cols"
In tables:
"align", "valign", "frame", "rules", "span", "headers", "rowspan", "colspan", "s
cope"
META "http-equiv", "name"

Frameset adds "frameborder", "scrolling", and "align".

I haven't listed the monovalued attributes (i.e. foo="foo"), but if they're 
found in full form, they should also have whitespace normalized.

Christopher Hoess (gone)

Comment 19

•

23 years ago

Oops, that should be bug 88243 above.  Changing the summary on this to reflect 
the nature of the bug.  Note that the rules for XML normalization are, as far 
as I can tell, exactly the same as these (just restated); since we're a non-
validating parser, all we have to do is transform linefeeds to whitespace.  
Spun off bug 94174 to deal with the original problem.  Adding html4 keyword, as 
this is an issue with proper SGML behavior in HTML.  Changing component to 
Parser (HTML ELement is deprecated, yes?)

URL: http://www.negocios.pt

Component: HTML Element → Parser

Keywords: html4

Summary: URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " > → Need to strip whitespace from ends of non-CDATA attributes in HTML

Mats Palmgren (inactive)

Comment 20

•

23 years ago

*** Bug 96092 has been marked as a duplicate of this bug. ***

Andreas M. "Clarence" Schneider

Comment 21

•

23 years ago

Actually, Mozilla strips whitespace from the ends of *every* HTML attribute:
http://lxr.mozilla.org/seamonkey/source/content/html/document/src/nsHTMLContentSink.cpp#709
But the list of whitespace chars contains only '\n', '\t', '\r' and '\b'.

Andreas M. "Clarence" Schneider

Comment 22

•

23 years ago

The patch in attachment 47271 [details] [diff] [review] to bug 95754 fixes this bug too.

Moied

Updated

•

23 years ago

QA Contact: bsharma → moied

R.K.Aa.

Comment 23

•

23 years ago

*** Bug 104571 has been marked as a duplicate of this bug. ***

R.K.Aa.

Comment 24

•

23 years ago

*** Bug 107352 has been marked as a duplicate of this bug. ***

Jerry Baker

Comment 25

•

23 years ago

Mass removing self from CC list.

Jerry Baker

Comment 26

•

23 years ago

Now I feel sumb because I have to add back. Sorry for the spam.

Susie Wyshak

Comment 27

•

23 years ago

Attencion: This bug has not received enough visibility.

It is affecting some key AOL & parter sites, which we have evangelized in the
meantime but are an indicator that it could be prolific. 

 - http://www.atlantic-records.com/ (thumbnails)
 - http://www.mimiessentials.com (product thumbnails)
 - http://www.metlife.com (menu images)

Can we please beef up the priority?

URL: http://www.metlife.com, http://www.mi...

Keywords: nsbeta1, topembed

Susie Wyshak

Comment 28

•

23 years ago

Tracking broken sites in bugscape 14779

Whiteboard: [bugscape 14779 ]

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 29

•

23 years ago

I don't think those sites Suzie listed describe this bug. Those sites have
problems with images: leading/trailing spaces in src attribute value. I somehow
think I've seen a bug about that before and I am almost sure it has been fixed
at some point as well. Anyone find it? I believe the img src issue should be
fixed in the img element code or possibly in image loader code. Not in parser in
my opinion.

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 30

•

23 years ago

I think the fix might need to go into nsHTMLImageElement::SetSrcInner().

Susie Wyshak

Comment 31

•

23 years ago

Bclary believes bug 87894 is causing the bugs in comment #27. So THAT's the one
we woud like fixed by rtm :-)

Jaime Rodriguez, Jr.

Comment 32

•

23 years ago

Removing [bugscape 14779 ] and nsbeta1 nomination as this bug was mistakedly
marked as the 'cause for problems with images: leading/trailing spaces in src
attribute value.

Keywords: nsbeta1

Whiteboard: [bugscape 14779 ]

harishd

Assignee

Comment 33

•

23 years ago

>I don't think those sites Suzie listed describe this bug. Those sites have
>problems with images: leading/trailing spaces in src attribute value.

Sure it's not an exact dupe of this bug but very similar. I'll have to revisit
the spec. to see if this problem could be fixed in the parser itself.

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 34

•

23 years ago

Of course, the problem of removing the whitespace in the parser means dataloss
when you save the page. Generally this might not be a problem, but it might.
Having the element/layout code deal with whitespace seems like the safer approach.

harishd

Assignee

Comment 35

•

23 years ago

Well, it's not really a data loss since we're only removing unnecessary data. 

And according to the spec.

"User agents may ignore leading and trailing white space in CDATA attribute
values (e.g., "   myval   " may be interpreted as "myval")."

harishd

Assignee

Comment 36

•

23 years ago

After investigating the bug I belive that this bug should in fact be fixed in
|layout|.

Reasons are:
1) We can target the fix to a specific attribute. Whereas in parser it would be
a more generic fix and therefore could have an impact on the performance.
2) Setting attribute values via DOM will benifit from layout fix. That is, if
the bug is fixed in the parser then attribute values, with leading/trailing
whitespace, set via DOM will exhibit the same bug.
3) With a layout fix we don't need additional code to preserve view-source.

Dave Barrowman

Comment 37

•

23 years ago

topembed- per EDT.  Susie, if you think this is the wrong decision, please let
us know.

Keywords: topembed → topembed-

Susie Wyshak

Comment 38

•

23 years ago

Hi Dave. Good news. This was fixed by
http://bugzilla.mozilla.org/show_bug.cgi?id=87894

(not sure if this bug should be marked a dupe or if it's another issue that
needs to be fixed.)

Christopher Hoess (gone)

Comment 39

•

23 years ago

No, there's SGML parsing issues separate from the fix that was made. This should 
remain open.

Christopher Hoess (gone)

Updated

•

23 years ago

Whiteboard: [HTML4-6.2]

Chris Nalls

Comment 40

•

22 years ago

This was marked topembed- per EDT in June.  Susie, please check these sites and let
us know if this needs to be plussed.

Michael Dunn

Comment 41

•

22 years ago

Per Susie comment fixed with: bug 87894

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → WORKSFORME

Christopher Hoess (gone)

Comment 42

•

22 years ago

No, it's not. See comment #39 and comment #18, for instance. Bug 87894 affects 
a single *CDATA* HTML attribute, and should have been a quirks-mode-only patch 
(as I mentioned in the bug), but standards are evidently below us, here.

URL: http://www.metlife.com, http://www.mi...

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Christopher Hoess (gone)

Comment 43

•

22 years ago

n.b.--I looked back at the HTML 4.01 spec, and as Harish said, it does allow 
UAs to trim leading and trailing whitespace on all attributes. I'm not sure 
whether it's actually allowed this leeway as an SGML application, but it's at 
least arguably correct, so I apologize for the snarkiness of my last sentence.

Michael Dunn

Updated

•

22 years ago

Whiteboard: [HTML4-6.2] → [HTML4-6.2] [t2]

Boris Zbarsky [:bzbarsky]

Comment 44

•

20 years ago

Quite frankly, implementing this would be extra complexity for no essential
benefit.  Trying to really follow SGML for tag-soup parsing is really just not
worth it in the end...

Does anyone have a good reason I shouldn't just wontfix this?

Blake Kaplan (:mrbkap) (inactive)

Comment 45

•

20 years ago

This, as a general bug is WONTFIX. While HTML is a SGML derived language, our
parser is not a SGML parser. SGML has some really funky rules regarding
whitespace (as far as I can tell) and things like this add complexity to the
parser (which really doesn't need it). If we need to fix this in specific cases
(like IMG src tags), it should be fixed in those places.

Status: REOPENED → RESOLVED

Closed: 22 years ago → 20 years ago

Resolution: --- → WONTFIX