Closed Bug 42287 Opened 25 years ago Closed 20 years ago

Need to strip whitespace from ends of non-CDATA attributes in HTML

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX
Future

People

(Reporter: fpcosta, Assigned: harishd)

References

Details

(4 keywords, Whiteboard: [HTML4-6.2] [t2])

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Win95; en-US; m16) Gecko/20000609
BuildID:    2000060908

If a page as a url reference like:

<A   href="  /categoria/doc.asp?cd=6365 " >


the URL that will be formed is:
http://www.negocios.pt/  /categoria/doc.asp?cd=6365


It seems that the URL parser doesn't remove the spaces before and after.

It works fine in Netscape 4.73

Reproducible: Always
Steps to Reproduce:
1.open site
2.click on link
3.bad URL

Actual Results:  gets a page saying the invalid URL

Expected Results:  Show the page
Summary: URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " > → URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " >
I would think that the bug is actually in Netscape 4.x if it properly constructs 
that link. Quote marks are used to delimit the string that is contained between 
them. Mozilla is simply appending the data contained in the quote marks to the 
base URL. That is good and well. Spaces are illegal in URL's, and so the URL 
will fail. That is what should happen.
I think stripping the spaces may actually be correct, but I'd have to check the 
spec.  Even so, this is a 4xp issue...
David is correct; per SGML and leading or trailing spaces or linefeeds should 
be trimmed in attributes.

I think.
setting bug status to New
Status: UNCONFIRMED → NEW
Ever confirmed: true
Learn something new everyday. So <a href="
                     htt
p://ww
w.
yahoo.co
m                                 ">

is OK?
Not really, that should be turned into:

   <a href="htt p://ww w. yahoo.co m">

...which is probably not what you meant. (And I may be wrong about the 
normalisation in the middle as well; I can't remember if HREF is CDATA and if
CDATA is treated like an SGML minimum literal. There is already a bug on this
somewhere in Bugzilla, it may be closed.)
Linefeeds are ok, no?
Ian's right -- but linefeeds are not ok.
Reassigning to Harish.
Assignee: clayton → harishd
This is something that could wait. I think.  FUTURING the bug.

Let me know your concerns. Thanx.
Status: NEW → ASSIGNED
Target Milestone: --- → Future
*** Bug 22836 has been marked as a duplicate of this bug. ***
*** Bug 49532 has been marked as a duplicate of this bug. ***
OS: Windows 95 → All
Hardware: PC → All
massive update for QA contact.
QA Contact: petersen → lorca
Target Milestone: Future → mozilla0.9
Reassigning QA Contact for all open and unverified bugs previously under Lorca's
care to Gerardo as per phone conversation this morning.
QA Contact: lorca → gerardok
This bug has been marked "future" because the original netscape engineer working 
on this is over-burdened. If you feel this is an error, that you or another
known resource will be working on this bug,or if it blocks your work in some way 
-- please attach your concern to the bug for reconsideration -----
Target Milestone: mozilla0.9 → Future
qa contact updated.
QA Contact: gerardok → bsharma
First off, href is CDATA.  Goldfarb, p. 331 (on deriving the attribute value 
from the attribute value literal): "The interpretation is accomplished by 
replacing any entity references or character references within the literal and 
then normalizing the resulting attribute value by throwing out any entity ends 
and record starts and replacing any record end or separator characters with a 
space...At this point, if the declared value of the attribute is character data 
(CDATA), derivation of the attribute value is complete.  If the declared value 
is anything else...there is a further process in which any sequence of space 
characters is replaced by a single space.  Also, if there are any space 
characters at the beginning or end of the attribute value, those are deleted." 

In summary: we should be replacing linefeeds with whitespace (bug 47078) in all 
attributes.  This bug as currently phrased is invalid because "href" is CDATA 
and doesn't get the whitespace normalization.  However, the summary should be 
changed to reflect the need for whitespace stripping/normalization on all non-
CDATA attributes.
After seeing bug 82243, I was inspired to compile a list of the attributes 
which are, in fact, in need of whitespace stripping in HTML 4.01 Strict (non-
CDATA): there are surprisingly few.  These are:
"lang", "id", "dir", and "tabindex" attributes.
A "shape"
PARAM "valuetype"
FORM "method"
LABEL "for"
INPUT "maxlength"
SELECT "size"
TEXTAREA "rows"
TEXTAREA "cols"
In tables:
"align", "valign", "frame", "rules", "span", "headers", "rowspan", "colspan", "s
cope"
META "http-equiv", "name"

Frameset adds "frameborder", "scrolling", and "align".

I haven't listed the monovalued attributes (i.e. foo="foo"), but if they're 
found in full form, they should also have whitespace normalized.
Oops, that should be bug 88243 above.  Changing the summary on this to reflect 
the nature of the bug.  Note that the rules for XML normalization are, as far 
as I can tell, exactly the same as these (just restated); since we're a non-
validating parser, all we have to do is transform linefeeds to whitespace.  
Spun off bug 94174 to deal with the original problem.  Adding html4 keyword, as 
this is an issue with proper SGML behavior in HTML.  Changing component to 
Parser (HTML ELement is deprecated, yes?)
Component: HTML Element → Parser
Keywords: html4
Summary: URL malformed when HTML code like <A href=" /categoria/doc.asp?cd=6365 " > → Need to strip whitespace from ends of non-CDATA attributes in HTML
*** Bug 96092 has been marked as a duplicate of this bug. ***
Actually, Mozilla strips whitespace from the ends of *every* HTML attribute:
http://lxr.mozilla.org/seamonkey/source/content/html/document/src/nsHTMLContentSink.cpp#709
But the list of whitespace chars contains only '\n', '\t', '\r' and '\b'.
The patch in attachment 47271 [details] [diff] [review] to bug 95754 fixes this bug too.
QA Contact: bsharma → moied
*** Bug 104571 has been marked as a duplicate of this bug. ***
*** Bug 107352 has been marked as a duplicate of this bug. ***
Mass removing self from CC list.
Now I feel sumb because I have to add back. Sorry for the spam.
Attencion: This bug has not received enough visibility.

It is affecting some key AOL & parter sites, which we have evangelized in the
meantime but are an indicator that it could be prolific. 

 - http://www.atlantic-records.com/ (thumbnails)
 - http://www.mimiessentials.com (product thumbnails)
 - http://www.metlife.com (menu images)

Can we please beef up the priority?
Tracking broken sites in bugscape 14779  
Whiteboard: [bugscape 14779 ]
I don't think those sites Suzie listed describe this bug. Those sites have
problems with images: leading/trailing spaces in src attribute value. I somehow
think I've seen a bug about that before and I am almost sure it has been fixed
at some point as well. Anyone find it? I believe the img src issue should be
fixed in the img element code or possibly in image loader code. Not in parser in
my opinion.
I think the fix might need to go into nsHTMLImageElement::SetSrcInner().
Bclary believes bug 87894 is causing the bugs in comment #27. So THAT's the one
we woud like fixed by rtm :-)
Removing [bugscape 14779 ] and nsbeta1 nomination as this bug was mistakedly
marked as the 'cause for problems with images: leading/trailing spaces in src
attribute value.
Keywords: nsbeta1
Whiteboard: [bugscape 14779 ]
>I don't think those sites Suzie listed describe this bug. Those sites have
>problems with images: leading/trailing spaces in src attribute value.

Sure it's not an exact dupe of this bug but very similar. I'll have to revisit
the spec. to see if this problem could be fixed in the parser itself.
Of course, the problem of removing the whitespace in the parser means dataloss
when you save the page. Generally this might not be a problem, but it might.
Having the element/layout code deal with whitespace seems like the safer approach.
Well, it's not really a data loss since we're only removing unnecessary data. 

And according to the spec.

"User agents may ignore leading and trailing white space in CDATA attribute
values (e.g., "   myval   " may be interpreted as "myval")."
After investigating the bug I belive that this bug should in fact be fixed in
|layout|.

Reasons are:
1) We can target the fix to a specific attribute. Whereas in parser it would be
a more generic fix and therefore could have an impact on the performance.
2) Setting attribute values via DOM will benifit from layout fix. That is, if
the bug is fixed in the parser then attribute values, with leading/trailing
whitespace, set via DOM will exhibit the same bug.
3) With a layout fix we don't need additional code to preserve view-source.
topembed- per EDT.  Susie, if you think this is the wrong decision, please let
us know.
Keywords: topembedtopembed-
Hi Dave. Good news. This was fixed by
http://bugzilla.mozilla.org/show_bug.cgi?id=87894

(not sure if this bug should be marked a dupe or if it's another issue that
needs to be fixed.)
No, there's SGML parsing issues separate from the fix that was made. This should 
remain open.
Whiteboard: [HTML4-6.2]
This was marked topembed- per EDT in June.  Susie, please check these sites and let
us know if this needs to be plussed.
Per Susie comment fixed with: bug 87894
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → WORKSFORME
No, it's not. See comment #39 and comment #18, for instance. Bug 87894 affects 
a single *CDATA* HTML attribute, and should have been a quirks-mode-only patch 
(as I mentioned in the bug), but standards are evidently below us, here.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
n.b.--I looked back at the HTML 4.01 spec, and as Harish said, it does allow 
UAs to trim leading and trailing whitespace on all attributes. I'm not sure 
whether it's actually allowed this leeway as an SGML application, but it's at 
least arguably correct, so I apologize for the snarkiness of my last sentence.
Whiteboard: [HTML4-6.2] → [HTML4-6.2] [t2]
Quite frankly, implementing this would be extra complexity for no essential
benefit.  Trying to really follow SGML for tag-soup parsing is really just not
worth it in the end...

Does anyone have a good reason I shouldn't just wontfix this?
This, as a general bug is WONTFIX. While HTML is a SGML derived language, our
parser is not a SGML parser. SGML has some really funky rules regarding
whitespace (as far as I can tell) and things like this add complexity to the
parser (which really doesn't need it). If we need to fix this in specific cases
(like IMG src tags), it should be fixed in those places.
Status: REOPENED → RESOLVED
Closed: 22 years ago20 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.