Closed Bug 322270 Opened 19 years ago Closed 15 years ago

Convert all whitespace to spaces in (most) HTML attributes (title, alt, javascript,...)

Categories

(Core :: DOM: HTML Parser, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX
mozilla1.9alpha1

People

(Reporter: steuard+moz, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(2 files)

Several longstanding bugs (such as bug 67127 and an issue seen in bug 92799) have the same root cause: whitespace characters like carriage returns that appear in HTML attribute values are not converted to spaces as the HTML specification dictates. (For type CDATA, user agents should "Replace each carriage return or tab with a single space.")  Effects of this problem range from ugly extra characters to broken scripts.  While it might be possible to identify and patch every resulting bug individually, a global solution would be preferred.

As discussed in bug 228099 comment 7, it could be dangerous to make this change for every attribute.  Some pages might break if this altered "value" attributes, and perhaps other CDATA attributes are similarly "fragile"(though I haven't thought of any yet).  But the list of attributes where this bug is known to cause problems is already rather long: title (bug 67127), alt (bug 67127 as originally filed), ondblclick (bug 92779 comment 11) and all other Javascript attributes.  So treating "value" as the special case might be the best approach.

Line breaks in attributes are not uncommon: anyone who composes HTML using an editor with word wrap (or otherwise limits line lengths for legibility) will
eventually create them, quite possibly without noticing or thinking anything of
it.  And some page authors intentionally include line breaks in "title" or "alt" attributes to force multi-line tooltips in MSIE (see bug 67127 comment 131 for discussion of that independent issue).  As it stands, these linebreaks lead to ugly black squares or symbols when seen in many Mozilla browsers (I've observed it in Firefox on both Mac and Windows and in Camino).

This bug report obviously contains substantial overlap with several others, but I feel that the current state of affairs justifies this separate entry.  Bug 67127 as originally filed was directly relevant, but it seems to have morphed into an independent effort to enable multi-line tooltips.  Bug 228099 spun off of that bug and includes additional suggestions for entity handling, but it did not indicate the need for an exception for the "value" attribute and was thus WONTFIXed.  Bug 47078 is almost an exact duplicate of this, but again lacked the "value" exception.  I believe that with that exception explicitly recognized, the decision to WONTFIX this issue deserves to be reconsidered.
This testcase shows how the bug affects "alt" text in images with an explicit size (in this case, large enough to show all of the text).  Specifically, the newlines in the HTML source show up as boxes in the image replacement text.  (Similar examples, for both alt and title, can be found in bug 67127.)

When the image is not given an explicit size, the alt text is eventually displayed as inline HTML or something like it (once the image has finally failed to load).  In that case, the newline characters are not displayed, and in fact all whitespace seems to be collapsed to a single space.  That may not be quite what the HTML 4.01 spec calls for (is the intended behavior clear?), but it certainly looks best to me.  For the record, it IS what the XHTML spec calls for: "When user agents process attributes... Map sequences of one or more white space characters (including line breaks) to a single inter-word space."  (http://www.w3.org/TR/xhtml1/#h-4.7)
Compare bug 114997
Depends on: 114997
Looking at the source, it seems that at least one place where line breaks aren't being handled properly is here: http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsHTMLTokens.cpp#1948.

The comments there refer to bug 15204 and to bug 47535 (a duplicate); the relevant discussion starts at bug 15204 comment 53 (and extending roughly through comment 72).  Most relevant may be bug 15204 comment 55, which includes a suggested change to the source (updated to reflect aRetainWhitespace -> aRetain in the current source):

# if(!aRetain) {
#   if (<we have a CDATA attribute>) {
#     mTextValue.StripChars("\n"); //per the HTML spec, ignore linefeeds...
#     mTextValue.ReplaceChar("\t"," "); // replace tabs with spaces and...
#     mTextValue.ReplaceChar("\r"," "); // replace CRs with spaces...
#   } else {
#     mTextValue.StripChars("\r\n"); //per the HTML spec, ignore linefeeds...
#   }
# }

For this to do what we want, we would need to set aRetain to True when reading a "value" attribute (where?), and recognize CDATA attributes somehow.  I have no idea if there's already a variable or flag available that would recognize that.

For consistency across platforms, we might also want to violate the HTML spec so as to treat all platforms' line breaks the same (as bug 15204 comment 55 noted, the code above would react differently to the same text written on UNIX vs. Windows).  That's unpleasant, but at least it looks like this behavior is a "should" rather than a "must".

To make approaches like bug 67127 comment 131 workable, it's important that all of this processing happen before character entites are replaced with characters.  I suspect that's what would happen if this code were put in place, but someone who actually knows the source should make sure.  (I know nothing about where entity substitution happens in the parsing process.)
We are considering changing these rules in HTML5. What do other browsers do? I tested Safari and found it did no stripping at all; it seems Mozilla does normalisation for "class" but not other attributes (on <p>, at least). I haven't tested other browsers. Maybe we should say that only CR->LF normalisation happens and nothing else?
Err. Where do you see aRetainWhitespace anywhere? aRetain died a couple of years ago. I suppose the proper check nowadays would be:
if (!(aFlag & NS_IPARSER_FLAG_VIEW_SOURCE)).

Of course, there is the problem that entities are expanded in attributes before we reach this code, which is an annoying bug that bites view source in a couple of places. It sort of bothers me to do this stripping in the tokenizer, but that seems like what is called for.
(In reply to comment #5)
> Err. Where do you see aRetainWhitespace anywhere? aRetain died a couple of
> years ago. I suppose the proper check nowadays would be:
> if (!(aFlag & NS_IPARSER_FLAG_VIEW_SOURCE)).

It's commented out, but you can see where I got the aRetain in the LXR link up in comment 3.  The aRetainWhitespace syntax came from bug 15204, which is obviously ancient.  I'm glad to know the new syntax (and I'd wondered why a search for aRetain wasn't turning anything up...).  So would it be dangerous to set this "view source" flag when parsing "value" attributes?

Actually, asuming this is the only point in the code that needs changing, we probably don't need to worry about this at all: we should have access to mTextFlag directly, so we can just test to see if it equals "value".  Is that safe?  (And is there any way to also check if we're inside an "input" element, as suggested recently in bug 114997 comment 15?)

> Of course, there is the problem that entities are expanded in attributes before
> we reach this code, which is an annoying bug that bites view source in a couple
> of places. It sort of bothers me to do this stripping in the tokenizer, but
> that seems like what is called for.

Can you give bug references for those?  From your comments, I guess that if it were easy to change the order, it would have been done already; that is unfortunate considering the sensible ideas for enabling _intentional_ line breaks developed in bug 67127.  Just how much re-engineering of the browser would be necessary to do this stripping first?  And would we expect the tokenizer to know the difference between a "value" attribute and an "onclick" attribute?  (I don't know where in the code to look for this, to be honest.)

Technically, doing this stripping before entity processing is a separate issue, but if it's feasible to do it "right" it will avoid wasted work when fixing other bugs.  On the other hand, if it looks like a proper fix would take a while, this bug causes enough headaches that it might be worth fixing it "wrong" in the meantime.
> it seems Mozilla does normalisation for "class"

We actually have bugs on that, fwiw (though they may be wontfix).  This is a consequence of us parsing "class" into an array of classes that we serialize on get.  We do similar "normalization" for "style" (it's parsed into CSS data and serialized on get).  Similar for cols/rows on frameset, width/height on image/applet/object/embed, and so forth.
Blocks: 322413
That's another reason I want to do this stripping in content. When we're consuming attributes, we have no way of retreiving the current tag name. It's possible to hack in special cases with flags (e.g. we could make a NS_HTMLTOKENIZER_FLAG_IN_INPUT_TAG flag to tell us that we're in an <input ...>) but that's a hack. This sounds like a stronger reason to allow attributes to somehow create entity tokens (and fix both bugs in one swell foop) than to simply fix view-source.
No longer blocks: 322413
Priority: -- → P3
Target Milestone: --- → mozilla1.9alpha
Based on discussion in bug 114997 comment 30 and later, it sounds like we _do_ want to strip whitespace characters from the value attribute in cases like <input type="text" value="...">.  It seems reasonably likely that the only value attribute that should preserve whitespace verbatim is <input type="hidden" value="...">.  Comments on other special cases and on how to achieve that are welcome.

As an explicit example, note that a bug fix here would mostly eliminate bug 256027 as well, leaving just a few loose ends as in bug 322413.
Blocks: 256027
Blocks: 358452
Assignee: mrbkap → nobody
No longer blocks: 67127
Titles in Acronym tags are not sanitized and use LF and TAB as format codes, see testcase in https://bug67127.bugzilla.mozilla.org/attachment.cgi?id=173547

Will this bug ever get FIXED when code is already marked with this bugs number?
http://mxr.mozilla.org/mozilla/source/browser/base/content/browser.js#2561
Hb, I'm not clear on what you're saying is wrong with the current behavior of the title text in that example (from bug 67127).  On my platform (Firefox 3.5.3 on Mac), I see the title text "Scalable Vector Graphics" without any special formatting or unusual spacing, which I believe is the correct behavior.

As for fixing this bug, my understanding is that doing so "correctly" would be Very Hard, requiring a substantial rewrite of the low-level parser.  I suspect that most of those who have enough experience to even contemplate such a fundamental change have many demands on their time, and this isn't that big of an issue.

Incidentally, could you add me to the CC lists of any of the bugs on "remaining issues" related to this (and to bug 67127) that you've filed recently?
We are now tracking HTML5--not HTML 4. HTML5 requires the original white space characters to be preserved in the parse.

If there are higher-level operations that need normalized whitespace, those operations should normalize whitespace on a higher layer without touching the DOM.

For the title attribute in particular, it's even desirable in some cases to have multi-line tooltips where the line breaks come from the line feeds in the attribute value.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
(In reply to comment #11)
> ... I see the title text "Scalable Vector Graphics" without any special
> formatting or unusual spacing, which I believe is the correct behavior.

See attachment for display on SM 2.0 RC1 under XP. Acronym shows formatting from TAB and LF.

> Incidentally, could you add me to the CC lists of any of the bugs on "remaining
> issues" related to this (and to bug 67127) that you've filed recently?

I did not file bugs on these issues.
(In reply to comment #12)
> We are now tracking HTML5--not HTML 4. HTML5 requires the original white space
> characters to be preserved in the parse.

Ah, I see.  So I take it that means that the Mozilla parser is either already doing this or will do it soon?  That's good to hear!

> If there are higher-level operations that need normalized whitespace, those
> operations should normalize whitespace on a higher layer without touching the
> DOM.

Excellent.  So, er, doesn't that mean that this bug is now *more* relevant rather than less?  Or is the idea that a single global bug on the issue is no longer relevant, and that we should instead shift to focus on individual whitespace issues one at a time?  That sounds reasonable, though it would be very handy if there were some common function that we could call to normalize whitespace consistently.  I had trouble finding such a thing back when I was working on this stuff.

> For the title attribute in particular, it's even desirable in some cases to
> have multi-line tooltips where the line breaks come from the line feeds in the
> attribute value.

I'd say instead that it's desirable to have SOME way to specify line breaks in multi-line tooltips.  At least in the days of HTML4, white space wasn't supposed to be semantically meaningful, so a lot of people who line-wrapped their HTML for readability would end up with unexpected formatting problems or broken scripts (often without having any idea why).  See bug 67127 comment 131 for the more-or-less consensus that emerged (after painfully long discussion) on the best resolution for this issue.

Would that solution still be reasonable under HTML5?  Does the HTML5 parser finally make it possible to implement it?  (That is, to normalize whitespace in the original source and THEN substitute entities, so that "&#10;" would give a line break in a tooltip but an actual line break in the HTML would not?)
(In reply to comment #14)
> (In reply to comment #12)
> > We are now tracking HTML5--not HTML 4. HTML5 requires the original white space
> > characters to be preserved in the parse.
> 
> Ah, I see.  So I take it that means that the Mozilla parser is either already
> doing this or will do it soon?  That's good to hear!

It's already present but preffed off by default. You can get the new parser by setting the pref html5.enable to true on trunk.

> > If there are higher-level operations that need normalized whitespace, those
> > operations should normalize whitespace on a higher layer without touching the
> > DOM.
> 
> Excellent.  So, er, doesn't that mean that this bug is now *more* relevant
> rather than less?  Or is the idea that a single global bug on the issue is no
> longer relevant, and that we should instead shift to focus on individual
> whitespace issues one at a time?

It's possible that a tracking bug (maybe this one reopened) is still relevant, but it doesn't belong in the HTML: Parser component.

> > For the title attribute in particular, it's even desirable in some cases to
> > have multi-line tooltips where the line breaks come from the line feeds in the
> > attribute value.
> 
> I'd say instead that it's desirable to have SOME way to specify line breaks in
> multi-line tooltips.  At least in the days of HTML4, white space wasn't
> supposed to be semantically meaningful, so a lot of people who line-wrapped
> their HTML for readability would end up with unexpected formatting problems or
> broken scripts (often without having any idea why).  See bug 67127 comment 131
> for the more-or-less consensus that emerged (after painfully long discussion)
> on the best resolution for this issue.

The consensus seems to date from the pre-HTML5 era when people were trying to fit implementations to interpretations of HTML 4.01 instead of trying to fit specs to what was already deployed.

> Would that solution still be reasonable under HTML5?

http://hsivonen.iki.fi/test/moz/linebreak-in-title.html shows a two-line tooltip in Safari and IE8 (and the line break collapses to nothingness in Firefox and Opera), so no, a proposal that doesn't match either of the existing behaviors is probably not going to fly for HTML5.

> Does the HTML5 parser
> finally make it possible to implement it?  (That is, to normalize whitespace in
> the original source and THEN substitute entities, so that "&#10;" would give a
> line break in a tooltip but an actual line break in the HTML would not?)

I think it's pretty unlikely that the HTML5 were changed to normalize whitespace anywhere in the parser (apart from turning CR and CRLF to LF). However, if you believe the HTML5 spec is wrong, you could try filing a bug against the HTML5 spec, but it's highly likely that the bug report would be WONTFIXed citing the behavior of Safari and IE being compatible with existing content.
Note that HTML5 already endorses the rendering behavior of Safari and IE:
"If the title  attribute's value contains U+000A LINE FEED (LF) characters, the content is split into multiple lines. Each U+000A LINE FEED (LF) character represents a line break."
Many thanks for your detailed reply here.  I'm glad to know how this situation has changed (even if I'm ultimately disappointed by the result).

(In reply to comment #15)
> (In reply to comment #14)
[...]
> It's possible that a tracking bug (maybe this one reopened) is still relevant,
> but it doesn't belong in the HTML: Parser component.

Agreed.  I'll probably leave the management of that to others (at least until and unless I find some time to devote to it).  Could you suggest an appropriate component for this if we do reopen it as a tracking bug?

> > I'd say instead that it's desirable to have SOME way to specify line breaks
> > in multi-line tooltips. [...] See bug 67127 comment 131 for the
> > more-or-less consensus [...] on the best resolution for this issue.

> The consensus seems to date from the pre-HTML5 era when people were trying to
> fit implementations to interpretations of HTML 4.01 instead of trying to fit
> specs to what was already deployed.

That's accurate, and it's clear that the situation has changed.  I rather loathe the "already deployed" IE behavior, since it effectively means one cannot use word-wrap when editing HTML.  But I understand that there's only so long that the Web can be asked to wait for a better option to actually be implemented.  It's just disappointing that the opportunity to finally make it work has arrived hand in hand with the mandate to leave it broken.

Thanks again for your detailed reply.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: