Closed Bug 47078 Opened 24 years ago Closed 20 years ago

Newlines are not converted to whitespace in attributes

Categories

(Core :: Layout, defect, P3)

defect

Tracking

()

VERIFIED WONTFIX
Future

People

(Reporter: ian, Assigned: harishd)

References

()

Details

(Keywords: html4, testcase, Whiteboard: [Hixie-P5][HTML4-6.2] (correctness issue for standard mode, doesn't apply to XML))

Newlines are not being converted to whitespace in attributes.

STEPS TO REPRODUCE:
   Visit the testcase:
      http://www.bath.ac.uk/%7Epy8ieh/internet/eviltests/attrlinewrap.html
   It is self explanatory.

EXPECTED RESULTS:
   For example, title="this is
   a test" should be interpreted as "this is a test" and not "this isatest".


This is what HTML4 spec says about the contents of CDATA attributes such as
value, alt, title, href, longdesc...:

# CDATA is a sequence of characters from the document character set and
# may include character entities. User agents should interpret attribute
# values as follows:
#  o Replace character entities with characters,
#  o Ignore line feeds,
#  o Replace each carriage return or tab with a single space.
# User agents may ignore leading and trailing white space in CDATA attribute
# values (e.g., "   myval   " may be interpreted as "myval").
  -- http://www.w3.org/TR/REC-html40/types.html#h-6.2  

This broke out of bug 22707.
Blocks: html4.01
Target Milestone: --- → Future
I think this may cause correctness problems with form submission that break
sites or corrupt data - there were a few examples dup'd on bug 22707. 
Nominating for nsbeta3.  Thanks Ian!
Status: NEW → ASSIGNED
Keywords: nsbeta3
OS: Windows 2000 → All
Hardware: PC → All
Target Milestone: Future → M18
This most DEFINATELY causes problems on all kinds of web sites.
(Swoon being one of them). There are some sites that won't accept form submissions
from mozilla at all because of this.

Adding myself to cc, I'll do anything to fix this.
Marking nsbeta3+. Adding html4 keyword
Keywords: html4
Whiteboard: [nsbeta3+]
Dup of 15204.  The logic where we strip newlines is probably the same place we 
should be replacing carriage returns and tabs with spaces...

*** This bug has been marked as a duplicate of 15204 ***
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → DUPLICATE
Reopening. The duplicate bug is for quirks mode, this is for standards mode and
is actually the reverse of the other bug.

The original description of this bug still stands.
We should not be _stripping_ newlines, we should be converting them to spaces.

(Note that we are ignoring the requirement that \r and \n be handled differently
since that requirement is an SGML legacy where they were supposed to be record
delimiters. I have proposed a change to the spec in www-html, and there was no
overwhelming response either pro or against, so I think it is safe to say that
nobody actually cares that \r and \n should be treated differently per the 
SGML specification.)
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Removing nsbeta3+

It looks like its a parser problem.
Assignee: pollmann → harishd
Status: REOPENED → NEW
Whiteboard: [nsbeta3+]
Strict DTD will not be supproted in Mozilla. Marking bug nsbeta3- and marking 
FUTURE.
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3-]
Target Milestone: M18 → Future
If I insert newlines into for example title attribute they are not shown as
spaces but as black bars (one if saved as UNIX file and two if saved as Windows
file).
There is an example on http://www.holodeck1.com/mozilla/break.html

Using Mozilla 2001011620 on Windows 2000
See also:
bug 67127 Newline in ALT attribute of IMG tag causes black square to be 
displayed.
bug 92779 newline in ondblclick attribute is sent to JS engine
*** Bug 67127 has been marked as a duplicate of this bug. ***
*** Bug 63824 has been marked as a duplicate of this bug. ***
Whiteboard: [nsbeta3-] → [Hixie-P5] (correctness issue for standard mode, doesn't apply to XML)
Hmm...I think the thing to do here may be to overload CreateTokenOfType in
nsDTDUtils with a third function that's aware of the mode we're parsing in...we
can call it when we're in strict mode or parsing XML, and it will contain a
modified function to strip the whitespace.  I don't really know much about the
parser internals; would anyone care to comment?
Aack, I goofed.  This ground has been gone over already in bug 15204.
By replacing the current lines 1684-1691 of nsHTMLTokens.cpp with the following 
code:

// See bugs 47078 and 15204
if(NS_IPARSER_FLAG_STRICT_MODE) { //stripping linefeeds breaks old pages
  mTextValue.StripChars("\n"); // per the HTML spec, ignore linefeeds...
  mTextValue.ReplaceChar("\t"," "); // replace tabs with spaces and...
  mTextValue.ReplaceChar("\r"," "); // replace carriage returns with spaces...
}

We should parse things fine in strict mode once this goes in, but we'll need to 
do this somewhere else in quirks mode to deal with the "black bars in tooltips" 
problem.  (I'd suggest replacing "\n" with a space and replacing the other two 
in all title attributes in quirks mode.)  I don't have a build environment 
available, and won't for a while, so I'll try to find some help to apply and 
test this patch.
If you're going to strip out \n's, wouldn't it make sense to strip out \r's as
well?  It would be confusing if an HTML file created on Linux changed meaning
after being saved on Windows.
For the "black bars" problem, a couple of comments:

First of all, if you look at IE's behavior on this, it actually retains 
newlines in tooltips, sort of like "\n" in javascript alert boxes. Some sites 
take advantage of this behavior. Since this is non-standard, however, I don't 
think it's important to emulate it (unless we think it would be nice for quirks 
mode). For strict, we should use a space per newline.

Second, considering on Windows uses \r\n for newlines, *nix uses \n, and Mac 
uses \r, I think the following three pattern matches should be used (in order) 
for portability's sake:

s/\r\n/ /g
s/\n/ /g
s/\r/ /g

\r and \n used above for legibility, should actually be \015 and \012, 
respectively. Does the above sound reasonable?
Guys, please read the bug. :-)

From my comments almost exactly a year ago (to within 2 days):

(Note that we are ignoring the requirement that \r and \n be handled differently
since that requirement is an SGML legacy where they were supposed to be record
delimiters. I have proposed a change to the spec in www-html, and there was no
overwhelming response either pro or against, so I think it is safe to say that
nobody actually cares that \r and \n should be treated differently per the 
SGML specification.)

Be warned that this WILL break sites, which is why we are ONLY doing this in
quirks mode.
er. Standards mode.
Ian, I read your comment but wasn't exactly sure you meant...

So you are saying that for _all_ HTML pages hosted on a Windows box, a windows-
standard newline in an ALT attribute (which is \r\n) will be replaced with 
_two_ spaces. This, in Mozilla, as opposed to the actual newlines in ALT 
attributes (as I described above) that IE provides.

Is this just to keep it simple to code or what? This will look like a BUG to 
most users and developers and is not consistent with how the rest of HTML 
responds to newlines. Windows newlines should be one space; at least it would 
look like it was done on purpose.
I think what's being overlooked here is that people should not be putting
newlines in attributes anyway (that's the place of elements, at least IMO).  I
do see the point about ALT attributes, though, especially as that's what people
are likely to be changing as they go over to strict mode.  Perhaps this is where
we should use our Goldfarb-footnote-mandated obligatory bug in record-start and
record-end handling? :-)  On the gripping hand, it's not unlikely that people
who have CR's, or LF's, or both, stuck in the middle of their "alt" attributes
will have whitespace before the line-break, so that we'll get weird results with
those documents on Mac and Windows anyway, and not Unix.  I think we'll get some
breakage no matter which way we do this, so we might as well DTRT, but I don't
have any hard assessment of how people are putting their linefeeds in.  (BTW,
the XML parser does this already, right?)
Michael: No, in Quirks mode we should carry on doing the right thing for
compatabiliy, which is probably letting the new lines through and making that
all work as authors expect, including having carriage returns in tooltips if
that is part of the issue. This bug is only about standards mode, where any
line endings should be converted to a space. (The special case of "\r\n" or
"\n\r" or whatever should be converted to one space for sanity, sure...)
Ian: Oh... from your post about reading the bug I thought you disagreed with my 
initial suggestion, but it seems like that is not the case? :)

I just want to make sure the decision on what to do here makes sense, since it 
affects the overall polish of the product. I agree that the HTML 4 spec doesn't 
make a lot of sense in this case, especially if you consider files hosted on a 
Mac.
Wait a minute, we're going to display newlines in tooltips in quirks mode? 
That's going to discourage a lot of authors from moving to standards mode,
because there isn't a standard way to make newlines appear in tooltips.
Maybe newlines in tooltips should be standard. I do not know much about the W3C 
standards process but I do not see why this feature could not be incorporated 
into HTML 4(.02?). It's a really useful feature which doesn't block 
accessibility and seems easy to implement.

Is there currently a group continuing work on HTML 4?
Hmmm... sorry for more spam... seeing as the last revision of HTML 4.01 was 
released in 1999 and the latest work is being done on XHTML (which doesn't look 
like a lot but I am not involved in any way), it doesn't look like *anyone* 
could get anything new into the spec...

It is interesting to note that while standards promote interoperability, in 
this case it is preventing the adoption of a useful feature for the simple 
reason that it was not considered during the standardization process. The M$ 
embrace-and-extend strategy is starting to make sense, in a twisted 
way... ::shudder::
Jesse: yep.
*** Bug 85232 has been marked as a duplicate of this bug. ***
Are tabs going to be handled in this bug or are they going to be handled
separately?
Jesse: to answer your earlier question, the issue just came up on c.i.w.a.h
today, and it was pointed out that using 
 should produce a linefeed
(but bug 67127 blocks that, currently).  So, there is a way to put linefeeds in
tooltips in strict mode.

Justin H.: If the patch from bug 15024 is adopted, tabs (\t) will be handled, yes.
Depends on: 67127
Ian: it looks like both of us missed something WRT to the SGML spec and RE/RS.
It turns out that SGML applications are supposed to implement an "entity
manager" which converts the record boundaries from the native format to the
proper RE & RS before applying SGML whitespace rules. In other words, documents
formatted to only use CR or only use LF should have their linebreaks transformed
into full CR-LFs. Thus do we avoid having whitespace handling vary by platform
of original creation.

Right now, we do this normalization in the content sink; this should probably be
moved into the HTML tokenizer, because it already searches for newlines to do
line counting. As we convert, we can also take care of this case (CR-LF, CR, LF,
and tab all become a single space each) and the RE-ignoring-after-tags (separate
bug).
>Right now, we do this normalization in the content sink; this should probably be
>moved into the HTML tokenizer.
No no. Normalization happens in the sink because it's not easy to perform
compression on sliding strings ( in html tokenizer ).
*** Bug 126405 has been marked as a duplicate of this bug. ***
*** Bug 147640 has been marked as a duplicate of this bug. ***
*** Bug 152157 has been marked as a duplicate of this bug. ***
Whiteboard: [Hixie-P5] (correctness issue for standard mode, doesn't apply to XML) → [Hixie-P5][HTML4-6.2] (correctness issue for standard mode, doesn't apply to XML)
*** Bug 183858 has been marked as a duplicate of this bug. ***
*** Bug 186700 has been marked as a duplicate of this bug. ***
verifying: this happens on Apple's own website
http://www.apple.com/
The entire alternate description text is run together with boxes instead of
spaces, until the images have finished loading.

Does this also relate to pasting into TextEdit? (in the middle of a file I
suddenly start getting 0d instead of 0a for linebreaks, after pasting from
Camino, which never happens with Safari when I copy and paste from a web page; I
also have to set translate line breaks in Terminal in order for pasting to work,
in emacs, for example).
After reading other people's comments, I disagree that it is impossible to solve
this, at least for the case of alt attributes. Any combination of newlines
(CRLF, CR, LF) should be reduced to one kind (linebreak normalization) and
displayed as a space. this is the intention of the image alternate attribute,
anyway, is it not? the text in the alt="blah blah" tag is just text, so it is
safe to do this linebreak conversion, and definitely proper to do the space
conversion. Tabs also should be replaced with spaces--their intent is to format
html, not change the text in the alternate image description.

Additionally, the reason this comes up is not because it is dumb for people to
put linebreaks into the attribute (someone says they "should not be putting"
them in), but because they are trying to keep their HTML looking nice.

p.s. Apple is using LFs in their web page:

doing "wget -O - -i - | hex | grep 0d" matches nothing, so they are using Unix
linebreaks.
Sorry, I wasn't thinking. I should have said "wget -O - http://www.apple.com/ |
hex | grep 0d" so you could see what I was doing.
In the W3C page
http://www.w3.org/TR/html4/types.html#h-6.2
(similarly referenced in the initial report of this bug), the following
instructions are given:

CDATA is a sequence of characters from the document character set and may
include character entities. User agents should interpret attribute values as
follows:

    * Replace character entities with characters,
    * Ignore line feeds,
    * Replace each carriage return or tab with a single space.

  However, this was before the following instructions (in XHTML 1 and
following), and was done because it was assumed "canonical convention" for CRLF
to be the new line sequence. In the later instructions, the processing behavior
was clearly changed.

http://www.w3.org/TR/xhtml1/#h-4.7

4.7. White Space handling in attribute values

When user agents process attributes, they do so according to Section 3.3.3 of [XML]:

    * Strip leading and trailing white space.
    * Map sequences of one or more white space characters (including line
breaks) to a single inter-word space.

  Obviously, images are indiscriminate of whether they reside in an XHTML 1
document or in an HTML 4.01 document, so the behavior should be updated as per
these established instructions (which have changed in that they do not assume
CRLF as the expected form).

  In Camino, and in Mozilla 1.3 final, this behavior is worst of all at present,
because LFs are not deleted, nor are replaced with whitespace, but mis-displayed
completely as "missing character" boxes.

  These rules cannot be followed for every attribute, but because HTML images
are _indiscriminate_ of their surrounding HTML document type, they may follow
the new rules.
*sigh* you probably want bug 67217. comment #29 is, roughly speaking, an
adequate statement of the problem (SGML requires us, in HTML 4, to replace line
breaks in attributes with spaces, and it is done in a platform-independent
manner). Further comments here are probably not useful, unless they run to the
effect of "I have a clever way of making the content sink do this".
Blocks: 67127
No longer depends on: 67127
Depends on: 228099
*** Bug 229237 has been marked as a duplicate of this bug. ***
I feel that this bug should be wontfix, for the reasons given in bug 228099
comment 7.  Speak now (within the next few weeks or so) or forever hold your peace.
I don't understand. Do you mean that using a 
 in an attribute is incorrect?
Otherwise I don't see why Mozilla should display garbage.
No, I mean that converting all newlines in attributes into spaces would break
pages that use newlines in "value" attributes (especially for hidden inputs, to
move around information).   Doing that is pretty unacceptable, since that's a
reasonably common technique.

In short, I am replying to comment 0, not to all the semi-related stuff that's
happened since then.
OK, so bug 67127 would be the real bug for this problem.

However, not converting newlines to whitespace could break pages that use
newlines only to split lines in the source (but really meaning a whitespace).
Are such pages common?  Perhaps what needs to happen is that different
attributes need to be treated differently.
Wontfix.  Bug 67127 needs to be fixed in the tooltip-attr parsing code (which is
NOT part of parser, btw, so don't reassign the bug there).
Status: ASSIGNED → RESOLVED
Closed: 24 years ago20 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → VERIFIED
How about bug 152157 then? It should still be fixed.
Vincent, that's bug 67127
No longer blocks: 67127
(Comment #46 said:)
> OK, so bug 67127 would be the real bug for this problem.
> 
> However, not converting newlines to whitespace could break pages that use
> newlines only to split lines in the source (but really meaning a whitespace).

(And comment #47 replied:)
> Are such pages common?  Perhaps what needs to happen is that different
> attributes need to be treated differently.

I'm not convinced that bug 67127 is the right place for this: work there is
currently aimed primarily at enabling multi-line tooltips, while this issue is
both much simpler and much more general.  In particular, newlines are not just a
problem for the "title" attribute: they cause similar visual problems in images'
"alt" text (see the original report for bug 67127) and they break Javascript
attributes (see bug 92779, where the first attached testcase fails due to a line
break in an "ondblclick" caused by word-wrap).  I wouldn't be surprised if other
attributes are potentially vulnerable as well.

Regarding whether such pages are common, anyone who composes HTML using an
editor with word wrap (or otherwise limits line lengths for legibility) will
eventually create them, quite possibly without noticing or thinking anything of
it.  And bug 67127 comment 13 points out that pages at www.w3.org include
newlines in attributes quite frequently.

I agree that different attributes should probably be treated differently.  I
have no reason to doubt your warnings about people putting meaningful newlines
in "value" attributes (though I don't think I've seen it myself).  On the other
hand, for "title", "alt", "ondblclick", and probably many (most?) others,
replacing the newlines with spaces is the right thing to do.  (For most of
those, my intuition would actually be to collapse any amount of whitespace to a
single space, but it sounds like that's not what the specification indicates.) 
In my mind, replacing newlines with spaces should be the default behavior, with
the "value" attribute as a possibly unique exception.

I don't know the code well enough to know how to implement this, or even where
in the parsing/processing chain it should happen.  But would be highly
inefficient (and error-prone) to produce separate patches to convert newlines
for each affected attribute as it was identified.
Based on my arguments in comment 51, I have filed bug 322270 to address this issue, but with the need for an exception for "value" attributes explicitly recognized.  Comments (or criticism) would be very welcome.  (I wish that I had the knowledge to suggest an actual fix, but I'm afraid that I don't.)
You need to log in before you can comment on or make changes to this bug.