Closed Bug 15378 Opened 21 years ago Closed 10 years ago

Newlines and spaces in html src get passed through to output


(Core :: DOM: HTML Parser, defect, P5)






(Reporter: akkzilla, Unassigned)


(Blocks 1 open bug)


When parsing html, newlines in the html source get passed through to the output
sink as part of the string in a parser node of type eHTMLTag_text.  The output
sink is passing these verbatim into the plaintext output.

I'm not sure what the parser should be doing in this case; I was under the
impression that the newline should be passed as a separate tag of type
eHTMLTag_newline, to make it easier to separate out.  But if the parser is
behaving as expected, I can change the output sink to parse the string and
filter the newlines out.

An example is in htmlparser/tests/outsinks/simple.html when converting to
plaintext.  On Linux, you can do this easily by running the test:
TestOutput -i text/html -o text/plain -f 0 -w 0 OutTestData/simple.html
; on other platforms try highlighting text and pasting into something that
accepts plaintext, or loading the file in the editor and doing "Debug->Output to
Text" or highlighting and doing "test selection".

The parser passes "page.\nHere is some " all as one text node, where I would
have expected sepate nodes for "page.", eHTMLTag_newline, "Here is some", and
eHTMLTag_whitespace (and I thought the parser used to separate the tags out that
Summary: Newlines in html src get passed through to output → Newlines and spaces in html src get passed through to output
Another, related, question.  On the same page, there's the line:

Here is a <a href="">link to the</a> page.

The parser sends "Here is a " as a chunk (note the trailing space), but after
the link, it sends "page.\nHere is some " -- note the space between the link and
"page" isn't part of the string (and it wasn't sent as a separate text node,

What's the rule on which spaces/newlines get embedded into text nodes and which
ones don't?
To me the rule seems to be that newline and whitespace are part of a text but
not of a tag, i.e., in the example

Here is a <a href="">link to the</a> page.

"Here is a " --->  Trailing whitespace part of the string
"<a href="">link to the</a>" --> No whitespace
"Whitespace token" ---> separate token is created since it follows a tag.
"page" ---> No leading whitespace.
Target Milestone: M12
It's not a matter of whether it's part of the string or part of the tag, since
in both cases the whitespace is adjacent to a text node.  It seems like trailing
whitespace is being considered part of the tag, but leading whitespace isn't.

This is going to take a fair amount of effort to re-parse in the output sinks,
so I want to make sure that this is really what's intended (and exactly what the
rules are) before trying to write that parsing code.  The rule it's following
right now doesn't make much sense to me.

Are things like this (e.g. the meaning of whitespace and newline nodes and when
they're used) written down anywhere?
Akkana, what should I do with this bug?

BTW, I'm still looking for proper documentation on whitespace handling whenever
I find time.
Could you put a short document up on explaining the current
whitespace handling (trailing whitespace gets included in the node but leading
whitespace doesn't, or whatever the rule is) then assign the bug back to me?
I'll make the content sinks do whatever they have to, but I'd like to have some
constant document to refer to so that I have a reminder of what I'm trying to
make them do.
Target Milestone: M12 → M14
Priority: P3 → P4
Target Milestone: M13 → M14
Priority: P4 → P5
Target Milestone: M14 → M16
Bulk move of all "Output" component bugs to new "DOM to Test Conversion" 
component.  Output will be deleted as a component.
Component: Output → DOM to Text Conversion
Moving to M19..
Target Milestone: M16 → M19
Target Milestone: M19 → Future
Here's a quote that says what authoring tools should do. Maybe
this will give us a clue as to what ``user agents'' like Mozilla should do.
(Where are the W2 recommendations for user agents ?)
In order to avoid problems with SGML line break rules and inconsistencies among
extant implementations, authors should not rely on user agents to render white
space immediately after a start tag or immediately before an end tag. Thus,
authors, and in particular authoring tools, should write:
  <P>We offer free <A>technical support</A> for subscribers.</P>
and not:
  <P>We offer free<A> technical support </A>for subscribers.</P>
Component -> Parser.

This is definitely a bug with Parser in handling the new lines at least for the 
ones appearing immediately after an opening tag or before a closing tag. 

Following HTML specification indicates the same: says that 
"The following two HTML examples must be rendered identically:

<P>Thomas is watching TV.</P>

Thomas is watching TV.

This is what I mentioned in bug#75283, comment #18. In fact if we can do 
something nicer in parser to overcome this problem, many(most) of DOM-TXT 
serialization bugs would be resolved retaining a nice view source and  
Component: DOM to Text Conversion → Parser
Blocks: 107927
But they are rendered identically, at least as far as I can see. What actually
is in the content tree is a different matter.
Blocks: 147355
At ths same time, the composer ads whitespace in the saved HTML.  It seems to
just stick whitespace in at random.  After editing with composer, all my html
now looks like:

    "<p>We are a      small local firm             offering reliable</p>"

If I have a CDATA like this:

<style type="text/css">
table.all {
    document.body.clientWidth > (650/12) *
        "auto" );

Composer ads an extra newline between each line, every time the file is edited.

I'd like the final HTML result from composer to be neatly formatted, so that
hand-editing is possible.
Assignee: harishd → nobody
QA Contact: sujay → parser
This should be INVALID/WONTFIX per HTML5.
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.