15378 - Newlines and spaces in html src get passed through to output

Reporter

Description

•

25 years ago

When parsing html, newlines in the html source get passed through to the output
sink as part of the string in a parser node of type eHTMLTag_text.  The output
sink is passing these verbatim into the plaintext output.

I'm not sure what the parser should be doing in this case; I was under the
impression that the newline should be passed as a separate tag of type
eHTMLTag_newline, to make it easier to separate out.  But if the parser is
behaving as expected, I can change the output sink to parse the string and
filter the newlines out.

An example is in htmlparser/tests/outsinks/simple.html when converting to
plaintext.  On Linux, you can do this easily by running the test:
TestOutput -i text/html -o text/plain -f 0 -w 0 OutTestData/simple.html
; on other platforms try highlighting text and pasting into something that
accepts plaintext, or loading the file in the editor and doing "Debug->Output to
Text" or highlighting and doing "test selection".

The parser passes "page.\nHere is some " all as one text node, where I would
have expected sepate nodes for "page.", eHTMLTag_newline, "Here is some", and
eHTMLTag_whitespace (and I thought the parser used to separate the tags out that
way).

Akkana Peck

Reporter

Updated

•

25 years ago

Summary: Newlines in html src get passed through to output → Newlines and spaces in html src get passed through to output

Akkana Peck

Reporter

Comment 1

•

25 years ago

Another, related, question.  On the same page, there's the line:

Here is a <a href="http://www.mozilla.org">link to the mozilla.org</a> page.

The parser sends "Here is a " as a chunk (note the trailing space), but after
the link, it sends "page.\nHere is some " -- note the space between the link and
"page" isn't part of the string (and it wasn't sent as a separate text node,
either).

What's the rule on which spaces/newlines get embedded into text nodes and which
ones don't?

harishd

Comment 2

•

25 years ago

To me the rule seems to be that newline and whitespace are part of a text but
not of a tag, i.e., in the example

Here is a <a href="http://www.mozilla.org">link to the mozilla.org</a> page.

"Here is a " --->  Trailing whitespace part of the string
"<a href="http://www.mozilla.org">link to the mozilla.org</a>" --> No whitespace
"Whitespace token" ---> separate token is created since it follows a tag.
"page" ---> No leading whitespace.

Akkana Peck

Reporter

Updated

•

25 years ago

Target Milestone: M12

Akkana Peck

Reporter

Comment 3

•

25 years ago

It's not a matter of whether it's part of the string or part of the tag, since
in both cases the whitespace is adjacent to a text node.  It seems like trailing
whitespace is being considered part of the tag, but leading whitespace isn't.

This is going to take a fair amount of effort to re-parse in the output sinks,
so I want to make sure that this is really what's intended (and exactly what the
rules are) before trying to write that parsing code.  The rule it's following
right now doesn't make much sense to me.

Are things like this (e.g. the meaning of whitespace and newline nodes and when
they're used) written down anywhere?

harishd

Comment 4

•

25 years ago

Akkana, what should I do with this bug?

BTW, I'm still looking for proper documentation on whitespace handling whenever
I find time.

Akkana Peck

Reporter

Comment 5

•

25 years ago

Could you put a short document up on mozilla.org explaining the current
whitespace handling (trailing whitespace gets included in the node but leading
whitespace doesn't, or whatever the rule is) then assign the bug back to me?
I'll make the content sinks do whatever they have to, but I'd like to have some
constant document to refer to so that I have a reminder of what I'm trying to
make them do.

harishd

Updated

•

25 years ago

Target Milestone: M12 → M14

harishd

Updated

•

25 years ago

Priority: P3 → P4

harishd

Updated

•

25 years ago

Target Milestone: M13 → M14

harishd

Updated

•

25 years ago

Priority: P4 → P5

Target Milestone: M14 → M16

leger

Comment 6

•

25 years ago

Bulk move of all "Output" component bugs to new "DOM to Test Conversion" 
component.  Output will be deleted as a component.

leger

Updated

•

25 years ago

Component: Output → DOM to Text Conversion

harishd

Comment 7

•

24 years ago

Moving to M19..

Target Milestone: M16 → M19

harishd

Updated

•

24 years ago

Status: NEW → ASSIGNED

Target Milestone: M19 → Future

dcary

Comment 8

•

23 years ago

Here's a quote that says what authoring tools should do. Maybe
this will give us a clue as to what ``user agents'' like Mozilla should do.
(Where are the W2 recommendations for user agents ?)
``
In order to avoid problems with SGML line break rules and inconsistencies among
extant implementations, authors should not rely on user agents to render white
space immediately after a start tag or immediately before an end tag. Thus,
authors, and in particular authoring tools, should write:
  <P>We offer free <A>technical support</A> for subscribers.</P>
and not:
  <P>We offer free<A> technical support </A>for subscribers.</P>
''
--
http://www.w3.org/TR/html4/struct/text.html

Tanu Mutreja

Comment 9

•

22 years ago

Component -> Parser.

This is definitely a bug with Parser in handling the new lines at least for the 
ones appearing immediately after an opening tag or before a closing tag. 

Following HTML specification indicates the same:

http://www.w3.org/TR/html4/appendix/notes.html#notes-line-breaks says that 
"The following two HTML examples must be rendered identically:

<P>Thomas is watching TV.</P>

<P>
Thomas is watching TV.
</P>", 

This is what I mentioned in bug#75283, comment #18. In fact if we can do 
something nicer in parser to overcome this problem, many(most) of DOM-TXT 
serialization bugs would be resolved retaining a nice view source and  
indentation.

Component: DOM to Text Conversion → Parser

Tanu Mutreja

Updated

•

22 years ago

Blocks: 107927

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 10

•

22 years ago

But they are rendered identically, at least as far as I can see. What actually
is in the content tree is a different matter.

Tanu Mutreja

Updated

•

22 years ago

Blocks: 147355

Bryce Mozilla Nesbitt

Comment 11

•

21 years ago

At ths same time, the composer ads whitespace in the saved HTML.  It seems to
just stick whitespace in at random.  After editing with composer, all my html
now looks like:

    "<p>We are a      small local firm             offering reliable</p>"


If I have a CDATA like this:

<style type="text/css">
/*<![CDATA[*/
table.all {
max-width:43em;
width:expression(
    document.body.clientWidth > (650/12) *
    parseInt(document.body.currentStyle.fontSize)?
        "40em":
        "auto" );
}
/*]]>*/
  </style>


Composer ads an extra newline between each line, every time the file is edited.

I'd like the final HTML result from composer to be neatly formatted, so that
hand-editing is possible.

Phil Ringnalda (:philor)

Updated

•

15 years ago

Assignee: harishd → nobody

Status: ASSIGNED → NEW

QA Contact: sujay → parser

Henri Sivonen (:hsivonen)

Comment 12

•

15 years ago

This should be INVALID/WONTFIX per HTML5.

Henri Sivonen (:hsivonen)

Updated

•

15 years ago

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

Newlines and spaces in html src get passed through to output

Categories

(Core :: DOM: HTML Parser, defect, P5)

Tracking

()

People

(Reporter: akkzilla, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Updated

Comment 6

Updated

Comment 7

Updated

Comment 8

Comment 9

Updated

Comment 10

Updated

Comment 11

Updated

Comment 12

Updated