Mozilla reports existence of phantom text nodes in the DOM

RESOLVED DUPLICATE of bug 26179

Status

()

Core
DOM: Core & HTML
--
enhancement
RESOLVED DUPLICATE of bug 26179
12 years ago
10 years ago

People

(Reporter: VK, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3

This bug is "re-open for further action" Phantom Nodes Bug #26179. Anyone willing to answer is invited to /read beforehand in full/ the previous discussion in full and be accommodated with all arguments of either side.


Reproducible: Always

Steps to Reproduce:
ABSTRACT
This bug is "re-open for further action" Phantom Nodes Bug #26179. Anyone willing to answer is invited to /read beforehand in full/ the previous discussion in full and be accommodated with all arguments of either side.

The description below is fully technical and it tends to describe the effect without any implications on correctness or non-correctness upon W3C specifications:

Currently implemented HTML parsing in Gecko engines creates additional empty text nodes in the resulting DOM Tree in place of line breaks in the source code. In many cases it leads to the DOM Tree not intended by the source producer (pretty-print reflection in DOM) and it requires using specially adjusted tree walkers instead of default DOM methods. The aim of this filing is to address the issue without breaking the existing solutions and leaving an option to preserve the above described behavior if it is needed.


RATIONALE
An almost exact desired behavior can be found in another Gecko module: XSLT
Before proceed with further explanations anyone with DOM Inspector installed is welcome to compare DOM trees for a HTML 4.01 Transitional document:
   <http://www.nskom.com/external/tmp/pnodes/index.html>
and for the same document obtained over XSLT transformer:
   <http://www.nskom.com/external/tmp/pnodes/index_strip.xml>

As anyone can see, in the second case (XSLT) the DOM tree is exactly what anyone (I strongly believe) would expect from the source code.

I'm really not a great W3C docs expert, so either XSLT specs indeed allow this long-waited convenience, or while Mr.Zbarsky was fighting with dupes for #26179, he was betrayed behind his back by the XSLT team (? :-)

It is interesting to mention that this XSLT convenient behavior is consistent even without strip-space command. While the first XSL template contains
   <xsl:strip-space elements="*"/>
the following sample uses template w/o any strip-space commands:
   <http://www.nskom.com/external/tmp/pnodes/index_nostrip.xml>
yet it produces the same DOM tree (which is great and /please/ do not mark it as bug).

This way besides the utter inconvenience of the phantom nodes effect, we also have two distinct results depending only on how an HTML document was retrieved. IMHO one of results has to be brought in accordance with another one ASAP.


PROPOSAL

1-A (Optimum) Just make the HTML parser work as XSLT does and just forget of phantom nodes as of an occasional nightmare. Custom tree walkers will not be affected, they will just continue to make unnecessary checks.


1-B (Minimum) Add new CSS selectors
 -moz-strip-space
 -moz-preserve-space
with arguments and functionality as close as possible to their XSL counterparts.

See <http://www.w3.org/TR/1999/REC-xslt-19991116#strip> with further explanations at <http://www.w3.org/1999/11/REC-xslt-19991116-errata/>


2-B Make -moz-strip-space rule "inherit" by default. This way say the rule:

 html {
  -moz-strip-space: *;
 }

will ensure phantom nodes removal all across the document and

 html {
  -moz-strip-space: *;
  -moz-preserve-space: P ADDRESS;
 }

will ensure phantom nodes removal all across the document except <p> and <address> elements.

I place the rule for html element instead of body element because currently phantom nodes are being added even into non-rendering sections (head), so the clean up has to start from the very top. 


3 File a request to W3C to change or remove the section <http://www.w3.org/TR/REC-xml/#sec-white-space> as errata. The statement of question is a wrongly extended requirement for XML-to-XML transformations. In such cases indeed intermediary transformation points can not know about the final document and respectively /should not/ take any decisions of what data will be needed and what data will not (unless explicitly instructed by programmer, see strip-space above). This rule /must not/ have an application for direct data source --> viewable final document transformations. The key point of distinction is "validating XML processor MUST also inform the application" which shows clearly that the paragraph is brutaly out of some description of a XML process /external/ to the recipient (say server-side processor <--> client UA)


For viewable final (X)HTML document it /could/ be applied the rule as of
<http://www.w3.org/TR/html4/appendix/notes.html#notes-line-breaks>
"SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception. ...<samples showing that it is applied to the physical line breaks, not br elements>..."
(IMHO that was the hook for the chosen XSLT behavior in Gecko IMHO)


4. Dynamic rendered <--> <pre> switching can be implemented by using default pretty-print format build into browser (as if you are viewing an .xml file directly). Though I would just block <pre> switching as unnecessary gismo. Either you want your code rendered or not, and that's it. If anyone needs to show the source later, let them do it manually over their own scripting with whatever pretty-print one likes.


5. All this is only to be /as good as IE/ in this particular aspect. As the task is to be /better than IE/ (at least I hope it is :-), it should be also insured that the phantom nodes issue will be resolved for white-spaces between single tags as well.
If -moz-strip-space: * would works as well for:

<p>
  <img src...>
  <img src...>
</p>

(so white-spaces removed with equivalent to <p><img src...><img src...></p>) then many web developers will get a greatest gift they dreamed since 1995 or so.


P.S. So dropping all these "this spec - that spec" quotations, a really truly deeply user-friendly behavior on strip-space would be an equivalent of
 replace(/\>\s\</,'><');
thus all white-spaces in any amount and any combinations being removed between gt and lt. (\u00A0 NON-BREAKING SPACE is not a white-space and being preserved).

Comment 1

12 years ago
Please don't file intentional dups unless you're a module owner or peer.

*** This bug has been marked as a duplicate of 26179 ***
Status: UNCONFIRMED → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → DUPLICATE
(Reporter)

Comment 2

12 years ago
(In reply to comment #1)
> Please don't file intentional dups unless you're a module owner or peer.
> 
> *** This bug has been marked as a duplicate of 26179 ***

It is feature request, not a bug. Before clicking, please read the rationale and study the linked testcases (for XSL transformations).

Status: RESOLVED → UNCONFIRMED
Resolution: DUPLICATE → ---
I have read the bug before - in fact, I was very vocal on it, advocating first the validity of the bug, and then when I realized I was wrong, advocating it be killed as invalid.

Bugzilla is the wrong format for a debate like this.  I've learned that lesson the hard way too.  This should be discussed in mozilla.dev.tech.xml or in a WHATWG e-mail conversation before coming to a Bugzilla report.  A Bugzilla report means "Yes, it is presumed there is a bug."  (I've made this mistake before too.)

Please take this conversation to one of the above places, and if there's a general consensus to do something like you , then we can reopen this bug.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 12 years ago12 years ago
Resolution: --- → INVALID
(Reporter)

Comment 4

12 years ago
(In reply to comment #3)
> Bugzilla is the wrong format for a debate like this.  I've learned that lesson
> the hard way too.  This should be discussed in mozilla.dev.tech.xml or in a
> WHATWG e-mail conversation before coming to a Bugzilla report.  A Bugzilla
> report means "Yes, it is presumed there is a bug."  (I've made this mistake
> before too.)
> Please take this conversation to one of the above places, and if there's a
> general consensus to do something like you , then we can reopen this bug.

I would ask again to read the feature request 339511, not just keywords. There is a test case showing XML parser discarding phantom nodes which is correct by XML specs I presume (as there is no bug filed on it). There is a test case showing HTML parser discarding phantom nodes which is correct by some specs I presume (as the relevant bug is marked as invalid). Obviously there is something fishy in it (besides the huge inconvenience of phantom nodes themselves).

Most probably it will require a discussion at moz.dev and W3C mailing list (not sure what the mentioned WHAT WG has to do with the reading of official W3C specs??).

At the same time this feature request is filed after the arguments with Boris Zbarsky over the bug #26179 and by his suggestion (see the bug posts). Boris Zbarsky returns home at Monday and most probably will post some decision on it - up to suggesting to file a bug on current XML parser for stripping phantom nodes (I hope not but possible).

Until then I would ask everyone just to relax and stop playing ping-pong with the bug status. Thanks in advance.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
(In reply to comment #0)
> I'm really not a great W3C docs expert, so either XSLT specs indeed allow this
> long-waited convenience, or while Mr.Zbarsky was fighting with dupes for
> #26179, he was betrayed behind his back by the XSLT team (? :-)

Most of the people working on the XSLT engine also work on the DOM. I'm one of them, and I think removing the whitespace-only textnodes is wrong.

> It is interesting to mention that this XSLT convenient behavior is consistent
> even without strip-space command. While the first XSL template contains
>    <xsl:strip-space elements="*"/>
> the following sample uses template w/o any strip-space commands:
>    <http://www.nskom.com/external/tmp/pnodes/index_nostrip.xml>
> yet it produces the same DOM tree (which is great and /please/ do not mark it
> as bug).

"For stylesheets, the set of whitespace-preserving element names consists of just xsl:text."

Your stylesheet doesn't even use the source document, so strip-space is irrelevant.

> This rule /must not/ have an application
> for direct data source --> viewable final document transformations. The key
> point of distinction is "validating XML processor MUST also inform the
> application" which shows clearly that the paragraph is brutaly out of some
> description of a XML process /external/ to the recipient (say server-side
> processor <--> client UA)

Your interpretations aren't convincing. The fact is that in the case of a non-validating parser, there's no sound way to distinguish relevant and irrelevant character data.

I vote for WONTFIXING this too. IMHO you're just wasting people's time by opening an obvious dupe.
> 1-B (Minimum) Add new CSS selectors

CSS selectors can't affect the DOM.  So this whole part of the proposal is not going to work.

> File a request to W3C to change or remove the section

We have no interest in doing so, especially since it's interoperably implemented in modern browsers.  You're free to send your on feedback to the XML working group, of course.

> Though I would just block <pre> switching as unnecessary gismo.

That would violate the CSS spec.

> As the task is to be /better than IE/ (at least I hope it is :-)

The task is to be maximally predictable and interoperable, so as to cause web developers minimal pain.  IE is neither; the current behavior is both.

> thus all white-spaces in any amount and any combinations being removed between

  <p><b>Bold</b> <i>Italic</i></p>

would have no space then.  VERY bad.

I told you to file a bug about DOM extensions to replace .firstChild and .nextSibling if that was what you wanted, but you have, in fact, filed a duplicate of bug 26179.  Clearly som miscommunication there...

To prevent further bugzilla clutter, I suggest taking this to the newsgroups as Alex already suggested.  Keep in mind that any suggestions that we change the current behavior of the Core DOM methods are duplicates of bug 26179.


*** This bug has been marked as a duplicate of 26179 ***
Status: UNCONFIRMED → RESOLVED
Last Resolved: 12 years ago12 years ago
Resolution: --- → DUPLICATE

Updated

10 years ago
Component: DOM: Core → DOM: Core & HTML
QA Contact: ian → general
You need to log in before you can comment on or make changes to this bug.