Closed Bug 207377 Opened 21 years ago Closed 21 years ago

txMozillaXMLOutput::endHTMLElement needs improvement

Categories

(Core :: XSLT, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: emrick, Assigned: emrick)

Details

(Keywords: perf)

Attachments

(2 files, 5 obsolete files)

User-Agent:       Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.3) Gecko/20030313
Build Identifier: Mozilla/5.0 Gecko/20030313

Testcase measurement shows potential for overall 8% improvement from
txMozillaXMLOutput::endHTMLElement
http://lxr.mozilla.org/seamonkey/source/extensions/transformiix/source/xslt/txMozillaXMLOutput.cpp#529
specificly, in the: 
if (atom == txHTMLAtoms::table && !aXHTML) { ... }
code block. 

Testcase is NOT marked as xhtml, yet renders correctly if above is changed to:
 if (0) { ... } 
to no-op the tbody insertion code block.

Some means of short-cutting this check to do tbody insertion processing is
needed. The means to do that is not exactly clear to me right now, but I'm
studying it, and by the bugzilla, hoping to get ideas from others about it.    


Reproducible: Always

Steps to Reproduce:
there's also a crasher bug very close to the lines you want to touch: bug 202765.
Interesting data.
 
The best performance result is obtained by following the XHTML path, either by
hacking the code, or by putting the right XHTML incantation into the
xsl:stylesheet attributtes. 

The document does NOT have well-formed tbody's. In fact, it does not use tbody
at all anyplace. So, curious is this... it appears tbody and /tbody are
redundant anyway... table and /table appear to 'wrap' table-tbody elements just
fine without the need for tbody and /tbody............ yes? no? maybe?  

Good but not best performance result is obtained by editing the document and
adding the tbody (just after each table) and /tbody (just before each /table).
Having the tbody and /tbody already there avoids the wrapChildren path. 

It would seem to me, if it is the case that tbody REALLY needs to be inserted,
then the place to do this would be at compile-time, not execute time. Yes/No ?
It does not look to hard to do there.

I'm going to play with what happens if 

<table blah blah>
<tbody>
<tbody>
...
</tbody>
</tbody>
</table>

if its not bad, then could just unconditionally add a tbody after each table and
not need to 'lookahead' to see if one is already there....

Other ideas or comments on this?




    
1) Mozilla expects the tbody to be there for HTML, see bug 30378.
2) XHTML output is always going to be faster, less fixup to do.
3) When you say the tbody is redundant, are you sure? Does the C in attachment
65021 [details] have a colored background if you skip the tbody insertion code?
4) Let's not make the output even worse by unconditionally adding a tbody.
5) It is not possible to do this at compile time afaik.

My advice is to change the stylesheet to output XHTML and add tbody's in the
stylesheet. We could make this code smarter, but it is going to be nasty because
of nested tables, you'll at least need a stack.
Olivier: bug 202765 is totally unrelated to this one, please don't mix them.
> 1) Mozilla expects the tbody to be there for HTML, see bug 30378.
Much ado about nothing. The code does whats right regardless of what 30378 says. 
If and only if there are multiple table sections (multiple tbody's or tbody with
tfoot and thead) then, of course, tbody tags required. Otherwise optional.  

>2) XHTML output is always going to be faster, less fixup to do.
That will be sweet then... when everybody uses it. 
Unfortunatly, neither you or I will live that long.

>3) When you say the tbody is redundant, are you sure? Does the C in attachment
65021 have a colored background if you skip the tbody insertion code?
If I add the xmlns="http://www.w3.org/1999/xhtml" incantation to thet stylesheet
then ix goes through XHTML path, <tbodys> are not added... and it still looks
same to me. Do you get a different result?
   
>4) Let's not make the output even worse by unconditionally adding a tbody.
I'm not sure I understand your point here?????

>5) It is not possible to do this at compile time afaik.
Why do you think so?

>My advice is to change the stylesheet to output XHTML and add tbody's in the
>stylesheet. 
Change every stylesheet in the world ??? Talk about steady work ! 
Nobody uses tbody's unless they need multiple table sections. 

>We could make this code smarter, but it is going to be nasty because
>of nested tables, you'll at least need a stack.
I am not afraid of a stack. :-)
I guess we could discuss removing this code if it doesn't regress bug 53944. I'd
personally be in favor of keeping the code, in HTML tbody is an inferred element
(and I'd rather be consistent with what our parser does for regular HTML, it
avoids nasty surprises for users). And don't make output nasty with two levels
of tbody if the user has been so good to provide one himself.
In short, I'm more likely to review a patch that improves this code than one
that removes it.
Assignee: peterv → emrick
Status: UNCONFIRMED → NEW
Ever confirmed: true
And there's no way to do this at compile time because you don't have all the
cases where a stylesheet will produce a table in the output. The simplest
example is a copying stylesheet that you feed a document with a table in it.
I've been working with varients of
http://lxr.mozilla.org/seamonkey/source/layout/html/tests/table/viewer_tests/test4.html
and sadness prevails..... and some surprise.... there is no means to
consistently insert a <tbody> in a visually invisable way.... and yes I'm saying
that impacts the rendering of documents processed by xslt and the base gkparser.
(IE produces the intended invisible result, Moz does not). 

(scratches head icon)
Sam, how do you get an 8% number anyway? That pretty much puzzles me.
That's an if(), so at most one branch prediction miss in any real-world testcase.
With all the bad perf problems we have 8% for an if() come as quite a suprise
to me.
First of all, we definitely have to keep the same logic as exists now and let
the outputhandler insert tbody-elements if they are "missing". In theory this
can sometimes be done at compiletime, but not always, see comment 6.

There are two things we can do however:

1. I suspect the content-code ends up doing way too much work when we move the
<table>-children from the <table> to the <tbody>, this is because we use
DOM-methods instead of nsIContent-methods. This is a problem in general with the
output handlers. I have a patch that is half-way done that should improve things.

2. Instead of putting the elements into the <table> and then move them to the
<tbody> we could instead hold off putting childnodes into a <table> until we
know for sure if a <tbody> needs to be inserted or not. That should make the
insertion of the <tbody> next to free.

IMHO we should first do 1 since that should be done anyway. After that we should
investigate if we're still paying a penilty for the moving and if so do 2.
oh, an alternative to doing 2 in my previous comment is to always insert the
<tbody> and then remove it if we end up inserting another table-section-element.
That should make us do the 'optimal' think in almost all realworld cases.
>First of all, we definitely have to keep the same logic as exists
>now and let the outputhandler insert tbody-elements...

Yea, I said that(clumsy way, I guess) with comment #7 - to maintain same visual 
result as base Mozilla, then XSLT must insert tbody's too, and also, XSLT 
cannot insert additional 'redundant' tbody's either. I think this is a base bug 
in tbody-attributes handling, but that's another issue. For now, pertinent 
point is, XSLT must do tbody's to produce same visual results as base Mozilla). 

I think compile-time tbody insertion is worth considering more. Even if it is 
not 100%, it would 'fix-up' many cases... and then the currnet execute-time 
code would 'catch the rest'.   

>"doing way too much work when we move the <table>-children from the <table> to 
the <tbody>"

Exactly so. And because code does not detect the need to do this 'move' until 
the end (endElement). Better would be to detect the condition on the first 
element not appropriate to live under tbody. 

That is, as such: if after a <table>, then first <tr>, before <tbody> would 
signal need to insert <tbody>.... or something like that.

One attribute of today's code... the intent this tbody insertion business is to 
fix-up the most simple kind of tables, tables with just one table secion, those 
tables where tbody is by-spec an optional keyword. 

However, the logic here in XSLT and in mozilla base actually does more, as it 
will do same fix-up to malformed multi-section tables. Simple tables w/o tbody 
are by spec okay. Complex tables with malformed sections are by spec not 
allowed, but the code today attempts to do same sort of tbody insertion fix-up. 
Don't know if anybody thinks this is a problem. (I hope not, solution space 
we're talking about in this bug has same attribute).
Only after instrumenting complex tables, have I come to appreciate the power
allowed by xsl logic and flow control statements embedded into html table
building code....

The result is such -- the potential for COMPILE SIDE optimization of tables
seems limited to a specific but important case -- the case where <TABLE>
StartElement is immediantly followed by a <TR> StartElement. When this is the
case, then insert between these two a <TBODY> StartElement, 

This would just catch and fix-up the most simple cases. The more complex cases
must be, can only be, detected and fixed-up at execution time.

Nevertheless, I think it's useful to consider adding the compile-side fixup, to
handle the most ssimple and most common cases. Looks to me, the place to do it
is in txStylesheetCompiler::startElement or
txStylesheetCompiler::startElementInternal. Means to do this would be, each step
save current startElement atom as previous atom. If previous atom == "table" and
current atom == "tr" then do special code to insert tbody into compiled code
stream. Some care would need be taken in this code to do minimal conversion in
compares and assignments, That is, table and tr tags are a small percent of any
given input tag stream, and overhead to detect the sequence <table><tr> must be
minimal to make this a good optimization. 

Comments? 
- HTML fixup is module only code
- it would take more than a trout to make me wanna have TX_EXE in stylesheet
  compilation
- the very common case of default-html-output-method cannot be caught during
  compilation

I would still like to see what's actually taking the time.

On how to fix this in startElement, the parser code that does this is using
http://lxr.mozilla.org/seamonkey/source/htmlparser/src/COtherElements.h#819
which adds a tbody if you insert a tr or th directly into a table.

We could mockup that, but then we had to have a stack telling us if the current
element is inserted. Not too bad, but I'd prefer to see what's really going on
inside the childmoving, first.
per Axel's request, a trace of wrapChildren portion of endHTMLElement
processing
aaaaahhhh.
We should definitly do sicking-1.
This stuff does security checks and modification notifications and all. No wonder
it's slow as molasses.
Okay..... Jonas convinced me it is NOT possible to do a fix at compile-time with
a testcase that I cannot fugure out how to handle. 

Next try. At execute-time, SIMPLE code to handle the most common case of simple
tables at startElement time (rather than endElement time). The fastest way to do
wrapChildren is to NOT do it at all! This code provides that for the simple
table cases. The endElement is still needed for cases that this code does not
catch. But this code will catch a lot of tables and do them most quickly.  

This is NOT to say that Jonas's 'solution 1' is not needed. That is a bigger
fish. This is a small fish that is good bang for the buck.

Not even big enough to do a patch file. Here's the concept and code:
                                                                               
         
1) In txMozillaXMLOutput.h ......                                              
         
                                                                               
         
In txMozillaXMLOutput class private data, add:                                 
         
+ PRBool mPreviousStartHTMLWasTable,                                           
         
and in Constructor init this bool to PR_FALSE                                  
         
                                                                               
         
2) In txHTMLAtomList.h add a element for the 'tr' tag                          
         
                                                                               
         
3) Then, in txMozillaXMLOutput.cpp, just a bit of added code:                  
         
                                                                               
         
void txMozillaXMLOutput::startHTMLElement(nsIDOMElement* aElement)             
         
{                                                                              
         
    nsCOMPtr<nsIAtom> atom;                                                    
         
    nsCOMPtr<nsIContent> content = do_QueryInterface(aElement);                
         
    content->GetTag(*getter_AddRefs(atom));                                    
         
                                                                               
         
+  // Is this flag necessary? If actually good XHTML, not needed!              
         
+   PRBool aXHTML = PR_FALSE; // Is this needed to reject bad XHTML?           
         
+                                                                              
         
+  // Do insert of <tbody> case where <tr> immediantly follows <table>         
         
+   if (mPreviousStartHTMLWasTable == PR_TRUE &&                               
         
+      atom == txHTMLAtoms::tr && !aXHTML) {                                   
         
+      nsresult rv;                                                            
         
+      nsCOMPtr<nsIDOMElement> elem;                                           
         
+      rv = mDocument->CreateElement(NS_LITERAL_STRING("tbody"),
getter_AddRefs(elem));  
+      NS_ASSERTION(NS_SUCCEEDED(rv), "Can't create tbody element");           
         
+      nsCOMPtr<nsIDOMNode> resultNode;                                        
         
+      rv = mCurrentNode->AppendChild(elem, getter_AddRefs(resultNode));       
         
+      NS_ASSERTION(NS_SUCCEEDED(rv), "Can't append tbody elem node");         
         
+      }                                                                       
         
+                                                                              
         
+   if (atom == txHTMLAtoms::table && !aXHTML)                                 
         
+      mPreviousStartHTMLWasTable = PR_TRUE;                                   
         
+   else mPreviousStartHTMLWasTable = PR_FALSE;                                
         
                                                                               
         
    mDontAddCurrent = (atom == txHTMLAtoms::script);                           
         
    ...                                                                        
         
                                                                               
         

 Please?
That won't work since we need to add the <tr>s inside the inserted <tbody>, not
after it. But the idea is definitly right, we should default to inserting the
<tbody> instead of defaulting to not insert it. See comment 10.
Ew, that's going to get messy in EndElement, we'll have to walk up to the parent
twice if we inserted the tbody ourselves.
>That won't work since we need to add the <tr>s
>inside the inserted <tbody>, not after it. 

Jonas, not sure that I understand you. What this code does is, if and 
only if, the html tag sequence is this:
<table>
<tr>

then it changes the sequence to this:
<table)
<tbody>
<tr>

such that, for example, if the tbody already appears in the html 
source, then this will NOT add another. (startElement code flow wise, 
the new code inserts the tbody if needed, then the code below this does 
the <tr> insert that was the reason this code was entered. 

>Ew, that's going to get messy in EndElement, we'll have 
>to walk up to the parent twice if we inserted the tbody ourselves.

There are no changes required to endElement.

I have tested this code and it works.
 
No, you're actually creating a DOM like

<table>
<tbody></tbody>
<tr>...</tr>
<tr>...</tr>
...
</table>

whereas what we want is

<table>
<tbody>
<tr>...</tr>
<tr>...</tr>
</tbody>
</table>

To do this you need to set mCurrentNode to the newly inserted <tbody>. And then
modify endElement to walk up two nodes when appropriate.

An alternative to walking up two steps is to keep a stack of the current-node at
all levels, which is what the contentsinks do. This is probably something we
should do since that is a requirement to supporting document.write, and it also
protects us from modifications of the document during the transformation.
Keywords: perf
Proposed patch, following suggested implementation. 
Jonas was already so kind as to do first level review.
Additional review requested. Thanks.
Sam
Peterv: Does it matter what we do for tables like:

<table>
  <tr>...</tr>
  <tbody>
    <tr>...</tr>
    <tr>...</tr>
  </tbody>
  <tr>...</tr>
</table>
?

The current code doesn't do anything if any child is a tablesection. The
HTML-parser creates a table with 3 <tbody>s by wrapping a <tbody> around the
first and the last <tr>.

>Peterv: Does it matter what we do for tables like:

Well, "what is right to do" for malformed html is not clear to me. Tbody's are
ONLY optional in a 1-section table. (Basicly, backward compatibility to simple
html 3.2 tables).

Whenever multiple sections are present, (html 4.0, including thead, tfoot, or
multiple tbody's) then tbody is not optional.

There's a more well-formed and more common case that neither current code nor
proposed code handles. The </tbody> tags are optional in html 4.0 even in the
cases where the tbody tag is requeired. With a bit more work, the code here
could handle that case too, I think.  
Jonas points out by email that my mPreviousHTMLTableAddedTbody[32] 'fixed size
stack' is insufficient and this needs to use txStack instead. I agree (and admit
I was being lazy). Assuming we're on the right track here, I'll modify patch to
use txStack. 

Other option is to eliminate the stack altogether. 
Do this as such: 
1) Detect and 'fix up' all instances of "found /TABLE when was expecting /TBODY". 
2) Since TBODY we insert always follow TABLE, no risk of adding an /TBODY in the
wrong place. 
3) This would also fix-up the most simple cases of missing /TBODY in html source.

What do y'all think? Would this be better or worse?
this one insures only the specific end-tbody matching our inserted start-tbody
tag is ever manipulated. 

The added compare, using TX_StringEqualsAtom, is rather performance expensive.
Is there a cheaper way to compare aName equals literal string "TABLE" here?
Sam
Attachment #125773 - Attachment is obsolete: true
Comment on attachment 125844 [details] [diff] [review]
patch varient, insures no other end-tbody is manipulated

We shouldn't add more code that uses the name arg to endElement. That's
supposed to die.
I'm still not sure about the good way to do this.
I wonder if we should make a few odd table constructs and check what the html
parser does to them. For example,
<table>
 <tfoot><!-- stuff -->
 </tfoot>
 <tr/>
</table>
and stuff like that
Okay, I've created a patch varient that does not rely on a stack. This one adds,
same as before, adds <tbody> if <tr> or <th> immediantly follows <table>.

In endElement, it handles case where name was </table>, but element was actually
</tbody> and the parent element is indeed the expected </table>.  

By this means no stack is required. Very specific start tag insertion rule and
likewise specific end tag insertion rule make it such that no state is required.

I'd hoped that this code would also allow/support fix-up of missing 'optional'
end-tbody tags in user html shource. Turns out that is a moot point because
expat refuses to parse such source. (its probably a bug that expat does this,
but not my worry right now). 

Regarding <thead> and <tfoot>, the proposed patch simply ignores such cases and
passes them on to current code to potentially do tbody child creation and
element wrappering.

The intent and function of proposed patch is to handle the most simple forms of
tables in most performance-efficient manner. (based on my opinion that simple
html 3.2-flavor tables are by far cost common form of tables).  

This patch should, ideally at least, make no change to comples table processing.
Ideally, it should make not change 'malformed table' processing either. (

However, that's a lot to ask... as in, current code does not handle malformed
cases either, so I'm not sure I understand if and why there is a requirement
that this patch needs to handle junk input better than current code.)   



 
Attached patch new version (obsolete) — Splinter Review
This one will probably deal with tables containing mixed <tbody>s and <tr>s. It
even does what the html-parser does and wrapps groups of <tr>s (turned out that
was the easiest way to do it).
Attachment #125844 - Attachment is obsolete: true
Attached patch new version that works (obsolete) — Splinter Review
sorry about that guys, this one actually works
Attachment #127831 - Attachment is obsolete: true
Attached patch new version that *really* works (obsolete) — Splinter Review
sorry, forgot to run a final cvs-diff before attaching. The only change between
this one and attachment 127850 [details] [diff] [review] is that this one doesn't set mTableState when
popping from the mTableStateStack in startHTMLElement in the |if (atom ==
tbody)| block
Comment on attachment 127886 [details] [diff] [review]
new version that *really* works

r=axel@pike.org with a comment in txMozillaXMLOutput.h that ADDED_TBODY only
signals that our fixup routine added the tbody, not tbodies added thru the
stylesheet. That should make the trap I almost ran into harmless.
Attachment #127886 - Flags: review+
I just noticed a nice sideeffect from this patch. For tables like:

<table>
  <caption>...</caption>
  <tr>...</tr>
  <tr>...</tr>
</table>

we ended up creating a DOM like

<table>
  <tbody>
    <caption>...</caption>
    <tr>...</tr>
    <tr>...</tr>
  </tbody>
</table>

However with my patch only the 'tr's are wrapped inside the tbody. However that
made me realize that we should not only move out tbodies from the our created
tbody, all non-tr elements should make us step one step up the tree and place
that element outside the tbody.

New patch comming up for that
Attached patch This is goldSplinter Review
This also fixes the problem mentioned in comment 32. Basically I just
rearranged the |if|s in txMozillaXMLOutput::startHTMLElement so that we break
out of an added <tbody> if we try to insert anything other then <tr>s in it.
Attachment #127886 - Attachment is obsolete: true
Attachment #128874 - Flags: superreview?(peter)
Attachment #128874 - Flags: review?(axel)
Comment on attachment 128874 [details] [diff] [review]
This is gold

>Index: source/xslt/txMozillaXMLOutput.cpp
>===================================================================

>@@ -234,26 +236,29 @@ void txMozillaXMLOutput::endElement(cons

> #ifdef DEBUG
>-    nsAutoString nodeName;
>-    mCurrentNode->GetNodeName(nodeName);
>-    NS_ASSERTION(nodeName.Equals(aName, nsCaseInsensitiveStringComparator()),
>-                 "Unbalanced startElement and endElement calls!");
>+    if (mTableState != ADDED_TBODY) {
>+        nsAutoString nodeName;
>+        mCurrentNode->GetNodeName(nodeName);
>+        NS_ASSERTION(nodeName.Equals(aName,
>+                                     nsCaseInsensitiveStringComparator()),
>+                     "Unbalanced startElement and endElement calls!");
>+    }

Do we want to check the parent's nodename for ADDED_TBODY?

>@@ -488,21 +500,60 @@ void txMozillaXMLOutput::closePrevious(P
>         NS_ASSERTION(NS_SUCCEEDED(rv), "Can't append text node");
> 
>         mText.Truncate();
>     }
> }
> 
>-void txMozillaXMLOutput::startHTMLElement(nsIDOMElement* aElement)
>+void txMozillaXMLOutput::startHTMLElement(nsIDOMElement* aElement, PRBool aXHTML)
> {
>+    nsresult rv = NS_OK;
>     nsCOMPtr<nsIAtom> atom;
>     nsCOMPtr<nsIContent> content = do_QueryInterface(aElement);
>     content->GetTag(getter_AddRefs(atom));
> 
>     mDontAddCurrent = (atom == txHTMLAtoms::script);
> 
>-    if (atom == txHTMLAtoms::head) {
>+    if ((atom != txHTMLAtoms::tr || aXHTML) &&
>+             NS_PTR_TO_INT32(mTableStateStack.peek()) == ADDED_TBODY) {

Line this up.

>+        nsCOMPtr<nsIDOMNode> parent;
>+        mCurrentNode->GetParentNode(getter_AddRefs(parent));
>+        mCurrentNode = parent;

mCurrentNode.swap(parent)?

>+        mTableStateStack.pop();
>+    }
>+
>+    if (atom == txHTMLAtoms::table && !aXHTML) {
>+        mTableState = TABLE;
>+    }
>+    else if (atom == txHTMLAtoms::tr && !aXHTML &&
>+        NS_PTR_TO_INT32(mTableStateStack.peek()) == TABLE) {
>+        nsCOMPtr<nsIDOMElement> elem;
>+        if (mDocumentIsHTML) {
>+            rv = mDocument->CreateElement(NS_LITERAL_STRING("tbody"),
>+                                          getter_AddRefs(elem));
>+        }
>+        else {
>+            rv = mDocument->CreateElementNS(NS_LITERAL_STRING(kXHTMLNameSpaceURI),
>+                                            NS_LITERAL_STRING("tbody"),
>+                                            getter_AddRefs(elem));
>+        }
>+        if (NS_FAILED(rv)) {
>+            return;
>+        }
>+        nsCOMPtr<nsIDOMNode> dummy;
>+        rv = mCurrentNode->AppendChild(elem, getter_AddRefs(dummy));
>+        if (NS_FAILED(rv)) {
>+            return;
>+        }
>+        nsresult rv = mTableStateStack.push(NS_INT32_TO_PTR(ADDED_TBODY));
>+        if (NS_FAILED(rv)) {
>+            return;
>+        }
>+        mCurrentNode = elem;

mCurrentNode.swap(elem)?

>+void txMozillaXMLOutput::endHTMLElement(nsIDOMElement* aElement)

>+    if (mTableState == ADDED_TBODY) {
>+        NS_ASSERTION(atom == txHTMLAtoms::tbody, "Added tbody in nontable");

I don't really understand this assertion's message.

>+        nsCOMPtr<nsIDOMNode> parent;
>+        mCurrentNode->GetParentNode(getter_AddRefs(parent));
>+        mCurrentNode = parent;

mCurrentNode.swap(parent)?

>Index: source/xslt/txMozillaXMLOutput.h
>===================================================================

>+        ADDED_TBODY  // the current node is an inserted tbody not comming from
>+                     // the stylesheet

coming
Attachment #128874 - Flags: superreview?(peterv) → superreview+
Comment on attachment 128874 [details] [diff] [review]
This is gold

r=me.

> Do we want to check the parent's nodename for ADDED_TBODY?

no. that stuff is supposed to die. IMHO.

NS_ASSERTION(atom == txHTMLAtoms::tbody, "This should have been the added
tbody");

?

And I like swap().
Attachment #128874 - Flags: review?(axel) → review+
fix checked in
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: