Closed Bug 272702 Opened 20 years ago Closed 19 years ago

Remove skipped content

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mrbkap, Assigned: mrbkap)

References

Details

Attachments

(3 files, 10 obsolete files)

There was some talk about removing skipped content in bug 88952 and it certainly
seems like it'd be possible to do. Skipped content apparently had two goals: one
was to make trailing content work correctly (we we'd be sure to collect all of
the intermediate text tokens). The other was a bit more obscure: it lets us
easily special-case tags (such as <script> and <style>) that need extra
processing done on their content (i.e., we call mSink->ProcessSSCRIPTTag() which
can then get all of the information needed to load a script and process it).

Once skipped content is removed, content will have to do more processing. This
will make processing <script> and <style> very interesting, but I guess we'll
cross that bridge when we come to it.
actually, most of that work has already been done. Script, style and textareas
already get thier 'data' by looking at thier childnodes. When we parse XHTML we
just create the contained text as normal childnodes, there is very little
special work to be done.

That said, for scripts and style there defenetly is a little extra work that
needs to be done. For scripts we need to hold off inserting the script-tag into
the main DOM until the entire tag and all its contents has been parsed,
otherwise we might kick off the script after just half the script is inserted.

For styles we want to signal to the element that it's about to be parsed, and
once it's fully parsed we want to signal that. Otherwise every single
modification to the style-element will cause it to reparse its contained CSS
with a perfhit as result.

Additionally we need to block the parser if either of the elements are loading
external stuff.

Textareas should be strait ahead though. It shouldn't need any special treatment
at all.


An alternative solution to script/style is to use the
DoneCreatingElement/DoneAddingChildren mechanism. That would allow elements to
not do anything (execute scripts, parse stylesheet) until they and their
children are fully parsed. The elements themselvs could then probably block and
unblock the parser, or at least signal that it was needed through returnvalues
or some such.

Bug 239152 is about enabling that mechanism for all elements, however that isn't
as trivial as it sounds since we're creating elements in a whole lot of places
(half of which are in nsHTMLContentSink.cpp). For now it might be easier to just
enable it for scripts and styles.
actually, it might not be as hard as i make it sound. The codepaths for style
and script are already separate enough that we could just insert code to do the
special handling we need there.
> Textareas should be strait ahead though. It shouldn't need any special
> treatment at all.

Would that it were so!  See bug 152329.  I bet it's a bug in the frame code, but
I haven't quite tracked down where yet...
(In reply to comment #1)
> For styles we want to signal to the element that it's about to be parsed, and
> once it's fully parsed we want to signal that. Otherwise every single
> modification to the style-element will cause it to reparse its contained CSS
> with a perfhit as result.

Note that we don't even construct <script> or <style> tokens (in the tokenizer)
until we've either: 1) Run out of room (i.e., found the last chunk of incoming
data, in which case they will be childrenless) or 2) Found and consumed all of
the child data, so I'm not certain if we need to worry about this.

The same goes for <textarea>, <xmp>, <server>, and <title>.
Blake: unfortunatly that won't help since in the contentsink we will still treat
the element and its childnode as a separate transaction. So once we get the
script-starttag-notification we'll create the element with all its attributes
and insert that in the tree. This will cause the scriptelement to try to execute
immideatly, even before we return to the contentsink.

Though you might be saved by the fact that if the script loads an external file
we'll load that asyncronously I think, even if it's cashed. So we'd still have
time to insert the text childnode if it's immideatly available as you say.
Though this seems like a very risky thing to do and I'd prefer if we just held
off inserting the script in the DOM until it was fully created (including its
children)
At the moment I'm employing the hack from 254989 in my fix for bug 277469. Once
this is fix I shouldn't need that code in nsTextEquivContentSink::AddLeaf, in
the accessibility module.
Blocks: 288526
Status: NEW → ASSIGNED
Blocks: 293162
Blocks: 208869
What's the timeframe for this fix?  No chance for Gecko 1.8?
I'm not going to get to this for 1.8b2, but it sounds like there's going to be
enough time to shoot for 1.8b3 (assuming the patch is safe enough to land in beta).
Attached patch WIP in the parser (obsolete) — Splinter Review
This isn't even close to being finished (I haven't even touched
nsHTMLContentSink yet). I'm just attaching it so I don't lose it.
Attached patch more WIP (obsolete) — Splinter Review
More WIP. This runs and does a bunch of the right stuff. The only real stuff
left to do is <script> and <style> handling. This patch is slightly bloated
because I chose a large context setting and it has another patch included in
it. It turns out that this is mostly code removal.
Attachment #183833 - Attachment is obsolete: true
Can someone explain what this bug is about in general terms, so I know what I
should be writing test cases for? Thanks...
Attached patch patch v1 (obsolete) — Splinter Review
As far as I know, this patch actually works. I'd need convincing that it
doesn't. In fact, I'm so sure that it works, I'm going to attach a diff -w 
here also. I'm not going to ask for review yet only because I want to look over
it more.
Attachment #184736 - Attachment is obsolete: true
Attached patch patch v1 -w (obsolete) — Splinter Review
This is simply the -w version of patch v1. For the record, I'm not removing the
"skipped content" field from the element table here (unless people want me to)
since that'll just be more unnecessary complexity to this patch.
Attached file builderror
With that patch I get a build error, I'm using a MingW build on WindowsXP.
There are two problems with the patch:
-- The nsHTMLTokenizer.cpp changes don't belong in it (and are causing the build
error). Removing them should give you a mostly working build.
-- Gmail doesn't work correctly (I think the reduced test case is:
<body><script>document.write('<div>')</script><script>document.write('in the
div?')</script></body>). I'm working on that now.

Thanks for trying to try it! ;-)
Hixie: This bug is about how we deal with parsing elements that require extra
treatment during parsing, such as <textarea>, <title>, <style> and <script>
(blake, am I missing some element?). So it'd be great with some testcases that
stick weird content inside these tags to make sure that they are parsed
correctly. In particular make sure that <script> and <style> work as they
should, even when their content is big enough to arrive to necko in chunks.

Blake: I'm a little bit unweary about using aNotify as a mechanism for not
updating <style> during parsing. I can't off the top of my head think of any
case where it'd fail, but I don't remember all the places where we pass |false|
to the various functions involved here.
In the XML sink we manually disable and enable updates on the style element
while it's being parsed:
http://lxr.mozilla.org/mozilla/source/content/xml/document/src/nsXMLContentSink.cpp#457
http://lxr.mozilla.org/mozilla/source/content/xml/document/src/nsXMLContentSink.cpp#536
I'd feel better if you did that, or used the DoneAddingChildren mechanism as
you've done for <script>
I have gmail fixed and line numbers in <script> and <style> now reflect
something closer to reality. The only known question that's left is how to
handle the <style> content in the content sink. Boris seemed OK with using
aNotify to determine whether or not to update the style sheet, but Jonas wasn't
so sure. I'm inclined to keep things as-is because this way makes sense to me.
Peterv, do you have any ideas?

I'll attach an updated patch in case anybody wants to test it.
Attached patch patch v2 (obsolete) — Splinter Review
This should build and work. I don't know of any bugs with this patch (yet!)
[note my dampened enthusiasm after being betrayed by patch v1!].
Attachment #184956 - Attachment is obsolete: true
Attachment #184957 - Attachment is obsolete: true
> Boris seemed OK with using aNotify to determine whether or not to update the
> style sheet

I thought about it some more, and we want to use DoneCreatingElement no matter
what.  Otherwise we can end up getting ContentAppended notifications when only
some of the kids have been parsed, and those do cause us to recreate the sheet.
 they shouldn't unless DoneCreatingElement happened.
Are you sure that ContentAppended notifications can break this? Wouldn't those
notifications just go to all nsIDocumentObservers and the element would be left
as is? Though I do agree that DoneCreatingElement is the safest approach (mostly
because the aNotify parameter is poorly specified)
Blocks: 277469
Hmm...  Yeah, the element wouldn't see ContentAppended.

And I think I meant DoneAddingChildren, not DoneCreatingElement.
Attached patch patch v2, merged (obsolete) — Splinter Review
This is patch v2, just merged with my checkins today (one of which was making
<textarea> not skipped content). Gmail still works, so I don't think I've
managed to break anything. (actually, the the XML content sink changes here, I
think I've managed to fix <script> in XHTML, but I haven't tested yet).

The current plan WRT <style> is to do the same thing that the XML content sink
does. That is to disable style loading on <style>, and to re-enable it on
</style>. I haven't figured out whether this involves DoneAddingChildren or
not, yet.
Attachment #185045 - Attachment is obsolete: true
Some test cases:
   http://hixie.ch/tests/adhoc/html/parsing/title/
   http://hixie.ch/tests/adhoc/html/parsing/cdata-blocks/
   http://hixie.ch/tests/adhoc/html/parsing/textarea/

...and other directories around there (I'll hopefully be adding more tests in
due course but at the moment, e.g., the last directory above is empty).
(In reply to comment #22)
> The current plan WRT <style> is to do the same thing that the XML content sink
> does. That is to disable style loading on <style>, and to re-enable it on
> </style>. I haven't figured out whether this involves DoneAddingChildren or
> not, yet.

All you should need to do is to stick:

     if (!mInsideNoXXXTag) {
       ssle->InitStyleLinkElement(mParser, PR_FALSE);
       ssle->SetEnableUpdates(PR_FALSE);
     } else {
       ssle->InitStyleLinkElement(nsnull, PR_TRUE);
     }

in the |case eHTMLTag_style| in OpenContainer and remove the same code from
ProcessSTYLEEndTag and you should be good to go.
Attached patch patch v3 (obsolete) — Splinter Review
As far as I can tell, this patch is actually ready for review. If people want
to start taking a look (and/or testing it) that'd be awesome. <style> is
implemented in the same sort of way as in the XML content sink. <script> isn't
broken in XHTML and no changes are required in the XML content sink because it
creates elements with aFromParser = PR_FALSE, so DoneAddingChildren() isn't
used.
Attachment #185108 - Attachment is obsolete: true
No longer blocks: 293162
Comment on attachment 185233 [details] [diff] [review]
patch v3

>Index: content/html/document/src/nsHTMLContentSink.cpp
>@@ -1349,22 +1389,37 @@ SinkContext::CloseContainer(const nsHTML
>     break;
>   case eHTMLTag_iframe:
>     mSink->mNumOpenIFRAMES--;
> 
>     break;
>   case eHTMLTag_select:
>   case eHTMLTag_textarea:
>   case eHTMLTag_object:
>   case eHTMLTag_applet:
>     content->DoneAddingChildren();
>+    break;
>+
>+  case eHTMLTag_script:
>+    result = mSink->ProcessSCRIPTEndTag(content);
>+    break;
>+
>+  case eHTMLTag_style:
>+    result = mSink->ProcessSTYLEEndTag(content);
>+    break;
> 
>+  case eHTMLTag_title:
>+    if (mSink->mInTitle) {
>+      mSink->SetDocumentTitle();
>+      mSink->mInTitle = PR_FALSE;
>+    }

I wouldn't mind seing SetDocumentTitle() die entierly and just inline the code
here. The timing code for that function hardly seems usefull and other then
that it's just a couple of lines of code. Up to you.

>Index: parser/htmlparser/src/CNavDTD.cpp
>@@ -467,24 +465,23 @@ nsresult CNavDTD::BuildModel(nsIParser* 
>           // The content sink has requested that DTD interrupt processing tokens
>           // So we need to make sure the parser is in a state where it can be
>           // interrupted. 
>           // The mParser->CanInterrupt will return TRUE if BuildModel was called
>           // from a place in the parser where it prepared to handle a return value of
>           // NS_ERROR_HTMLPARSER_INTERRUPTED.
>           // If the parser has mPrevContext then it may be processing
>           // Script so we should not allow it to be interrupted.
>           // We also need to make sure that an interruption does not override
>           // a request to block the parser.
>-          if ((mParser->CanInterrupt()) && 
>-            (nsnull == mParser->PeekContext()->mPrevContext) && 
>-            (eHTMLTag_unknown==mSkipTarget) &&
>-            NS_SUCCEEDED(result)) {
>+          if (mParser->CanInterrupt() && 
>+             !mParser->PeekContext()->mPrevContext && 
>+              NS_SUCCEEDED(result)) {

Indent the second test one more char.

Still digging into the parser changes
Could you attach a -w version of this patch? I'm trying to grok the
CNavDTD::HandleToken changes
Attached patch merged to trunk (obsolete) — Splinter Review
This is patch v3, merged to the trunk (it should apply cleanly). I haven't
addressed any comments in this, yet (esp. since they're minor so far). -w
version coming up.
Attachment #185233 - Attachment is obsolete: true
Attached patch -w version for easier reviewing (obsolete) — Splinter Review
Attachment #189783 - Flags: review?(bugmail)
Blocks: 279676
Comment on attachment 189783 [details] [diff] [review]
-w version for easier reviewing

>Index: parser/htmlparser/src/CNavDTD.cpp
>@@ -884,20 +766,27 @@ nsresult CNavDTD::HandleToken(CToken* aT
...
>+              eHTMLTags top = mBodyContext->Last();
>+              if (top != eHTMLTag_head &&
>+                  gHTMLElements[top].CanContain(theTag, mDTDMode)) {
>+                // Some tags (such as <object> and <script>) are opened in the
>+                // head and allow other non-head content to be children.
>+                break;
>+              }

Might be nice to mention that these commonly are textnodes under <script> and
<style>


>@@ -1866,20 +1700,32 @@ nsresult CNavDTD::HandleEndToken(CToken*
...
>+    case eHTMLTag_script:
>+      // Note: we don't fall through to the default case because
>+      // CloseContainersTo() has the bad habit of closing tags that are opened
>+      // by document.write(). Fortunately, the tokenizer guarantees that no
>+      // actual tags appear between <script> and </script> so we won't be
>+      // closing the wrong tag.

This could be a bit more descriptive. Something like:

Note: we don't fall through to the default case because CloseContainersTo()
will loop calling CloseContainer() and close first the <script> as then
containers opened by document.write. Fortunately...

>+      NS_ASSERTION(mBodyContext->Last() == eHTMLTag_script,
>+                   "The tokenizer isn't enforcing proper <script> behavior");
>+      mBodyContext->Pop();
>+      result = CloseContainer(eHTMLTag_script, theChildTag, PR_FALSE);

You need to handle the case where there's a stray '</script>' in the document.
You can probably just ignore it.

>@@ -3311,26 +3077,40 @@ CNavDTD::CloseContainer(const eHTMLTags 
...
>+      if ((mFlags & NS_DTD_FLAG_HAD_BODY) &&
>+          mBodyContext->GetCount() + 1 == mHeadContainerPosition) {
>+        result = CloseHead();
>+        mHeadContainerPosition = -1;
>+      }

why the +1?

> nsresult CNavDTD::OpenHead(const nsIParserNode *aNode)
...
>-    result = mSink ? mSink->OpenHead(*aNode) : NS_OK;
>+    if (mSink && aNode) {
>+      result = mSink->OpenHead(*aNode);
>+    }
>+    else if (mSink) {
>+      result = mSink->OpenHead();
>+    }

if (mSink) {
  result = aNode ? mSink->OpenHead(*aNode) :
		   mSink->OpenHead();
}

maybe? Or maybe even only have one OpenHead signature that allows null to be
passed?


>+nsresult CNavDTD::AddHeadContent(nsIParserNode *aNode){
...
>+    else {
>+      mBodyContext->Push(NS_CONST_CAST(nsCParserNode*,
>+                           NS_STATIC_CAST(const nsCParserNode*, aNode)),
>+                         nsnull, PR_FALSE);

That const-cast seems unneccesary


>Index: parser/htmlparser/src/COtherElements.h
>       result=theHead->OpenContext(aNode,aTag,aContext,aSink);
>       if(NS_SUCCEEDED(result)) {
>-        result=aSink->SetTitle(mText);
>+        //result=aSink->SetTitle(mText);

Just remove the line compleatly.

This is a great patch! Codesimplification rocks! I still don't quite understand
the way the calls to CloseHead is done. It seemed like we used to always call
it on non-head-element-start-tags, but now we just call it when we open the
body and after we've reopened the head. Are we always calling OpenBody at the
right place?
(In reply to comment #30)
> You need to handle the case where there's a stray '</script>' in the document.
> You can probably just ignore it.

I've turned this assertion into:
      if (mBodyContext->Last() != eHTMLTag_script) {
        // Except if we're here, then there's probably a stray script tag.
        NS_ASSERTION(mBodyContext->LastOf(eHTMLTag_script) == kNotFound,
                     "Mishandling scripts in CNavDTD");
        break;
      }

> >+          mBodyContext->GetCount() + 1 == mHeadContainerPosition) {
> why the +1?

When I was setting mHeadContainerPosition, I was pushing the new node onto the
DTD context before actually setting it. By the time we're here, we've already
popped the head container off of the context stack, so I needed to add 1. I've
flipped the code in CNavDTD::AddHeadContent to set mHeadContainerPosition before
pushing the node onto the context stack and removed this + 1.


> That const-cast seems unneccesary

Thanks for catching this!

> >-        result=aSink->SetTitle(mText);
> >+        //result=aSink->SetTitle(mText);
> 
> Just remove the line compleatly.

Heh, the other DTD gets no lovin' ;-). I commented this out before I was
entirely confident with my nsIHTMLContentSink changes and never really revisited
it. I've removed it locally, but with the note that the other DTD is going away
entirely very soon (thanks peterv!).

> 
> This is a great patch! Codesimplification rocks! I still don't quite understand
> the way the calls to CloseHead is done. It seemed like we used to always call
> it on non-head-element-start-tags, but now we just call it when we open the
> body and after we've reopened the head. Are we always calling OpenBody at the
> right place?
> 

We are indeed calling OpenBody at the right places. On IRC we discussed not
doing this juggling at all (i.e., not moving <style> to the head) and that would
actually simplify things a lot in these cases. Once this bug is checked in, I'll
open up a followup bug to look at that more carefully.

I've addressed all of the other comments locally (though
HTMLContentSink::SetDocumentTitle still does live).
Comment on attachment 189783 [details] [diff] [review]
-w version for easier reviewing

sounds great, r=me
Attachment #189783 - Flags: review?(bugmail) → review+
Attachment #189783 - Flags: superreview?(jst)
Note to self: <script>, <style>, <title> and <server> all need to have the
kNoStyleLeaksIn special prop set on them.
Attached patch merged to trunk, again (obsolete) — Splinter Review
jst asked for an updated patch. This patch addresses all review comments so far
and fixes the userdefined tag without explicitly declared head problem that I
recently fixed. I'll attach a -w in a second for review. (note: I'm carrying
over sicking's review).
Attachment #189782 - Attachment is obsolete: true
Attachment #189783 - Attachment is obsolete: true
Attached patch better updatesSplinter Review
Attachment #195162 - Attachment is obsolete: true
This is the latest and greatest.
Attachment #195189 - Flags: superreview?(jst)
Attachment #195189 - Flags: review+
Comment on attachment 195189 [details] [diff] [review]
better updates -w

- In parser/htmlparser/public/nsIDTD.h:

-    NS_IMETHOD CollectSkippedContent(PRInt32 aTag, nsAString& aContent,
PRInt32 &aLineNo) = 0;

Not that it probably makes a difference, but you should update the IID for this
interface, same for nsIHTMLContentSink

- In SinkContext::OpenContainer():

+    case eHTMLTag_title:
+      if (mSink->mDocument->GetDocumentTitle().IsVoid()) {

Don't you mean .IsEmpty() here?

- In SinkContext::CloseContainer():

+  case eHTMLTag_title:
+    if (mSink->mInTitle) {
+      mSink->SetDocumentTitle();

This seems like it should be renamed now, as it doesn't really "set" the title
(doesn't get a title value or anything). Maybe UpdateDocumentTitle() in stead?

sr=jst
Attachment #195189 - Flags: superreview?(jst) → superreview+
Fix checked into trunk. Looks like Tp on btek dropped ~20ms or so. I've already
found a couple of regressions this patch has introduced. I'll file new bugs on
those.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Yeah, this helped out Tp on other tinderboxen as well.
Depends on: 307797
Depends on: 307821
After the fix for bug 307821 Tp is back where it started.
Depends on: 308140
Attachment #189783 - Flags: superreview?(jst)
Depends on: 308592
Depends on: 308781
Beginning the day this appeared on the trunk, the STYLE data from the header of
HTML e-mail is displayed in the body when viewing it as simple HTML.  

This looks like a likely culprit - is it related?
Probably, please file new bugs and mark them as blocking this one.
Depends on: 309012
Blocks: 258538
Depends on: 328751
No longer depends on: 328751
Depends on: 328751
No longer depends on: 328751
Depends on: 328751
Depends on: 329007
Blocks: 396568
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: