If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

nsDocument::Destroy needs to die

RESOLVED FIXED

Status

()

Core
DOM
P1
normal
RESOLVED FIXED
11 years ago
6 years ago

People

(Reporter: sicking, Assigned: sicking)

Tracking

(Blocks: 1 bug)

Trunk
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9 +
in-testsuite -

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sg:critical])

Attachments

(1 attachment, 6 obsolete attachments)

We need to get rid of the hidiousness of nsDocument::Destroy. All sorts of code gets confused by the fact that we call UnbindFromTree on all the children of the document without neither sending out any nsIMutationObserver notifications, or actually removing the nodes from the childlist of the document.

The attachment in bug 326645 comment 20 had to be backed out because the layout code tried to render the half-destroyed document. Bug 330302 has some analysis of what went on there.

Bug 348062 is another example where we end up crashing because we end up with dangling pointers in nsContentList.
Blocks: 348062
Adding some more bugs that this blocks...
Btw, It would be great if we could use graydons cycle collector rather than nsDocument::Destroy to fix these leaks.
Blocks: 266754
Blocks: 361226

Comment 3

11 years ago
Is this easier now that Graydon's cycle collector is in?  I'd really like to see bug 335896 and bug 361226 go away, since they interfere with my testing.
Oh, yes, this'd be awesome to fix now with the collector. Do we know which cycles cause leaks?
We should just go through the UnbindFromTree impls and hook up everything cleaned up by them to cycle collection.  As I recall, the main issue was XUL stuff (controllers, etc).
I'll get on this one.
Assignee: general → bugmail

Updated

11 years ago
Blocks: 370810
Depends on: 367779, 368530, 368549
Depends on: 368773
Blocks: 359821

Updated

11 years ago
Flags: blocking1.9+
Target Milestone: --- → mozilla1.9alpha5
Jonas?  What's the story here?  Gonna make A5?  Let's change this if necessary to A6.
Target Milestone: mozilla1.9alpha5 → mozilla1.9alpha6

Updated

10 years ago
Whiteboard: [sg:critical]
Created attachment 268039 [details] [diff] [review]
Patch to fix

So this makes us not UnbindFromTree nodes in nsDocument::Destroy. It also makes us not kill the hash that keeps wrappers alive.

I audited all ::UnbindFromTree implementations and the only thing I found not hooked up to the cycle collector were these three things:

1. mControllers. There are piles of different controllers and many implemented
   in javascript. The common ownership seems to be
     mControllers->nsXULControllers->nsBaseCommandController->
     nsControllerCommandTable->{{stuff}}
   However there might be other ones too for all I know and there is plenty of
   "stuff".
   But I think that we want to kill all controllers from nsDocument::Destroy
   anyway, so this shouldn't be a problem. I simply walk the DOM and manually
   unhook all mControllers.
   I did add code to unhook mControllers in Unlink though since we already
   traverse it.
2. nsHTMLFormElement::mWebProgress. I made this a weak reference instead.
3. Access keys. A few elements unregister their access keys from UnbindFromTree.
   This makes the nsEventStateManager no longer hold strong refs to these nodes.
   However I can't see a way to get the nsEventStateManager to participate in a
   cycle so I think we should be fine there.

Boris, let me know if you won't have time to review this tonight or tomorrow and I'll try to get someone else to do it.
Attachment #268039 - Flags: superreview?
Attachment #268039 - Flags: review?(bzbarsky)
Attachment #268039 - Flags: superreview? → superreview?(bzbarsky)
Comment on attachment 268039 [details] [diff] [review]
Patch to fix

Arg, I'm still leaking with this :(
Attachment #268039 - Attachment is obsolete: true
Attachment #268039 - Flags: superreview?(bzbarsky)
Attachment #268039 - Flags: review?(bzbarsky)
Comment on attachment 268039 [details] [diff] [review]
Patch to fix

>Index: content/base/src/nsDocument.cpp
>-    // mRootContent.  If we did this (or at least the latter), we could remove
>-    // the silly null-checks in nsHTMLDocument::MatchLinks.

File a followup bug to remove the null-check?

>+  nsNodeUtils::KillControllers(this);

I think we should work on making all controllers participate in cycle collection (the JS ones do automatically, right?) and removing this code, possibly.

>Index: content/base/src/nsNodeUtils.cpp
>+nsNodeUtils::KillControllers(nsINode* aRoot)

>+  if (aRoot->IsNodeOfType(nsINode::eXUL)) {
>+    nsXULElement* xul = NS_STATIC_CAST(nsXULElement*, aRoot);

Ugh.  Maybe nsXULElement::FromContent should take an nsINode* instead.  Followup bug?

Thinking about why this change starts to leak.
We might want to add nsXTFElementWrapper and stuff it points to(?) to cycle collection.  I doubt that's your issue, though.

Are you leaking on documents with subframes?  Or just on starting and quitting the browser?

Could we be leaking through XBL?  Does calling 
ChangeDocumentFor to null for each node on the binding manager help the leak?

Past that, I don't really know what could be going on.  Stylesheets don't hold strong refs into the DOM, that I can see...
Created attachment 268305 [details] [diff] [review]
Patch v2

This also kills all XBL bindings from a document when it's being destroyed. It's unfortunate, but it's no worse then we are currently. I'll file a bug on getting it removed.
Attachment #268305 - Flags: superreview?(jst)
Attachment #268305 - Flags: review?(jst)
Created attachment 268307 [details] [diff] [review]
Patch v2.1

Forgot to do a final diff before attaching
Attachment #268305 - Attachment is obsolete: true
Attachment #268307 - Flags: superreview?(jst)
Attachment #268307 - Flags: review?(jst)
Attachment #268305 - Flags: superreview?(jst)
Attachment #268305 - Flags: review?(jst)
I added nsXTFElementWrapper, but the stuff that it points to is implemented by extensions. But at least this way XTF extensions could take part of cycle collection.
Right.  And if it's implemented in JS (which I think is the idea), it participates automatically, right?
Yeah, though XForms uses C++
Comment on attachment 268307 [details] [diff] [review]
Patch v2.1

r+sr=jst
Attachment #268307 - Flags: superreview?(jst)
Attachment #268307 - Flags: superreview+
Attachment #268307 - Flags: review?(jst)
Attachment #268307 - Flags: review+
Checked in. Thanks for the review!
So this bug is fixed then?
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED

Comment 20

10 years ago
RLk on the Firefox Linux tinderbox jumped from 3KB to 982KB around the time this was checked in :(
Backed out :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Have to push this unfortunately :(
Target Milestone: mozilla1.9alpha6 → mozilla1.9beta1

Updated

10 years ago
Target Milestone: mozilla1.9 M8 → ---
Created attachment 282791 [details] [diff] [review]
Sync to tip

Same as above but synced to tip. Testing now I don't seem to leak any more.
Attachment #268307 - Attachment is obsolete: true
Comment on attachment 282791 [details] [diff] [review]
Sync to tip

got a=jst on landing to see if it still leaks.
Attachment #282791 - Flags: approval1.9+
It looks like this turned the talos tinderboxes red (crashing).

(It's really confusing to understand what the talos tinderboxes are doing because of bug 393338, though.)
In particular, it seems (from the log) that it's the talos Tp test, although it's a little hard to tell from the log.
I backed this out to see if it fixes the talos tinderbox red.
[28 21:48:37] <dbaron> bhearsum, did you see what it was doing when it crashed?
[28 21:48:42] <dbaron> bhearsum, (which test it was running?)
[28 21:48:45] <bhearsum> it was running through the pageset
[28 21:48:45] <bhearsum> nothing special
[28 21:48:48] <bhearsum> tp

Comment 29

10 years ago
I think this patch made http://www.somethingawful.com/d/photoshop-phriday/ leak.
I have the leak pinned down. Gotta get Tp2 installed to try to reproduce the crash.
Status: REOPENED → ASSIGNED
Created attachment 283108 [details] [diff] [review]
patch v3

Turns out the crash and the leak was caused by the same problem. The leak problem was that frame-elements holds a strong reference to the nsFrameLoader, which holds a strong reference to the child-frames nsDocShell which holds a strong reference to its document. So if there are any references from the inner document to the outer we have a cycle, and since we still can't cyclecollect nsDocShells we leak.

I fixed this by destroying the nsFrameLoader while recursively "destroying" the content tree. We should additionally hook up nsDocShell to the cycle collector, but that's more work so I'll do that separately.

Not destroying the frameloader also caused crashes as we are currently relying on nsFrameLoader::Destroy to null out mParentDocument on the inner document. This is not very reliable as the outer document can get destroyed before the frameloader. I bet it's possible to come up with testcases that crashes because of this.

Instead I made the nsDocument code that deals with the sub-document hash guarantee that if a subdocument is removed from the hash (as happens when the outer document is destroyed) it nulls out the mParentDocument pointer of the subdocument.
Attachment #282791 - Attachment is obsolete: true
Attachment #283108 - Flags: superreview?(jst)
Attachment #283108 - Flags: review?(jst)
Attachment #283108 - Attachment is patch: true
Attachment #283108 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 283108 [details] [diff] [review]
patch v3

Oops, i still have some #if 0 code in there. Removed locally.

Updated

10 years ago
Attachment #283108 - Flags: superreview?(jst)
Attachment #283108 - Flags: superreview+
Attachment #283108 - Flags: review?(jst)
Attachment #283108 - Flags: review+
Had to back this out since it caused perf regressions on the talos Tp runs. My theory is this:

So for a lot of documents we now rely on the cycle collector to delete the document. This could cause Tp regressions for two reasons:

1. We used to tear down the document as soon as it was navigated away from (modulo things like bfcache, but we'd still throw out one document right at the end of each navigation). This had a higher chance of being between two page-loads of the Tp test.
2. Cycle collection is probably slower than normal tear-down since we call unlink on each node in the collected cycle. For cycles that involve the document-node these cycles can presumably be pretty big.

One way to fix 1 would be to make Tp force a cycle collection between page-loads. During normal browsing this should normally happen, but since Tp tests continuously loads page after page chances of cycle collecting between two pages gets pretty small.

Not sure if 2 is an actual problem as long as we are able to schedule cycle collections in situations where the user isn't bothered. Fixing bug 373462 should take care of that hopefully.
There's a third possible reason:

3.  Talos is not measuring the right thing.

Exactly what quantity does Talos measure?  Note that the old Tp test was not affected, and it usually shows 3-5% across-the-board regressions just fine.
The numbers that you see reported by Talos on the Firefox tinderbox page are generated by Tp2 - so effectively you are seeing the same page load time collection as on the other tinderbox test machines.

The difference that we see in results is usually presumed to be because of the complexity of the talos page set - compared to the 40 circa year 2000 web pages being used on test tinderboxes talos uses ~400 pages that are now about 6 months old.

We are also starting to build up a history of talos showing regressions where the test tinderboxes do not.  We are not yet totally, 100% convinced that this is because the talos numbers are correct, but we have seen regression causing patches backed out causing an improvement in talos results.  At the very least, I believe that talos regressions are worthy of investigation and as they are starting to develop a history they will become more and more trustworthy.
OK.  Note that we've had issues with Tp2 in the past that I'm not sure we ever resolved, both in terms of measurement methodology (it had some bugs in the number-crunching, pure and simple) and in terms of what it's measuring (loads from file:// and loads from http:// often have somewhat different performance characteristics).  Tp2 is useful, but not a replacement for Tp.

Jonas, if this is Tp2 there should be no bfcache involved: the test runs in an iframe.  It also means that the time is measured from right before window.location is set to when onload fires on the subframe.  So I would have really expected cycle collection to happen during the timing, both before and now.
Just for clarity the tp2 test that talos is running loads from http:// with pages served by a local apache install.

I can also say things about Vlad's new pageloader, but then I fear that I'm getting farther and farther off topic.
Jonas, I wonder whether it helps to change the UnbindFromTree() calls in nsGenericElement::Unlink to be shallow.  Right now if we happen to unlink in more or less root-to-leaf order we end up unbinding the leaves once for each ancestor.

And because we know that all descendants will be unlinked if we're being unlinked, we don't have to worry about leaving dangling document pointers or whatnot... I think.
Unfortunately doing a shallow unbind would not be entierly correct. Imagine a situation like (ext is an external reference):

   Y         D -
    \       /    \
 +-->X  B  C     |
 |    \ | /      |
ext    \|/      /
        A <----

In this case we would cycle collect the A-B-C-D cycle, but only shallowly unbinding X, leaving Y to still believe that it still is connected to A.

This might actually be good enough as in most cases Y will only care about if it's connected to A or not when it's part of a document. I think forms and SVG are cases that care, but only when connected to the document.

However ideally cycle collecting shouldn't be timed as part of the Tp tests as we should try to move it to a point when it won't interfere with the user (see also bug 373462).

The weird part is that when I do some logging it looks like at least locally on my machine cycle collection happens between pageloads. I logged timing-start, timing-end, cyclecollection and document-dtor. In basically all cases cyclecollection and document-dtor happened between timing-end and timing-start.
I'm not sure what that diagram represents.  Are those DOM nodes?  Which way are the parent/child relationships?

In general, since parents own their kids we can't collect a kid while the parent is alive.  So if we're unlinking a node, that means all its ancestors are being collected as well.  We already have that situation on branch right now:  if you build up a DOM in JS, never stick it in a document, and then forget about it, we'll GC that and as the nodes go away we'll do shallow unbinds on the kids.

I guess we could have an issue where a kid is holding a weak ref to some ancestor and doesn't clear the ref because while the ancestor is being cycle-collected the kid isn't?  Or something like that?

Ideally we'd do a deep unbind on the root of any subtree being collected, and shallow unbinds on the rest of it, right?

As for timing this as part of Tp and interference with the user...  In practice it interferes with the user, imo.  We might as well be measuring the extent (and trying to mitigate it).
In the graph all nodes except 'ext' are DOM elements. A is the root of the tree.

I didn't realize that we do shallow unbinds from the dtor of nsGenericElement, but we do indeed. This is actually a bit risky as not necessarily all descendants in the tree will be destroyed. So for example an <input> element that is a descendant of a <form> will not notice that the form is going away and will keep a weak pointer to it :(
I think we should make that pointer in particular a strong pointer that gets nulled out on Unlink.
Attachment #282791 - Flags: approval1.9+

Comment 43

10 years ago
Now that bug 373462 is fixed, is this patch ready to land (after beta)?
Depends on: 373462
Priority: -- → P1

Comment 44

10 years ago
should we land this?
Depends on: 403835

Comment 45

10 years ago
Jonas - we makin progress here?
Whiteboard: [sg:critical] → [sg:critical] Blocked by bug 403835
Created attachment 290639 [details] [diff] [review]
Patch v3.5

This syncs to tip and also adds a couple of performance improvements:

1. Call DestroyLinkMap first in the nsDocument unlink function
2. Don't do deep unbind in the nsGenericElement unlink function

Both these are just mimicking what we're doing in destructors of these classes.

Got r/sr=jst over irl
Attachment #283108 - Attachment is obsolete: true
Attachment #290639 - Flags: superreview+
Attachment #290639 - Flags: review+
Created attachment 290799 [details] [diff] [review]
Patch v4

So this moves back to using recursive UnbindFromTree. We definitely still have to do that when IsInDoc is true, but we can possibly do non-recusive when it's not. But I'll do that as a separate patch instead, if we feel that we want to.

I also added a SaveState call to nsGenericHTMLFormElement::DestroyContent so that we're still saving state where we used to.
Attachment #290639 - Attachment is obsolete: true
Looks like it finally stuck!

/me does victory dance
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago10 years ago
Resolution: --- → FIXED

Comment 49

10 years ago
Congrats!
Tests for this should be associated with the dependencies, so nothing to do here.
Flags: in-testsuite-
Depends on: 406840

Updated

10 years ago
Depends on: 406684
Depends on: 406852

Updated

10 years ago
Whiteboard: [sg:critical] Blocked by bug 403835 → [sg:critical]
Depends on: 408268
Blocks: 370811
So does this mean that the InZombieDocument() checks in presshell need updating?
Not really sure what the purpose of that code is, so not sure. Could be that things work fine since the document is 'intact', or we might want to test mIsGoingAway
roc, mats, do you remember what the point of the InZombieDocument() stuff is?

I'd do the CVS archeology, but I'm not sure I'll have time for it until a few days from now.  :(
I think at least part of it was fixing the bug where app shortcut keys didn't work at all during the transition between documents (assuming the content area was focused) -- originally bug 110718, with some later regression fixes on top.
Jonas, can you check on that, please?  That's been a major annoyance in that past; I'm hoping we didn't regress it...
It seems to work fine for me.
Depends on: 412920
Depends on: 419132

Updated

10 years ago
Depends on: 415192
Group: core-security
Depends on: 620122
You need to log in before you can comment on or make changes to this bug.