Closed
Bug 348156
Opened 18 years ago
Closed 17 years ago
nsDocument::Destroy needs to die
Categories
(Core :: DOM: Core & HTML, defect, P1)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
People
(Reporter: sicking, Assigned: sicking)
References
(Blocks 1 open bug)
Details
(Whiteboard: [sg:critical])
Attachments
(1 file, 6 obsolete files)
21.83 KB,
patch
|
Details | Diff | Splinter Review |
We need to get rid of the hidiousness of nsDocument::Destroy. All sorts of code gets confused by the fact that we call UnbindFromTree on all the children of the document without neither sending out any nsIMutationObserver notifications, or actually removing the nodes from the childlist of the document.
The attachment in bug 326645 comment 20 had to be backed out because the layout code tried to render the half-destroyed document. Bug 330302 has some analysis of what went on there.
Bug 348062 is another example where we end up crashing because we end up with dangling pointers in nsContentList.
Comment 1•18 years ago
|
||
Adding some more bugs that this blocks...
Assignee | ||
Comment 2•18 years ago
|
||
Btw, It would be great if we could use graydons cycle collector rather than nsDocument::Destroy to fix these leaks.
Comment 3•18 years ago
|
||
Is this easier now that Graydon's cycle collector is in? I'd really like to see bug 335896 and bug 361226 go away, since they interfere with my testing.
Assignee | ||
Comment 4•18 years ago
|
||
Oh, yes, this'd be awesome to fix now with the collector. Do we know which cycles cause leaks?
Comment 5•18 years ago
|
||
We should just go through the UnbindFromTree impls and hook up everything cleaned up by them to cycle collection. As I recall, the main issue was XUL stuff (controllers, etc).
Assignee | ||
Updated•18 years ago
|
Updated•18 years ago
|
Flags: blocking1.9+
Assignee | ||
Updated•18 years ago
|
Target Milestone: --- → mozilla1.9alpha5
Comment 7•18 years ago
|
||
Jonas? What's the story here? Gonna make A5? Let's change this if necessary to A6.
Assignee | ||
Updated•18 years ago
|
Target Milestone: mozilla1.9alpha5 → mozilla1.9alpha6
Updated•18 years ago
|
Whiteboard: [sg:critical]
Assignee | ||
Comment 8•18 years ago
|
||
So this makes us not UnbindFromTree nodes in nsDocument::Destroy. It also makes us not kill the hash that keeps wrappers alive.
I audited all ::UnbindFromTree implementations and the only thing I found not hooked up to the cycle collector were these three things:
1. mControllers. There are piles of different controllers and many implemented
in javascript. The common ownership seems to be
mControllers->nsXULControllers->nsBaseCommandController->
nsControllerCommandTable->{{stuff}}
However there might be other ones too for all I know and there is plenty of
"stuff".
But I think that we want to kill all controllers from nsDocument::Destroy
anyway, so this shouldn't be a problem. I simply walk the DOM and manually
unhook all mControllers.
I did add code to unhook mControllers in Unlink though since we already
traverse it.
2. nsHTMLFormElement::mWebProgress. I made this a weak reference instead.
3. Access keys. A few elements unregister their access keys from UnbindFromTree.
This makes the nsEventStateManager no longer hold strong refs to these nodes.
However I can't see a way to get the nsEventStateManager to participate in a
cycle so I think we should be fine there.
Boris, let me know if you won't have time to review this tonight or tomorrow and I'll try to get someone else to do it.
Attachment #268039 -
Flags: superreview?
Attachment #268039 -
Flags: review?(bzbarsky)
Assignee | ||
Updated•18 years ago
|
Attachment #268039 -
Flags: superreview? → superreview?(bzbarsky)
Assignee | ||
Comment 9•18 years ago
|
||
Comment on attachment 268039 [details] [diff] [review]
Patch to fix
Arg, I'm still leaking with this :(
Attachment #268039 -
Attachment is obsolete: true
Attachment #268039 -
Flags: superreview?(bzbarsky)
Attachment #268039 -
Flags: review?(bzbarsky)
Comment 10•18 years ago
|
||
Comment on attachment 268039 [details] [diff] [review]
Patch to fix
>Index: content/base/src/nsDocument.cpp
>- // mRootContent. If we did this (or at least the latter), we could remove
>- // the silly null-checks in nsHTMLDocument::MatchLinks.
File a followup bug to remove the null-check?
>+ nsNodeUtils::KillControllers(this);
I think we should work on making all controllers participate in cycle collection (the JS ones do automatically, right?) and removing this code, possibly.
>Index: content/base/src/nsNodeUtils.cpp
>+nsNodeUtils::KillControllers(nsINode* aRoot)
>+ if (aRoot->IsNodeOfType(nsINode::eXUL)) {
>+ nsXULElement* xul = NS_STATIC_CAST(nsXULElement*, aRoot);
Ugh. Maybe nsXULElement::FromContent should take an nsINode* instead. Followup bug?
Thinking about why this change starts to leak.
Comment 11•18 years ago
|
||
We might want to add nsXTFElementWrapper and stuff it points to(?) to cycle collection. I doubt that's your issue, though.
Are you leaking on documents with subframes? Or just on starting and quitting the browser?
Could we be leaking through XBL? Does calling
ChangeDocumentFor to null for each node on the binding manager help the leak?
Past that, I don't really know what could be going on. Stylesheets don't hold strong refs into the DOM, that I can see...
Assignee | ||
Comment 12•18 years ago
|
||
This also kills all XBL bindings from a document when it's being destroyed. It's unfortunate, but it's no worse then we are currently. I'll file a bug on getting it removed.
Attachment #268305 -
Flags: superreview?(jst)
Attachment #268305 -
Flags: review?(jst)
Assignee | ||
Comment 13•18 years ago
|
||
Forgot to do a final diff before attaching
Attachment #268305 -
Attachment is obsolete: true
Attachment #268307 -
Flags: superreview?(jst)
Attachment #268307 -
Flags: review?(jst)
Attachment #268305 -
Flags: superreview?(jst)
Attachment #268305 -
Flags: review?(jst)
Assignee | ||
Comment 14•18 years ago
|
||
I added nsXTFElementWrapper, but the stuff that it points to is implemented by extensions. But at least this way XTF extensions could take part of cycle collection.
Comment 15•18 years ago
|
||
Right. And if it's implemented in JS (which I think is the idea), it participates automatically, right?
Assignee | ||
Comment 16•18 years ago
|
||
Yeah, though XForms uses C++
Comment 17•18 years ago
|
||
Comment on attachment 268307 [details] [diff] [review]
Patch v2.1
r+sr=jst
Attachment #268307 -
Flags: superreview?(jst)
Attachment #268307 -
Flags: superreview+
Attachment #268307 -
Flags: review?(jst)
Attachment #268307 -
Flags: review+
Assignee | ||
Comment 18•18 years ago
|
||
Checked in. Thanks for the review!
Comment 19•18 years ago
|
||
So this bug is fixed then?
Assignee | ||
Updated•18 years ago
|
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Comment 20•18 years ago
|
||
RLk on the Firefox Linux tinderbox jumped from 3KB to 982KB around the time this was checked in :(
Assignee | ||
Comment 22•18 years ago
|
||
Have to push this unfortunately :(
Target Milestone: mozilla1.9alpha6 → mozilla1.9beta1
Updated•17 years ago
|
Target Milestone: mozilla1.9 M8 → ---
Assignee | ||
Comment 23•17 years ago
|
||
Same as above but synced to tip. Testing now I don't seem to leak any more.
Attachment #268307 -
Attachment is obsolete: true
Assignee | ||
Comment 24•17 years ago
|
||
Comment on attachment 282791 [details] [diff] [review]
Sync to tip
got a=jst on landing to see if it still leaks.
Attachment #282791 -
Flags: approval1.9+
It looks like this turned the talos tinderboxes red (crashing).
(It's really confusing to understand what the talos tinderboxes are doing because of bug 393338, though.)
In particular, it seems (from the log) that it's the talos Tp test, although it's a little hard to tell from the log.
I backed this out to see if it fixes the talos tinderbox red.
[28 21:48:37] <dbaron> bhearsum, did you see what it was doing when it crashed?
[28 21:48:42] <dbaron> bhearsum, (which test it was running?)
[28 21:48:45] <bhearsum> it was running through the pageset
[28 21:48:45] <bhearsum> nothing special
[28 21:48:48] <bhearsum> tp
Comment 29•17 years ago
|
||
I think this patch made http://www.somethingawful.com/d/photoshop-phriday/ leak.
Assignee | ||
Comment 30•17 years ago
|
||
I have the leak pinned down. Gotta get Tp2 installed to try to reproduce the crash.
Status: REOPENED → ASSIGNED
Assignee | ||
Comment 31•17 years ago
|
||
Turns out the crash and the leak was caused by the same problem. The leak problem was that frame-elements holds a strong reference to the nsFrameLoader, which holds a strong reference to the child-frames nsDocShell which holds a strong reference to its document. So if there are any references from the inner document to the outer we have a cycle, and since we still can't cyclecollect nsDocShells we leak.
I fixed this by destroying the nsFrameLoader while recursively "destroying" the content tree. We should additionally hook up nsDocShell to the cycle collector, but that's more work so I'll do that separately.
Not destroying the frameloader also caused crashes as we are currently relying on nsFrameLoader::Destroy to null out mParentDocument on the inner document. This is not very reliable as the outer document can get destroyed before the frameloader. I bet it's possible to come up with testcases that crashes because of this.
Instead I made the nsDocument code that deals with the sub-document hash guarantee that if a subdocument is removed from the hash (as happens when the outer document is destroyed) it nulls out the mParentDocument pointer of the subdocument.
Attachment #282791 -
Attachment is obsolete: true
Attachment #283108 -
Flags: superreview?(jst)
Attachment #283108 -
Flags: review?(jst)
Assignee | ||
Updated•17 years ago
|
Attachment #283108 -
Attachment is patch: true
Attachment #283108 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Comment 32•17 years ago
|
||
Comment on attachment 283108 [details] [diff] [review]
patch v3
Oops, i still have some #if 0 code in there. Removed locally.
Updated•17 years ago
|
Attachment #283108 -
Flags: superreview?(jst)
Attachment #283108 -
Flags: superreview+
Attachment #283108 -
Flags: review?(jst)
Attachment #283108 -
Flags: review+
Assignee | ||
Comment 33•17 years ago
|
||
Had to back this out since it caused perf regressions on the talos Tp runs. My theory is this:
So for a lot of documents we now rely on the cycle collector to delete the document. This could cause Tp regressions for two reasons:
1. We used to tear down the document as soon as it was navigated away from (modulo things like bfcache, but we'd still throw out one document right at the end of each navigation). This had a higher chance of being between two page-loads of the Tp test.
2. Cycle collection is probably slower than normal tear-down since we call unlink on each node in the collected cycle. For cycles that involve the document-node these cycles can presumably be pretty big.
One way to fix 1 would be to make Tp force a cycle collection between page-loads. During normal browsing this should normally happen, but since Tp tests continuously loads page after page chances of cycle collecting between two pages gets pretty small.
Not sure if 2 is an actual problem as long as we are able to schedule cycle collections in situations where the user isn't bothered. Fixing bug 373462 should take care of that hopefully.
Comment 34•17 years ago
|
||
There's a third possible reason:
3. Talos is not measuring the right thing.
Exactly what quantity does Talos measure? Note that the old Tp test was not affected, and it usually shows 3-5% across-the-board regressions just fine.
Comment 35•17 years ago
|
||
The numbers that you see reported by Talos on the Firefox tinderbox page are generated by Tp2 - so effectively you are seeing the same page load time collection as on the other tinderbox test machines.
The difference that we see in results is usually presumed to be because of the complexity of the talos page set - compared to the 40 circa year 2000 web pages being used on test tinderboxes talos uses ~400 pages that are now about 6 months old.
We are also starting to build up a history of talos showing regressions where the test tinderboxes do not. We are not yet totally, 100% convinced that this is because the talos numbers are correct, but we have seen regression causing patches backed out causing an improvement in talos results. At the very least, I believe that talos regressions are worthy of investigation and as they are starting to develop a history they will become more and more trustworthy.
Comment 36•17 years ago
|
||
OK. Note that we've had issues with Tp2 in the past that I'm not sure we ever resolved, both in terms of measurement methodology (it had some bugs in the number-crunching, pure and simple) and in terms of what it's measuring (loads from file:// and loads from http:// often have somewhat different performance characteristics). Tp2 is useful, but not a replacement for Tp.
Jonas, if this is Tp2 there should be no bfcache involved: the test runs in an iframe. It also means that the time is measured from right before window.location is set to when onload fires on the subframe. So I would have really expected cycle collection to happen during the timing, both before and now.
Comment 37•17 years ago
|
||
Just for clarity the tp2 test that talos is running loads from http:// with pages served by a local apache install.
I can also say things about Vlad's new pageloader, but then I fear that I'm getting farther and farther off topic.
Comment 38•17 years ago
|
||
Jonas, I wonder whether it helps to change the UnbindFromTree() calls in nsGenericElement::Unlink to be shallow. Right now if we happen to unlink in more or less root-to-leaf order we end up unbinding the leaves once for each ancestor.
And because we know that all descendants will be unlinked if we're being unlinked, we don't have to worry about leaving dangling document pointers or whatnot... I think.
Assignee | ||
Comment 39•17 years ago
|
||
Unfortunately doing a shallow unbind would not be entierly correct. Imagine a situation like (ext is an external reference):
Y D -
\ / \
+-->X B C |
| \ | / |
ext \|/ /
A <----
In this case we would cycle collect the A-B-C-D cycle, but only shallowly unbinding X, leaving Y to still believe that it still is connected to A.
This might actually be good enough as in most cases Y will only care about if it's connected to A or not when it's part of a document. I think forms and SVG are cases that care, but only when connected to the document.
However ideally cycle collecting shouldn't be timed as part of the Tp tests as we should try to move it to a point when it won't interfere with the user (see also bug 373462).
The weird part is that when I do some logging it looks like at least locally on my machine cycle collection happens between pageloads. I logged timing-start, timing-end, cyclecollection and document-dtor. In basically all cases cyclecollection and document-dtor happened between timing-end and timing-start.
Comment 40•17 years ago
|
||
I'm not sure what that diagram represents. Are those DOM nodes? Which way are the parent/child relationships?
In general, since parents own their kids we can't collect a kid while the parent is alive. So if we're unlinking a node, that means all its ancestors are being collected as well. We already have that situation on branch right now: if you build up a DOM in JS, never stick it in a document, and then forget about it, we'll GC that and as the nodes go away we'll do shallow unbinds on the kids.
I guess we could have an issue where a kid is holding a weak ref to some ancestor and doesn't clear the ref because while the ancestor is being cycle-collected the kid isn't? Or something like that?
Ideally we'd do a deep unbind on the root of any subtree being collected, and shallow unbinds on the rest of it, right?
As for timing this as part of Tp and interference with the user... In practice it interferes with the user, imo. We might as well be measuring the extent (and trying to mitigate it).
Assignee | ||
Comment 41•17 years ago
|
||
In the graph all nodes except 'ext' are DOM elements. A is the root of the tree.
I didn't realize that we do shallow unbinds from the dtor of nsGenericElement, but we do indeed. This is actually a bit risky as not necessarily all descendants in the tree will be destroyed. So for example an <input> element that is a descendant of a <form> will not notice that the form is going away and will keep a weak pointer to it :(
Comment 42•17 years ago
|
||
I think we should make that pointer in particular a strong pointer that gets nulled out on Unlink.
Assignee | ||
Updated•17 years ago
|
Attachment #282791 -
Flags: approval1.9+
Comment 43•17 years ago
|
||
Now that bug 373462 is fixed, is this patch ready to land (after beta)?
Depends on: 373462
Assignee | ||
Updated•17 years ago
|
Priority: -- → P1
Comment 44•17 years ago
|
||
should we land this?
Comment 45•17 years ago
|
||
Jonas - we makin progress here?
Assignee | ||
Updated•17 years ago
|
Whiteboard: [sg:critical] → [sg:critical] Blocked by bug 403835
Assignee | ||
Comment 46•17 years ago
|
||
This syncs to tip and also adds a couple of performance improvements:
1. Call DestroyLinkMap first in the nsDocument unlink function
2. Don't do deep unbind in the nsGenericElement unlink function
Both these are just mimicking what we're doing in destructors of these classes.
Got r/sr=jst over irl
Attachment #283108 -
Attachment is obsolete: true
Attachment #290639 -
Flags: superreview+
Attachment #290639 -
Flags: review+
Assignee | ||
Comment 47•17 years ago
|
||
So this moves back to using recursive UnbindFromTree. We definitely still have to do that when IsInDoc is true, but we can possibly do non-recusive when it's not. But I'll do that as a separate patch instead, if we feel that we want to.
I also added a SaveState call to nsGenericHTMLFormElement::DestroyContent so that we're still saving state where we used to.
Attachment #290639 -
Attachment is obsolete: true
Assignee | ||
Comment 48•17 years ago
|
||
Looks like it finally stuck!
/me does victory dance
Status: ASSIGNED → RESOLVED
Closed: 18 years ago → 17 years ago
Resolution: --- → FIXED
Comment 49•17 years ago
|
||
Congrats!
Comment 50•17 years ago
|
||
Tests for this should be associated with the dependencies, so nothing to do here.
Flags: in-testsuite-
Depends on: 406840
Updated•17 years ago
|
Whiteboard: [sg:critical] Blocked by bug 403835 → [sg:critical]
Comment 51•17 years ago
|
||
So does this mean that the InZombieDocument() checks in presshell need updating?
Assignee | ||
Comment 52•17 years ago
|
||
Not really sure what the purpose of that code is, so not sure. Could be that things work fine since the document is 'intact', or we might want to test mIsGoingAway
Comment 53•17 years ago
|
||
roc, mats, do you remember what the point of the InZombieDocument() stuff is?
I'd do the CVS archeology, but I'm not sure I'll have time for it until a few days from now. :(
I think at least part of it was fixing the bug where app shortcut keys didn't work at all during the transition between documents (assuming the content area was focused) -- originally bug 110718, with some later regression fixes on top.
Comment 55•17 years ago
|
||
Jonas, can you check on that, please? That's been a major annoyance in that past; I'm hoping we didn't regress it...
Assignee | ||
Comment 56•17 years ago
|
||
It seems to work fine for me.
Updated•14 years ago
|
Group: core-security
Updated•6 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•