Closed Bug 705582 Opened 13 years ago Closed 12 years ago

add async purplebuffer-cleanup phase to CC?

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12

People

(Reporter: smaug, Assigned: smaug)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [Snappy:P1])

Attachments

(3 files, 15 obsolete files)

9.26 KB, patch
Details | Diff | Splinter Review
77.28 KB, patch
Details | Diff | Splinter Review
94.40 KB, patch
Details | Diff | Splinter Review
at some point before CC run, we could perhaps
remove certainly black objects from the purple
buffer. this could reduce the graph size.
Interesting idea.  I just worry about what would happen if the information used to remove something from the buffer is invalidated before the CC runs.

Maybe something like an object is suspected, and it is the only object in the cycle that is suspected, then the purplebuffer cleanup phase sees that it has a black preserved wrapper and removes the object, then the JS side of the heap changes around so that the wrapper isn't reachable from a root any more... though I guess in a cross-domain cycle, the objects that cross the boundary are always traversed by the CC, independent of the purple buffer.

So the problems, if any, would be with cycles that don't involve JS.  Maybe if you had a DOM subtree that was somehow in a loop, and another part of the DOM has a black wrapper, then you split out the DOM subtree after removing it from the purple buffer...
(In reply to Andrew McCreight [:mccr8] from comment #1)
> So the problems, if any, would be with cycles that don't involve JS.  Maybe
> if you had a DOM subtree that was somehow in a loop,
Well, DOM tree is always a loop: parent owns children and children own parent.

> and another part of the
> DOM has a black wrapper, then you split out the DOM subtree after removing
> it from the purple buffer...
When you split a DOM tree, some refcount is decreased - and object is put to purple buffer.
Attached patch wip (obsolete) — Splinter Review
This contains also a patch for beforeTraverse.

With this patch, I'm rarely getting any CCs when I have lots of static
pages and Chatzilla running.
I need to test some other kinds of pages.

One reasonable easy thing to add to the patch would be
async DOM subtree deletion when they are owned only by themselves.
Talking with smaug in IRC, my guess is that what is happening is that by removing things from the purple buffer that we know will never be part of garbage cycles (for instance, because they have wrappers that have been marked non-gray by the GC), the cycle collector trigger that is based on the size of the purple buffer will rarely fire.
Blocks: 698919
Yeah, that is the idea. I need to still investigate the behavior some more.

https://tbpl.mozilla.org/?tree=Try&rev=609a735db334
For some reason there are no debug tests atm, so I don't know about possible leaks.

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/opettay@mozilla.com-609a735db334/
Comment on attachment 577618 [details] [diff] [review]
wip

Ok, now there are some test results. This leaks.
I'll investigate tomorrow why.
Attachment #577618 - Attachment is obsolete: true
And I can see at least one bug.
(The leak isn't happening in all the mochitests. 4 seems to pass.)
Whiteboard: [MemShrink]
Whiteboard: [MemShrink] → [Snappy]
Attached patch wip2 (obsolete) — Splinter Review
need to update CCGeneration. Posted to tryserver
(In reply to Olli Pettay [:smaug] from comment #8)
> Created attachment 577919 [details] [diff] [review] [diff] [details] [review]
> wip2
> 
> need to update CCGeneration. Posted to tryserver
https://tbpl.mozilla.org/?tree=Try&rev=961ef9d1bfac
Attached patch WIP4 (obsolete) — Splinter Review
Silly me. The previous patch was leaking the timer which clears purple buffer.
This patch is also more effective, since it optimizes those (common) subtrees 
where root is document. 


https://tbpl.mozilla.org/?tree=Try&rev=40ef9f5015e5
Attachment #577919 - Attachment is obsolete: true
Much better. No leaks, but one crash in a crashtest.
Marking this P1.  This seems pretty experimental so maybe it won't work out, but in comment 3, smaug said he found that it makes the CC pause less in situations where you aren't doing much, and the CC is a major source of pauses.
Whiteboard: [Snappy] → [Snappy:P1]
When I'm using Nightlies, I'm getting usually GC, CC, GC, CC...
but with the patch it is closer to every 10th thing being CC.
Of course running the clean-up phase takes some time, but it is usually 0 or just
few ms.

In my case there are 100+ bugzilla pages open, few hg.mozilla.org, tbpl
and chatzilla running as an addon.
I should actually try the latest patch for daily use, since it should be able optimize
certain things better than the older patches.
Woah, this is pretty crazy, smaug.  I'm seeing the CC not running for 40 seconds at a time.  If I didn't know better, I'd think it was broken.

Is there some way to see how much time the async cleanup is taking?  Should that get logged to the error console?

bz, jesup, since you've been having some CC problems, you could try out the build in Comment 12.  It doesn't pass one crash test, so feel free to pass.
https://tbpl.mozilla.org/?tree=Try&rev=41750f757148 
has some more logging to the error console.
hmm, the patch can trigger a bad pause, but that can be optimized out.
/me continues profiling and debugging
No no, that will leak.
(In reply to Andrew McCreight [:mccr8] from comment #15)
> Is there some way to see how much time the async cleanup is taking?  Should
> that get logged to the error console?
I'm adding some logging code.
Attached patch WIP5 (obsolete) — Splinter Review
Attachment #578279 - Attachment is obsolete: true
(In reply to Olli Pettay [:smaug] from comment #21)
> Maybe this is finally a bit better
> https://tbpl.mozilla.org/?tree=Try&rev=a801a583bae4

Seems to pass on try. At least I'm not seeing the crash that was happening before (during shutdown).
Assignee: nobody → bugs
Attached patch WIP7, better logs (obsolete) — Splinter Review
Comment on attachment 578907 [details] [diff] [review]
WIP7, better logs

I think I need some comments.

This patch is something we could easily try out on trunk for some time to
get more feedback.
Attachment #578907 - Flags: feedback?(jst)
Attachment #578907 - Flags: feedback?(continuation)
And still more changes... https://tbpl.mozilla.org/?tree=Try&rev=88bb85a2fb75
Don't run purple cleanup so often if there are just few new suspected objects.
Attached patch WIP8 (obsolete) — Splinter Review
(testing this patch locally...)
Attachment #578922 - Flags: feedback?(jst)
Attachment #578922 - Flags: feedback?(continuation)
Comment on attachment 578907 [details] [diff] [review]
WIP7, better logs

WIP8 is probably better, at least a lot less time is spent in
cleaning up purple buffer.
Attachment #578907 - Flags: feedback?(jst)
Attachment #578907 - Flags: feedback?(continuation)
Attached patch WIP9 (obsolete) — Splinter Review
Attachment #578632 - Attachment is obsolete: true
Attachment #578907 - Attachment is obsolete: true
Attachment #578922 - Attachment is obsolete: true
Attachment #578922 - Flags: feedback?(jst)
Attachment #578922 - Flags: feedback?(continuation)
Attachment #578929 - Flags: feedback?(jst)
Attachment #578929 - Flags: feedback?(continuation)
OS: Linux → All
Hardware: Other → All
Version: unspecified → Trunk
I think the body of nsGenericElement::UnoptimizableCCNode would look a little better implemented as a giant return-or statement instead of having return true and return false.

Is there a better way to set up a callback that only needs to be done once than put a check in nsGenericElement::BeforeTraverse?
Attached patch unlink subtreesSplinter Review
An additional patch over the WIP9. In some cases we can unlink lots of stuff
without CC, but only in some.
(In reply to Olli Pettay [:smaug] from comment #37)
> Created attachment 579722 [details] [diff] [review] [diff] [details] [review]
> unlink subtrees
> 
> An additional patch over the WIP9. In some cases we can unlink lots of stuff
> without CC, but only in some.
https://tbpl.mozilla.org/?tree=Try&rev=574f3c524546
Attached patch WIP10 (obsolete) — Splinter Review
Simpler way to init the callback functions. Some other minor clean-ups.
Attachment #578929 - Attachment is obsolete: true
Attachment #578929 - Flags: feedback?(jst)
Attachment #578929 - Flags: feedback?(continuation)
Attachment #579739 - Flags: feedback?(continuation)
(In reply to Olli Pettay [:smaug] from comment #39)
> Created attachment 579739 [details] [diff] [review] [diff] [details] [review]
> WIP10
https://tbpl.mozilla.org/?tree=Try&rev=261a354902c2
I should have some time to look at this tomorrow in more detail.

One concern I have is that it may not be running the cycle collector enough.  The purple buffer can be empty, and there still can be garbage cycles (involving JS and C++), so we should still run it occasionally, if the JS GC is running.  I'm not entirely sure of the mechanism that is causing it to not run CCs very much.
I'm looking around at the CC graph that is produced from the latest version you posted here.  With JS and garbage stripped out it looks pretty amazing!

I noticed a few minor issues that might be fixable.  They seem like a few specific issues, so they are probably best addressed in a followup patch, as your base patch is pretty long and complex at it is.

One odd thing I noticed is that there's a nsBaseContentList with about 700 children.  The children are all (or at least almost all) HTML nodes without any children.  There's also an nsContentSink with about 300 children that again appear to be mostly HTML nodes without any children.  Well, examining it, it must actually be a nsHtml5TreeOpExecutor, not an actual nsContentSink.  These children are all elements of an array.   nsTArray< nsCOMPtr<nsIContent> > in the first case, and nsCOMArray<nsIContent> mOwnedElements in the second case.  I wonder if there's some way to check if an element of an array is not actually going to be traversed before we report it as a child?

Another large glob is an nsJSContext that has a refcount of 458, with 457 things pointing at it.  This part of the graph is much better now that you removed the other pointer in bug 702036, but I wonder if we could get rid of these altogether?  I wonder if GetOutstandingRequests() is non-zero for this nsJSContext?  In that case we root the nsJSContext, so there's no point in adding any of the nsJSEventListeners to the graph.  So we could add a similar optimization for these nodes as for DOMs where the document is live.  Though you'd have to be careful about the last request going away and somehow causing an orphan garbage cycle.
(In reply to Andrew McCreight [:mccr8] from comment #42)
> One odd thing I noticed is that there's a nsBaseContentList with about 700
> children.
Strange.


> There's also an nsContentSink with about 300 children that
> again appear to be mostly HTML nodes without any children.  Well, examining
> it, it must actually be a nsHtml5TreeOpExecutor, not an actual
> nsContentSink.  These children are all elements of an array.   nsTArray<
> nsCOMPtr<nsIContent> > in the first case, and nsCOMArray<nsIContent>
> mOwnedElements in the second case.  I wonder if there's some way to check if
> an element of an array is not actually going to be traversed before we
> report it as a child?
I think there is a bug open to fix HTML5 parser's handling in that case.
Henri would know more.

> 
> Another large glob is an nsJSContext that has a refcount of 458, with 457
> things pointing at it.  This part of the graph is much better now that you
> removed the other pointer in bug 702036, but I wonder if we could get rid of
> these altogether?  I wonder if GetOutstandingRequests() is non-zero for this
> nsJSContext?  In that case we root the nsJSContext, so there's no point in
> adding any of the nsJSEventListeners to the graph.  So we could add a
> similar optimization for these nodes as for DOMs where the document is live.
Indeed. Optimizing nsJSEventListeners that way should be reasonable easy.
Here's what the largest glob looks like in this particular graph, after I removed all JS and garbage nodes.  Purple nodes are nsJSEventListeners.  They all fan in to an nsJSContext in the middle.  Blue nodes are HTML elements.  They are reached from a single node in the middle, and don't have any children.  The bigger glob at the bottom has a nsContentSink at the middle, and the smaller one in the upper right has the nsHtml5TreeOpExecutor at the middle.  There's other stuff in this glob, but it is fairly inconsequential.

Disclaimer: I haven't looked much at other graphs from this session, so I don't know how representative it is.  But nsJSEventListener blooms like that are a fairly common problem I've noticed.
The nsContentSink looks like this:
0x124c82838 [rc=3] nsContentSink
> 0x14b5ab800 mDocument
> 0x12119d700 mParser
> 0x1162d7ec0 mNodeInfoManager
> 0x14cdbec10 mScriptElements[i]
> 0x12835e7f0 mOwnedElements[i]
(then a huge number of mOwnedElements)

The nsBaseContentList just has a bunch of mElements[i] fields.
I should also point out that of the 5 or so graphs I've looked at for this execution only the first one I looked at had the nsContentSink and nsHtml5TreeOpExecutor blobs, so maybe it isn't a big deal.
(In reply to Olli Pettay [:smaug] from comment #43)
> > There's also an nsContentSink with about 300 children that
> > again appear to be mostly HTML nodes without any children.  Well, examining
> > it, it must actually be a nsHtml5TreeOpExecutor, not an actual
> > nsContentSink.

nsContentSink is never instantiated directly. It only shows up when the actual object is an instance of a subclass of nsContentSink.

> > These children are all elements of an array.   nsTArray<
> > nsCOMPtr<nsIContent> > in the first case, and nsCOMArray<nsIContent>
> > mOwnedElements in the second case.  I wonder if there's some way to check if
> > an element of an array is not actually going to be traversed before we
> > report it as a child?

The key is that the executor owns these elements, the elements own their parent and the parent owns the executor, so failure to traverse these would mean a failure to detect a reference cycle. That is, its not so much about traversing the child nodes of these nodes but about traversing their parent references.

I suppose one possible avenue of optimization would be treating any active parser and document that has an active parser as a known-uncollectable cycle and then using some other mechanism to make sure that we never let go of a document with an active parser without telling the parser to stop.

> I think there is a bug open to fix HTML5 parser's handling in that case.
> Henri would know more.

Do you mean bug 515941? The fix would be another special-purpose stop-the-world collector, so it might not be a snappiness win.
(In reply to Henri Sivonen (:hsivonen) from comment #47)
> I suppose one possible avenue of optimization would be treating any active
> parser and document that has an active parser as a known-uncollectable cycle
> and then using some other mechanism to make sure that we never let go of a
> document with an active parser without telling the parser to stop.

Maybe parser could
do something like
if (nsCCUncollectableMarker::InGeneration(cb, mDocument->GetMarkedCCGeneration()) {
  return;
}
in traverse.
(In reply to Olli Pettay [:smaug] from comment #48)
> Maybe parser could
> do something like
> if (nsCCUncollectableMarker::InGeneration(cb,
> mDocument->GetMarkedCCGeneration()) {
>   return;
> }
> in traverse.

I'm not familiar enough with the CC to be sure what that means. Is an uncollectable marker something that's maintained on a per-document basis by a mechanism outside the CC traverse itself?
Yes, it is.  That checks sees if the document is currently being displayed.  It is updated by iterating over all of the top level windows before a CC.

That optimization is okay to do only if you know that all of the elements held by the nsContentSink are reachable from the document.
I think gray nodes in black trees could be just unmarked (recursively), and unpurpled.
Problem is that the next gc might then re-gray-mark those nodes.
Need to figure out how gc could leave those nodes black.
Target Milestone: --- → mozilla11
But even that patch isn't effective enough. We seem to traverse _lots_ of script objects, which are
kept alive. Need to figure out why. Perhaps XPConnect causes us to traverse most of 
event listeners all the time (not necessarily nsJSEventListeners), and that could cause us to
traverse all the relevant js objects which aren't connected to global scope.
Attached patch more crazyness, back up (obsolete) — Splinter Review
Attachment #584354 - Attachment is obsolete: true
Attached patch some more crazy ideas (obsolete) — Splinter Review
Try to rather aggressively mark JS objects black, if the C++ object owning them is certainly black.
(In reply to Andrew McCreight [:mccr8] from comment #50)
> Yes, it is.  That checks sees if the document is currently being displayed. 
> It is updated by iterating over all of the top level windows before a CC.
> 
> That optimization is okay to do only if you know that all of the elements
> held by the nsContentSink are reachable from the document.

nsContentSink is reachable from the document during the parse, so anything held by the nsContentSink is transitively reachable from the document, isn't it? After the parse finishes and the document drops the sink, the sink might remain alive for a moment so that it holds some elements that are no longer reachable from the document if a script has removed some parser-inserted elements from the tree.
It sounds to me that optimizing the traversal of nsContentSink isn't worth doing at this point due to possibly having some odd cases, as I imagine that nsContentSink doesn't really stay around that long.
(this discussion about contentsink optimization should go to some other bug.)
I think contentsink can stay alive quite long time when document.write is used but
document.close isn't, right Henri?
Or when iframe is used for loading data slowly.
(In reply to Olli Pettay [:smaug] from comment #58)
> (this discussion about contentsink optimization should go to some other bug.)
> I think contentsink can stay alive quite long time when document.write is
> used but document.close isn't, right Henri?
> Or when iframe is used for loading data slowly.

Correct.
(In reply to Olli Pettay [:smaug] from comment #58)
> (this discussion about contentsink optimization should go to some other bug.)
Filed bug 714830.
Depends on: 716518
Attachment #579739 - Flags: feedback?(continuation)
Attached patch backup, v21 (obsolete) — Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=c5240065b0fe
Attachment #579739 - Attachment is obsolete: true
Attachment #582996 - Attachment is obsolete: true
Attachment #584449 - Attachment is obsolete: true
Attachment #585235 - Attachment is obsolete: true
Attached patch v22 (obsolete) — Splinter Review
Depends on: 718297
Attached patch v28 (obsolete) — Splinter Review
Attachment #588383 - Attachment is obsolete: true
Attachment #588479 - Attachment is obsolete: true
It would be nice if there was an inlined function or something for this nsCCUncollectableMarker::InGeneration(foo->GetMarkedCCGeneration())) pattern that appears all over the place, defined in nsCCUncollectableMarker.h.
Attached patch v29Splinter Review
This doesn't remove the ABP optimization
Attachment #589460 - Attachment is obsolete: true
Blocks: 720423
No longer blocks: 720423
Depends on: 720423
Depends on: 719949
Depends on: 720536
Depends on: 714642
Depends on: 716502
Depends on: 720630
Depends on: 720647
Depends on: 720686
Blocks: 721067
Depends on: 720808
Depends on: 721420
Depends on: 721515
Depends on: 721543
Depends on: 721548
Depends on: 716014
Depends on: 721794
I got this crash in your latest try build: https://crash-stats.mozilla.com/report/index/bp-137929e4-21ff-4bc7-94b3-4d9962120127
I don't know if it is related or not.  Maybe this is something the nullcheck you added after the XPCshell test failures would fix?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/opettay@mozilla.com-8eebbfd69115/ 
should have the null check, which fixed xpcshell tests.

Hard to say anything about that crash, since the stack is just broken. It could be the
same null check problem.
Depends on: 722057
Target Milestone: mozilla11 → mozilla12
It might make sense to move the Compartment GC bug over to the general CC responsiveness meta bug, then mark this closed.
Depends on: 723157
Depends on: 725768
Depends on: 725752
I think we can consider this done, and just put further work over in bug 716598.
Status: NEW → RESOLVED
Closed: 12 years ago
No longer depends on: 716014, 725752, 725768
Resolution: --- → FIXED
Depends on: 726007
Depends on: 720879
Depends on: 727313
No longer blocks: 698919
Blocks: 698919
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: