Closed Bug 698629 Opened 13 years ago Closed 11 years ago

investigate non-JS cycle collections

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:P3])

Attachments

(2 files, 1 obsolete file)

The cycle collector traverses a lot of small pure DOM subgraphs.  One idea to reduce cycle collector pause times (at the possible cost of increased total time spent CCing) would be to have two kinds of cycle collections, "minor" and "major".  A minor collection drains the purple buffer, but does not trace into JS.  A major collection leaves the purple buffer alone, but traces through objects that link JS and DOM.

The names are by-analogy to major and minor GCs, but the analogy is not quite correct, because a major GC looks at everything, whereas a major CC only looks for cross-domain cycles, so perhaps other names are better.

This may result in an increase in cycle collector time because an object can be reachable both from a cross-domain wrapper and from a suspected object.  If CCs are split into major and minor, then it will be visited twice.

As a first step, I was going to look at switching between major and minor CCs at every step and see what that does.  It does both if a major CC is scheduled, but the purple buffer is full (has >= 1000 things in it).

On a major CC, BeginCycleCollection is run.  On a minor CC, the purple buffer is drained.  The trickiness I've seen so far is that BeginCycleCollection sets up a special JS context for the cycle collector to use, but there are some DOM things that use this context.

It looks like there are two uses that may not be not easy to avoid.

1) XPCWrappedNatives.  I would expect that these can only be reached from JS objects, and thus don't need to ever be added to the purple buffer (even on trunk), and there's a note to that effect on NS_DECL_CYCLE_COLLECTION_UNMARK_PURPLE_STUB (which is also used by nsXPCWrappedJS).  However, as far as I can tell from looking at the code, XPCWrappedNatives and nsXPCWrappedJS are in fact suspected as any other object.  All UnmarkPurple does is restore the ref count after an object has been removed from the purple buffer.  Additionally, when I disable BeginCycleCollection, I crash in XPCWrappedNative::Traverse, so they are being reached from somewhere.  I'll investigate this and figure out what I'm not understanding, or file a bug.

2) nsFrameScriptExecutor::Traverse.  This calls nsContentUtils::XPConnect()->NoteJSContext(), which in turn accesses the special CC context.  I have no idea what this is.
Okay, I figured out what I was misunderstanding about XPCWrappedNatives and nsXPCWrappedJS: these use NS_DECL_ISUPPORTS (and nsAutoRefCnt) instead of NS_DECL_CYCLE_COLLECTING_ISUPPORTS (and nsCycleCollectingAutoRefCnt), so they are never added to the purple buffer.  The overload of UnmarkPurple is thus okay because we never need to restore a ref count.

The mystery of how an XPCWrappedNative is getting into the graph without BeginCollection (and with NoteScriptChild etc. disabled) remains.
Looks like an XPCWrappedNative is getting added to the graph as a NoteXPComChild of an nsFrameScriptExecutor.  Could be, then, that problem (1) above is really another instance of problem (2).

Here's what the stack looked like when I do a dummy Traverse immediately upon adding a node to the graph in order to figure out where the XPCWrappedNative is being added:

Thread 3 Crashed:
0x0000000101a2c913 XPCWrappedNative::cycleCollection::Traverse(void*, nsCycleCollectionTraversalCallback&) + 243 (XPCWrappedNative.cpp:146)
0x0000000101ef5820 GCGraphBuilder::AddNode(void*, nsCycleCollectionParticipant*) + 192 (nsCycleCollector.cpp:1643)
0x0000000101ef5bc5 GCGraphBuilder::NoteXPCOMChild(nsISupports*) + 149 (nsCycleCollector.cpp:1498)
0x00000001014db629 nsFrameScriptExecutor::Traverse(nsFrameScriptExecutor*, nsCycleCollectionTraversalCallback&) + 41 (nsFrameMessageManager.cpp:810)
0x00000001014dfb5f nsInProcessTabChildGlobal::cycleCollection::Traverse(void*, nsCycleCollectionTraversalCallback&) + 63 (nsInProcessTabChildGlobal.cpp:159)
0x0000000101ef6b1a nsCycleCollector::BeginCollection(nsICycleCollectorListener*) + 650 (nsCycleCollector.cpp:1660)

nsFrameScriptExecutor::Traverse does a NoteXPComChild of its mGlobal field, which has type nsCOMPtr<nsIXPConnectJSObjectHolder>.  Looks like that is the problem.  Of course, the next line in the Traverse is a nsXPConnect::NoteJSContext which will probably cause problems by accessing the null cycle collector global.
Blocks: 698919
Another advantage of this approach is that it would allow finer-grained scheduling.  I've read that Chrome only breaks JS-DOM cycles at major collections, so maybe we could do CCs that touch JS less often.
This works well enough to browse around without crashing.  I didn't check shutdown leaks or whatnot.
I'm not sure how accurate or anything this is, but here's a timing log from a little browsing around I did.

In the patch I gave above, I alternate doing JS-crossing CC with pure DOM CC.  But, one of the triggers for a CC is that there are 1000 or more suspected nodes in the graph.  To keep the CC triggerer from going berserk if it expects the purple buffer to be drained when it is not, I always add purple roots if there are 1000 or more suspected nodes.

If you look at the timing logs, SelectPurple actually runs every time, so the CC, even in this simple run with not many tabs, has 1000 suspected objects.  So that could be improved.

But it looks like pure DOM CCs are indeed faster.  They have length: 25ms, 7ms, 6ms, 20ms, 14ms, 19ms, 25ms, 10ms, 19ms

the mixed CCs are: 38ms, 33ms, 70ms, 200ms, 116ms, 131ms, 204ms, 99ms, 111ms

Of course, the minor CCs may not be collecting much of anything.

Another thing that needs to be fixed is that shutdown CCs should use both kinds of roots.

I fixed the crash problems I talked about above by simply making XPCWrappedNatives not traverse their expandos if the cycle collection context is NULL, which happens during minor CCs.
I iterated on this a little bit.  Currently, I'm looking at two types of CCs.  One only adds purple roots, the other only adds XPConnect roots.

There are no real obvious wins in terms of reducing pause times.  To try to figure out why, I tested to see how much overlap there is between successive CCs, eg how many nodes get added in a purple roots CC vs an XPConnect CC?  This is a measure of how much work we're redoing.

In my browsing session, this was about 3500 to 7600 nodes.  The overlap was about 50% to 80% of the size of the purple root CC graph.  This matches my experience that about half of the stuff in the graph is small independent free-floating things, and the other half is one or two giant knots that go all over the place.

As I noted in bug 701878, the large blobs are going to be trouble for any kind of simple CC incrementalization work.  For this particular approach, if we can quickly determine that a node is in this big blob, we could skip it and only deal with it in an XPCConnect collection, but that seems tricky.

A heuristic like "don't add anything XUL related to the CC graph" would probably would to avoid the large blobs, but implementing that would be tricky.  (Namely, if you hit a XUL or whatever node during traversal that you decide to skip, you'd probably have to add that node to a separate "major CC" purple-esque buffer.)  Probably better to just try to figure out what the blob is.

Here are the top ten types of nodes in the overlap between two CCs I looked at (with some kind of bucketing together of nodes):
    1765 nsGenericDOMDataNode
    1410 nsGenericElement (xhtml)
     961 nsGenericElement (XUL)
     796 nsXULPrototypeNode
     424 nsJSEventListener
     312 nsGenericElement (XBL)
     303 nsEventListenerManager
     224 nsXBLBinding
     209 nsNodeInfo (XUL)
     112 nsChildContentList


One caveat.  It mostly works when browsing, except that cnn.com fails to load with a message in the console:

Sat Nov 12 09:22:56 host-7-174.mv.mozilla.com plugin-container[57360] <Error>: unknown error code: invalid context

Maybe the plugin container is involved in CC somehow in a way that I'm not handling properly.  Not a big deal right now, as it isn't clear this is even worth doing.
Attachment #571511 - Attachment is obsolete: true
Depends on: 702032
Whiteboard: [Snappy]
Whiteboard: [Snappy] → [Snappy:P3]
No longer blocks: 698919
Blocks: 698919
Nowadays, the graph is small enough that this probably isn't worth the complexity.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: