Created attachment 723712 [details] [diff] [review]
Cycle collector graph building can be done incrementally if we record all new references to cycle collected objects. This idea is from "An on-the-fly Reference Counting Garbage Collector for Java" by Levanoni and Petrank. Any such newly referenced objects are treated as being alive during the ICC. Any entries added to the purple buffer during an ICC are treated as suspects for garbage during the next CC, not the current one. Once the graph is built, everything else can be done incrementally without much trouble.
I think finding new references to JS objects can be done by checking all gray JS objects at the end of graph building to see if they are black. Blackness is a monotonic property (assuming we bail out of an ICC if a GC occurs), so this can be done incrementally if desired.
We can find some of the new references to C++ objects by modifying the addref for cycle collected objects to have a dirty bit that gets set on the purple buffer entry. At the end of graph building, we scan the purple buffer for objects that have this dirty bit set. As with JS objects, this can be done incrementally.
The limitation of this approach is that any sort of swap() operation, where a reference is moved from one object to the other without any refcount operations, will be missed. I'm not sure how sound this actually is. If it is a problem, it may be possible to work around this by modifying swap() somehow, or we may need to use a more heavyweight approach, like adding write barriers for CCed fields in CCed objects.
The WIP is not super great. I'm not entirely convinced it collects everything it should, it crashes on shutdown due to a null deref somewhere in storage code, and the pause times are pretty awful. Part of the reason for the bad pause times is that I disabled all cycle collector optimizations, so the graphs are huge.
Created attachment 723740 [details] [diff] [review]
Created attachment 725450 [details] [diff] [review]
WIP, vaguely stable, still leaky (patches 1-22)
Rather than rooting all objects in the graph for the duration of the ICC, dying objects now tell the cycle collector they are going away, and then they are nulled out of the graph. Graph building now uses vaguely time based slices.
This is what the pause times look like for TechCrunch, with graph building slice times set to around 5ms (note that we don't always hit the mark...). Note that JS compartment merging (bug 754495) is still disabled, so the graph is much bigger than it would be once I hook that up again. Note also that this is a debug build.
CC: BeginIncrementalCollection took 42.30 ms
// this is pretty bad. I'm not sure why, but I suspect it is because
// I have yet to reenable RemoveSkippable.
CC: GraphBuildSlice took 5.22ms to traverse 3300 nodes.
CC: GraphBuildSlice took 5.45ms to traverse 2500 nodes.
CC: GraphBuildSlice took 5.00ms to traverse 2700 nodes.
CC: GraphBuildSlice took 4.91ms to traverse 5400 nodes.
CC: GraphBuildSlice took 5.32ms to traverse 2400 nodes.
CC: GraphBuildSlice took 8.08ms to traverse 1300 nodes.
// This slice is way longer! The last 100 nodes must have taken 3ms+ for some reason
CC: GraphBuildSlice took 4.96ms to traverse 2800 nodes.
CC: GraphBuildSlice took 5.10ms to traverse 2500 nodes.
CC: GraphBuildSlice took 5.04ms to traverse 2100 nodes.
CC: GraphBuildSlice took 4.94ms to traverse 2600 nodes.
CC: GraphBuildSlice took 4.91ms to traverse 2500 nodes.
CC: GraphBuildSlice took 5.12ms to traverse 2000 nodes.
CC: GraphBuildSlice took 5.12ms to traverse 2600 nodes.
// The rest of these are part of the final work slice, because we decided
// enough was enough.
cc: FinishICC::GraphBuildSlice took 27ms
// We finish building the graph synchronously. Better scheduling of work slices
// hopefully will avoid this happening too much.
CC: Colors after newly black flooding: black: 98, white: 0, grey: 47899, wc: 0
CC: Colors after normal scanning: black: 1008, white: 46987, grey: 2, wc: 46987
cc: FinishICC::scan roots took 5ms
// Scanning barely takes any time, even with this huge graph, and the extra phases
// needed for ICC.
cc: FinishICC::collect white took 24ms
// This takes a long time, but it should be easy to make it incremental.
cc: total cycle collector time was 5748ms
// total time is the time from when we started, not max or total pause time.
cc: visited 10146 ref counted and 37849 GCed objects, freed 46987.
cc: FinishICC::misc cleanup took 2ms
CC: FinishIncrementalCollection took 59.47 ms
// The total time for this final slice was 60ms, which is not great. :)
Created attachment 725917 [details] [diff] [review]
WIP, vaguely stable, vaguely non-leaky (patches 1-31)
The focus of this iteration is removing the sync CC stuff.
- reenabled nsPurpleBuffer::RemoveSkippable
- moved ICC state into nsCycleCollector
- moved ScanRoots into its own time slice. This is usually fast, around 8ms on TC closing, so it is mostly useful as a stepping stone towards making CollectWhite incremental. The crumminess of top level slice scheduling triggering synch markroots also stops us from getting the most out of this.
- Use "incremental" cycle collection synchronously at shutdown, so we can remove all the unneeded synch CC stuff
- remove the CC thread runner stuff, as I'm not using that right now
- replace various unused synch versions of methods with their incremental versions (BeginCollection, CollectWhite, FinishCollection, MarkRoots, ScanRoots) and removed one or two now-unneeded methods.
Just for fun, let's see how hard a try run crash-and-burns:
Okay, I got a linking error on Linux64 for some reason. Maybe OSX will work better.
Created attachment 727022 [details] [diff] [review]
WIP, non-incremental, less leaky (patches 1-35)
I made the cycle collector non-incremental for the moment, to investigate the leaks I was seeing. I found a few random Traverse/Unlink problems in the course of things, but they didn't seem to be related. After carefully comparing my patch to the base CC, I noticed that I wasn't calling FixGrayBits, and I was keeping mScanInProgress true during CollectWhite, which is different from base CC. Some combination of the two seemed to fix the local leaks from running crash test for a minute that I was seeing, and it is plausible that the former will fix the crashes. (LazyContentBlaster seemed to be causing problems, but maybe it was just interacting badly with mScanInProgress.) We'll see.
I don't actually want to log all cycle collections on my try push...
Fun fact: apparently we can end up forcing a GC during the unlinking phase of a CC (perhaps from JS::DestroyContext()). Of course, my code tries to finish a CC when we start a GC, leading to reentrancy, leading to various badness (resetting a counter in a bad place caused it to underflow, leading us to OOM when we tried to allocate an array of that size). This isn't a problem in a regular CC because it doesn't trigger a new CC at the start of a GC, and because it never touches any JS objects once CollectWhite() begins, so it doesn't matter if they go away.
To see if it fixes crashes I'm getting, I added AddRef/Release to swap() to ensure that newly stored objects get marked as live, but it turns out swap is used many places off-main-thread, and thus requires swap be implemented without any AddRef/Release calls. This is sound from the perspective of ICC, because the original COM pointer is just getting zeroed out, and the original value is getting moved into a stack location, which is never traced by the CC, so there's no danger of it getting traced at the wrong time, but it does require tracking down the various places this is done, and changing them to use the current swap.
Created attachment 729272 [details] [diff] [review]
Created attachment 731528 [details] [diff] [review]
WIP (patches 1-42c)
This version is incremental, and had a mostly green try run on OSX (10.6, 10.7, 10.8). BC is pretty orange, so there are likely still some problems there. The synchronous garbage verifier I wrote still gets a few hits during mochitest runs, so there are probably things still to fix there.
Changes since the previous version:
- reenable incremental CC fixups
- implemented synchronous CC in CollectWhite to verify that any object being freed by ICC is actually garbage
- use XPC callback jst mentioned to me to call objectIsDying for all dying JSContexts
- prepare for GC in the IGC DOM callback in addition to the other callback
- fix reentrancy from CC triggering a GC triggering a CC
- rewrite PrepareForGarbageCollection to not CollectWhite
- reset mWhiteNodesCount in BeginCollection, not ScanRoots
- don't wrongly treat JSCompartments as GCthings during pre-scan fixups
- XPCWrappedNatives had a few problems
- tell the CC when they die
- always add live XPCWN reflectors to the CC graph, to keep them from rooting dead XPCWNs
- make nsTimeout not cycle collected
- various refactoring
This version does not include any of the changes I made to make swap() tell the CC objects are alive (mentioned in comment 9), as the current state of affairs appears to be stable enough without it. I'll deal with that at some point.
One outstanding mystery is why my patch causes thousands of "can't access dead object" errors in each Mochitest run, even when running the CC synchronously. I haven't looked into that yet.
This was the try run: https://tbpl.mozilla.org/?tree=Try&rev=fb3c377f449f
Created attachment 737064 [details] [diff] [review]
WIP (patches 1-60)
This patch incrementalizes more things. This is what the TechCrunch teardown looks like with an opt debug build:
CC: BeginCollection took 2.10 ms
CC: MarkRoots slice took 5.08ms to traverse 500 nodes.
CC: MarkRoots slice took 4.91ms to traverse 4700 nodes.
CC: MarkRoots slice took 4.80ms to traverse 6697 nodes.
CC: Colors after normal scanning: black: 552, white: 11344, grey: 1, wc: 11344
CC: ScanRoots took 0.91ms.
CC: Checking CC results:
cc: ScanRoots::WalkFromRoots took 6ms
CC: Colors after normal scanning: black: 165, white: 58516, grey: 0, wc: 58516
CC: ICC identified 10643 garbage nodes, sync checker found 10643.
CC: Done checking CC results in 126.09ms.
CC: RootGarbage took 2.32ms to root 10643, skipped 701 white GCed, and ignored 553.
CC: ForgetSkippable took 1.79 ms
CC: UnlinkGarbage took 5.84ms to unlink 500
CC: UnlinkGarbage took 5.88ms to unlink 1000
CC: UnlinkGarbage took 3.99ms to unlink 9143
CC: ForgetSkippable took 4.06 ms
CC: UnrootGarbage took 9.61ms to unroot 1498 objects and skip 2 objects.
CC: UnrootGarbage took 4.94ms to unroot 2500 objects and skip 0 objects.
CC: UnrootGarbage took 5.37ms to unroot 3000 objects and skip 0 objects.
CC: UnrootGarbage took 5.79ms to unroot 3000 objects and skip 0 objects.
CC: UnrootGarbage took 1.50ms to unroot 643 objects and skip 0 objects.
cc: total cycle collector time was 803ms
cc: visited 11104 ref counted and 792 GCed objects, freed 10643 ref counted and 701 GCed objects.
CC: CleanupAfterCollection took 1.47ms
Aside from checking (which is only done for debugging purposes), all slices but 1 are all under 6ms, with a target time of 5ms.
One slice of unrooting takes almost 10ms, but I think that's mostly due to a single nsCSSStyleSheet taking a huge amount of time to destruct. I measured one style sheet on TechCrunch taking about 4ms, which would account for blowing past the budget entirely, if we got unlucky and encountered it near the end of the slice. I filed bug 861449 to investigate ways to incrementally tear down these. We already lazily tear down DOM trees, and with ICC we may find other data structures where it makes sense to do.
As you can see it is still running ForgetSkippable during the ICC, which is probably not awesome. The one during teardown takes 4ms, probably because tons of things are getting thrown into the purple buffer. These things are going to die after unrootGarbage, so we're probably better off locking out the cleanup during at least that part of ICC. I probably need to fix up how I am scheduling the timers.
The ICC work slices are 50ms apart which was fairly arbitrary. IGC uses 100ms, but I think their slices tend to be longer.
I forgot to make ICC slices respect CC lockout, so I was getting assertion failures from touching the JS heap. This actually is only bad due to the debug-checking of the CC results: other than that, after graph building and scanning we never touch the JS heap, so we could in theory interleave slices of IGC and ICC. However, for debugging purposes, before ICC starts freeing things it checks that the objects are actually garbage, by doing a synchronous CC, which is bad if we're in the middle of IGC.
I fixed that, but I still got a failure. I think what happens is this sequence:
- start an ICC
- start an IGC, making the CC finish graph building
- force a CC, which finishes the current CC before starting a new CC, triggering checking, which touches the JS heap and causes an assertion failure as before.
Easy enough to fix.
Created attachment 741104 [details] [diff] [review]
WIP, patches 1-70
This is green again. I fixed some silliness with scheduling between ICC vs IGC vs Forgetskippable.
I also removed the mAlive flag, because it only does something when an object is released, but not addrefed. Instead we can drop the flag and treat everything in the purple buffer as being alive.
Next I'm going to rebase my patch.
I have a patch in bug 861449 that eliminates the long pauses from destroying nsCSSStyleSheets (turns out it is from destroying giant arrays of CSS rules, whatever those are) in TechCrunch teardown, but it is orthogonal to ICC, so I'm not marking it as blocking or blocked by.
Created attachment 743399 [details]
A bunch of my pre-ICC refactorings are getting close to landing, so I'm getting close to rebasing, which will involve constructing a new patch stack, so for posterity here's the current one. It is green on OSX on try, perhaps modulo the lazy rule destroyer. It doesn't build on Linux or Windows due to XPCOM fun I'll need to address.
Created attachment 745541 [details] [diff] [review]
rebased, a bit crashier than before
The crashes are mostly GC-related crashes on Linux, so there's probably some remaining issue to be worked out with CC/GC interaction.
Created attachment 749582 [details] [diff] [review]
I'm going to try running this under ASAN in the hope it will illuminate what is going wrong.
I rewrote the synchronous verification. It now runs at the start of a CC, on the same purple buffer. It computes a set of garbage objects, without disrupting the purple buffer. Then when we've computed the set of objects ICC thinks are garbage, we check that the sets are identical. This gives us both soundness (all things ICC thinks are garbage sync CC thinks are garbage) as well as completeness (everything the sync CC thinks are garbage are freed in that ICC).
Any objects that die during the ICC are removed from the set of found objects.
One tricky case is objects that are not garbage at the start, but become garbage during the CC. However, ICC won't treat these as garbage because such objects will necessarily have had a decref on them after the start of an ICC, so they will be in the purple buffer at the end of ScanRoots() and thus be treated as live by the ICC. So I think it is okay.
In a few logs I checked, I didn't see any violations of soundness, but I did see some completeness problems, often involving just nsGlobalWindows, but sometimes other things. I haven't had a chance to investigate.
At Bill's recommendation I increased the slice time from 5ms every 50ms to 40ms every 100ms, to try to fix the OOMy timeouts I was getting.
This seems to have improved things. Some stats from Win7 BC:
regular CC: Ran CC 419 times. Average graph size (rc/gc): 23065 59863. Average freed (rc/gc): 11033 42496.
ICC: Ran CC 408 times. Average graph size (rc/gc): 23076 63906. Average freed (rc/gc): 10730 45582.
So, ICC ran about as many times as we used to run the CC, and the average graph sizes and number of freed objects are very similar, which seems like a good sign.
Here's some stats about average slice times across 408 CCs:
410 BeginCollection slices, avg len of 3 ms (excluding checking CC time)
1228 MarkRoots slices, avg len of 35 ms
186 ScanRoots slices, avg len of 9 ms
355 Rooting slices, avg len of 6 ms
432 Unlinking slices, avg len of 17 ms
426 Unrooting slices, avg len of 18 ms
410 Cleanup slices, avg len of 2 ms
This does not include the MarkRoots and ScanRoots we do before a GC, which in many cases are quite awful.
Created attachment 752275 [details] [diff] [review]
One fun fix this includes is fixing a weak cache of nodes so that things are removed when the nodes die. I found that one using ASAN.
A problem I noticed using dynamic completeness checking is that weak references to objects that are otherwise garbage can be promoted to strong references, then dropped, and that will keep the object alive for the current CC, because the object will be put in the purple buffer. This happens in Mochitests with things like nsCaret::NotifySelectionChanged, imgRequestProxy::OnDiscard and the Ghost Window detector.
This isn't a problem by itself, as we'll just clean it up in the next CC, but in theory a weak reference to an object could be repeatedly turned into a strong ref, then dropped, which would keep the object alive forever. Essentially, due to ICC, the weak reference would become a strong reference, which is surprising. Continually polling a weak reference doesn't seem like a totally ridiculous thing to do, either.
Our shutdown leak detection would fail to notice this (the object with the weak reference would go away before shutdown most likely), plus in Mochitests we don't really keep pages alive for long enough (like 15 seconds or something) to hit multiple CCs while the page is alive to notice this. I'm not really sure how to fix this.
One approach would be to come up with some kind of special stack-only smart pointer to use for weak references that carefully tracks and manages the purpleness of its referent. This has a number of problems. It isn't entirely clear to me that this can be even implemented safely. We'd also have to convert every existing place where weak pointers are promoted to strong pointers on the stack over to this, which might be a lot of code. Finally, we'd have to come up with some way to stop people from people promoting weak pointers to strong pointers without being aware of this possible problem.
Another approach would be to detect that an object that we otherwise determined to be garbage was in the purple buffer for multiple cycle collections in a row, and this happened multiple cycle collections in a row. In that case, we could trigger a synchronous CC with only the suspicious living-dead objects as purple roots (plus XPC roots). Hopefully this would be fairly rare, like the synch GCs to fix gray bits, and thus not have an effect on overall performance. The main drawback to this approach that I see is that it would be hard to tests, and, unlike the fixbits stuff, would be a little complex. I suppose we could run a normal synchronous CC, which might be less efficient, but would be a codepath we normally exercise.
The latter seems far more palatable, so I'll investigate that. It will require that I run the "fixup" pass, which in theory makes ScanRoot more inefficient, but that pass takes almost no time at all (I haven't even bothered to make it incremental yet) so we can probably just deal with it.
Thanks to the dynamic correctness analysis, I realized there's a hole in the barriers: XPCWN and XPCWJS are ref counted, but don't participate in the purple buffer. This isn't needed for leak finding, because XPConnect keeps a table of all of them, but we need to add them to the purple buffer so we treat them as live if they are being manipulated during an ICC.
Created attachment 768653 [details] [diff] [review]
More investigation of weak references interacting with ICC. The logging is better, still haven't figured out exactly what to do. My next big task is rebasing over the CC-on-workers patches that have landed and SnowWhite, which is about to land.
Created attachment 772617 [details] [diff] [review]
rebased on top of snow-white, still a bit buggy, on change set 136661:52f605debfd4
/home/smaug/mozilla/hg/mozilla/xpcom/base/nsCycleCollector.cpp: In member function ‘void nsCycleCollector::ScanIncrementalRoots()’:
/home/smaug/mozilla/hg/mozilla/xpcom/base/nsCycleCollector.cpp:2583:32: error: ‘LogObjectDescription’ was not declared in this scope
/home/smaug/mozilla/hg/mozilla/xpcom/base/nsCycleCollector.cpp: In constructor ‘nsCycleCollector::nsCycleCollector(CCThreadingModel)’:
/home/smaug/mozilla/hg/mozilla/xpcom/base/nsCycleCollector.cpp:3015:5: error: class ‘nsCycleCollector’ does not have any field named ‘mInner’
/home/smaug/mozilla/hg/mozilla/xpcom/base/nsCycleCollector.cpp:3016:5: error: class ‘nsCycleCollector’ does not have any field named ‘mComputedGarbage’
Ah, sorry, it has been a bit since I tried a non-DEBUG build. I can fix that up tomorrow, hopefully... All of those fields that are missing are stuff that shouldn't be run in a non-DEBUG build, so I guess I'm missing some #ifdef blocks.
Created attachment 773511 [details] [diff] [review]
WIP rebased including snow-white 136661:52f605debfd4
This should work in opt builds now. At least, I fixed the errors you posted before.
rebasing over SnowWhite and CC on worker threads
Created attachment 795717 [details] [diff] [review]
Created attachment 799095 [details] [diff] [review]
Fixed some telemetry and extraForgetSkippable problems.
The high-level goal of this current set of patches (bug 913080, bug 913130, bug 913527, bug 913666) is to change the Collect method, which is the top level method for invoking a cycle collection, into a series of 5 method calls:
1. BeginCollection - set up data structures, add the roots to the graph.
2. MarkRoots - build the graph
3. ScanRoots - determine which objects are garbage
4. CollectWhite - free garbage objects
5. CleanupAfterCollection - clean up temporary data structures
Each of these will become a separate phase in incremental cycle collection.
Bug 913130 unifies the set of top level methods invoked by ShutdownCollect with those in Collect, so subsequent patches only have to change a single location. Bug 913080 and bug 913527 eliminate weird control flow in between the various phases, which makes moving things around easier. Bug 913666 reduces the span of code involved with setting up the cycle collector listeners, which lets us move the prelude code in Collect() into BeginCollection().
After that, I'm going to move MarkRoots and ScanRoots out of BeginCollection, which will require moving the graph builder and listener, respectively, from the stack onto the nsCycleCollector data structure. That's a little unfortunate, but it will need to be done anyways once the cycle collector is actually incremental.
The final step after that to get to the 5 phases described above is to move PrepareForCollection into BeginCollection. The problem there is that Prepare takes results and whiteNodes as arguments, but Begin does not. The simplest approach would just be to add those arguments into BeginCollection, but I have to move them onto nsCycleCollector anyways, so I'll just wait until I've done that before I move in PrepareForCollection.
Oh, and after that, I'm going to split CollectWhite into three subphases: RootGarbage, UnlinkGarbage, UnrootGarbage, but that's trivial.
Created attachment 804105 [details] [diff] [review]
I've been splitting up the work in WIP and landing some of it, then breaking it off onto a bunch of local patches, and here is what I have relative to m-c. Individual phases still need to be made incremental, and this is only lightly tested thus far.
Created attachment 804107 [details] [diff] [review]
Well, this is a better diff, as it excludes the changes from the unlanded bug 915488.
I've been investigating a timeout in content/canvas/test/crossorigin/test_video_crossorigin.html for a while, and I figured out what the problem is. The sequence of steps is something like:
1. Create an XPCWrappedJS (WJS) x for a JS object y (for use as an event listener).
2. The WJS becomes part of a garbage cycle, but the JS object is still alive.
3. Start an ICC, do MarkRoots and ScanRoots. We decide the WJS is dead.
4. Try to create a WJS for y. We see that there's already one around, x, in the WrappedJSMap table, so use that. (again, as an event listener)
5. Finish the ICC. We still think x is dead, so we unlink it, zeroing out the pointer to y.
6. The event fires on x, but x doesn't point to anything, so nothing happens.
7. The event handler was supposed to finish the test, so we time out.
The general problem is that if any kind of weak reference becomes strong after ScanRoots, we can end up unlinking a live object, because we don't change our mind about something being dead at that point. This isn't a problem if it happens before ScanRoots, because we won't free anything that was AddRefed or Released during the current CC.
I'm not sure what the best way to solve this is. I maybe could add a mechanism for letting the weakly-held object know that it is going to go away in ScanRoots.
In the short term, I'm going to work around this by doing ScanRoots and everything after it non-incrementally, meaning that pretty much only MarkRoots is going to be done incrementally, but MarkRoots normally takes 90% or more of the CC time, so it should be a good improvement just with that, while I figure out a longer term solution for the rest of it.
I don't understand (2). If the event listener (WJS) is in ELM, it shouldn't become a part
of a garbage cycle.
As part of 2, the event fires and the element the listener was on are destroyed.
To be more concrete, the sequence is a little like:
let v1 = document.createElement('video');
v1.addEventListener("loadeddata", someTopLevelFunction, false); // create XPCWJS
// the loaded data event fires and runs someTopLevelFunction, which throws out v1:
// do MarkRoots/ScanRoots. v1 and the XPCWJS for someTopLevelFunction are garbage
let v2 = document.createElement('video');
v2.addEventListener("loadeddata", someTopLevelFunction, false); // reuse XPCWJS
// do Unlink, so the XPCWJS for someTopLevelFunction is unlinked
// the loaded data event fires, the listener runs and nothing happens
In a recent version of ICC, in a debug build, tearing down 5 TechCrunch tabs at once (about 200,000 objects in the CC graph) takes 10 slices of just graph building, at 40 to 47ms each, then a final slice of around 200ms to actually do all of the unlinking. So that's a reduction of max pause time from about 600ms to 200ms. Reducing that further will require dealing with weak reference resurrecting dead objects (to allow unlinking incrementally) and figuring out how to deal with Mochitests without falling over, which should allow reducing the slice budget below 40ms.
Never do any perf testing using a debug build ;)
You can now go into about:config and set dom.cycle_collector.incremental to true and ICC will be enabled. Unfortunately it doesn't record max pause time yet (bug 948554), so the only sign you are using it is that long CCs become even longer (because it is recording the time from start to end). Remaining work to enable ICC by default will be tracked in bug 911246.