Closed Bug 1368972 Opened 7 years ago Closed 7 years ago

Increase slice time for long-running incremental GCs

Categories

(Core :: JavaScript: GC, enhancement)

55 Branch
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Performance Impact medium
Tracking Status
firefox57 --- fixed

People

(Reporter: jonco, Assigned: smaug)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

If a GC is not making sufficient progress for whatever reason we should increase the slice budget to ensure that it completes.  If we hit the next alloc threshold we will do a non-incremental GC which will cause jank.

We could work out how many slices is 'normal' for a GC and start increasing the budget if it exceeds some threshold.  Or we could look at allocation pressure and how far over the incremental threshold we are or how far from the non-incremental threshold we are.

This was done for the cycle collector in bug 1304205.
There's a trade-off here (of course): we don't want to get into a situation where we jank at the end of every GC when we could've finished it in time without adjusting the slice length. Ideally we would be able to quantify how much of the GC we've completed, so we can predict how many more milliseconds we need, then compare that to how many more milliseconds we expect the mutator to be able to run without triggering the next GC (and schedule the mutator-to-GC ratio accordingly).

Then we could build in an element of persistence so that if we consistently end up behind, we start out with longer slices - and if we consistently finish with loads of time left, we start out with shorter slices. We can do something like that anyway, but without some sort of linear/exponential back-off the only metrics would be 'did we have a lot of time left over before the next trigger' and 'did we have to do a full GC because we were too slow' (the latter being the problem case).
In the CC, the time it runs is bounded (some number of slices), so I used that to guide how desperate the CC becomes. I'm not sure what the main causes are for the GC to stop running incrementally. I know the CC forces it to stop if the GC takes too long. Also maybe the GC will finish up if there's a need for memory.
Whiteboard: [qf] → [qf:p2]
(In reply to Andrew McCreight [:mccr8] from comment #2)
> I'm not sure what the main
> causes are for the GC to stop running incrementally. I know the CC forces it
> to stop if the GC takes too long. Also maybe the GC will finish up if
> there's a need for memory.

75% ZoneChange
20% AbortRequested
most of the rest is NonIncrementalRequested

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-05-30&keys=__none__!__none__!__none__&max_channel_version=nightly%252F55&measure=GC_RESET_REASON&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-03-08&table=0&trim=1&use_submission_date=0

So currently, it's mostly because we add a new zone to the set we're collecting. I'm not sure how often that's from hitting a GC trigger vs the CC or somebody add in a zone?

I also don't know which one the CC-forced GCs show up as. I failed at tracing the code. I did see that if we get to the case where we need a GC after the CC, then it does it by scheduling all of the zones for GC, which would presumably show up as ZoneChange if we were in the middle of an incremental GC? Or maybe that doesn't happen, because we wouldn't be GCing during a CC anyway?

/me confused
This, or something similar, is showing up in BHR hang reports quite badly. CC has been waiting for GC to finish, but if that doesn't happen, it needs to synchronously finish the GC.
I'll write some kind of patch for this.
Assignee: nobody → bugs
Perhaps something like this. This should start to kick on usually after 10% of locked time, so 3s and increasing from there up to 50ms (which happens to be also the max idle time)

remote: View your change here:
remote:   https://hg.mozilla.org/try/rev/3bc55d7ad05eeebcc94e3e00bc08d466328a8588
remote: 
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=3bc55d7ad05eeebcc94e3e00bc08d466328a8588
remote: 
remote: It looks like this try push has talos jobs. Compare performance against a baseline revision:
remote:   https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=3bc55d7ad05eeebcc94e3e00bc08d466328a8588
Attachment #8904973 - Flags: review?(continuation)
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

Review of attachment 8904973 [details] [diff] [review]:
-----------------------------------------------------------------

This looks reasonable to me. Maybe you should get feedback from a GC person, or at least give them a heads up.

::: dom/base/nsJSEnvironment.cpp
@@ +1785,5 @@
>  InterSliceGCRunnerFired(TimeStamp aDeadline, void* aData)
>  {
>    nsJSContext::KillInterSliceGCRunner();
>    MOZ_ASSERT(sActiveIntersliceGCBudget > 0);
> +  // We use longer budgets when CC has been locked since that means

The CC is always locked out when the GC is running (it is set to true in JS::GC_CYCLE_BEGIN). Maybe you mean something more like, when the CC has been locked out but the CC has tried to run?
Attachment #8904973 - Flags: review?(continuation) → review+
(In reply to Andrew McCreight [:mccr8] from comment #6)
> This looks reasonable to me. Maybe you should get feedback from a GC person,
> or at least give them a heads up.
Sure
 

> The CC is always locked out when the GC is running (it is set to true in
> JS::GC_CYCLE_BEGIN). Maybe you mean something more like, when the CC has
> been locked out but the CC has tried to run?
Indeed
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

s/when CC has been locked/when the CC has been locked out but the CC has tried to run/

sActiveIntersliceGCBudget is 5ms by default.
Attachment #8904973 - Flags: review?(sphink)
Attachment #8904973 - Flags: review?(sphink) → review?(jcoppeard)
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

Review of attachment 8904973 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the heads up.

This looks good, although I think that 30 is still too long to let a GC run before increasing slice time to the max.  

We can experiment with changing the settings later though when we see what effect this has on telemetry.
Attachment #8904973 - Flags: review?(jcoppeard) → review+
Yeah, 30s is a lot, but at least it is increased gradually before that already.

Perhaps it shouldn't be * 10, but something like * 50. Better to eventually have 250ms slice if needed, rather than sync GCs right before CC. But I'll try with * 10.
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/01d736f31b39
Increase slice time for long-running incremental GCs when CC has tried to run, r=mccr8,jonco
https://hg.mozilla.org/mozilla-central/rev/01d736f31b39
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Performance Impact: --- → P2
Whiteboard: [qf:p2]
You need to log in before you can comment on or make changes to this bug.