Increase slice time for long-running incremental GCs

RESOLVED FIXED in Firefox 57

Status

()

enhancement
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jonco, Assigned: smaug)

Tracking

(Blocks 1 bug)

55 Branch
mozilla57
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 fixed)

Details

(Whiteboard: [qf:p2])

Attachments

(1 attachment)

Reporter

Description

2 years ago
If a GC is not making sufficient progress for whatever reason we should increase the slice budget to ensure that it completes.  If we hit the next alloc threshold we will do a non-incremental GC which will cause jank.

We could work out how many slices is 'normal' for a GC and start increasing the budget if it exceeds some threshold.  Or we could look at allocation pressure and how far over the incremental threshold we are or how far from the non-incremental threshold we are.

This was done for the cycle collector in bug 1304205.
There's a trade-off here (of course): we don't want to get into a situation where we jank at the end of every GC when we could've finished it in time without adjusting the slice length. Ideally we would be able to quantify how much of the GC we've completed, so we can predict how many more milliseconds we need, then compare that to how many more milliseconds we expect the mutator to be able to run without triggering the next GC (and schedule the mutator-to-GC ratio accordingly).

Then we could build in an element of persistence so that if we consistently end up behind, we start out with longer slices - and if we consistently finish with loads of time left, we start out with shorter slices. We can do something like that anyway, but without some sort of linear/exponential back-off the only metrics would be 'did we have a lot of time left over before the next trigger' and 'did we have to do a full GC because we were too slow' (the latter being the problem case).
In the CC, the time it runs is bounded (some number of slices), so I used that to guide how desperate the CC becomes. I'm not sure what the main causes are for the GC to stop running incrementally. I know the CC forces it to stop if the GC takes too long. Also maybe the GC will finish up if there's a need for memory.

Updated

2 years ago
Whiteboard: [qf] → [qf:p2]
(In reply to Andrew McCreight [:mccr8] from comment #2)
> I'm not sure what the main
> causes are for the GC to stop running incrementally. I know the CC forces it
> to stop if the GC takes too long. Also maybe the GC will finish up if
> there's a need for memory.

75% ZoneChange
20% AbortRequested
most of the rest is NonIncrementalRequested

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2017-05-30&keys=__none__!__none__!__none__&max_channel_version=nightly%252F55&measure=GC_RESET_REASON&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-03-08&table=0&trim=1&use_submission_date=0

So currently, it's mostly because we add a new zone to the set we're collecting. I'm not sure how often that's from hitting a GC trigger vs the CC or somebody add in a zone?

I also don't know which one the CC-forced GCs show up as. I failed at tracing the code. I did see that if we get to the case where we need a GC after the CC, then it does it by scheduling all of the zones for GC, which would presumably show up as ZoneChange if we were in the middle of an incremental GC? Or maybe that doesn't happen, because we wouldn't be GCing during a CC anyway?

/me confused
Assignee

Comment 4

2 years ago
This, or something similar, is showing up in BHR hang reports quite badly. CC has been waiting for GC to finish, but if that doesn't happen, it needs to synchronously finish the GC.
I'll write some kind of patch for this.
Assignee: nobody → bugs
Assignee

Comment 5

2 years ago
Perhaps something like this. This should start to kick on usually after 10% of locked time, so 3s and increasing from there up to 50ms (which happens to be also the max idle time)

remote: View your change here:
remote:   https://hg.mozilla.org/try/rev/3bc55d7ad05eeebcc94e3e00bc08d466328a8588
remote: 
remote: Follow the progress of your build on Treeherder:
remote:   https://treeherder.mozilla.org/#/jobs?repo=try&revision=3bc55d7ad05eeebcc94e3e00bc08d466328a8588
remote: 
remote: It looks like this try push has talos jobs. Compare performance against a baseline revision:
remote:   https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=3bc55d7ad05eeebcc94e3e00bc08d466328a8588
Attachment #8904973 - Flags: review?(continuation)
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

Review of attachment 8904973 [details] [diff] [review]:
-----------------------------------------------------------------

This looks reasonable to me. Maybe you should get feedback from a GC person, or at least give them a heads up.

::: dom/base/nsJSEnvironment.cpp
@@ +1785,5 @@
>  InterSliceGCRunnerFired(TimeStamp aDeadline, void* aData)
>  {
>    nsJSContext::KillInterSliceGCRunner();
>    MOZ_ASSERT(sActiveIntersliceGCBudget > 0);
> +  // We use longer budgets when CC has been locked since that means

The CC is always locked out when the GC is running (it is set to true in JS::GC_CYCLE_BEGIN). Maybe you mean something more like, when the CC has been locked out but the CC has tried to run?
Attachment #8904973 - Flags: review?(continuation) → review+
Assignee

Comment 7

2 years ago
(In reply to Andrew McCreight [:mccr8] from comment #6)
> This looks reasonable to me. Maybe you should get feedback from a GC person,
> or at least give them a heads up.
Sure
 

> The CC is always locked out when the GC is running (it is set to true in
> JS::GC_CYCLE_BEGIN). Maybe you mean something more like, when the CC has
> been locked out but the CC has tried to run?
Indeed
Assignee

Comment 8

2 years ago
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

s/when CC has been locked/when the CC has been locked out but the CC has tried to run/

sActiveIntersliceGCBudget is 5ms by default.
Attachment #8904973 - Flags: review?(sphink)
Assignee

Updated

2 years ago
Attachment #8904973 - Flags: review?(sphink) → review?(jcoppeard)
Reporter

Comment 9

2 years ago
Comment on attachment 8904973 [details] [diff] [review]
longer_gc_slices_near_cc.diff

Review of attachment 8904973 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the heads up.

This looks good, although I think that 30 is still too long to let a GC run before increasing slice time to the max.  

We can experiment with changing the settings later though when we see what effect this has on telemetry.
Attachment #8904973 - Flags: review?(jcoppeard) → review+
Assignee

Comment 10

2 years ago
Yeah, 30s is a lot, but at least it is increased gradually before that already.

Perhaps it shouldn't be * 10, but something like * 50. Better to eventually have 250ms slice if needed, rather than sync GCs right before CC. But I'll try with * 10.

Comment 11

2 years ago
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/01d736f31b39
Increase slice time for long-running incremental GCs when CC has tried to run, r=mccr8,jonco
https://hg.mozilla.org/mozilla-central/rev/01d736f31b39
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.