Open Bug 1346943 Opened 7 years ago Updated 2 years ago

No-added-insult GC mode

Categories

(Core :: JavaScript: GC, defect, P5)

defect

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

This is a very vague and hand wavy request, but when I brought it up with Kannan he didn't seem to think I'm insane, so I decided to file it.  Apologies in advance for the vague nature of what follows.  :-)

Sometimes in profiles of websites such as Google Suite we spend *tons* of times running JS.  Things like 10 seconds of time being spent only in SpiderMonkey is not that uncommon.  Looking at our BHR data, multi-second script runs also do happen in the wild.  So I'm not alone in seeing them.

When profiling these workloads, typically GC ends up being a good portion of the time we spend in SpiderMonkey.  For example, we can spend 10% of the total time just GCing.  When I profile cases like that, and look at all of the free memory I have on my machine, I can't help thinking what if SM decided to not add insult to injury when JS execution is taking so long and just not ran any GC in the middle of super long runs of JS.  Or at least not when you have tons of free memory available.  Or something along those lines.

Is this remotely possible?
There are some prefs (dating back to the B2G error) starting with javascript.options.mem that you could try tweaking. The effectiveness will depend on how much time we spend actually freeing garbage vs. tracing live things. (Excluding, of course, benchmark cheesing by delaying a GC until after we record how long a test is taking.)
"B2G error" should be "B2G era"...
I'm also worried about whether we'd be dooming ourselves by avoiding for example nursery GCs while we can and have to do more expensive ones later and whatnot.
IMO delaying nursery GCs is not an option anyway because it would slow down other parts of the system. For instance, the nursery's bump allocation for slots/elements is a lot faster than malloc/free.

It would be good to know which GCs (full/Zone/nursery) take most of the time.
Whiteboard: [qf:p1]
There are a couple of risks in delaying GCs.

The main one is probably OOM. For the most part, spidermonkey is ok running right up to the limit, because it detects OOM and triggers a last-ditch GC. But from what I understand, that mechanism is largely useless because if you're close to OOM, you're just as likely to hit it in Gecko code and that will generally bring down the browser. So the GC is necessarily conservative, trying to keep things generally under control to avoid nearing the unknown upper limit where it will OOM and crash.

An obvious fix would be to guess that limit, based on available memory. I should dig up bugs for that, but my general understanding is that (1) we didn't find a reliable cross-platform way of reporting this; (2) often the number you get is one that will avoid OOM but will throw you into swap hell instead, which means that this can often make things worse; (3) users complain if you start using "too much" memory even if it's totally reasonable to do so, and figuring out how reasonable it is can be tricky (we don't have a good feedback mechanism if some other process starts needing a bunch of memory); and (4) for a number of workloads, especially ones that fit the generational hypothesis (high infant mortality) well, this will slow things down because you're blowing out your cache.

Terrence wrote a great comment at http://searchfox.org/mozilla-central/source/js/src/gc/GCRuntime.h#225 laying a lot of this out.

mccr8's suggestion of setting a preference sidesteps most of this, but it isn't well exposed and we haven't really figured out what "good" values are (since GC memory is a workload-dependent percentage of the overall heap, you can't just tell the GC to use your full available memory.)

Other risks beyond OOM are those I mentioned, swapping and cache effects.

However, you're not wrong. Allocation behavior really matters here. If you're doing a 10 second computation before returning from JS, there's a good chance that you're in an allocation-heavy phase and GC is a waste of time because you're just tracing through lots of live stuff that you won't end up freeing anyway (as mccr8 said in comment 1). We have very limited heuristics to detect this -- if you start GCing frequently, we'll decide that you're in "high frequency GC mode" and allow the heap to grow much more (300%) before triggering the next heuristic collection. Also, we are more trigger-happy with GCs if you return from JS to the event loop.

It would be good to see what type of GCs these are, where they are slow, and how much memory is getting collected. The easiest way to do that is run your browser with the env var MOZ_GCTIMER set to a filename.

I'm confident that our current set of heuristics is inadequate for some workloads, and that we're not propagating or making good use of some of the information we have available. If you have examples like this where we really seem to be falling down, I'd like to see the GC logs and try to figure out if we can tune the existing heuristics or add new ones to hopefully handle it better without harming other scenarios too much. We can also add prefs to force-disable GC in certain scenarios, just to see what effect it has. It's still hard to predict how much it will mess things up on someone else's machine or workload, but it's definitely useful data.
This sounds like an interesting avenue to explore but I don't think it's a P1.  This also overlaps with the scheduler work.
Whiteboard: [qf:p1]
Priority: -- → P5
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.