Closed Bug 1149752 Opened 5 years ago Closed 5 years ago

Large increase in GC pauses (with beachballing) in the last week of 2015-03

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox39 --- affected
firefox40 --- fixed

People

(Reporter: MattN, Assigned: jonco)

Details

(Keywords: regression)

Attachments

(6 files, 3 obsolete files)

In the last week I've been getting large multi-second GC pauses which lead to a beachball (Not Responding) on OS X.

Platform: OS X
Channel: Nightly
e10s: Off
Tabs: ~580 (most unloaded due to lazy loading)
App Tab highlights: Gmail (x4), BMO, etherpads, Google Docs/Spreadsheets, Lucidchart
Extensions: Many

I do have Treeherder as an app tab for my own Fx-Team landings so maybe this is related to bug 1145831 (though I've had that app tab for a few weeks before the problem started).

I'm marking this as a regression until we know that it was caused by something outside of Firefox (like an extension or web content change).

Has there been any recent landings related to GC?
Note that 7% of GC_MAX_PAUSE_MS are >= 1s
about:memory shows:
448.28 MB (12.09%) ── heap-unclassified
Did you get the entire giant GC report in your first attachment?  I only see slice 0, 1 and 2, and none of them seem that long.  Please attach the entire thing.

Anecdotally, I've noticed my browser has been more sluggish in the last week or so, but I don't know why.
(In reply to Andrew McCreight [:mccr8] from comment #3)
> Did you get the entire giant GC report in your first attachment?  I only see
> slice 0, 1 and 2, and none of them seem that long.  Please attach the entire
> thing.

I thought I had the entire thing but maybe not. For some reason I'm not hitting the pauses now to capture more output. I'll attach the log when the pauses come back.
Also, the CC one might be useful, too, once you are seeing the problems.
It looks like the 7183.4ms output line is truncated. Is this better?
Attachment #8586333 - Attachment is obsolete: true
Flags: needinfo?(continuation)
Yeah, there may be some kind of bug with the logging code.  Olli was also complaining about the line being truncated.

I do see slice information for one fairly long slice:

Slice: 44, Pause: 222.5 (When: 4779.6ms, Reason: INTER_SLICE_GC): Mark: 0.3ms, Mark Delayed: 0.2ms, Sweep: 216.3ms, Mark During Sweeping: 144.0ms, Mark Weak: 73.9ms, Mark Gray: 59.9ms, Mark Gray and Weak: 10.0ms, Finalize Start Callback: 9.5ms, Sweep Compartments: 28.4ms, Sweep Discard Code: 11.3ms, Sweep Inner Views: 0.2ms, Sweep Cross Compartment Wrappers: 27.2ms, Sweep Base Shapes: 4.0ms, Sweep Initial Shapes: 10.5ms, Sweep Type Objects: 16.0ms, Sweep Breakpoints: 3.1ms, Sweep Regexps: 2.1ms, Sweep Miscellaneous: 15.9ms, Sweep type information: 0.2ms, Sweep Object: 16.6ms, Minor GCs to Evict Nursery: 0.8ms, Mark Roots: 0.5ms, Mark Rooters: 0.1ms

I don't know how useful that is.
Flags: needinfo?(continuation)
I got the untruncated version from "Save verbose" on about:memory. It seems like the output gets truncated before getting to the Browser Console.
Attachment #8586446 - Attachment is obsolete: true
I think a GC person needs to look at this further.  Is this something you could look at Terrence?  Thanks.
Flags: needinfo?(terrence)
I'm seeing something similar. CC pauses of 1.3 seconds, with over a million GCed objects in the graph.  GC max pauses "only" around 160ms, looking like comment 9.
I have a bunch of ghost windows, from TechCrunch and Facebook.  Running minimize memory usage did not fix them.
    Slice: 0, Pause: 1467.5 (When: 0.0ms, Reason: ALLOC_TRIGGER): Begin Callback: 1458.8ms, Mark Discard Code: 0.3ms, Purge: 1.3ms, Mark: 5.1ms, Mark Roots: 4.9ms, Buffer Gray Roots: 3.5ms, Mark Cross Compartment Wrappers: 0.9ms, Mark Compartments: 0.3ms, Minor GCs to Evict Nursery: 1.8ms
Depends on: 1150127
I filed bug 1150127 for my issue which may or may not be the same thing.  The about:memory report here shows 4 ghost-windows, so maybe it is the same thing.
(In reply to Matthew N. [:MattN] from comment #10)
> Created attachment 8586465 [details]
> javascript.options.mem.log snippet showing Max Pause: 7183.4ms (untruncated)
> 
> I got the untruncated version from "Save verbose" on about:memory. It seems
> like the output gets truncated before getting to the Browser Console.

That still appears truncated (although much further down) -- at least I can't find the slice that has the listed Max Pause. Also it seems to cut of in the mid

If you set MOZ_GCTIMER=path/to/some/file.txt we'll save the output directly to a file, nicely formatted for viewing in a terminal, as opposed to the squished up format we use for the browser console.
Flags: needinfo?(terrence)
(In reply to Andrew McCreight [:mccr8] from comment #15)
> I filed bug 1150127 for my issue which may or may not be the same thing. 
> The about:memory report here shows 4 ghost-windows, so maybe it is the same
> thing.

When I got a long hang today then opened about:memory I didn't see any ghost windows at that time so it may not be related.

Thanks Terrence. I documented that env. var on MDN: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/GC_and_CC_logs

Here is a gzipped log for most of today. Near the end of it I had a max pause of 6307.657ms. Do you need any more info?
Attachment #8586465 - Attachment is obsolete: true
Flags: needinfo?(terrence)
Analyzing the long GC:

This is a compacting GC triggered because of inactivity. The heap is fairly large ~1GiB, so we might expect a few slices to take a bit longer than budget, but not 6 seconds over budget.

First slice: 96ms total
  50ms in MarkRoots
  25ms in Unmark
  10ms in DiscardCode
  10ms by 1's and 2's elsewhere
    This is, unfortunately, a long pause, but not unexpected as we don't incrementalize any of this stuff yet.

Slice 1 to 32: 40ms
  Marking is incrementalized perfectly and working exactly as expected for INTER_SLICE_GC.

Slice 33: 25ms
  We get a single slice of REFRESH_FRAME for some animation on the first slice of Sweeping. Still behaving roughly as expected.

Slice 33 - 103: 40ms
  25ms in FinalizeStartCallback
  10ms in MarkDuringSweep
   6ms in SweepCompartments
   0.05ms in Sweep GC Things.

We're hitting our pause times on the nose, but we have so much fixed overhead per sweeping slice that we're only effectively getting 50us of real work done per frame. Since we have ~1s of sweeping to do, we'd take days to finish the GC. Long before then, the CC needs to run and causes us to just do a full GC.

Slice 104: 6.3s -- an all-zones, non-incremental, compacting GC
  Mark: 1.4s
  Sweep: 1s
  Compact: 3.7s

So the majority of the 6.3s of pause time is 3.7s of non-incremental compacting GC.

There are a couple of problems here.

(1) We need to figure out what our fixed overheads are and see if we can't incrementalize them. At the very least we need to detect when we're not making adequate forward progress and adjust our slice times accordingly.

(2) We schedule compacting at the beginning of the GC, but the GC takes so long that the user has interacted with the browser by the time we get to the end of the GC and could feasibly do the long compacting operation. It would be nice if we could not do compacting if the user interacts with the session between when we start the compacting GC and when the compacting actually happens.

Jon, is (2) something you could take a look at?
Flags: needinfo?(terrence) → needinfo?(jcoppeard)
No longer depends on: 1150127
For (1), could we get a profile of the offending functions while one of these GCs is going on? Maybe we recently started doing something slow during the fixed phase without realizing it.
Assignee: nobody → jcoppeard
Flags: needinfo?(jcoppeard)
This could actually still be a dupe of bug 1150127.  If you have a leaking window for, say, bugzilla.mozilla.org, but you still have some windows open to bugzilla.mozilla.org, then the ghost-window analysis won't report any ghost windows.  billm had something like that yesterday.  So it might be good to wait a day or two to see if MattN still is seeing this problem.
I am still seeing this with 40.0a1 (2015-04-09).
Patch to add AbortIncrementalGC() API and call this if the user becomes active again while we are collecting.
Attachment #8591032 - Flags: review?(terrence)
I haven't fully read up on this problem or the patch, but completely aborting a GC seems fairly heavy-handed to me. If the GC is incremental anyway, could we add, say, a static member (bool stopASAP) to SliceBudget and check it in SliceBudget::checkOverBudget? The caveat with that specific approach would be that it would also affect the cycle collector (and other potential users of SliceBudget, though currently the GC and the CC are the only ones), but maybe the same reasoning applies there anyway.
Comment on attachment 8591032 [details] [diff] [review]
bug1149752-cancel-inactivity-gc

Review of attachment 8591032 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsJSEnvironment.cpp
@@ +219,5 @@
>  
>  static bool sGCOnMemoryPressure;
> +
> +// nsJSEnvironmentObserver observes the user-interaction-inactive notifications
> +// and triggers a shrinking a garbage collection if the user is still active

s/active/inactive/?

::: js/src/jit-test/tests/gc/incremental-abort.js
@@ +7,5 @@
> +{
> +    // Allocate objectCount objects in zoneCount zones and run a incremental
> +    // shrinking GC.
> +
> +    print("test abort in state: " + abortState);

If you've tested that this works, you should probably remove the prints. We don't currently filter the output to add test markers, so this will confuse treeherder and annoy the sheriffs.
Attachment #8591032 - Flags: review?(terrence) → review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #23)
What's being proposed here is a way to cancel the current ongoing GC.  This may perform some amount of work to get the collector back into its idle state.  For example, if we are currently sweeping a zone group then we will continue sweeping until the end of that group.

This would be used to halt a speculative GC caused by the user inactive notification if the user becomes active again.  The abort would happen on the main thread in between GC slices.

Let me know if you can see a problem with this.
Comment on attachment 8591032 [details] [diff] [review]
bug1149752-cancel-inactivity-gc

Requesting review for browser changes.
Attachment #8591032 - Flags: review?(bugs)
Attachment #8591032 - Flags: review?(bugs) → review+
(In reply to Jon Coppeard (:jonco) from comment #25)
> (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #23)
> This would be used to halt a speculative GC caused by the user inactive
> notification if the user becomes active again.  The abort would happen on
> the main thread in between GC slices.

I see, it seems I misunderstood - thanks for explaining!
https://hg.mozilla.org/mozilla-central/rev/4654c5e880d3
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.