Closed Bug 1374797 Opened 7 years ago Closed 7 years ago

Facebook is intermittently unusable on load due to GC

Categories

(Core :: JavaScript: GC, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Performance Impact high
Tracking Status
platform-rel --- ?
firefox-esr52 --- unaffected
firefox54 - wontfix
firefox55 + fixed
firefox56 + fixed

People

(Reporter: jdm, Assigned: jonco)

References

(Depends on 1 open bug)

Details

(Whiteboard: [platform-rel-Facebook])

Attachments

(3 files)

I finally captured a profile of something I see on a frequent but intermittent basis. At least once a day using FF Nightly I visit facebook.com (logged in) and the page appears but refuses to let me interact with it. At some point the yellow warning of "A page is slowing down your browser"; if I wait long enough (60s) then it eventually goes away. Inverting the call stack in the profiler shows that the vast majority of time is being spent in the GC.

http://bit.ly/2sxTeW5
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
Version: unspecified → Trunk
Can we figure out why these aren't incremental or something?
Flags: needinfo?(jcoppeard)
Flags: needinfo?(bugs)
Do you know the reason for these GCs?

javascript.options.mem.log true and look at the browser console and messages from the relevant child process.
Flags: needinfo?(bugs) → needinfo?(josh)
Hmm, these are all JS engine triggered GCs, so my guess is reason is ALLOC or some such.
Attached file GC log
There are some gems in this log:
Pause: 300.663ms of unlimited budget (@ 31.816ms); Reason: API;
Pause: 281.301ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 265.338ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 275.415ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 280.692ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
etc.

In total it was probably 15 or 20 seconds on unresponsiveness.
Flags: needinfo?(josh)
I assume those aren't interruptible (trying to recall what the log says in such case).

I don't quite understand why the profile doesn't show anything before the GC for that particular child process.
That log is filtered to only include messages containing "content-". If that's a problem I can do it without filtering.
Whiteboard: [qf] → [qf:p1]
Depends on: GCScheduling
I also experience this periodically, fwiw.
Another profile if it's useful: http://bit.ly/2rL6QLI
The profiles have lots of GC information embedded in them. If I look at /threads/2/markers/data/*/timings/nonincremental_reason  all but one of the 50 or so GCs in there shows "GCBytesTrigger". (And yes, reason is ALLOC_TRIGGER.) Which implies that something is furiously allocating a massive number of GC things.

They are all collecting 6 zones out of a total of 7.

chunks_added and chunks_removed are both 0 throughout. 'allocated' is 32, which appears to mean that the GC heap size is a steady 32MB.

I guess that means we're allocating an enormous amount of short-lived stuff in the tenured heap? It would be nice to have some kind of information on what that is, like a histogram at least by AllocKind.

I downloaded 3 of the leaf script files just above the GC on the stack to look for commonalities, but nothing stood out. (It might work better if we had column numbers!) Clicking around on the stacks to see what part of the engine triggered the GC didn't really suggest anything to me either.
I see this quite often too (I go to facebook once a day being logged in, and it happens about every 3rd time). Here's another profile: https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when pressing Stop eventually :( Keeping "Wait" seems like it is running like forever... Let me know if I can help giving more information. FWIW, I am on Linux
platform-rel: --- → ?
Whiteboard: [qf:p1] → [qf:p1][platform-rel-Facebook]
(In reply to gerd.neumann from comment #10)
> I see this quite often too (I go to facebook once a day being logged in, and
> it happens about every 3rd time). Here's another profile:
> https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the
> "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when
> pressing Stop eventually :( Keeping "Wait" seems like it is running like
> forever... Let me know if I can help giving more information. FWIW, I am on
> Linux

Just for the record, this profile *is* the same issue (a long series of back-to-back GCs.)
I think I'm seeing the same behavior in Bug 1382814. I'm able to reproduce reasonably consistently using Netflix.
> Which implies that something is furiously allocating a massive number of GC things.

I wonder if we are triggering GC because the atoms zone is over its threshold but find we can't collect it due to ongoing parsing activity.
Ted's profile is a pretty extreme one. The heap is holding steady at around 16MB, though it slowly creeped up to that size. The collections are nonincremental GCs of 3 out of 4 zones. Does it mean something that we always seem to be doing n-1 out of n?
Okay, I looked at my Netflix example (on nightly 2017-05-01 since I can hit bug predictably) and can attach a debugger while the GC is acting up.

Jonco's theory about atoms zone seems to be correct. We wind up with atoms zone over threshold, schedule a Full GC, and then skip the atoms zone because a background parse is in progress. This hammers the main thread as script makes lock-step progress with full GCs.

To enter the stalled state, we have an async parse (from <script async=""> in the Netflix case) that is waiting on the parseFinishedList_ queue until main thread can do finalization tasks (copy to correct compartment, etc). This is blocked until current script on main thread completes.

On each major GC, we check if the atoms zone is over threshold and if so, we create a deferred request to do a full GC in the future. This flag is checked for every new Parser run on main thread. This means every lazy script or eval that main thread triggers will cause a full GC. For a complex script, this can be thousands of full GCs on page load.

Perhaps someone get help create a testcase. I'm not sure how to reliably force AtomsZone to go over threshold (especially if it is 10s of MB). I could not reproduce the issue with MOZ_GCTIMER active and timing seems to be sensitive. I can look around more in the debugger if there specific questions.
As discussed on #jsapi, the immediate problem is that ~AutoKeepAtoms doesn't check for hasHelperThreadZone() before scheduling the GC. This leads to the back-to-back GC problem.

https://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/js/src/jscntxt.h#1260

There is also the matter of while an async script is waiting for main-thread in order to finalize parse (copy compartments from parser to target), we will never be able to GC the atoms zone in that main-thread script. Jon is going to investigate this some more.
Assignee: nobody → jcoppeard
Flags: needinfo?(jcoppeard)
The problem is that we're repeatedly triggering a full GC to collect the atoms zone, but then not actually collecting it due to the presence of helper threads.

The patch factors out the check for whether we can collect the atoms zone (no AutoKeepAtoms, no helper threads) into JSContext::canCollectAtoms() and replaces the original checks with this (both were wrong it appears).

I added a specific reason code for this type of GC so it will show up in telemetry and added an assertion that it always collects the atoms zone.  I refactored the zone selection logic at the start of the GC: the atoms zone used to be handled separately because it could only be collected in a full GC but this is no longer the case.

I also made sure that the relevant zone was scheduled in cases where we collect non-incrementally due to hitting triggers, as that could potentially cause similar issues.
Attachment #8889527 - Flags: review?(sphink)
See Also: → 1383833
By the way, the same thing happens on treeherder, but to a lesser extent than on Facebook.
Comment on attachment 8889527 [details] [diff] [review]
bug1374797-atoms-gcs

Review of attachment 8889527 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!! It's scary to think how many much we may be suffering from this right now.

I traced through the logic as much as I could, and it looks good from what I can tell.

::: js/src/jsgc.cpp
@@ +3870,1 @@
>      // If we are repeating a GC becuase we noticed dead compartments haven't

*because

@@ +3884,5 @@
> +
> +    // If keepAtoms() is true then either an instance of AutoKeepAtoms is
> +    // currently on the stack or parsing is currently happening on another
> +    // thread. In either case we don't have information about which atoms are
> +    // roots, so we must skip collecting atoms.

Comment needs an update. keepAtoms appears to not be a function, and it does not encompass parsing happening (hasHelperThreadZones).

Oh. I guess this can just do s/keepAtoms/canCollectAtoms/

@@ +3898,5 @@
> +    // the other collected zones are using are marked, and we can update the
> +    // set of atoms in use by the other collected zones at the end of the GC.
> +    if (zone->isAtomsZone()) {
> +        JSRuntime* rt = zone->runtimeFromActiveCooperatingThread();
> +        return !TlsContext.get()->keepAtoms && !rt->hasHelperThreadZones();

...at least if this can be simplified to 

   if (zone->isAtomsZone())
       return TlsContext.get()->canCollectAtoms();

?

@@ +3961,2 @@
>       */
> +    MOZ_ASSERT_IF(reason == JS::gcreason::DELAYED_ATOMS_GC, atomsZone->isGCMarking());

Ooh, I like this assert.

@@ +6738,5 @@
>              }
> +
> +            // This ensures we collect zones that have reached the malloc limit.
> +            // TODO: Start collecting these zones earlier like the we do for the
> +            // GC bytes trigger above.

Please file a bug, and s/the we/we/
Attachment #8889527 - Flags: review?(sphink) → review+
Is this nightly only or does it affect other branches?
I'm my attempts to bisect on Bug 1382814, I got failures as far back as nightly 2017-03-01 which I think would be FF55. Possibly earlier too.
(In reply to Ted Campbell [:tcampbell] from comment #22)
> I'm my attempts to bisect on Bug 1382814, I got failures as far back as
> nightly 2017-03-01 which I think would be FF55. Possibly earlier too.

54 left central on 2017-03-06, so it sounds like it could also be affected.

[Tracking Requested - why for this release]:
Important perf issue affecting commonly used pages.
You are correct. I can reproduce on both 54 and 55. A quick look at the code suggests ESR-52 is also affected, but I can't get the timing to appear with my test case.
We are only one week from 55 RC. Mark 54 won't fix. Let's see if we can get it fix in 55.
(In reply to Steve Fink [:sfink] [:s:] from comment #20)
> Comment needs an update. keepAtoms appears to not be a function, and it does
> not encompass parsing happening (hasHelperThreadZones).

Good point, and yes we can simplify this to just call canCollectAtoms().

> > +            // This ensures we collect zones that have reached the malloc limit.
> > +            // TODO: Start collecting these zones earlier like the we do for the
> > +            // GC bytes trigger above.
> 
> Please file a bug, and s/the we/we/

Done, filed bug 1384049.
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e33e38d61829
Fix logic around triggering atoms GCs r=sfink
https://hg.mozilla.org/mozilla-central/rev/e33e38d61829
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1384550
Hi Jon, Julien, should we consider uplifting this to 55? I know it's pretty late and the change doesn't seem trivial. Thoughts?
Flags: needinfo?(jcristau)
Flags: needinfo?(jcoppeard)
I think it's probably too late for 55 unfortunately, unless we want to delay the release over this to get some extra confidence (and that seems unlikely).
Flags: needinfo?(jcristau)
The good news here is that Both Facebook and treeherder scrolling is great now for me.  Those are the 2 places where I saw major issues.
(In reply to Julien Cristau [:jcristau] from comment #32)
RyanVM asked the same question on IRC and we came to the conclusion that it's probably too late.
Flags: needinfo?(jcoppeard)
While I understand the rationale for holding off on uplifting this, the number of bugs that are turning out to be dupes of this one is kind of worrying me. It seems that both Facebook and Gmail (from bug 1242861) were affected, which are not insignificant websites, so I'm worried that a lot of users are quietly having a pretty miserable experience using Firefox at the moment.
I agree we should consider this for 55 or at least a dot release shortly afterwards.

The short summary is that some of the most popular websites (Facebook, Netflix, etc) hang the content process for at least a minute on load.
I think we're all worried about the implications of this not making 55. The concern is how much time this patch needs to bake before we can feel comfortable with uplifting it into the RC build on Monday (or potentially a dot release in a few weeks). Fuzzing, telemetry, something else?
Well here I am with 20-20 hindsight to say this should have been uplifted as soon as we had a patch so that we had more testing on the beta channel.  IT seems a bit late now but we also have the now that we eliminated aurora, we no longer have the luxury of doing a beta of a 55.0.1.  It seems anything we do here will be wrong.
What probably should have been done, and I am not posting this to criticize, but to show a better way to do this if something like this comes up again.  i t should have been uplifted to beta immediately and then if not thought testing was long enough backed out there before release and held for a later point release so that there would be more widespread testing because ti was beta at one time.  Having more testing on Nightly is fine but does not seem to compare to the testing on the beta channel.
Attached patch bug1374797-betaSplinter Review
Here's a backport of the patch to beta that includes the fix for bug 1384550.

I had to give DELAYED_ATOMS_GC a different reason code for this version.
Comment on attachment 8891338 [details] [diff] [review]
bug1374797-beta

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1325050.
[User impact if declined]: Possibility of very slow browsing due to excessive GCs.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None (this patch includes fix for bug 1384550).
[Is the change risky?]: Some risk.
[Why is the change risky/not risky?]: There is some risk due to the possibility of unforseen side effects by changing a complex area of code and short time this has had to bake on central.
[String changes made/needed]: None.
Attachment #8891338 - Flags: approval-mozilla-beta?
Comment on attachment 8891338 [details] [diff] [review]
bug1374797-beta

Green on Try too.
Attachment #8891338 - Flags: feedback+
Comment on attachment 8891338 [details] [diff] [review]
bug1374797-beta

let's try this in 55 rc1, the bug sounds bad enough to take this risk
Attachment #8891338 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This feels baked enough now that I think it's time to consider rebasing this for ESR52 and nominating it for uplift. WDYT, Jon? Could roll the fix for bug 1390087 into it at the same time.
Flags: needinfo?(jcoppeard)
This problem affect 54 onwards.
Flags: needinfo?(jcoppeard)
Even better! Thanks and my bad :)
Performance Impact: --- → P1
Whiteboard: [qf:p1][platform-rel-Facebook] → [platform-rel-Facebook]
You need to log in before you can comment on or make changes to this bug.