Closed
Bug 1374797
Opened 7 years ago
Closed 7 years ago
Facebook is intermittently unusable on load due to GC
Categories
(Core :: JavaScript: GC, defect)
Tracking
()
People
(Reporter: jdm, Assigned: jonco)
References
(Depends on 1 open bug)
Details
(Whiteboard: [platform-rel-Facebook])
Attachments
(3 files)
220.37 KB,
text/plain
|
Details | |
15.58 KB,
patch
|
sfink
:
review+
|
Details | Diff | Splinter Review |
17.29 KB,
patch
|
RyanVM
:
feedback+
jcristau
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
I finally captured a profile of something I see on a frequent but intermittent basis. At least once a day using FF Nightly I visit facebook.com (logged in) and the page appears but refuses to let me interact with it. At some point the yellow warning of "A page is slowing down your browser"; if I wait long enough (60s) then it eventually goes away. Inverting the call stack in the profiler shows that the vast majority of time is being spent in the GC.
http://bit.ly/2sxTeW5
Reporter | ||
Updated•7 years ago
|
OS: Unspecified → Mac OS X
Hardware: Unspecified → x86_64
Version: unspecified → Trunk
Comment 1•7 years ago
|
||
Can we figure out why these aren't incremental or something?
Flags: needinfo?(jcoppeard)
Flags: needinfo?(bugs)
Comment 2•7 years ago
|
||
Do you know the reason for these GCs?
javascript.options.mem.log true and look at the browser console and messages from the relevant child process.
Flags: needinfo?(bugs) → needinfo?(josh)
Comment 3•7 years ago
|
||
Hmm, these are all JS engine triggered GCs, so my guess is reason is ALLOC or some such.
Reporter | ||
Comment 4•7 years ago
|
||
There are some gems in this log:
Pause: 300.663ms of unlimited budget (@ 31.816ms); Reason: API;
Pause: 281.301ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 265.338ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 275.415ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
Pause: 280.692ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER;
etc.
In total it was probably 15 or 20 seconds on unresponsiveness.
Flags: needinfo?(josh)
Comment 5•7 years ago
|
||
I assume those aren't interruptible (trying to recall what the log says in such case).
I don't quite understand why the profile doesn't show anything before the GC for that particular child process.
Reporter | ||
Comment 6•7 years ago
|
||
That log is filtered to only include messages containing "content-". If that's a problem I can do it without filtering.
Updated•7 years ago
|
Whiteboard: [qf] → [qf:p1]
Updated•7 years ago
|
Depends on: GCScheduling
Comment 7•7 years ago
|
||
I also experience this periodically, fwiw.
Comment 8•7 years ago
|
||
Another profile if it's useful: http://bit.ly/2rL6QLI
Comment 9•7 years ago
|
||
The profiles have lots of GC information embedded in them. If I look at /threads/2/markers/data/*/timings/nonincremental_reason all but one of the 50 or so GCs in there shows "GCBytesTrigger". (And yes, reason is ALLOC_TRIGGER.) Which implies that something is furiously allocating a massive number of GC things.
They are all collecting 6 zones out of a total of 7.
chunks_added and chunks_removed are both 0 throughout. 'allocated' is 32, which appears to mean that the GC heap size is a steady 32MB.
I guess that means we're allocating an enormous amount of short-lived stuff in the tenured heap? It would be nice to have some kind of information on what that is, like a histogram at least by AllocKind.
I downloaded 3 of the leaf script files just above the GC on the stack to look for commonalities, but nothing stood out. (It might work better if we had column numbers!) Clicking around on the stacks to see what part of the engine triggered the GC didn't really suggest anything to me either.
Comment 10•7 years ago
|
||
I see this quite often too (I go to facebook once a day being logged in, and it happens about every 3rd time). Here's another profile: https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when pressing Stop eventually :( Keeping "Wait" seems like it is running like forever... Let me know if I can help giving more information. FWIW, I am on Linux
Updated•7 years ago
|
platform-rel: --- → ?
Whiteboard: [qf:p1] → [qf:p1][platform-rel-Facebook]
Comment 11•7 years ago
|
||
(In reply to gerd.neumann from comment #10)
> I see this quite often too (I go to facebook once a day being logged in, and
> it happens about every 3rd time). Here's another profile:
> https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the
> "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when
> pressing Stop eventually :( Keeping "Wait" seems like it is running like
> forever... Let me know if I can help giving more information. FWIW, I am on
> Linux
Just for the record, this profile *is* the same issue (a long series of back-to-back GCs.)
Comment 12•7 years ago
|
||
I think I'm seeing the same behavior in Bug 1382814. I'm able to reproduce reasonably consistently using Netflix.
Assignee | ||
Comment 14•7 years ago
|
||
> Which implies that something is furiously allocating a massive number of GC things.
I wonder if we are triggering GC because the atoms zone is over its threshold but find we can't collect it due to ongoing parsing activity.
Comment 15•7 years ago
|
||
Ted's profile is a pretty extreme one. The heap is holding steady at around 16MB, though it slowly creeped up to that size. The collections are nonincremental GCs of 3 out of 4 zones. Does it mean something that we always seem to be doing n-1 out of n?
Comment 16•7 years ago
|
||
Okay, I looked at my Netflix example (on nightly 2017-05-01 since I can hit bug predictably) and can attach a debugger while the GC is acting up.
Jonco's theory about atoms zone seems to be correct. We wind up with atoms zone over threshold, schedule a Full GC, and then skip the atoms zone because a background parse is in progress. This hammers the main thread as script makes lock-step progress with full GCs.
To enter the stalled state, we have an async parse (from <script async=""> in the Netflix case) that is waiting on the parseFinishedList_ queue until main thread can do finalization tasks (copy to correct compartment, etc). This is blocked until current script on main thread completes.
On each major GC, we check if the atoms zone is over threshold and if so, we create a deferred request to do a full GC in the future. This flag is checked for every new Parser run on main thread. This means every lazy script or eval that main thread triggers will cause a full GC. For a complex script, this can be thousands of full GCs on page load.
Perhaps someone get help create a testcase. I'm not sure how to reliably force AtomsZone to go over threshold (especially if it is 10s of MB). I could not reproduce the issue with MOZ_GCTIMER active and timing seems to be sensitive. I can look around more in the debugger if there specific questions.
Comment 17•7 years ago
|
||
As discussed on #jsapi, the immediate problem is that ~AutoKeepAtoms doesn't check for hasHelperThreadZone() before scheduling the GC. This leads to the back-to-back GC problem.
https://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/js/src/jscntxt.h#1260
There is also the matter of while an async script is waiting for main-thread in order to finalize parse (copy compartments from parser to target), we will never be able to GC the atoms zone in that main-thread script. Jon is going to investigate this some more.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jcoppeard
Flags: needinfo?(jcoppeard)
Assignee | ||
Comment 18•7 years ago
|
||
The problem is that we're repeatedly triggering a full GC to collect the atoms zone, but then not actually collecting it due to the presence of helper threads.
The patch factors out the check for whether we can collect the atoms zone (no AutoKeepAtoms, no helper threads) into JSContext::canCollectAtoms() and replaces the original checks with this (both were wrong it appears).
I added a specific reason code for this type of GC so it will show up in telemetry and added an assertion that it always collects the atoms zone. I refactored the zone selection logic at the start of the GC: the atoms zone used to be handled separately because it could only be collected in a full GC but this is no longer the case.
I also made sure that the relevant zone was scheduled in cases where we collect non-incrementally due to hitting triggers, as that could potentially cause similar issues.
Attachment #8889527 -
Flags: review?(sphink)
Comment 19•7 years ago
|
||
By the way, the same thing happens on treeherder, but to a lesser extent than on Facebook.
Comment 20•7 years ago
|
||
Comment on attachment 8889527 [details] [diff] [review]
bug1374797-atoms-gcs
Review of attachment 8889527 [details] [diff] [review]:
-----------------------------------------------------------------
Nice!! It's scary to think how many much we may be suffering from this right now.
I traced through the logic as much as I could, and it looks good from what I can tell.
::: js/src/jsgc.cpp
@@ +3870,1 @@
> // If we are repeating a GC becuase we noticed dead compartments haven't
*because
@@ +3884,5 @@
> +
> + // If keepAtoms() is true then either an instance of AutoKeepAtoms is
> + // currently on the stack or parsing is currently happening on another
> + // thread. In either case we don't have information about which atoms are
> + // roots, so we must skip collecting atoms.
Comment needs an update. keepAtoms appears to not be a function, and it does not encompass parsing happening (hasHelperThreadZones).
Oh. I guess this can just do s/keepAtoms/canCollectAtoms/
@@ +3898,5 @@
> + // the other collected zones are using are marked, and we can update the
> + // set of atoms in use by the other collected zones at the end of the GC.
> + if (zone->isAtomsZone()) {
> + JSRuntime* rt = zone->runtimeFromActiveCooperatingThread();
> + return !TlsContext.get()->keepAtoms && !rt->hasHelperThreadZones();
...at least if this can be simplified to
if (zone->isAtomsZone())
return TlsContext.get()->canCollectAtoms();
?
@@ +3961,2 @@
> */
> + MOZ_ASSERT_IF(reason == JS::gcreason::DELAYED_ATOMS_GC, atomsZone->isGCMarking());
Ooh, I like this assert.
@@ +6738,5 @@
> }
> +
> + // This ensures we collect zones that have reached the malloc limit.
> + // TODO: Start collecting these zones earlier like the we do for the
> + // GC bytes trigger above.
Please file a bug, and s/the we/we/
Attachment #8889527 -
Flags: review?(sphink) → review+
Comment 21•7 years ago
|
||
Is this nightly only or does it affect other branches?
Comment 22•7 years ago
|
||
I'm my attempts to bisect on Bug 1382814, I got failures as far back as nightly 2017-03-01 which I think would be FF55. Possibly earlier too.
(In reply to Ted Campbell [:tcampbell] from comment #22)
> I'm my attempts to bisect on Bug 1382814, I got failures as far back as
> nightly 2017-03-01 which I think would be FF55. Possibly earlier too.
54 left central on 2017-03-06, so it sounds like it could also be affected.
[Tracking Requested - why for this release]:
Important perf issue affecting commonly used pages.
status-firefox54:
--- → affected
status-firefox55:
--- → affected
status-firefox56:
--- → affected
tracking-firefox54:
--- → ?
tracking-firefox55:
--- → ?
tracking-firefox56:
--- → ?
Comment 24•7 years ago
|
||
You are correct. I can reproduce on both 54 and 55. A quick look at the code suggests ESR-52 is also affected, but I can't get the timing to appear with my test case.
Updated•7 years ago
|
status-firefox-esr52:
--- → affected
Comment 26•7 years ago
|
||
We are only one week from 55 RC. Mark 54 won't fix. Let's see if we can get it fix in 55.
Assignee | ||
Comment 27•7 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #20)
> Comment needs an update. keepAtoms appears to not be a function, and it does
> not encompass parsing happening (hasHelperThreadZones).
Good point, and yes we can simplify this to just call canCollectAtoms().
> > + // This ensures we collect zones that have reached the malloc limit.
> > + // TODO: Start collecting these zones earlier like the we do for the
> > + // GC bytes trigger above.
>
> Please file a bug, and s/the we/we/
Done, filed bug 1384049.
Comment 28•7 years ago
|
||
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e33e38d61829
Fix logic around triggering atoms GCs r=sfink
Comment 30•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Hi Jon, Julien, should we consider uplifting this to 55? I know it's pretty late and the change doesn't seem trivial. Thoughts?
Flags: needinfo?(jcristau)
Flags: needinfo?(jcoppeard)
Comment 32•7 years ago
|
||
I think it's probably too late for 55 unfortunately, unless we want to delay the release over this to get some extra confidence (and that seems unlikely).
Flags: needinfo?(jcristau)
Comment 33•7 years ago
|
||
The good news here is that Both Facebook and treeherder scrolling is great now for me. Those are the 2 places where I saw major issues.
Assignee | ||
Comment 34•7 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #32)
RyanVM asked the same question on IRC and we came to the conclusion that it's probably too late.
Flags: needinfo?(jcoppeard)
Comment 35•7 years ago
|
||
While I understand the rationale for holding off on uplifting this, the number of bugs that are turning out to be dupes of this one is kind of worrying me. It seems that both Facebook and Gmail (from bug 1242861) were affected, which are not insignificant websites, so I'm worried that a lot of users are quietly having a pretty miserable experience using Firefox at the moment.
Comment 36•7 years ago
|
||
I agree we should consider this for 55 or at least a dot release shortly afterwards.
The short summary is that some of the most popular websites (Facebook, Netflix, etc) hang the content process for at least a minute on load.
Comment 37•7 years ago
|
||
I think we're all worried about the implications of this not making 55. The concern is how much time this patch needs to bake before we can feel comfortable with uplifting it into the RC build on Monday (or potentially a dot release in a few weeks). Fuzzing, telemetry, something else?
Comment 38•7 years ago
|
||
Well here I am with 20-20 hindsight to say this should have been uplifted as soon as we had a patch so that we had more testing on the beta channel. IT seems a bit late now but we also have the now that we eliminated aurora, we no longer have the luxury of doing a beta of a 55.0.1. It seems anything we do here will be wrong.
Comment 39•7 years ago
|
||
What probably should have been done, and I am not posting this to criticize, but to show a better way to do this if something like this comes up again. i t should have been uplifted to beta immediately and then if not thought testing was long enough backed out there before release and held for a later point release so that there would be more widespread testing because ti was beta at one time. Having more testing on Nightly is fine but does not seem to compare to the testing on the beta channel.
Assignee | ||
Comment 40•7 years ago
|
||
Here's a backport of the patch to beta that includes the fix for bug 1384550.
I had to give DELAYED_ATOMS_GC a different reason code for this version.
Assignee | ||
Comment 41•7 years ago
|
||
Comment on attachment 8891338 [details] [diff] [review]
bug1374797-beta
Approval Request Comment
[Feature/Bug causing the regression]: Bug 1325050.
[User impact if declined]: Possibility of very slow browsing due to excessive GCs.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None (this patch includes fix for bug 1384550).
[Is the change risky?]: Some risk.
[Why is the change risky/not risky?]: There is some risk due to the possibility of unforseen side effects by changing a complex area of code and short time this has had to bake on central.
[String changes made/needed]: None.
Attachment #8891338 -
Flags: approval-mozilla-beta?
Comment 42•7 years ago
|
||
Attachment #8891338 -
Flags: feedback+
Comment 43•7 years ago
|
||
Comment on attachment 8891338 [details] [diff] [review]
bug1374797-beta
let's try this in 55 rc1, the bug sounds bad enough to take this risk
Attachment #8891338 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 44•7 years ago
|
||
bugherder uplift |
Flags: in-testsuite+
Comment 46•7 years ago
|
||
This feels baked enough now that I think it's time to consider rebasing this for ESR52 and nominating it for uplift. WDYT, Jon? Could roll the fix for bug 1390087 into it at the same time.
Flags: needinfo?(jcoppeard)
Comment 48•7 years ago
|
||
Even better! Thanks and my bad :)
Updated•3 years ago
|
Performance Impact: --- → P1
Whiteboard: [qf:p1][platform-rel-Facebook] → [platform-rel-Facebook]
You need to log in
before you can comment on or make changes to this bug.
Description
•