1374797 - Facebook is intermittently unusable on load due to GC

Reporter

Description

•

7 years ago

I finally captured a profile of something I see on a frequent but intermittent basis. At least once a day using FF Nightly I visit facebook.com (logged in) and the page appears but refuses to let me interact with it. At some point the yellow warning of "A page is slowing down your browser"; if I wait long enough (60s) then it eventually goes away. Inverting the call stack in the profiler shows that the vast majority of time is being spent in the GC. http://bit.ly/2sxTeW5

Josh Matthews [:jdm]

Reporter

Updated

•

7 years ago

OS: Unspecified → Mac OS X

Hardware: Unspecified → x86_64

Version: unspecified → Trunk

Jan de Mooij [:jandem]

Comment 1

•

7 years ago

Can we figure out why these aren't incremental or something?

Flags: needinfo?(jcoppeard)

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 2

•

7 years ago

Do you know the reason for these GCs? javascript.options.mem.log true and look at the browser console and messages from the relevant child process.

Flags: needinfo?(bugs) → needinfo?(josh)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 3

•

7 years ago

Hmm, these are all JS engine triggered GCs, so my guess is reason is ALLOC or some such.

Josh Matthews [:jdm]

Reporter

Comment 4

•

7 years ago

Attached file GC log — Details

There are some gems in this log: Pause: 300.663ms of unlimited budget (@ 31.816ms); Reason: API; Pause: 281.301ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER; Pause: 265.338ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER; Pause: 275.415ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER; Pause: 280.692ms of 5ms budget (@ 0.000ms); Reason: ALLOC_TRIGGER; etc. In total it was probably 15 or 20 seconds on unresponsiveness.

Flags: needinfo?(josh)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 5

•

7 years ago

I assume those aren't interruptible (trying to recall what the log says in such case). I don't quite understand why the profile doesn't show anything before the GC for that particular child process.

Josh Matthews [:jdm]

Reporter

Comment 6

•

7 years ago

That log is filtered to only include messages containing "content-". If that's a problem I can do it without filtering.

Jean Gong :jgong

Updated

•

7 years ago

Whiteboard: [qf] → [qf:p1]

Kannan Vijayan [:djvj]

Updated

•

7 years ago

Depends on: GCScheduling

Mike Conley (:mconley) (:⚙️)

Comment 7

•

7 years ago

I also experience this periodically, fwiw.

Mike Conley (:mconley) (:⚙️)

Comment 8

•

7 years ago

Another profile if it's useful: http://bit.ly/2rL6QLI

Steve Fink [:sfink] [:s:]

Comment 9

•

7 years ago

The profiles have lots of GC information embedded in them. If I look at /threads/2/markers/data/*/timings/nonincremental_reason all but one of the 50 or so GCs in there shows "GCBytesTrigger". (And yes, reason is ALLOC_TRIGGER.) Which implies that something is furiously allocating a massive number of GC things. They are all collecting 6 zones out of a total of 7. chunks_added and chunks_removed are both 0 throughout. 'allocated' is 32, which appears to mean that the GC heap size is a steady 32MB. I guess that means we're allocating an enormous amount of short-lived stuff in the tenured heap? It would be nice to have some kind of information on what that is, like a histogram at least by AllocKind. I downloaded 3 of the leaf script files just above the GC on the stack to look for commonalities, but nothing stood out. (It might work better if we had column numbers!) Clicking around on the stacks to see what part of the engine triggered the GC didn't really suggest anything to me either.

gerd.neumann

Comment 10

•

7 years ago

I see this quite often too (I go to facebook once a day being logged in, and it happens about every 3rd time). Here's another profile: https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when pressing Stop eventually :( Keeping "Wait" seems like it is running like forever... Let me know if I can help giving more information. FWIW, I am on Linux

Adam Stevenson [:adamopenweb]

Updated

•

7 years ago

platform-rel: --- → ?

Whiteboard: [qf:p1] → [qf:p1][platform-rel-Facebook]

Steve Fink [:sfink] [:s:]

Comment 11

•

7 years ago

(In reply to gerd.neumann from comment #10) > I see this quite often too (I go to facebook once a day being logged in, and > it happens about every 3rd time). Here's another profile: > https://perfht.ml/2vbpRJi from today (I clicked "Wait" about 10 times at the > "Is slowing down. [Wait] [Stop]" dialog). In the end FB does not load when > pressing Stop eventually :( Keeping "Wait" seems like it is running like > forever... Let me know if I can help giving more information. FWIW, I am on > Linux Just for the record, this profile *is* the same issue (a long series of back-to-back GCs.)

Ted Campbell [:tcampbell]

Comment 12

•

7 years ago

I think I'm seeing the same behavior in Bug 1382814. I'm able to reproduce reasonably consistently using Netflix.

Jon Coppeard (:jonco)

Assignee

Comment 14

•

7 years ago

> Which implies that something is furiously allocating a massive number of GC things. I wonder if we are triggering GC because the atoms zone is over its threshold but find we can't collect it due to ongoing parsing activity.

Steve Fink [:sfink] [:s:]

Comment 15

•

7 years ago

Ted's profile is a pretty extreme one. The heap is holding steady at around 16MB, though it slowly creeped up to that size. The collections are nonincremental GCs of 3 out of 4 zones. Does it mean something that we always seem to be doing n-1 out of n?

Ted Campbell [:tcampbell]

Comment 16

•

7 years ago

Okay, I looked at my Netflix example (on nightly 2017-05-01 since I can hit bug predictably) and can attach a debugger while the GC is acting up. Jonco's theory about atoms zone seems to be correct. We wind up with atoms zone over threshold, schedule a Full GC, and then skip the atoms zone because a background parse is in progress. This hammers the main thread as script makes lock-step progress with full GCs. To enter the stalled state, we have an async parse (from <script async=""> in the Netflix case) that is waiting on the parseFinishedList_ queue until main thread can do finalization tasks (copy to correct compartment, etc). This is blocked until current script on main thread completes. On each major GC, we check if the atoms zone is over threshold and if so, we create a deferred request to do a full GC in the future. This flag is checked for every new Parser run on main thread. This means every lazy script or eval that main thread triggers will cause a full GC. For a complex script, this can be thousands of full GCs on page load. Perhaps someone get help create a testcase. I'm not sure how to reliably force AtomsZone to go over threshold (especially if it is 10s of MB). I could not reproduce the issue with MOZ_GCTIMER active and timing seems to be sensitive. I can look around more in the debugger if there specific questions.

Ted Campbell [:tcampbell]

Comment 17

•

7 years ago

As discussed on #jsapi, the immediate problem is that ~AutoKeepAtoms doesn't check for hasHelperThreadZone() before scheduling the GC. This leads to the back-to-back GC problem. https://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/js/src/jscntxt.h#1260 There is also the matter of while an async script is waiting for main-thread in order to finalize parse (copy compartments from parser to target), we will never be able to GC the atoms zone in that main-thread script. Jon is going to investigate this some more.

Jon Coppeard (:jonco)

Assignee

Updated

•

7 years ago

Assignee: nobody → jcoppeard

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Assignee

Comment 18

•

7 years ago

Attached patch bug1374797-atoms-gcs — Details — Splinter Review

The problem is that we're repeatedly triggering a full GC to collect the atoms zone, but then not actually collecting it due to the presence of helper threads. The patch factors out the check for whether we can collect the atoms zone (no AutoKeepAtoms, no helper threads) into JSContext::canCollectAtoms() and replaces the original checks with this (both were wrong it appears). I added a specific reason code for this type of GC so it will show up in telemetry and added an assertion that it always collects the atoms zone. I refactored the zone selection logic at the start of the GC: the atoms zone used to be handled separately because it could only be collected in a full GC but this is no longer the case. I also made sure that the relevant zone was scheduled in cases where we collect non-incrementally due to hitting triggers, as that could potentially cause similar issues.

Attachment #8889527 - Flags: review?(sphink)

Jon Coppeard (:jonco)

Assignee

Updated

•

7 years ago

Comment 19

•

7 years ago

By the way, the same thing happens on treeherder, but to a lesser extent than on Facebook.

Steve Fink [:sfink] [:s:]

Comment 20

•

7 years ago

Comment on attachment 8889527 [details] [diff] [review] bug1374797-atoms-gcs Review of attachment 8889527 [details] [diff] [review]: ----------------------------------------------------------------- Nice!! It's scary to think how many much we may be suffering from this right now. I traced through the logic as much as I could, and it looks good from what I can tell. ::: js/src/jsgc.cpp @@ +3870,1 @@ > // If we are repeating a GC becuase we noticed dead compartments haven't *because @@ +3884,5 @@ > + > + // If keepAtoms() is true then either an instance of AutoKeepAtoms is > + // currently on the stack or parsing is currently happening on another > + // thread. In either case we don't have information about which atoms are > + // roots, so we must skip collecting atoms. Comment needs an update. keepAtoms appears to not be a function, and it does not encompass parsing happening (hasHelperThreadZones). Oh. I guess this can just do s/keepAtoms/canCollectAtoms/ @@ +3898,5 @@ > + // the other collected zones are using are marked, and we can update the > + // set of atoms in use by the other collected zones at the end of the GC. > + if (zone->isAtomsZone()) { > + JSRuntime* rt = zone->runtimeFromActiveCooperatingThread(); > + return !TlsContext.get()->keepAtoms && !rt->hasHelperThreadZones(); ...at least if this can be simplified to if (zone->isAtomsZone()) return TlsContext.get()->canCollectAtoms(); ? @@ +3961,2 @@ > */ > + MOZ_ASSERT_IF(reason == JS::gcreason::DELAYED_ATOMS_GC, atomsZone->isGCMarking()); Ooh, I like this assert. @@ +6738,5 @@ > } > + > + // This ensures we collect zones that have reached the malloc limit. > + // TODO: Start collecting these zones earlier like the we do for the > + // GC bytes trigger above. Please file a bug, and s/the we/we/

Attachment #8889527 - Flags: review?(sphink) → review+

Ben Kelly [:bkelly, not reviewing]

Comment 21

•

7 years ago

Is this nightly only or does it affect other branches?

Ted Campbell [:tcampbell]

Comment 22

•

7 years ago

I'm my attempts to bisect on Bug 1382814, I got failures as far back as nightly 2017-03-01 which I think would be FF55. Possibly earlier too.

J. Ryan Stinnett [:jryans] (Use needinfo, replies may be slow)

Comment 23

•

7 years ago

(In reply to Ted Campbell [:tcampbell] from comment #22) > I'm my attempts to bisect on Bug 1382814, I got failures as far back as > nightly 2017-03-01 which I think would be FF55. Possibly earlier too. 54 left central on 2017-03-06, so it sounds like it could also be affected. [Tracking Requested - why for this release]: Important perf issue affecting commonly used pages.

status-firefox54: --- → affected

status-firefox55: --- → affected

status-firefox56: --- → affected

tracking-firefox54: --- → ?

tracking-firefox55: --- → ?

tracking-firefox56: --- → ?

Ted Campbell [:tcampbell]

Comment 24

•

7 years ago

You are correct. I can reproduce on both 54 and 55. A quick look at the code suggests ESR-52 is also affected, but I can't get the timing to appear with my test case.

Ted Campbell [:tcampbell]

Updated

•

7 years ago

status-firefox-esr52: --- → affected

Gerry Chang [:gchang]

Comment 26

•

7 years ago

We are only one week from 55 RC. Mark 54 won't fix. Let's see if we can get it fix in 55.

status-firefox54: affected → wontfix

tracking-firefox54: ? → -

tracking-firefox55: ? → +

tracking-firefox56: ? → +

Jon Coppeard (:jonco)

Assignee

Comment 27

•

7 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #20) > Comment needs an update. keepAtoms appears to not be a function, and it does > not encompass parsing happening (hasHelperThreadZones). Good point, and yes we can simplify this to just call canCollectAtoms(). > > + // This ensures we collect zones that have reached the malloc limit. > > + // TODO: Start collecting these zones earlier like the we do for the > > + // GC bytes trigger above. > > Please file a bug, and s/the we/we/ Done, filed bug 1384049.

Pulsebot

Comment 28

•

7 years ago

Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e33e38d61829 Fix logic around triggering atoms GCs r=sfink

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 30

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e33e38d61829

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox56: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla56

Jon Coppeard (:jonco)

Assignee

Updated

•

7 years ago

Depends on: 1384550

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 31

•

7 years ago

Hi Jon, Julien, should we consider uplifting this to 55? I know it's pretty late and the change doesn't seem trivial. Thoughts?

Flags: needinfo?(jcristau)

Flags: needinfo?(jcoppeard)

Julien Cristau [:jcristau]

Comment 32

•

7 years ago

I think it's probably too late for 55 unfortunately, unless we want to delay the release over this to get some extra confidence (and that seems unlikely).

Flags: needinfo?(jcristau)

Bill Gianopoulos [:WG9s]

Comment 33

•

7 years ago

The good news here is that Both Facebook and treeherder scrolling is great now for me. Those are the 2 places where I saw major issues.

Jon Coppeard (:jonco)

Assignee

Comment 34

•

7 years ago

(In reply to Julien Cristau [:jcristau] from comment #32) RyanVM asked the same question on IRC and we came to the conclusion that it's probably too late.

Flags: needinfo?(jcoppeard)

Emanuel Hoogeveen [:ehoogeveen]

Comment 35

•

7 years ago

While I understand the rationale for holding off on uplifting this, the number of bugs that are turning out to be dupes of this one is kind of worrying me. It seems that both Facebook and Gmail (from bug 1242861) were affected, which are not insignificant websites, so I'm worried that a lot of users are quietly having a pretty miserable experience using Firefox at the moment.

Jan de Mooij [:jandem]

Comment 36

•

7 years ago

I agree we should consider this for 55 or at least a dot release shortly afterwards. The short summary is that some of the most popular websites (Facebook, Netflix, etc) hang the content process for at least a minute on load.

Ryan VanderMeulen [:RyanVM]

Comment 37

•

7 years ago

I think we're all worried about the implications of this not making 55. The concern is how much time this patch needs to bake before we can feel comfortable with uplifting it into the RC build on Monday (or potentially a dot release in a few weeks). Fuzzing, telemetry, something else?

Bill Gianopoulos [:WG9s]

Comment 38

•

7 years ago

Well here I am with 20-20 hindsight to say this should have been uplifted as soon as we had a patch so that we had more testing on the beta channel. IT seems a bit late now but we also have the now that we eliminated aurora, we no longer have the luxury of doing a beta of a 55.0.1. It seems anything we do here will be wrong.

Bill Gianopoulos [:WG9s]

Comment 39

•

7 years ago

What probably should have been done, and I am not posting this to criticize, but to show a better way to do this if something like this comes up again. i t should have been uplifted to beta immediately and then if not thought testing was long enough backed out there before release and held for a later point release so that there would be more widespread testing because ti was beta at one time. Having more testing on Nightly is fine but does not seem to compare to the testing on the beta channel.

Jon Coppeard (:jonco)

Assignee

Comment 40

•

7 years ago

Attached patch bug1374797-beta — Details — Splinter Review

Here's a backport of the patch to beta that includes the fix for bug 1384550. I had to give DELAYED_ATOMS_GC a different reason code for this version.

Jon Coppeard (:jonco)

Assignee

Comment 41

•

7 years ago

Comment on attachment 8891338 [details] [diff] [review] bug1374797-beta Approval Request Comment [Feature/Bug causing the regression]: Bug 1325050. [User impact if declined]: Possibility of very slow browsing due to excessive GCs. [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: Yes. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None (this patch includes fix for bug 1384550). [Is the change risky?]: Some risk. [Why is the change risky/not risky?]: There is some risk due to the possibility of unforseen side effects by changing a complex area of code and short time this has had to bake on central. [String changes made/needed]: None.

Attachment #8891338 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 42

•

7 years ago

Comment on attachment 8891338 [details] [diff] [review] bug1374797-beta Green on Try too.

Attachment #8891338 - Flags: feedback+

Julien Cristau [:jcristau]

Comment 43

•

7 years ago

Comment on attachment 8891338 [details] [diff] [review] bug1374797-beta let's try this in 55 rc1, the bug sounds bad enough to take this risk

Attachment #8891338 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 44

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/7aa43a156e68

status-firefox55: affected → fixed

Flags: in-testsuite+

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

7 years ago

Depends on: 1390087

Ryan VanderMeulen [:RyanVM]

Comment 46

•

7 years ago

This feels baked enough now that I think it's time to consider rebasing this for ESR52 and nominating it for uplift. WDYT, Jon? Could roll the fix for bug 1390087 into it at the same time.

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Assignee

Comment 47

•

7 years ago

This problem affect 54 onwards.

status-firefox-esr52: affected → unaffected

Flags: needinfo?(jcoppeard)

Ryan VanderMeulen [:RyanVM]

Comment 48

•

7 years ago

Even better! Thanks and my bad :)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P1

Whiteboard: [qf:p1][platform-rel-Facebook] → [platform-rel-Facebook]

GC log 7 years ago Josh Matthews [:jdm] 220.37 KB, text/plain		Details
bug1374797-atoms-gcs 7 years ago Jon Coppeard (:jonco) 15.58 KB, patch	sfink : review+	Details \| Diff \| Splinter Review
bug1374797-beta 7 years ago Jon Coppeard (:jonco) 17.29 KB, patch	RyanVM : feedback+ jcristau : approval-mozilla-beta+	Details \| Diff \| Splinter Review