Open Bug 1629064 Opened 1 year ago Updated 12 hours ago

Closing about a hundred tabs resulted in the parent and all content processes being busy GC'ing

Categories

(Core :: DOM: Content Processes, defect, P2)

defect

Tracking

()

ASSIGNED
Fission Milestone M7a

People

(Reporter: florian, Assigned: pbone)

References

(Blocks 7 open bugs)

Details

(Keywords: perf, Whiteboard: fission-perf)

Attachments

(13 files, 10 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
2.53 KB, text/plain
chutten
: data-review+
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

See this profile: https://perfht.ml/2XmQBId
In the parent process (and some content processes, eg. "Content Process (5/8)") we have a lot of GCSlice markers of about 100ms with the reason PAGE_HIDE. In some other content processes I see a series of GCSlice markers with the CC_WAITING reason.

This made the browser very unresponsive for several seconds, I wonder if we were doing a GC for every closed tab.

The way that we schedule GCs on page close is through a function PokeGC(), which is kind of like a saturating bit, so closing many pages at once won't cause us to do many GCs. What you are probably seeing is there's a ton of work to do when cleaning up a hundred tabs, and the back-and-forth between GC and CC gets pretty bad if there's too much work to do.

Relatedly, when closing large number of tabs, the very process of garbage collecting uses large amounts of memory. So memory use goes significantly up before going down.

(In reply to Mike Hommey [:glandium] from comment #2)
The GC doesn't allocate any memory itself (or only very small amounts). It might be the cycle collector doing that.

Yeah, the cycle collector will allocate a ton of memory if you are freeing a lot of stuff, because it creates a shadow copy of the possibly-garbage heap.

(In reply to Andrew McCreight [:mccr8] from comment #1)
I thought Olli added a machanism so that all child processes didn't try to collect at once? It seems like that is what's happening here though.

The thing I added was parent process driven idle detection.
That certainly wouldn't catch the issue when closing processes.

OK that makes sense.

I'm going to move this to DOM since it seems to be more about scheduling GCs between multiple child processes than about GC per se.

Component: JavaScript: GC → DOM: Content Processes

P3
M6

Fission Milestone: --- → M6
Priority: -- → P3

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3
Flags: needinfo?(pbone)

Tracking for Fission M7 Beta.

This doesn't block enabling Fission in Nightly but we should fix soon after.

Fission Milestone: M6 → M7
Priority: P3 → P2

Thanks for NIing me on this Neha,

Hi Florian,

It looks like you did not have Fission enabled. Is that correct?

I don't know if this would be worse or better with Fission. With fission there are more processes to contend with each other like this. But many of those processes might be used in a single tab only, and exit rather than GC. It depends on how people use the browser of course. My point is that this might or might not be a priority for fission, it might just be a priority for improving the browser.

Flags: needinfo?(pbone)

I guess in a multiprocess firefox world one solution is that the parent process can have a number of "can GC" tokens it can hand out to child processes, probably at their request. (Child processes can still do emergency GC). The number of tokens may be related to the number of cores the browser has access to.

Blocks: 1661491

(In reply to Olli Pettay [:smaug] from comment #6)

The thing I added was parent process driven idle detection.
That certainly wouldn't catch the issue when closing processes.

Hi Olli, does that mean that processes will take in turns doing their idle GC (USER_INACTIVE)? Can we reuse that code to do PAGE_HIDE? Or do they both need doing and you specifically changed the idle detection stuff.

Thanks.

Flags: needinfo?(bugs)

idle != user inactive.

Idleness means that the main event loop doesn't have any other tasks to run than idle tasks and that vsync isn't supposed to come in real soon.
PAGE_HIDE uses normally idle time, because in general all the GCs triggered normally by nsJSContent use idle time, whenever possible.
But there is the issue that if GC for example runs because of js engine forcing it, the idle time isn't used. And also if there is ongoing CC and then GC needs to run, CC needs to run synchronously.

If you look at the profile in the initial comment, the long GC slice is start using idle time. Budget is 49ms (which is basically the max idle time, 50ms), but for some reason GC slice takes way more time. Why does GC have such a long slice?

Flags: needinfo?(bugs)

(In reply to Olli Pettay [:smaug] from comment #14)

If you look at the profile in the initial comment, the long GC slice is start using idle time. Budget is 49ms (which is basically the max idle time, 50ms), but for some reason GC slice takes way more time. Why does GC have such a long slice?

It's blocked on a background task that's taking too long, I filed Bug 1661872.

Thanks.

Depends on: 1661872

(In reply to Olli Pettay [:smaug] from comment #14)

idle != user inactive.

Idleness means that the main event loop doesn't have any other tasks to run than idle tasks and that vsync isn't supposed to come in real soon.
PAGE_HIDE uses normally idle time, because in general all the GCs triggered normally by nsJSContent use idle time, whenever possible.
But there is the issue that if GC for example runs because of js engine forcing it, the idle time isn't used. And also if there is ongoing CC and then GC needs to run, CC needs to run synchronously.

Thanks for the info Olli,

No longer depends on: 1661872
See Also: → 1661872
Assignee: nobody → pbone
Status: NEW → ASSIGNED

(In reply to Olli Pettay [:smaug] from comment #14)

idle != user inactive.

Idleness means that the main event loop doesn't have any other tasks to run than idle tasks and that vsync isn't supposed to come in real soon.
PAGE_HIDE uses normally idle time, because in general all the GCs triggered normally by nsJSContent use idle time, whenever possible.
But there is the issue that if GC for example runs because of js engine forcing it, the idle time isn't used. And also if there is ongoing CC and then GC needs to run, CC needs to run synchronously.

If you look at the profile in the initial comment, the long GC slice is start using idle time. Budget is 49ms (which is basically the max idle time, 50ms), but for some reason GC slice takes way more time. Why does GC have such a long slice?

I'm still unsure what you changed about idle detection. People are telling me that you changed it so that USER_INACTIVE GCs for multiple processes can't run at the same time. but I haven't found the code for that yet. So I'm missing something

The USER_INACTIVE GCs run from here: https://searchfox.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#1653 which is triggered by the timer started here: https://searchfox.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#1878-1887 which is called here: https://searchfox.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#346-349 for the "user-interaction-inactive" observer notification. I found that this is notified here: https://searchfox.org/mozilla-central/source/dom/events/EventStateManager.cpp#171

I've seen these notifications around, and AFAIK they're also cross process (but maybe I'm wrong). So my questions/assumptions are:

  • Does the observer service notification in UITimerCallback::Notify run from the content process or the parent process? Is it about the whole browser UI or only the UI for a process?
  • Do all observers receive a notification? do they receive it at approximately the same time?

I'm missing something or misunderstanding something. As far as I can tell the user stops moving the mouse and clicking things. The UI sends "user-interaction-inactive" to all the observers (all the client processes) which all start their timers for shrinking GC. Which are all started for the same number of seconds from now and therefore they all start their shrinking GC at approximately the same time.

Thanks for your help.

Flags: needinfo?(bugs)

user-interaction-inactive state is per process. Nothing guarantees that user-interaction-active -> user-interaction-inactive state transitions don't happen around the same time in many processes.

Idle detection is per process too, but idle scheduling is cross-process.

So background tabs most likely get user-interaction-inactive rather soon after they have been to background.
Foreground tab may of course get it too, if user doesn't do anything, or if user interacts with browser chrome or so.

Perhaps https://searchfox.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#1832-1841 should use IdleTaskRunner, not
a timer, if user-interaction-inactive somehow causes issues here.

In general parent process probably should close tabs and processes gradually to avoid overloading the system. There are also non-GC/CC related stuff which probably need to happen when closing a tab: session store, sending telemetry probes to parent, etc.

Flags: needinfo?(bugs)

Thanks Oli, and thanks for your answers over Matrix.

To summarise I wrote that and then I dumped everything I knew / have been considering about GC scheduling from the DOM side.

There are 3 timers used to initiate GC from nsJSEnvironment.cpp, plus other reasons like running out of memory or hitting a threshold. In an e-mail sfink told me he might try to make just one timer.

  • sGCTimer - started to fire 4 seconds after PokeGC is called to initiate a full or zonal GC.
  • sFullGCTimer - started after a zonal GC completes, fires after 60 seconds to initiate a full GC.
  • sShrinkingGCTimer - started after 15 seconds of inactivity, runs an incremental compacting GC (non-zonal?) The timer is cancelled if user interaction is detected.

These can be started by:

  • PokeGC may be called for things like PAGE_HIDE (this bug).
  • PokeShrinkingGC is called if this process is inactive. The user may have gone to make a coffee, or they might be using a different application, a different browser tab, or browser chrome.

Once a GC is started, with the exception of the first slice (I think) it's other slices will run during idle time. Idle time is when the process has no "Tasks" in the web-spec sense to do, no timers will fire soon, and no vsync is expected soon. smaug changed it so that the content process will check with the parent process before it starts an idle task to make sure that not too many idle tasks are currently running.

Note that timers such as the ones above will prempt other idle tasks. But AFAIK this is acceptable because it's only the first slice that runs in our timers' time.

The parent process code for idle scheduling is here: https://searchfox.org/mozilla-central/source/ipc/glue/IdleSchedulerParent.h

This means that slices of a GC should be scheduled not-too-aggressively WRT other content processes GC slices. But that if a GC's slice overruns its budget it could be unfair to other processes. Or if a GC slice does off-thread work that off-thread work will be using more resources than the scheduler in the parent process accounted for. At best it probably won't affect GC throughput, but by doing fewer GCs at once we may be able to return a smaller amount of memory to the OS sooner. Eg:

  1. P1 GC starts
  2. P2 GC starts
  3. P1 GC does a slice.
  4. P2 GC does a slice
  5. P1 GC finishes - returns memory to the OS
  6. P2 GC finishes - returns memory to the OS

vs:

  1. P1 GC starts
  2. P1 GC does a slice.
  3. P1 GC finishes - returns memory to the OS
  4. P2 GC starts
  5. P2 GC does a slice
  6. P2 GC finishes - returns memory to the OS

We return memory to the OS after 3 steps rather than 5 (quicker response). But the total memory to return to the OS is still returned after 6 steps (throughput).

We want to think about the following scenarios:

  • A user closes ~100 tabs mapped to different processes, they all want to start GCs at the same time.
  • A user closes ~100 tabs that use the same process, it starts a very big GC.
  • A user closes 1 really big tab, it starts a big GC.
  • A user closes the last tab that uses a process, If we aren't going to reuse the tab, then don't start a GC.
  • A user closes the last tab for a process, we might re-use it.
    ** It didn't have much memory allocated - do a GC and re-use the process. It'd be great if the GC doesn't slow the process down if a new request comes in for it, delay the GC when that happens?
    ** It had a lot of memory allocated - maybe it's faster and avoids fragmentation just to terminate the process and start a new one.
  • A user goes away to make coffee, we want to do compacting GC in all the active processes (the processes that didn't have a compacting GC recently).
  • The user launches a video game and Firefox is idle. Firefox may want to do compacting GC in multiple processes but it's sharing the system with a computer game.
  • The laptop / phone is on battery power, but still needs to avoid exhausting memory, ensure it's responsive when memory is requested by Firefox or other apps.

There's probably other scenarios and almost all of these are probably separate bugs. Some are not specific to Fission and might be best filed under Bug 1352524.

Plus there's now discussion in Bug 1661872 about what can be done about the slow first slices of the GCs in this profile.

See Also: → 1675554

Depends on D97714

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:pbone, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(pbone)

It's fine, I have more WIP patches in my workspace.

Flags: needinfo?(pbone)

This refactoring will make the following patches make more sense.

Depends on D97715

Depends on: 1683031

Comment on attachment 9189057 [details]
Bug 1629064 - pt 1. Fix some comments in IdleSchedulerParent.cpp r=smaug

Revision D97714 was moved to bug 1683031. Setting attachment 9189057 [details] to obsolete.

Attachment #9189057 - Attachment is obsolete: true

Comment on attachment 9189058 [details]
Bug 1629064 - pt 2. Remove unused variable r=smaug

Revision D97715 was moved to bug 1683031. Setting attachment 9189058 [details] to obsolete.

Attachment #9189058 - Attachment is obsolete: true

Comment on attachment 9192363 [details]
Bug 1629064 - pt 3. Use a single linked list for idle task queues r=smaug

Revision D99329 was moved to bug 1683031. Setting attachment 9192363 [details] to obsolete.

Attachment #9192363 - Attachment is obsolete: true

Comment on attachment 9192364 [details]
Bug 1629064 - pt 4. IdleScheduler may now schedule multiple items from a loop r=smaug

Revision D99330 was moved to bug 1683031. Setting attachment 9192364 [details] to obsolete.

Attachment #9192364 - Attachment is obsolete: true
Blocks: 1685016

Work remaining:

  • [x] Add test
  • [x] Check that things don't get confused if PokeGC and JS-engine-initiated GCs are interleaved.
  • [x] Check that things don't get confused if a GC is aborted.
  • [x] Check that scheduler accounts for JS-initiated GCs.
  • Add a timeout to the scheduler if a GC is taking really long we suspect either the client is crashing, swapping or forgot to tell us it finished a GC. In either case starting another GC won't make the situation (much) worse and it's better than starving. the timeout in content processes will take care of this.
  • Add a fallback to content processes to do a full GC anyway to avoid starving if the parent always says no.
    Waiting for Bug 1692308.
    Will rely on JS engine's thresholds.
  • [X] Shrinking GCs should be avoided if the parent delays the decision, since the user may not be idle anymore. Or nsJSEnvironment should check if we're still idle.
  • [X] Check AWSY, Depending on how AWSY runs tests we may see no difference or increased memory usage. That's okay and hopefully doesn't affect real world users.
  • [X] Test is too slow on debug builds, maybe a race causes the next GC events to get lost?
  • [X] Check interaction with content process shutdown. It probably already works.
    Spoke with Jesup, content processes exit without delay when they become unused, so this is fine.
  • Move MayGCNow to CCGCScheduler (in Bug 1703443)
  • [X] Add telemetry probes for how long content processes wait until they can do a GC.

After landing

  • [ ] Observe telemetry, We may see some increased memory usage but it should be minor/worth-while. We may also see decreased memory usage since these changes may be more likely to free memory sooner.
  • [ ] We may also see shorter GC times due to less contention, but probably not significantly enough to notice in telemetry.
Whiteboard: fission-perf
Blocks: 1686149
Attachment #9195546 - Attachment description: Bug 1629064 - Add IdleScheduler messages for GC → Bug 1629064 - Add IdleScheduler messages for GC r=smaug
Attachment #9195547 - Attachment description: Bug 1629064 - nsJSEnvironment will ask the parent if it can GC → Bug 1629064 - nsJSEnvironment will ask the parent if it can GC r=smaug

RunNextCollectorTimer should run the GC with the reason saved in the
relevant GC timer when run from DOMWindowUtils.

I need this change to cause the tab to ask the parent process if it can GC
in my tests.

Depends on D105419

Depends on D105420

Blocks: 1693712
Attachment #9195546 - Attachment description: Bug 1629064 - Add IdleScheduler messages for GC r=smaug → Bug 1629064 - pt 1. Add IdleScheduler messages for GC r=smaug
Attachment #9195547 - Attachment description: Bug 1629064 - nsJSEnvironment will ask the parent if it can GC r=smaug → Bug 1629064 - pt 2. nsJSEnvironment will ask the parent if it can GC r=smaug
Attachment #9203626 - Attachment description: Bug 1629064 - Notify the observer service of GCs for testing → Bug 1629064 - pt 3. Notify the observer service of GCs for testing r=smaug
Attachment #9203627 - Attachment description: Bug 1629064 - RunNextCollectorTimer should not change the GC reason r=sfink → Bug 1629064 - pt 4. RunNextCollectorTimer should not change the GC reason r=sfink
Attachment #9203628 - Attachment description: Bug 1629064 - WIP add new test → Bug 1629064 - pt 6. Add a basic GC scheduling test r=smaug
Depends on: 1692308

There's still things to test that aren't done yet, this is going to miss the M7 milestone so I'll move it to M7a.

I considered breaking it up and landing some now and some later, but it's still too close with one important edgecase untested.

Fission Milestone: M7 → M7a

Depends on D105957

Attachment #9203628 - Attachment description: Bug 1629064 - pt 6. Add a basic GC scheduling test r=smaug → Bug 1629064 - pt 7. Add a basic GC scheduling test r=smaug

Depends on D109854

Depends on D110378

Blocks: 1703443

I hoped to use Bug 1692308 but that bug didn't stick and I'm unclear why. The drawback of rebasing this to m-c without Bug 1692308 is that we won't have a time-based fallback for if the parent process doesn't reply to a content process to say it can GC. However we still have allocation-based fallbacks implemented in the JS engine itself.

No longer depends on: 1692308, 1683031
Depends on: 1683031

Depends on D100857

Depends on D112925

Depends on D112926

Attachment #9195547 - Attachment is obsolete: true
Attachment #9212759 - Attachment is obsolete: true
Attachment #9203627 - Attachment is obsolete: true
Attachment #9212758 - Attachment is obsolete: true
Attachment #9217370 - Attachment description: Bug 1629064 - pt 2b. nsJSEnvironment will ask the parent if it can GC r=smaug → Bug 1629064 - pt 2c. nsJSEnvironment will ask the parent if it can GC r=smaug

We'd like to know if there are any problems with starving content processes
of cleaning up memory in a timely way. Add some telemetry to get a sense of
this.

Depends on D109855

Attached file Data review request
Attachment #9218226 - Flags: data-review?(chutten)

I'm happy with this code and it can land pending:

  • Code review,
  • Data review,
  • sfink would like to land Bug 1683031 but we want to talk to smaug first if that is suitable. If Bug 1683031 lands first I'll have to revert back to the patches I wrote earlier.

Comment on attachment 9218226 [details]
Data review request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

No. This collection will expire in Firefox 98.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

Yes. :pbone is responsible for renewing or removing the collection before it expires in Firefox 98.


Result: datareview+

Attachment #9218226 - Flags: data-review?(chutten) → data-review+

We decided to land Bug 1692308 so we now switch back to requiring that.

Depends on: 1692308

This seems like a better place for discussing the GC reason patch ( https://phabricator.services.mozilla.com/D112928 ).

The problem as I understand it: pbone wants to land a test that calls PokeGC() to initiate a PAGE_HIDE GC that should wait for the parent process to allow it to proceed. But it then calls RunNextCollectorTimer, which overrides the GC reason, and thus when it gets to GCTimerFired it doesn't match one of the reasons for waiting for the parent's permission.

Simplest fixes:

  • make RunNextCollectorTimer, if called with reason=DOM_WINDOW_UTILS, inherit the previous reason rather than overriding it.
  • add DOM_WINDOW_UTILS to the list of reasons for which the parent process controls the timing

The first adds a bit of complexity to the reason computation, and also discards information that affects how the GC is performed. (eg, when looking at markers in the profiler, it would seem odd for a PAGE_HIDE GC to be all-zones.) The second seems a little dangerous because iiuc some test suites do GCs between tests to make behavior more predictable and prevent things from building up, and this change could delay the GC for a while, allowing the next test to start running before the GC happens. It seems like that might introduce some tricky intermittents? Or maybe that's not a problem, because it isn't a nonincremental GC in the first place so the only difference is whether the first slice happens promptly? (The purpose of preventing dead tabs/windows from building up would still be served.)

Do I have that right?

In the unlikely event that all of the above is correct, I would suggest adding a way to narrowly perform this test (which I think is an important test to have). Perhaps add an optional reason string parameter to runNextCollectorTimer() (with optional_argc, I guess)? If you get it into a string, it seems like it could be parsed with something like

for (int i = 0; i < int(JS::GCReason::NUM_TELEMETRY_REASONS; i++) {
  JS::GCReason reason = static_cast<JS::GCreason>(i);
  if (strcmp(JS::ExplainGCReason(reason), aReason) == 0) {
    return reason;
  }
}
Attachment #9212758 - Attachment is obsolete: false
Attachment #9195547 - Attachment is obsolete: false
Attachment #9217371 - Attachment description: Bug 1629064 - pt 4. RunNextCollectorTimer should not change the GC reason r=sfink → Bug 1629064 - pt 4. RunNextCollectorTimer should not change the GC reason r=smaug
Attachment #9218224 - Attachment description: Bug 1629064 - pt 10. Add telemetry r=smaug → Bug 1629064 - pt 11. Add telemetry r=smaug

The JS engine can start GCs based on allocation thresholds (for example).
We don't ask the parent if we should run one of these, but by letting the
parent know that's what's happening it can schedule other GC requests
appropriately.

Depends on D113275

This no-longer needs to be atomic and is always greater than zero.

Depends on D114706

Blocks: 1710543
Blocks: 1710552
Attachment #9217371 - Attachment is obsolete: true
Attachment #9217368 - Attachment is obsolete: true
Attachment #9217370 - Attachment is obsolete: true
Attachment #9217369 - Attachment is obsolete: true
Attachment #9204559 - Attachment description: Bug 1629064 - pt 5. Add a pref for the maximum number of concurrent GCs r=smaug → Bug 1629064 - pt 4. Add a pref for the maximum number of concurrent GCs r=smaug
Attachment #9211692 - Attachment description: Bug 1629064 - pt 6. Expose PokeGC from DOMWindowUtils r=smaug → Bug 1629064 - pt 5. Expose PokeGC from DOMWindowUtils r=smaug
Attachment #9203628 - Attachment description: Bug 1629064 - pt 7. Add a basic GC scheduling test r=smaug → Bug 1629064 - pt 6. Add a basic GC scheduling test r=smaug
Attachment #9211693 - Attachment description: Bug 1629064 - pt 8. Check that aborts do not confuse GC scheduler r=smaug → Bug 1629064 - pt 7. Check that aborts do not confuse GC scheduler r=smaug
Attachment #9211694 - Attachment description: Bug 1629064 - pt 9. Add a test for JS initiated GCs r=smaug → Bug 1629064 - pt 8. Add a test for JS initiated GCs r=smaug
Attachment #9212758 - Attachment description: Bug 1629064 - pt 10. Use the current idle slice to start the GC if we can r=smaug → Bug 1629064 - pt 9. Use the current idle slice to start the GC if we can r=smaug
Attachment #9218224 - Attachment description: Bug 1629064 - pt 11. Add telemetry r=smaug → Bug 1629064 - pt 10. Add telemetry r=smaug
Attachment #9221044 - Attachment description: Bug 1629064 - pt 13. Change type of sMaxConcurrentIdleTasksInChildProcesses r=smaug → Bug 1629064 - pt 11. Change type of sMaxConcurrentIdleTasksInChildProcesses r=smaug
You need to log in before you can comment on or make changes to this bug.