Open Bug 1938084 Opened 1 year ago Updated 11 months ago

Avoid repeated long cycle collections

Categories

(Core :: Cycle Collector, defect)

defect

Tracking

()

People

(Reporter: florian, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

BHR (Background Hang Reporter) data for content processes (https://fqueze.github.io/hang-stats/child/) shows stacks containing "Incremental CC" representing more than 30% of the total hang time of content processes.
We suspect this comes primarily from websites that leak large amounts of memory, so we have long cycle collections that don't actually reclaim much memory. This is bad both for responsiveness, and for power use.

While discussing this with members of the Performance team, multiple suggestions emerged:

  • stop the script in the affected page, using a similar UI as what we do for pages that have unresponsive scripts (there are challenges, as we don't have a way to show that UI for background tabs)
  • kill affected content processes. They spend all the CPU time they can get doing ineffective cycle collection, and browsing a tab that lives in them is a very unpleasant experience, so killing the process to have the tabs reload in a fresh content process might be a better user experience, if the dataloss is limited.
  • accept that the memory that has not been reclaimed after 3 cycle collections likely won't be reclaimed in the next cycle collections either, and only run cycle collector in the future in that content process on new memory (something was said about creating a new generation, or making nodes black, but I don't remember the full details). Whenever a top level document is closed, a new full cycle collection would be triggered.

Olli, can you help identify the next step (or the next person to needinfo)?

Flags: needinfo?(smaug)

If there is only one tab for the process, reload likely will wipe out all the leakage (since you know the tab that caused it). Reload-with-new-process in the one tab case would guarantee a clean process, at only a small perf hit. When there are multiple tabs, killing the process will have surprising impacts from a user perspective; you could if the user wants kill-and-reload all of them - but it will be unclear to the user which tabs will be reloaded (since it will be ~1/4 of all tabs to the site)

We could stop and just send a memory report request to that one process (instead of all), and then use that to identify the document with the largest memory usage, and reload that one - smaug, WDYT?

Maybe we could build on the tab unloading stuff once a decision has been made to get rid of a page?

Reloading a page in a shared process will fix page leaks but not Gecko leaks, but hopefully the latter are rarer.

Gecko leaks are far rarer and much smaller

I can see a huge amount of CC going on when working with this Google spreadsheet. CPU and memory increases quite a lot (Gecko profile) each time I'm trying to search for tests in the Test Results: Firefox sheet.

When I stay idle the CC is starting and always keeps the CPU busy at around 80% for an infinite amount of time. Here a Gecko profile showing that:
https://share.firefox.dev/3WdW8hN

Shall I file that as a separate bug or is it fine to keep it here? It's kinda severe for me given that my main process and the web content process usually take up 10GB of memory.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #6)

Shall I file that as a separate bug or is it fine to keep it here? It's kinda severe for me given that my main process and the web content process usually take up 10GB of memory.

Actually for that I've already filed bug 1938051.

Another thing we've discussed before is detecting when a process has "gone bad", and not putting any new pages in it, and hoping the existing pages get closed and then we can kill the process without disrupting the user. Our process selection code would need to be made fancier for that.

One thing I've vaguely experimented with before is a "ghost buster" that would run the unlink method on a ghost window but it isn't clear how much that would help vs cause random null deref crashes.

You need to log in before you can comment on or make changes to this bug.