Closed Bug 1785162 Opened 2 years ago Closed 1 year ago

[experimental] Delay crashing child processes when out of memory

Categories

(Core :: Memory Allocator, task)

task

Tracking

()

RESOLVED FIXED
107 Branch
Tracking Status
firefox106 --- wontfix
firefox107 --- fixed

People

(Reporter: rkraesig, Assigned: rkraesig)

References

(Blocks 1 open bug)

Details

Attachments

(9 files)

Following bug 1716727, we stall in the parent process, but not in child processes. (Technically we stall at most once over the lifetime of the process. This doesn't seem to have any useful effect.)

We should try doing that more meaningfully, to see what sort of problems crop up and where. Alternatively (or additionally), we should get rid of the "at most once" code and just not stall in child processes.

It's suspected that this may induce performance regressions. Do it
anyway, just to find out how bad it is. (But only on Nightly, for now.)

For simplicity's sake, this does not include any additional telemetry;
the decision of whether and what any additional telemetry is needed will
be deferred until we have some feedback from what we've already got.

Depends on D155300

Assignee: nobody → rkraesig
Status: NEW → ASSIGNED
Pushed by rkraesig@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/866131bd45bd
Always stall on OOM on Windows in Nightly  r=gsvelto
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch

Whoops! This bug is for an ongoing experiment, and should have been tagged leave-open.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

As shown in the attached graph, this patch has decreased child process OOMs-per-hour significantly. (Content processes have been omitted here for clarity, but the effect is similar to that of GPU processes.) However, the effect on main process crashes is obscured.

Main-process OOMs increased on 2022-08-12 or 2022-08-13, and decreased sometime between 2022-08-30 and 2022-09-01. (Main process memory usage generally appears to have increased during this time; see bug 1785209.) These dates doesn't correspond to any memory-allocator experiments... but :handyman and :spohl have noted that it does correspond to the landing and reversion of bug 965392. Tellingly, 32-bit builds (graph to follow) also show a parallel increase and decrease in main-process OOMs-per-hour around these dates, despite the OOM-stalling experiments having little to no effect on them (as expected).

If bug 965392 is indeed at fault for this, an eyeball-estimate suggests that D155301 has slightly increased 64-bit main-process OOMs from 0.001/hr to 0.0015/hr. While this is still a significant improvement from the pre-D151332 state (which a similar estimate puts at 0.003–0.004/hr), we may need to choose between a moderate reduction in OOMs across-the-board versus a stronger shifting of OOMs from main- to content-processes.

(In reply to Ray Kraesig [:rkraesig] from comment #5)

As shown in the attached graph, this patch has decreased child process OOMs-per-hour significantly. (Content processes have been omitted here for clarity, but the effect is similar to that of GPU processes.) However, the effect on main process crashes is obscured.

This is an amazing result and great analysis of the data, thanks Ray!

If bug 965392 is indeed at fault for this, an eyeball-estimate suggests that D155301 has slightly increased 64-bit main-process OOMs from 0.001/hr to 0.0015/hr. While this is still a significant improvement from the pre-D151332 state (which a similar estimate puts at 0.003–0.004/hr), we may need to choose between a moderate reduction in OOMs across-the-board versus a stronger shifting of OOMs from main- to content-processes.

We do need to prioritize the main process then, possibly by tweaking the delay behavior in content processes to ensure that they never take down the browser if it can be helped. I had suggested a few different approaches for this (such as letting content processes have a shorter delay-before-crash compared to the main process) but I'll leave it to you to pick the best approach.

A recent patch for bug 965392 (now reverted) may have been causing
additional memory use and OOM-crashes. To confirm this, temporarily(?)
revert all-process stalling on Nightly to the main-process-only version
active between 2022-07-28 and 2022-08-25, to collect a few day' worth of
telemetry.

On Beta and later, there are no functional changes.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

We do need to prioritize the main process then, possibly by tweaking the delay behavior in content processes to ensure that they never take down the browser if it can be helped. I had suggested a few different approaches for this (such as letting content processes have a shorter delay-before-crash compared to the main process) but I'll leave it to you to pick the best approach.

That's probably the simplest option; and in the absence of other constraints, I suppose that makes it the best.

(On the one hand, I expect that it will still increase the main process OOM rate above the baseline of "don't stall at all in content processes", and that we'll have to make a decision as to what ratio we find acceptable. On the other hand, I don't expect it to increase the OOM rate much -- the increase may not even be measurable in Nightly.)

Reverts the preceiding commit for this bug, since the baseline has been
(XXX: confirmed / reestablished).

Depends on D157124

Refactor to allow specifying a potentially variable stall-count and
-time, rather than a binary stall/don't-stall cue.

No functional changes; this facility will actually be used in the next
commit.

Depends on D157138

Retain the full stall period in the main process; stall for only half as
long in all other processes.

Depends on D157139

Pushed by rkraesig@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/163fd86fb687
Disable all-process stalling in Nightly r=gsvelto
Attachment #9294297 - Attachment description: WIP: Bug 1785162 - Revert "Disable all-process stalling in Nightly" → Bug 1785162 - [1/3] Revert "Disable all-process stalling in Nightly" r?glandium,gsvelto
Attachment #9294299 - Attachment description: WIP: Bug 1785162 - [n-1/n] refactor to specify stall count and time → Bug 1785162 - [2/3] refactor to specify stall count and time r?glandium,gsvelto
Attachment #9294300 - Attachment description: WIP: Bug 1785162 - [n/n] stall for half time in non-main processes → Bug 1785162 - [3/3] stall for half time in non-main processes r?glandium,gsvelto
Attachment #9294299 - Attachment description: Bug 1785162 - [2/3] refactor to specify stall count and time r?glandium,gsvelto → Bug 1785162 - [2/3] refactor to specify stall count and time r?glandium!,gsvelto
Attachment #9294300 - Attachment description: Bug 1785162 - [3/3] stall for half time in non-main processes r?glandium,gsvelto → Bug 1785162 - [3/3] stall for half time in non-main processes r?glandium!,gsvelto
Pushed by rkraesig@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f59707a8ef30
[1/3] Revert "Disable all-process stalling in Nightly"  r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/97f0c3d59ed9
[2/3] refactor to specify stall count and time  r=glandium,gsvelto
https://hg.mozilla.org/integration/autoland/rev/e297c5424088
[3/3] stall for half time in non-main processes  r=gsvelto

The past couple of weeks of crash telemetry indicate that stalling on
OOM in child processes for half as long as in main processes drastically
reduces child process OOM crashes without noticeably increasing
main-process OOM crashes.

Give it a ticket and send it on its way.

Pushed by rkraesig@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/455674a11a81
Allow child-process half-stalls to ride the trains  r=glandium,gsvelto
Regressions: 1794161
No longer regressions: 1794161

It seems like all these patches landed in Firefox 107. Can this be closed?

Flags: needinfo?(rkraesig)

I'd prefer not to, yet: I'd planned to use at least two weeks post-release to collect data (and give latent problems time to surface), and I'd rather not change experimental protocol midstream just because the initial data looks good.

(Even if it looks really good.)

Flags: needinfo?(rkraesig)

Well, it's been two and a half weeks, and the data... still looks really good! 🎉

There's a slight apparent rise in main-process OOM crashes in the most recent point release, but that's only been out for a couple of days — it's already started to level off (0.00134 yesterday vs. 0.00114 today), and I expect that to continue as more data comes in.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Resolution: --- → FIXED
Target Milestone: 106 Branch → 107 Branch

There was a brief OOM spike across all branches on December 30th (bug 1803675), so you might be seeing that.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: