[experimental] Delay crashing child processes when out of memory
Categories
(Core :: Memory Allocator, task)
Tracking
()
People
(Reporter: rkraesig, Assigned: rkraesig)
References
(Blocks 1 open bug)
Details
Attachments
(9 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
116.35 KB,
image/png
|
Details | |
102.10 KB,
image/png
|
Details | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
78.65 KB,
image/png
|
Details |
Following bug 1716727, we stall in the parent process, but not in child processes. (Technically we stall at most once over the lifetime of the process. This doesn't seem to have any useful effect.)
We should try doing that more meaningfully, to see what sort of problems crop up and where. Alternatively (or additionally), we should get rid of the "at most once" code and just not stall in child processes.
Assignee | ||
Comment 1•2 years ago
|
||
It's suspected that this may induce performance regressions. Do it
anyway, just to find out how bad it is. (But only on Nightly, for now.)
For simplicity's sake, this does not include any additional telemetry;
the decision of whether and what any additional telemetry is needed will
be deferred until we have some feedback from what we've already got.
Depends on D155300
Updated•2 years ago
|
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/866131bd45bd Always stall on OOM on Windows in Nightly r=gsvelto
Comment 3•2 years ago
|
||
bugherder |
Assignee | ||
Comment 4•2 years ago
|
||
Whoops! This bug is for an ongoing experiment, and should have been tagged leave-open
.
Updated•2 years ago
|
Assignee | ||
Comment 5•2 years ago
|
||
As shown in the attached graph, this patch has decreased child process OOMs-per-hour significantly. (Content processes have been omitted here for clarity, but the effect is similar to that of GPU processes.) However, the effect on main process crashes is obscured.
Main-process OOMs increased on 2022-08-12 or 2022-08-13, and decreased sometime between 2022-08-30 and 2022-09-01. (Main process memory usage generally appears to have increased during this time; see bug 1785209.) These dates doesn't correspond to any memory-allocator experiments... but :handyman and :spohl have noted that it does correspond to the landing and reversion of bug 965392. Tellingly, 32-bit builds (graph to follow) also show a parallel increase and decrease in main-process OOMs-per-hour around these dates, despite the OOM-stalling experiments having little to no effect on them (as expected).
If bug 965392 is indeed at fault for this, an eyeball-estimate suggests that D155301 has slightly increased 64-bit main-process OOMs from 0.001/hr to 0.0015/hr. While this is still a significant improvement from the pre-D151332 state (which a similar estimate puts at 0.003–0.004/hr), we may need to choose between a moderate reduction in OOMs across-the-board versus a stronger shifting of OOMs from main- to content-processes.
Assignee | ||
Comment 6•2 years ago
|
||
Comment 7•2 years ago
|
||
(In reply to Ray Kraesig [:rkraesig] from comment #5)
As shown in the attached graph, this patch has decreased child process OOMs-per-hour significantly. (Content processes have been omitted here for clarity, but the effect is similar to that of GPU processes.) However, the effect on main process crashes is obscured.
This is an amazing result and great analysis of the data, thanks Ray!
If bug 965392 is indeed at fault for this, an eyeball-estimate suggests that D155301 has slightly increased 64-bit main-process OOMs from 0.001/hr to 0.0015/hr. While this is still a significant improvement from the pre-D151332 state (which a similar estimate puts at 0.003–0.004/hr), we may need to choose between a moderate reduction in OOMs across-the-board versus a stronger shifting of OOMs from main- to content-processes.
We do need to prioritize the main process then, possibly by tweaking the delay behavior in content processes to ensure that they never take down the browser if it can be helped. I had suggested a few different approaches for this (such as letting content processes have a shorter delay-before-crash compared to the main process) but I'll leave it to you to pick the best approach.
Assignee | ||
Comment 8•2 years ago
|
||
A recent patch for bug 965392 (now reverted) may have been causing
additional memory use and OOM-crashes. To confirm this, temporarily(?)
revert all-process stalling on Nightly to the main-process-only version
active between 2022-07-28 and 2022-08-25, to collect a few day' worth of
telemetry.
On Beta and later, there are no functional changes.
Assignee | ||
Comment 9•2 years ago
•
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #7)
We do need to prioritize the main process then, possibly by tweaking the delay behavior in content processes to ensure that they never take down the browser if it can be helped. I had suggested a few different approaches for this (such as letting content processes have a shorter delay-before-crash compared to the main process) but I'll leave it to you to pick the best approach.
That's probably the simplest option; and in the absence of other constraints, I suppose that makes it the best.
(On the one hand, I expect that it will still increase the main process OOM rate above the baseline of "don't stall at all in content processes", and that we'll have to make a decision as to what ratio we find acceptable. On the other hand, I don't expect it to increase the OOM rate much -- the increase may not even be measurable in Nightly.)
Assignee | ||
Comment 10•2 years ago
|
||
Reverts the preceiding commit for this bug, since the baseline has been
(XXX: confirmed / reestablished).
Depends on D157124
Assignee | ||
Comment 11•2 years ago
|
||
Refactor to allow specifying a potentially variable stall-count and
-time, rather than a binary stall/don't-stall cue.
No functional changes; this facility will actually be used in the next
commit.
Depends on D157138
Assignee | ||
Comment 12•2 years ago
|
||
Retain the full stall period in the main process; stall for only half as
long in all other processes.
Depends on D157139
Comment 13•2 years ago
|
||
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/163fd86fb687 Disable all-process stalling in Nightly r=gsvelto
Comment 14•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 15•2 years ago
|
||
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f59707a8ef30 [1/3] Revert "Disable all-process stalling in Nightly" r=gsvelto https://hg.mozilla.org/integration/autoland/rev/97f0c3d59ed9 [2/3] refactor to specify stall count and time r=glandium,gsvelto https://hg.mozilla.org/integration/autoland/rev/e297c5424088 [3/3] stall for half time in non-main processes r=gsvelto
Comment 16•2 years ago
|
||
bugherder |
Assignee | ||
Comment 17•2 years ago
|
||
The past couple of weeks of crash telemetry indicate that stalling on
OOM in child processes for half as long as in main processes drastically
reduces child process OOM crashes without noticeably increasing
main-process OOM crashes.
Give it a ticket and send it on its way.
Comment 18•2 years ago
|
||
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/455674a11a81 Allow child-process half-stalls to ride the trains r=glandium,gsvelto
Comment 19•2 years ago
|
||
bugherder |
Comment 20•1 year ago
|
||
It seems like all these patches landed in Firefox 107. Can this be closed?
Assignee | ||
Comment 21•1 year ago
|
||
I'd prefer not to, yet: I'd planned to use at least two weeks post-release to collect data (and give latent problems time to surface), and I'd rather not change experimental protocol midstream just because the initial data looks good.
(Even if it looks really good.)
Updated•1 year ago
|
Assignee | ||
Comment 22•1 year ago
|
||
Well, it's been two and a half weeks, and the data... still looks really good! 🎉
There's a slight apparent rise in main-process OOM crashes in the most recent point release, but that's only been out for a couple of days — it's already started to level off (0.00134 yesterday vs. 0.00114 today), and I expect that to continue as more data comes in.
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 23•1 year ago
|
||
There was a brief OOM spike across all branches on December 30th (bug 1803675), so you might be seeing that.
Updated•1 year ago
|
Description
•