Closed Bug 1716727 Opened 4 years ago Closed 3 years ago

Delay crashing the main process when running out of memory

Categories

(Core :: Memory Allocator, enhancement)

Unspecified
Windows
enhancement

Tracking

()

RESOLVED FIXED
105 Branch
Tracking Status
relnote-firefox --- 105+
firefox105 --- fixed

People

(Reporter: gsvelto, Assigned: rkraesig)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [win:stability])

Attachments

(8 files, 1 obsolete file)

Wacky idea I got while discussing bug 1699744. On Linux we have a way to instruct the OOM killer to reap the content processes before the main process when the memory is tight. While we don't use this yet it's something that's feasible. On Windows however there's no such mechanism because there's no overcommit support nor an OOM killer, and we'll be crashing processes on our own when memory gets tight.

So we could delay crashing the main process when we hit an OOM to make it more likely that a content processes gets killed and the OOM scenario recedes. Crudely it could be something like adding a gOOMDelayedCrash boolean in mozalloc_oom.h/cpp - which would be set by the main process - and then tweaking mozalloc_handle_oom() so that it does something like:

void mozalloc_handle_oom(size_t size) {
  if (gOOMDelayedCrash) {
    gOOMDelayedCrash = false;
    Sleep(100); // Sleep for 100ms hoping some other process dies
    return;
  }

  // The rest behaves like it used to
  ...
}

Is this too crazy? I know it'd brutally yank the UI and I know the implementation should be a little bit more involved because of multiple threads hitting the condition at the same time (and re-setting the state if we actually survive the OOM), but you get the idea.

See Also: → 1699744

Maybe we could do something like hard kill child processes in this state? Maybe you can do that without allocating anything...

(In reply to Andrew McCreight [:mccr8] from comment #1)

Maybe we could do something like hard kill child processes in this state? Maybe you can do that without allocating anything...

It's complicated because we'd need to find a suitable candidate, retrieve its handle and call TerminateProcess() on it. We would still have to wait before trying a new allocation because TerminateProcess() is asynchronous so there's no guarantee that the process has been truly killed after calling it.

Quick update: I tried changing how mozalloc_handle_oom() behaves to implement this but ran into a pretty big stumbling block: mozalloc_handle_oom() is used all over the place in our codebase and the code calling it makes different assumptions about how it behaves:

So maybe I should implement this without touching mozalloc_handle_oom(), and then audit the codebase for odd calls to mozalloc_handle_oom().

Attached file commit-limit-tests.cpp

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

So we could delay crashing the main process when we hit an OOM to make it more likely that a content processes gets killed and the OOM scenario recedes.

I believe this has turned out to be the right idea for the wrong reasons. Rather than just delay the main process on OOM and hope a content process gets killed instead, we probably want to delay any process on OOM and hope that Windows expands the page file before retrying. (Although doing that more in the main process might not be a bad idea...)

This seems to be why crash reporter telemetry so often reports that there's plenty of memory to be had. I'm sure it occasionally happens that another process crashes while we're crashing, but it's likely more often the case that Windows has expanded the page file size between the allocation failing and the memory-status query for telemetry. Surprisingly, once you know what to look for, it's even possible to find Microsoft documentation supporting this idea.

Attached is a simple C++ program which induces a very-low-memory state and simulates the application of this hack. Looks like it mostly works?

Keywords: leave-open
Whiteboard: [win:stability]

Some trivially-resolvable nitpicks pointed out by clang-tidy.

Assignee: nobody → rkraesig
Status: NEW → ASSIGNED

Implement a drop-in-replacement wrapper for VirtualAlloc which, rather
than returning immediately on failure, instead ::Sleep()s and retries.

This will cause performance regressions in some (presumed-uncommon)
circumstances; this wrapper is therefore only enabled in Nightly, to
collect data on its efficacy and on the severity of those regressions.

Depends on D150618

Attachment #9283460 - Attachment description: Bug 1716727 - [1/2] Drive-by cleanup: nitpicks r?glandium → Bug 1716727 - [1/2] Drive-by cleanup: clang-tidy nitpicks r?glandium
Attachment #9283460 - Attachment description: Bug 1716727 - [1/2] Drive-by cleanup: clang-tidy nitpicks r?glandium → Bug 1716727 - [1/3] Drive-by cleanup: clang-tidy nitpicks r?glandium
Attachment #9283461 - Attachment description: Bug 1716727 - [2/2] OOM handling: stall and retry r?glandium,gsvelto → Bug 1716727 - [2/3] OOM handling: stall and retry r?glandium,gsvelto

Make Set_XREProcessType communicate the current process type to
mozjemalloc. Use this to avoid stalling repeatedly in auxiliary
processes.

Depends on D150519.

Depends on D150619

Blocks: 1256224
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c0fd4f429cd8 [1/3] Drive-by cleanup: clang-tidy nitpicks r=glandium https://hg.mozilla.org/integration/autoland/rev/895ad6545236 [2/3] OOM handling: stall and retry r=glandium,gsvelto https://hg.mozilla.org/integration/autoland/rev/11fb3262cccb [3/3] make stalling behavior conditional on process type r=glandium
Pushed by bszekely@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/e701e18d1944 [1/3] Drive-by cleanup: clang-tidy nitpicks r=glandium https://hg.mozilla.org/mozilla-central/rev/1623fcafb8be [2/3] OOM handling: stall and retry r=glandium,gsvelto https://hg.mozilla.org/mozilla-central/rev/2cec83cd66fe [3/3] make stalling behavior conditional on process type r=glandium
Regressions: 1782027

Does this bug still need to be open?

I'm going to say "yes". At a minimum, we'll need to remove the current nightly-only experimental code once the experiment's done; but this is also a good place to evaluate the results or propose further refinements, and then to decide what should replace it.

Flags: needinfo?(rkraesig)
Blocks: 1782178

(In reply to Ryan VanderMeulen [:RyanVM] from comment #12)

Does this bug still need to be open?

I agree with Ray's assessment, this is a bit of a shot in the dark given we don't know the effects. We want to evaluate them before we reach the next code freeze; if by then this worked we'll let it ride to beta, if it didn't we'll back the changes out.

For the current experimental code, since the stalling behavior is conditional on the process type (mostly only occurring in the main process), we should expect to see, at minimum, a significant shift in OOM crashes from main processes to content processes.

https://sql.telemetry.mozilla.org/queries/87122#215879

Initial results are quite promising! (There may also have been a drop in total OOM crashes, but it's too soon to say that.)

Since the plausible timeframe for immediate panicked reversion of the
previous patchset has passed without incident, unpin and clean up the
remaining uses of sChildProcessType from nsEmbedFunctions.cpp.

Additionally, get rid of the rump declaration of ..._set_always_stall
on non-Windows builds.

No functional changes.

(Reference: bug 1682520, D152198.)

It looks like some variant of the stalling code will indeed be here to
stay! \o/

In support of this, move all the logic about whether to stall out of
mozjemalloc itself, and into a function supplied to mozjemalloc by
client code (where, for the sake of further experimentation, it may
depend on potentially arbitrary nonsense like userprefs).

No functional changes.

I quickly wrote this query to directly compare the number of OOM crashes in the main process before and after we landed the change:

https://sql.telemetry.mozilla.org/queries/87187

This takes into account that not everybody is on an up-to-date nightly and makes the improvement more visible. To get an even better feeling of the impact we should average the number of crashes over the number of usage hours and measure session length but I don't remember how to do that.

Blocks: 1784033

I had to take some last minute PTO and I'll be off until next Monday. Given that the code freeze is almost upon us and this is working I propose the following: let the patch that already landed ride the trains and close this bug given that it's working. Let's open separate bugs for the following steps, I've already opened one for hooking VirtualAlloc() and we should have one for trying to delay child processes too and so on. Let's then try them one by one for each new version of nightly so we get a good feel of what their impact is. Ray, Mike, WDYT?

Well, letting the patch that's currently landed ride the trains won't have much of an effect, simply because it's #ifdefd to only do anything on Nightly. I have no fundamental objections to changing that, though.

I also have no particular opinion on how to organize subproblems amongst bugs, except that we probably want a metabug if we're going to split it up any farther.

(In reply to Ray Kraesig [:rkraesig] from comment #20)

Well, letting the patch that's currently landed ride the trains won't have much of an effect, simply because it's #ifdefd to only do anything on Nightly. I have no fundamental objections to changing that, though.

Yeah, let's change that and let it ride the trains.

I also have no particular opinion on how to organize subproblems amongst bugs, except that we probably want a metabug if we're going to split it up any farther.

That's fine by me!

(In reply to Gabriele Svelto [:gsvelto] (PTO until 21/08) from comment #19)

I had to take some last minute PTO and I'll be off until next Monday. Given that the code freeze is almost upon us and this is working I propose the following: let the patch that already landed ride the trains and close this bug given that it's working. Let's open separate bugs for the following steps, I've already opened one for hooking VirtualAlloc() and we should have one for trying to delay child processes too and so on. Let's then try them one by one for each new version of nightly so we get a good feel of what their impact is. Ray, Mike, WDYT?

I think if you and Ray are comfortable with the current Nightly patch riding to release, I'm good with that, too. The benefits seem undeniable, and the risks seem minimal.

Once we land the patch to let this ride to release, we should probably nominate this for a release note as well (we already have a similar one for Linux via bug 1771712). Note that the Nightly soft freeze also starts on Thursday, so getting that patch landed soon would be appreciated :)

Flags: needinfo?(rkraesig)

Although further experiments and refinements are called for, the effect
of this patch has been overall significantly positive. Let it ride the
trains as-is for Fx 105, rather than blocking on those experiments.

(This patch conflicts with D153784, which will need to be rebased and
moved to a new Bugzilla bug.)

Depends on D153783

(In reply to Ryan VanderMeulen [:RyanVM] from comment #23)

Once we land the patch to let this ride to release, we should probably nominate this for a release note as well (we already have a similar one for Linux via bug 1771712). Note that the Nightly soft freeze also starts on Thursday, so getting that patch landed soon would be appreciated :)

Copy that! (And enqueued for review.)

Flags: needinfo?(rkraesig)
Flags: needinfo?(rkraesig)
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/53a16a2dd4ed patchset #2 [1/2] - Assorted minor cleanup r=glandium,bobowen https://hg.mozilla.org/integration/autoland/rev/c64d8efb03c0 patchset #2 [2/2] Deexperimentalize stalling code r=glandium

Release Note Request (optional, but appreciated)
[Why is this notable]: Significantly reduces main-process out-of-memory crashes, effectively offloading them to content processes.
[Affects Firefox for Android]: No.
[Suggested wording]: "On Windows, Firefox itself will be more likely to survive briefly running out of memory, even if individual tabs crash."
[Links (documentation, blog post, etc)]: —

relnote-firefox: --- → ?
Flags: needinfo?(rkraesig)

Added to the Fx105 relnotes.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 105 Branch

Since mozglue is now aware of Gecko process-types, remove the
configurator for this bug's associated experiment. Instead, use
GeckoProcessType directly in mozjemalloc.

This requires a couple of adjustments for non-Firefox uses of
mozjemalloc:

  • Since logalloc-replay uses mozjemalloc in an odd way, tweak its
    moz.build to also include mozglue/misc/ProcessType.cpp.
  • Since GeckoProcessType is not available in standalone SpiderMonkey
    builds, make the mostly-arbitrarily choice to always stall there.

(The above is just a cleanup patch; it should have no functional changes, and does not need consideration for uplift.)

Attachment #9291027 - Attachment description: Bug 1716727 - patchset #3 [1/1] - Remove intermediate function r?glandium,gsvelto → Bug 1716727 - patchset #3 [1/1] - Cleanup: remove intermediate function r?glandium,gsvelto
See Also: → 1786451

A patch has been attached on this bug, which was already closed. Filing a separate bug will ensure better tracking. If this was not by mistake and further action is needed, please alert the appropriate party. (Or: if the patch doesn't change behavior -- e.g. landing a test case, or fixing a typo -- then feel free to disregard this message)

(In reply to Ray Kraesig [:rkraesig] from comment #28)
[Why is this notable]: Significantly reduces main-process out-of-memory crashes, effectively offloading them to content processes.

Do you have any data on how much difference this makes? Is it just main-process crashes?

Flags: needinfo?(rkraesig)

The current version is indeed just main-process crashes. (There may be a slight reduction in crashes overall, but as of the last data I had, it hadn't reached the level of statistical significance.)

There are a number of graphs showing this behavior; my favorite is this one (a current snapshot of which is attached).

Flags: needinfo?(rkraesig)
Pushed by rkraesig@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1ec4e7bf2251 patchset #3 [1/1] - Cleanup: remove intermediate function r=glandium
Attachment #9288651 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: