Open Bug 1784033 Opened 2 years ago Updated 2 years ago

Implement delaying OOM crashes on Windows by hooking VirtualAlloc directly

Categories

(Core :: Memory Allocator, enhancement)

Unspecified
Windows
enhancement

Tracking

()

People

(Reporter: gsvelto, Unassigned)

References

(Blocks 1 open bug)

Details

As per title, this should ensure the improvements we're seeing in bug 1716727 help code that calls VirtualAlloc() on its own, bypassing the memory allocator (such as the JS engine).

We've done some due diligence on this one and it might be problematic to just hook up VirtualAlloc(). That's because SpiderMonkey uses VirtualAlloc() directly but also handles OOM conditions on his own, for example by running the garbage collector. Since VirtualAlloc() failures don't cause crashes directly there (at least in some cases) delaying the failures might confuse SpiderMonkey's logic.

To move forward here I think we could do a couple things:

  1. Look for all the places where we use VirtualAlloc() directly outside of SpiderMonkey and the allocator, then evaluate how often those cause crashes
  2. If the non-allocator VirtualAlloc() calls are responsible for a significant amount of crashes write a MozVirtualAlloc() with delaying logic which we'll use in lieu of VirtuaAlloc(). This way we'll avoid perturbing SpiderMonkey's logic.
  3. Last but not least talk to the JIT team to figure out how they handle OOMs and if it would be useful to expose the delaying logic to SpiderMonkey. One could imagine that in some cases SpiderMonkey would like to avoid VirtualAlloc() from failing, but in other cases the delaying logic might be worth it.

I'm CC'ing :nbp from the JIT team so that they're aware of our experiments here. Nicholas you can check bug 1716727 for all the details behind this, but in a nutshell we introduced a delay-and-retry step before crashing when we run out of memory on Windows, and it's doing wonders. It almost cut our main process OOM crash rate in half.

CC'ing jonco and sfink because GC allocations might be more interesting than JIT allocations in this context.

See Also: → 1786451

SpiderMonkey has multiple consumers of VirtualAlloc:

  • Libffi, which is used to handle js-ctypes, to allocate intermediary object representations which are converted from JS code to C equivalent and managed by this in-between layer.
  • The ExecutableAllocator, to pre-reserve pages at specific addresses, to be mapped as executable pages during the execution.
  • The Garbage Collector, to reserve pages as the heap grow. These pages are used to store all objects which lifetime is managed by the garbage collector.
  • ArrayBufferObject, to allocate the buffers which are likely to be used by WebAssembly as a memory.

Except for the Garbage Collector case, I think the stalling behavior is a sane approach, even if this might be debatable whether it is useful for the ExecutableAllocator as this is during a process initialization.

Concerning the GC, Jon and Steve are most likely the right persons to ask as there is additional logic to handle failures of the system at providing more pages. It probably makes sense to keep this logic in place, without stalling, as we might assume that many Firefox processes might be fighting over the same system memory, and the reason of their survival, is that some memory got decommitted by one of the processes.

So we would probably want to have another stalling layer on top of the GC allocations, when there is no other choices than growing the heap or crashing.

Concerning the GC, Jon and Steve are most likely the right persons to ask as there is additional logic to handle failures of the system at providing more pages. It probably makes sense to keep this logic in place, without stalling, as we might assume that many Firefox processes might be fighting over the same system memory, and the reason of their survival, is that some memory got decommitted by one of the processes.

There's about 2-and-a-half cases within the SpiderMonkey GC.

  • Tenure heap chunk allocation (as you can imagine above)
  • Nursery chunk allocation (What currently happens here? Can it start a minor GC or will it fail-over to the tenured heap?)
  • Tenure heap chunk allocation while a GC is running! This is why I want to comment on this bug, it's complex and has a whole meta bug just for this topic: Bug 1472062). Essentially what's happening is that during a GC (any kind) the nursery needs to be collected, any reachable cells there need to be allocated into the tenured heap, and if the tenured heap needs a new chunk to do so it must be allocated. If that allocation fails the process will crash with an OOM. IMHO although tenured chunk allocation normally has a delaying behaviour by invoking the GC, this sub-case of tenured chunk allocation doesn't, and the delaying behaviour we've tried elsewhere could be done here.
You need to log in before you can comment on or make changes to this bug.