Open Bug 1897489 Opened 1 year ago Updated 1 year ago

EINTR is not handled when mprotect is called to manage pages for wasm

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

People

(Reporter: kael, Unassigned)

References

(Blocks 1 open bug)

Details

We're investigating intermittent test failures in our test suite for an application startup scenario where multiple threads are racing to grow the WebAssembly heap. We see them specifically on ubuntu for both Chrome and Firefox. Upon investigating, it looks like both v8 and spidermonkey have a small posix-specific bug that could be causing the issue for us (I've already reported it against v8).

Emscripten libc assumes that growing wasm memory will only ever fail (return -1) in an out of memory condition, so if it ever gets -1 back from growing the heap, sbrk/malloc will fail as well.

During our application startup, we have multiple threads all racing to allocate, which means they will potentially race on sbrk, and we will have multiple threads trying to grow the SAB that backs wasm memory. v8 takes a 'the OS will sort it out' approach to this, and from my reading of the source in dxr it looks like spidermonkey does too.

Unfortunately, despite the docs not mentioning it, it's possible for mprotect to return EINTR (I found at least one place in the linux kernel where it does so). The MapBufferMemory and CommitBufferMemory functions (https://searchfox.org/mozilla-central/source/js/src/vm/ArrayBufferObject.cpp#175) seem to assume that any mprotect failure is a permanent failure, instead of spotting EINTR and retrying.

It seems like these two mprotect call sites and maybe other mprotect call sites in the firefox tree (I found others) need to be updated to handle EINTR. From looking over the kernel source it looks like there's also a code path where it will return EAGAIN.

From experience, mprotect error handling across platforms is highly inconsistent … So this bug should be considered Linux specific unless proven otherwise.

The EAGAIN case is already handled in the Linux kernel implementation.

Checking for crashes with MOZ_RELEASE_ASSERT(ret == 0) from DecommitPages, does not highlight such a large volume of crashes, but this case sounds unlikely that it would be racing against many mmap calls as the kernel source suggest.

(In reply to Katelyn Gadd (:kael) from comment #0)

During our application startup, we have multiple threads all racing to allocate, […]

Do you happen to have a minimal test case reproducing this issue?

Blocks: sm-runtime
Severity: -- → S4
Component: JavaScript: WebAssembly → JavaScript Engine
Priority: -- → P3

(In reply to Nicolas B. Pierron [:nbp] from comment #1)

(In reply to Katelyn Gadd (:kael) from comment #0)

During our application startup, we have multiple threads all racing to allocate, […]

Do you happen to have a minimal test case reproducing this issue?

I do not, I've been unable to reproduce this locally. It happened a few times a day across our CI fleet and only when running in a multithreaded wasm configuration. It stopped causing tests to fail when we began retrying sbrk() (which means retrying when a wasm grow operation fails). I'm unable to come up with an alternative explanation for what's happening.

If necessary I could try to construct something that reproduces this, but it may also be dependent on configuration down to # of web workers and number of physical cores.

You need to log in before you can comment on or make changes to this bug.