Closed Bug 1754258 Opened 3 years ago Closed 3 years ago

MOZ_CRASH("Unexpected CodeRange kind") when profiling wasm-test.libreoffice.org

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

defect

Tracking

()

RESOLVED FIXED
99 Branch
Tracking Status
firefox99 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

(Current m-c, Fedora 38)

Make a release build. In a fresh tab, start the profiler. Go to https://wasm-test.libreoffice.org and wait. The tab will crash.

In a debugger, I see this:

Thread 6 "IPC I/O Parent" received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffd8c5b640 (LWP 50119)]
__libc_sendmsg (flags=64, msg=0x7fffd8c56550, fd=132) at ../sysdeps/unix/sysv/linux/sendmsg.c:28
Downloading -0.00 MB source file /usr/src/debug/glibc-2.34-25.fc35.x86_64/socket/../sysdeps/unix/sysv/linux/sendmsg.c
28        return SYSCALL_CANCEL (sendmsg, fd, msg, flags);                                                                                              
Missing separate debuginfos, use: dnf debuginfo-install xz-libs-5.2.5-7.fc35.x86_64
(gdb) bt
#0  __libc_sendmsg (flags=64, msg=0x7fffd8c56550, fd=132) at ../sysdeps/unix/sysv/linux/sendmsg.c:28
#1  __libc_sendmsg (fd=132, msg=0x7fffd8c56550, flags=64) at ../sysdeps/unix/sysv/linux/sendmsg.c:25
#2  0x00007fffec058bc5 in IPC::(anonymous namespace)::corrected_sendmsg(int, msghdr const*, int) (socket=132, message=0x7fffd8c56550, flags=64)
    at /home/lhansen/m-u/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:144
#3  IPC::Channel::ChannelImpl::ProcessOutgoingMessages() (this=<optimized out>)
    at /home/lhansen/m-u/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:705
#4  0x00007fffec05a2d1 in IPC::Channel::ChannelImpl::OnFileCanWriteWithoutBlocking(int) (this=0x7fffc0b37200, fd=<optimized out>)
    at /home/lhansen/m-u/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:894
#5  0x00007fffec0498ea in base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) (fd=132, flags=4, context=0x7fffc0b37200)
    at /home/lhansen/m-u/ipc/chromium/src/base/message_pump_libevent.cc:246
#6  0x00007fffec09339c in event_process_active_single_queue (base=0x7fffd96b1700, activeq=0x7fffd96a8230, max_to_process=2147483647, endtime=0x0)
    at /home/lhansen/m-u/ipc/chromium/src/third_party/libevent/event.c:1646
#7  0x00007fffec085e0e in event_process_active (base=0x7fffd96b1700) at /home/lhansen/m-u/ipc/chromium/src/third_party/libevent/event.c:1741
#8  event_base_loop (base=0x7fffd96b1700, flags=1) at /home/lhansen/m-u/ipc/chromium/src/third_party/libevent/event.c:1961
#9  0x00007fffec049ce2 in base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) (this=0x7fffd96495c0, delegate=0x7fffd8c5ab58)
    at /home/lhansen/m-u/ipc/chromium/src/base/message_pump_libevent.cc:353
#10 0x00007fffec047df6 in MessageLoop::RunInternal() (this=0x7fffd8c56550) at /home/lhansen/m-u/ipc/chromium/src/base/message_loop.cc:331
#11 MessageLoop::RunHandler() (this=0x7fffd8c56550) at /home/lhansen/m-u/ipc/chromium/src/base/message_loop.cc:324
#12 MessageLoop::Run() (this=0x7fffd8c56550) at /home/lhansen/m-u/ipc/chromium/src/base/message_loop.cc:306
#13 0x00007fffec0564ce in base::Thread::ThreadMain() (this=0x7fffda0be4c0) at /home/lhansen/m-u/ipc/chromium/src/base/thread.cc:187
#14 0x00007fffec04e31a in ThreadFunc(void*) (closure=0x84, closure@entry=0x7fffda0be4c0)
    at /home/lhansen/m-u/ipc/chromium/src/base/platform_thread_posix.cc:40
#15 0x00005555556208b2 in set_alt_signal_stack_and_start(PthreadCreateParams*) (params=<optimized out>)
    at /home/lhansen/m-u/toolkit/crashreporter/pthread_create_interposer/pthread_create_interposer.cpp:74
#16 0x00007ffff7b0ab1a in start_thread (arg=<optimized out>) at pthread_create.c:443
#17 0x00007ffff7b8f650 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

jseward has the same problem and it's 100% reproducible so it's not likely that this is caused by my hardware. Could be a profiler problem, of course.

Also a problem in the most recent FF99 Nightly, same platform.

SIGPIPE should be ignored; it looks like that's done in this part of NSPR initialization, and searchfox doesn't see anywhere else we'd reset it that would be applicable here.

One thing the profiler does is that it sends a lot of SIGPROF, so I wonder if there's some kind of race condition in the kernel (or maybe libc). I can reproduce it locally on Debian and I'm going to try a few variations. Interestingly, it doesn't reproduce under rr.

A few other configurations that crash:

  • pinned to a single CPU with taskset
  • A release build of Firefox (i.e., if this is our bug, it's not recent)
  • In an Ubuntu 18.04 VM with kernel 4.15 (i.e., if this is a kernel bug, it's at least 4 years old)

However, it does work in ESR 91.5, so maybe it can be (slowly) mozregression'ed.

Also crashes on an M1 MacBook Pro, so unlikely to be a Linux problem per se.

Note when running mozregression on this that there was a performance problem in WebAssembly in FF96 and FF97 that will make the "compiling" phase of this test take a very long time indeed (about 3 minutes on my build system, up from 15 seconds in builds without the bug).

Stepping back in history with mozregression on the M1 MacBook Pro, this is broken at least as far back as FF89. ESR91.6 on this platform doesn't seem to have the same profiling functionality (weird?) so don't know how to test that. Around FF78/FF79 we should expect to run into problems with missing Wasm features.

Seeing that this also happens on Mac, I was wondering if the real cause of the crash might actually not be SIGPIPE. I reproduced the crash with a build that happens to have my work-in-progress patches for bug 1658072 (which include logging for child process exit statuses) and got an interesting message:

[Parent 127484, IPC I/O Parent] WARNING: process 127883 exited on signal 11

I had been trying to get an ELF core dump and that didn't seem to work, but I did get dmesg output indicating that the segfault was raised on a thread named SamplerThread, and machine code bytes that disassembled into what looked like a MOZ_CRASH.

I tried again with gdb attached to the content process, and got this:

Thread 1 "WebCOOP+COEP Co" received signal SIGILL, Illegal instruction.
0x00003483bb888801 in ?? ()
(gdb) x/i $rip
=> 0x3483bb888801:      ud2    

I guessed that the ud2 was a JIT implementation detail, did handle SIGILL nostop noprint pass, and continued:

Thread 13 "SamplerThread" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f83a40ff640 (LWP 132173)]
js::wasm::ProfilingFrameIterator::initFromExitFP (…) at /home/jld/src/gecko-dev/js/src/wasm/WasmFrameIter.cpp:941
941           MOZ_CRASH("Unexpected CodeRange kind");

That looks like the crash I was expecting from the dmesg. Unfortunately, before gdb let me enter commands again, this happened:

js/src/gdb/mozilla/asmjs.py: Allowing WasmTrapHandler to run.
[Thread 0x7f8399fff640 (LWP 132337) exited]
…(34 similar lines removed)…
[Thread 0x7f83b602abc0 (LWP 131997) exited]
[Inferior 1 (process 131997) exited normally]

And the process was gone. However, the MOZ_CRASH stack had been logged:

Hit MOZ_CRASH(Unexpected CodeRange kind) at /home/jld/src/gecko-dev/js/src/wasm/WasmFrameIter.cpp:941
#01: js::wasm::ProfilingFrameIterator::initFromExitFP(js::wasm::Frame const*) (/home/jld/src/gecko-dev/js/src/wasm/WasmFrameIter.cpp:941)
#02: js::wasm::ProfilingFrameIterator::ProfilingFrameIterator(js::jit::JitActivation const&, JS::ProfilingFrameIterator::RegisterState const&) (/home/jld/src/gecko-dev/js/src/wasm/WasmFrameIter.cpp:0)
#03: JS::ProfilingFrameIterator::iteratorConstruct(JS::ProfilingFrameIterator::RegisterState const&) (/home/jld/src/gecko-dev/js/src/vm/Stack.cpp:0)
#04: JS::ProfilingFrameIterator::ProfilingFrameIterator(JSContext*, JS::ProfilingFrameIterator::RegisterState const&, mozilla::Maybe<unsigned long> const&) (/home/jld/src/gecko-dev/js/src/vm/Stack.cpp:490)
#05: ExtractJsFrames(bool, mozilla::profiler::ThreadRegistrationUnlockedReaderAndAtomicRWOnThread const&, Registers const&, ProfilerStackCollector&, JS::ProfilingFrameIterator::Frame*, StackWalkControl*) (/home/jld/src/gecko-dev/tools/profiler/core/platform.cpp:1651)
#06: DoSharedSample(bool, unsigned int, mozilla::profiler::ThreadRegistrationUnlockedReaderAndAtomicRWOnThread const&, JS::ProfilingFrameIterator::Frame*, Registers const&, unsigned long, unsigned long, ProfileBuffer&, mozilla::StackCaptureOptions) (/home/jld/src/gecko-dev/tools/profiler/core/platform.cpp:2340)
#07: SamplerThread::Run() (/home/jld/src/gecko-dev/tools/profiler/core/platform.cpp:3967)

I notice, belatedly, that the initial report shows a thread named IPC I/O Parent, meaning that the debugger was attached to the parent process, which isn't the one that's crashing. I think what the debugger was trying to say is that SIGPIPE would have been raised at that point if it weren't ignored, which is true but not especially relevant.

Component: IPC → Javascript: WebAssembly
Summary: SIGPIPE crash during profiling → MOZ_CRASH("Unexpected CodeRange kind") when profiling wasm-test.libreoffice.org

Nice work! I'll dig into this further.

Assignee: nobody → lhansen
Severity: -- → S3
Status: NEW → ASSIGNED
Priority: -- → P2

The code range kind is ImportJitExit.

The recorded exit reason is:

EXIT REASON 59
--> call to native Allocate<BigInt, NoGC> (in wasm)

So this seems to be a call to AllocateBigIntTenuredNoGC() (in WasmBuiltins.cpp) from code generated from FillArgumentArrayForExit() (in WasmStubs.cpp) on behalf of an ImportJitExit stub.

The call goes via a builtin thunk. This thunk sets the exitFP flag.

So now we have the situation where the stack (growing down) looks like:

  ...
  Wasm code
  Wasm code
  ImportJitExit code
  AllocateBigInt thunk
  AllocateBigIntTenuredNoGC
  <other stuff maybe>

When the profiler kicks in and wants to unwind the wasm activation it notices that exitFP is set for the thunk so we hit the hasWasmExitFP case in ProfilingIterator::ProfilingIterator(JitActivation,RegisterState) and we go directly to initFromExitFP() which will skip the bottommost frame (the thunk) and then start processing the next one, the ImportJitExit code. But it does not expect to see that, and so we crash.

This particular case is only a problem for the ImportJitExit because that's the only case where we need to call the bigint allocator during argument setup (it's the ToValue==true cases in FillArgumentArrayForExit).

I think this is really only an issue with the profiler and (rare) interrupts, and of course the window for anything to break is small, but apparently large enough that the test case breaks 100% of the time.

It may be right for this builtin to have a thunk so that the exit state can be recorded properly, but then we need to be prepared to unwind through the ImportJitExit code too. The alternative view is that the frame for the thunk is part of the frame for the ImportJitExit while we're in this intermediate state; this simplifies unwinding logic at the expense of the exit stub. Investigating.

Note that other primitives called from the stubs such as CoerceInPlace_ToInt32 do not get thunks, they're called directly. But then the stubs code is careful to mark these exits by tagging the FP around the call. The difference is thus only that these callouts get one frame and not two, their frame is part of the ImportJitExit frame.

OS: Linux → All
Hardware: x86_64 → All

This works around the issue (obviously the different MOZ_CRASH messages are not necessary) and demonstrates that the diagnosis is correct:

diff --git a/js/src/wasm/WasmFrameIter.cpp b/js/src/wasm/WasmFrameIter.cpp
--- a/js/src/wasm/WasmFrameIter.cpp
+++ b/js/src/wasm/WasmFrameIter.cpp
@@ -932,13 +945,26 @@ void ProfilingFrameIterator::initFromExi
       AssertMatchesCallSite(callerPC_, callerFP_);
       break;
     case CodeRange::ImportJitExit:
+      if (!exitReason_.isFixed() && exitReason_.symbolic() == SymbolicAddress::AllocateBigInt) {
+        fp = fp->wasmCaller();
+        callerPC_ = fp->returnAddress();
+        callerFP_ = fp->rawCaller();
+        break;
+      } else {
+        MOZ_CRASH("Unexpected CodeRange kind ImportJitExit");
+      }
     case CodeRange::ImportInterpExit:
+      MOZ_CRASH("Unexpected CodeRange kind ImportInterpExit");
     case CodeRange::BuiltinThunk:
+      MOZ_CRASH("Unexpected CodeRange kind BuiltinThunk");
     case CodeRange::TrapExit:
+      MOZ_CRASH("Unexpected CodeRange kind TrapExit");
     case CodeRange::DebugTrap:
+      MOZ_CRASH("Unexpected CodeRange kind DebugTrap");
     case CodeRange::Throw:
+      MOZ_CRASH("Unexpected CodeRange kind Throw");
     case CodeRange::FarJumpIsland:
-      MOZ_CRASH("Unexpected CodeRange kind");
+      MOZ_CRASH("Unexpected CodeRange kind FarJumpIsland");
   }
 
   MOZ_ASSERT(!done());

With that problem fixed, we can now see others. In a release-debug build, start the profiler and load the URL. Wait. I now see a new crash:

Assertion failure: callsite, at /home/lhansen/m-u/js/src/wasm/WasmFrameIter.cpp:889

with no meaningful stack at this time. Doesn't happen on every run.

This debug code is called at various points during unwinding, notably when starting to unwind and also later. It is possible my fix above triggers it, but it could also be something else.

(Note that bug 1754819 may get in the way of observing this if you're unlucky. Try enabling software webrender.)

The BigInt initialization code used in the JitExit is not correct
because it nests two exit frames within each other, one for the
JitExit and another for the BigInt allocation thunk, called while
initializing the first. This breaks unwinding, and it's not easy to
disentangle. See further analysis on the bug.

We fix this by disabling JitExits for functions that require a BigInt
allocation. This is not much of a hardship since the allocation
requires saving every register, calling out to the allocator, and then
restoring every register again on return - it would already be a very
expensive operation. Routing the call via C++ isn't going to make
that much difference.

(If the patch in comment 12 lands I will file a followup to investigate this further and maybe find some other way to re-enable that code, but with fewer performance problems and proper support for profiling.)

Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f68877bfcb64 disable JitExit for functions with i64 arguments. r=yury
Blocks: 1757130

Bug 1757130: Re-enable i64 jitexit after redesigning enough of the system to make it work sensibly and proving that it helps performance.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: