Closed Bug 1678226 Opened 4 years ago Closed 4 years ago

Crash in [@ _pthread_join]

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
85 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox83 --- unaffected
firefox84 + verified
firefox85 + verified

People

(Reporter: glandium, Assigned: glandium)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

I got this crash while running the JetStream2 benchmark on browserbench.org. It is reproducible and I've seen other similar crashes on crash-stats, all on arm64.

Crash report: https://crash-stats.mozilla.org/report/index/5ba6e3af-aae8-4b7f-b31e-158320201119

Reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 libsystem_pthread.dylib _pthread_join 
1 libnss3.dylib PR_JoinThread nsprpub/pr/src/pthreads/ptthread.c:586
2 XUL nsThreadShutdownAckEvent::Run xpcom/threads/nsThread.cpp:252
3 XUL mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:146
4 XUL mozilla::RunnableTask::Run xpcom/threads/TaskController.cpp:450
5 XUL mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:720
6 XUL mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:579
7 XUL mozilla::TaskController::ProcessPendingMTTask xpcom/threads/TaskController.cpp:373
8 XUL mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal xpcom/threads/nsThreadUtils.h:577
9 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1197

It is reproducible

It seems to always happen on the float-mm.c test.

I suspect this is a duplicate of bug 1677817. Could you verify you can not reproduce the problem on a build with the fix?

At least, a recent try build that includes the fix doesn't crash. I'll wait for beta 3 and double confirm with that one.

It's still happening in beta 3.

tracking+ given that this is one of the issues contributing to the current uncertainty over shipping native builds in 84.

Crash Signature: [@ _pthread_join] → [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8]

So far, this is what I have: a pthread_t value on macos is a pointer to internal data for the thread. That internal data lives within the allocated stack space, near the end of the block allocated for the stack. The contents of that internal data is corrupted, leading to this crash in _pthread_join. It's corrupted because /something/ (and it still remains to find what that is) is unmapping that stack space before the thread is over, and something else ends up being mapped instead in that memory block.

Crash Signature: [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8] → [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8]

For what it's worth, I was seeing this frequently on sites that use wasm.

I disabled 'javascript.options.wasm' and 'javascript.options.wasm_trustedprincipals' a few days ago and have not seen a crash since.

Element (Matrix chat; powers http://chat.mozilla.org/) was particularly crashy for me, since it uses wasm for end-to-end encryption. Participating an an e2ee room was almost impossible as a result.

With a debug beta build, I hit the crash in the debugger with the following stack.

Process 87559 stopped
* thread #48, stop reason = signal SIGABRT
    frame #0: 0x000000019ab1fcec libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x19ab1fcec <+8>:  b.lo   0x19ab1fd0c               ; <+40>
    0x19ab1fcf0 <+12>: pacibsp 
    0x19ab1fcf4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x19ab1fcf8 <+20>: mov    x29, sp
Target 0: (plugin-container) stopped.
(lldb) bt
* thread #48, stop reason = signal SIGABRT
  * frame #0: 0x000000019ab1fcec libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x000000019ab50c24 libsystem_pthread.dylib`pthread_kill + 292
    frame #2: 0x000000019aa988e4 libsystem_c.dylib`__abort + 116
    frame #3: 0x000000019aa783b4 libsystem_c.dylib`__stack_chk_fail + 96
    frame #4: 0x000000010af716c0 XUL`HandleMachException(request=<unavailable>) at WasmSignalHandlers.cpp:0:12 [opt]

We know Apple has fixed a trap handling issue that we are hitting under Rosetta with wasm (bug 1672843) which was confirmed to be in macOS 11.1 Beta.

I'm going to update my M1 to 11.1 Beta and try and reproduce this.

(In reply to Haik Aftandilian [:haik] from comment #10)

I'm going to update my M1 to 11.1 Beta and try and reproduce this.

The problem is still reproducible on macOS 11.1 beta.

(In reply to Mike Hommey [:glandium] from comment #8)

It's corrupted because /something/ (and it still remains to find what that is) is unmapping that stack space before the thread is over, and something else ends up being mapped instead in that memory block.

That part was a red herring. It turns out what happens is that when the thread terminates, the stack is unmapped, except for the portion that contains the pthread metadata. The question remains what is breaking that data.

My debug MOZCONFIG was still using ac_add_options --enable-optimize=-Og to workaround an older issue on the DTK. With the optimization turned off, I can't reproduce the twitch crash with a debug build.

FTR, applying bug 1677452 and bug 1676624 makes it less likely to happen, but it still happens.
I'm still mystified that it's not happening on central, only beta.

(In reply to Dan Callahan [:callahad] from comment #9)

For what it's worth, I was seeing this frequently on sites that use wasm.

I disabled 'javascript.options.wasm' and 'javascript.options.wasm_trustedprincipals' a few days ago and have not seen a crash since.

Element (Matrix chat; powers http://chat.mozilla.org/) was particularly crashy for me, since it uses wasm for end-to-end encryption. Participating an an e2ee room was almost impossible as a result.

Thanks. Testing with javascript.options.wasm=false, I have not reproduced the problem either. I wonder if there are any differences with our wasm trap handling in nightly vs beta.

After reverse engineering the pthread implementation in Big Sur, which, fortunately, while not identical, is close enough to the code in https://github.com/apple/darwin-libpthread/ that I wasn't going fully blindly, I identified where the problem occurs (all this would have been easier had there been debug symbols for those system libraries but well...).

As mentioned in comment 8, a pthread_t is just a pointer to the pthread metadata. All pthreads are part of an intrusive linked list, meaning the pointer to the next thread is within the pthread metadata, not in a separate struct. The list starts at the _pthread_head symbol, which points to the main thread. Then each thread creation adds to the list.

When pthread_join is called, one of the first things that's done is to go through the list of threads, starting from what _pthread_head points to, and check that the pthread_t that is passed in is, in fact, in that list, otherwise it returns an error.

This is where the crash is happening: while looping through the list, one of the "next" pointers is broken (most of the time, it's 0x11 for me). The first thing the code does with this pointer is compare it to the one that was given, which it doesn't match, so it goes on to try to load the next one, so it tries to read the "next" field in what it thinks is a pthread metadata struct at that location (0x11). The offset of the field in the struct is 0x10, so it tries to read at 0x21, which segfaults.

It so happens that the thread for which the "next" field is overwritten somehow is always the MachExceptionHandlerThread thread.

The reason this doesn't happen entirely reliably (sometimes it takes a few seconds of twitch streaming, sometimes it's instant) is that it depends in which order threads appear in the list, and whether threads being joined appear before or after the MachExceptionHandlerThread thread in the list.

Now, as to what is overwriting the "next" field, this is the backtrace when it happens:

    frame #0: 0x000000018bf14a04 libsystem_platform.dylib`_platform_memmove + 212
  * frame #1: 0x000000018be9c6b0 libsystem_kernel.dylib`thread_get_state + 452
    frame #2: 0x00000001066c05e0 XUL`MachExceptionHandlerThread() [inlined] HandleMachException(request=0x00000002b7e5ec98) at WasmSignalHandlers.cpp:847:10 [opt]
    frame #3: 0x00000001066c05ac XUL`MachExceptionHandlerThread() at WasmSignalHandlers.cpp:913 [opt]

Because I have some debugging code in WasmSignalHandlers.cpp locally, and this was on the beta branch, that line corresponds to: https://hg.mozilla.org/releases/mozilla-beta/file/ddad4255cf307a6b76d5a314cc84d52b2c1b96b8/js/src/wasm/WasmSignalHandlers.cpp#l845

And the reason this is causing these problems is because context.float_ is too small to actually contain the float state, so the thread_get_state is doing a stack buffer overflow. And as mentioned in comment 8, the pthread metadata is at the highest addresses in the stack for the thread. Without a guard page. So a stack buffer overflow in the first frames in the thread will happily overwrite the pthread metadata.

As for the reason it doesn't happen on nightly? I'm not entirely sure yet.

The best part of all this? It's a blunder on my part in bug 1658386.

Assignee: nobody → mh+mozilla
Component: XPCOM → Javascript: WebAssembly
Depends on: 1658386

Comment on attachment 9190966 [details]
Bug 1678226 - Use the appropriately sized arm_neon_state64_t for thread float state on arm64 mac.

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes on web pages using wasm on arm64 macs.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The analysis was pretty thorough, and the patch is a one-liner with a trivial fix for what essentially was a blunder.
  • String changes made/needed:
Attachment #9190966 - Flags: approval-mozilla-beta?
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/b99c0c159e25 Use the appropriately sized arm_neon_state64_t for thread float state on arm64 mac. r=lth
Pushed by dluca@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/d4b21131a094 Use the appropriately sized arm_neon_state64_t for thread float state on arm64 mac. r=lth
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

Comment on attachment 9190966 [details]
Bug 1678226 - Use the appropriately sized arm_neon_state64_t for thread float state on arm64 mac.

Nice detective work! Glad the fix ended up being pretty simple. Approved for 84.0b8.

Attachment #9190966 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Epilogue:

So, the reason this doesn't happen on nightly is that something nightly-only is making the stack size of the MachExceptionHandlerThread function larger, suchj that the buffer overflow stops at 0x10 into the pthread struct, so right before the pthread->next field, while on beta it stops at 0x20 into the pthread struct, overwriting the field. And since nothing uses the pthread struct for this specific thread, other than going through the list of threads, there's no occasion on nightly to crash because of the 0x10 overwritten bytes. But if I add just a call to pthread_self(), which does some validation of the first word in the pthread struct, I get the crash on central.

As for why -fstack-protector didn't catch this? Because HandleMacException is inlined into MachExceptionHandlerThread, and we never return from that MachExceptionHandlerThread, the stack protector never has an occasion to kick in (it only happens on function return).

Crash Signature: [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8] → [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8] [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST]
Severity: -- → S1
Crash Signature: [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8] [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST] → [@ _pthread_join] [@ libsystem_pthread.dylib@0x8ad8] [@ EMPTY: no crashing thread identified; ERROR_NO_THREAD_LIST]
Priority: -- → P1

Verified-fixed on the latest Firefox Beta 84.0 RC and Firefox Nightly (2020-12-10) on miniMac: MacOS Big Sur 11.0.1, Mac mini (M1, 2020), Chip: Apple M1, display 27-inch (2560x1440), with and without Rosetta (DTK).
There were no crashes encountered while running the benchmark mentioned in Description nor twitch tab crashes reported in Bug 1679513.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: