After reverse engineering the pthread implementation in Big Sur, which, fortunately, while not identical, is close enough to the code in https://github.com/apple/darwin-libpthread/ that I wasn't going fully blindly, I identified where the problem occurs (all this would have been easier had there been debug symbols for those system libraries but well...). As mentioned in comment 8, a pthread_t is just a pointer to the pthread metadata. All pthreads are part of an intrusive linked list, meaning the pointer to the next thread is within the pthread metadata, not in a separate struct. The list starts at the _pthread_head symbol, which points to the main thread. Then each thread creation adds to the list. When pthread_join is called, one of the first things that's done is to go through the list of threads, starting from what _pthread_head points to, and check that the pthread_t that is passed in is, in fact, in that list, otherwise it returns an error. This is where the crash is happening: while looping through the list, one of the "next" pointers is broken (most of the time, it's 0x11 for me). The first thing the code does with this pointer is compare it to the one that was given, which it doesn't match, so it goes on to try to load the next one, so it tries to read the "next" field in what it things is a pthread metadata struct at that location (0x11). The offset of the field in the struct is 0x10, so it tries to read at 0x21, which segfaults. It so happens that the thread for which the "next" field is overwritten somehow is always the MachExceptionHandlerThread thread. The reason this doesn't happen entirely reliably (sometimes it takes a few seconds of twitch streaming, sometimes it's instant) is that it depends in which order threads appear in the list, and whether threads being joined appear before or after the MachExceptionHandlerThread thread in the list. Now, as to what is overwriting the "next" field, this is the backtrace when it happens: ``` frame #0: 0x000000018bf14a04 libsystem_platform.dylib`_platform_memmove + 212 * frame #1: 0x000000018be9c6b0 libsystem_kernel.dylib`thread_get_state + 452 frame #2: 0x00000001066c05e0 XUL`MachExceptionHandlerThread() [inlined] HandleMachException(request=0x00000002b7e5ec98) at WasmSignalHandlers.cpp:847:10 [opt] frame #3: 0x00000001066c05ac XUL`MachExceptionHandlerThread() at WasmSignalHandlers.cpp:913 [opt] ``` Because I have some debugging code in WasmSignalHandlers.cpp locally, and this was on the beta branch, this corresponds to: https://hg.mozilla.org/releases/mozilla-beta/file/ddad4255cf307a6b76d5a314cc84d52b2c1b96b8/js/src/wasm/WasmSignalHandlers.cpp#l845 And the reason this is causing these problems is because context.float_ is too small to actually contain the float state, so the thread_get_state is doing a stack buffer overflow. And as mentioned in comment 8, the pthread metadata is at the highest addresses in the stack for the thread. Without a guard page. So a stack buffer overflow in the first frames in the thread will happily overwrite the pthread metadata. As for the reason it doesn't happen on nightly? I'm not entirely sure yet. The best part of all this? It's a blunder on my part in bug 1658386.
Bug 1678226 Comment 16 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
After reverse engineering the pthread implementation in Big Sur, which, fortunately, while not identical, is close enough to the code in https://github.com/apple/darwin-libpthread/ that I wasn't going fully blindly, I identified where the problem occurs (all this would have been easier had there been debug symbols for those system libraries but well...). As mentioned in comment 8, a pthread_t is just a pointer to the pthread metadata. All pthreads are part of an intrusive linked list, meaning the pointer to the next thread is within the pthread metadata, not in a separate struct. The list starts at the _pthread_head symbol, which points to the main thread. Then each thread creation adds to the list. When pthread_join is called, one of the first things that's done is to go through the list of threads, starting from what _pthread_head points to, and check that the pthread_t that is passed in is, in fact, in that list, otherwise it returns an error. This is where the crash is happening: while looping through the list, one of the "next" pointers is broken (most of the time, it's 0x11 for me). The first thing the code does with this pointer is compare it to the one that was given, which it doesn't match, so it goes on to try to load the next one, so it tries to read the "next" field in what it thinks is a pthread metadata struct at that location (0x11). The offset of the field in the struct is 0x10, so it tries to read at 0x21, which segfaults. It so happens that the thread for which the "next" field is overwritten somehow is always the MachExceptionHandlerThread thread. The reason this doesn't happen entirely reliably (sometimes it takes a few seconds of twitch streaming, sometimes it's instant) is that it depends in which order threads appear in the list, and whether threads being joined appear before or after the MachExceptionHandlerThread thread in the list. Now, as to what is overwriting the "next" field, this is the backtrace when it happens: ``` frame #0: 0x000000018bf14a04 libsystem_platform.dylib`_platform_memmove + 212 * frame #1: 0x000000018be9c6b0 libsystem_kernel.dylib`thread_get_state + 452 frame #2: 0x00000001066c05e0 XUL`MachExceptionHandlerThread() [inlined] HandleMachException(request=0x00000002b7e5ec98) at WasmSignalHandlers.cpp:847:10 [opt] frame #3: 0x00000001066c05ac XUL`MachExceptionHandlerThread() at WasmSignalHandlers.cpp:913 [opt] ``` Because I have some debugging code in WasmSignalHandlers.cpp locally, and this was on the beta branch, this corresponds to: https://hg.mozilla.org/releases/mozilla-beta/file/ddad4255cf307a6b76d5a314cc84d52b2c1b96b8/js/src/wasm/WasmSignalHandlers.cpp#l845 And the reason this is causing these problems is because context.float_ is too small to actually contain the float state, so the thread_get_state is doing a stack buffer overflow. And as mentioned in comment 8, the pthread metadata is at the highest addresses in the stack for the thread. Without a guard page. So a stack buffer overflow in the first frames in the thread will happily overwrite the pthread metadata. As for the reason it doesn't happen on nightly? I'm not entirely sure yet. The best part of all this? It's a blunder on my part in bug 1658386.
After reverse engineering the pthread implementation in Big Sur, which, fortunately, while not identical, is close enough to the code in https://github.com/apple/darwin-libpthread/ that I wasn't going fully blindly, I identified where the problem occurs (all this would have been easier had there been debug symbols for those system libraries but well...). As mentioned in comment 8, a pthread_t is just a pointer to the pthread metadata. All pthreads are part of an intrusive linked list, meaning the pointer to the next thread is within the pthread metadata, not in a separate struct. The list starts at the _pthread_head symbol, which points to the main thread. Then each thread creation adds to the list. When pthread_join is called, one of the first things that's done is to go through the list of threads, starting from what _pthread_head points to, and check that the pthread_t that is passed in is, in fact, in that list, otherwise it returns an error. This is where the crash is happening: while looping through the list, one of the "next" pointers is broken (most of the time, it's 0x11 for me). The first thing the code does with this pointer is compare it to the one that was given, which it doesn't match, so it goes on to try to load the next one, so it tries to read the "next" field in what it thinks is a pthread metadata struct at that location (0x11). The offset of the field in the struct is 0x10, so it tries to read at 0x21, which segfaults. It so happens that the thread for which the "next" field is overwritten somehow is always the MachExceptionHandlerThread thread. The reason this doesn't happen entirely reliably (sometimes it takes a few seconds of twitch streaming, sometimes it's instant) is that it depends in which order threads appear in the list, and whether threads being joined appear before or after the MachExceptionHandlerThread thread in the list. Now, as to what is overwriting the "next" field, this is the backtrace when it happens: ``` frame #0: 0x000000018bf14a04 libsystem_platform.dylib`_platform_memmove + 212 * frame #1: 0x000000018be9c6b0 libsystem_kernel.dylib`thread_get_state + 452 frame #2: 0x00000001066c05e0 XUL`MachExceptionHandlerThread() [inlined] HandleMachException(request=0x00000002b7e5ec98) at WasmSignalHandlers.cpp:847:10 [opt] frame #3: 0x00000001066c05ac XUL`MachExceptionHandlerThread() at WasmSignalHandlers.cpp:913 [opt] ``` Because I have some debugging code in WasmSignalHandlers.cpp locally, and this was on the beta branch, that line corresponds to: https://hg.mozilla.org/releases/mozilla-beta/file/ddad4255cf307a6b76d5a314cc84d52b2c1b96b8/js/src/wasm/WasmSignalHandlers.cpp#l845 And the reason this is causing these problems is because context.float_ is too small to actually contain the float state, so the thread_get_state is doing a stack buffer overflow. And as mentioned in comment 8, the pthread metadata is at the highest addresses in the stack for the thread. Without a guard page. So a stack buffer overflow in the first frames in the thread will happily overwrite the pthread metadata. As for the reason it doesn't happen on nightly? I'm not entirely sure yet. The best part of all this? It's a blunder on my part in bug 1658386.