This bug was filed from the Socorro interface and is report bp-460e37a4-1879-4af5-8609-5873d2161104. ============================================================= Crashing Thread (41) Frame Module Signature Source 0 xul.dll `anonymous namespace'::wasapi_stream_render_loop media/libcubeb/src/cubeb_wasapi.cpp:873 1 ucrtbase.dll _crt_at_quick_exit 2 kernel32.dll BaseThreadInitThunk 3 ntdll.dll __RtlUserThreadStart 4 ntdll.dll _RtlUserThreadStart crashes with "wasapi_stream_render_loop" have been around for a while but are regressing in volume starting with firefox 49 builds. it's a crash occurring on windows 7 & upwards - on 49 release this is accounting for ~0.14% of all crashes, on 50.0b it's 0.2%. judging from user comments there is no single theme of what might trigger this but people are often complaining about repeated crashes there.
Three signatures, let's break it down: - CleanupPerAppKey | `anonymous namespace''::wasapi_stream_render_loop I think that's the case where we're intentionally leaking the thread  because we could not join the thread properly: we close the handle brutally and the shutdown event right after. Closing the thread handle does not kill off the thread (that would be `TerminateThread`). Sometimes after, it wakes up and find an invalid handle and it crashes. This is not very frequent. - abort | `anonymous namespace''::wasapi_stream_render_loop We crashing because we're not finding the event we're expecting, in a release-enabled assert. We zero-out the shutdown assert , that must be the cause for the assertion blowing up. We've probably leaked the thread as well. - `anonymous namespace''::wasapi_stream_render_loop This contains a number of different scenarios: 1. UAF on `stm->output_client` , during a reconfigure. 2. The same assert as the previous issue blowing up (maybe it's being mis-categorized when bucketing by signature? It's unclear) 3. UAF on `stm->state_callback` , during shutdown. Those (appart from 2.) look like it could be caused by the rendering thread waking up after having been blocked somehow, and dequeuing some events. In the meantime, either an audio device was changed, or we've shut down the stream (maybe the page was being buggy and the user closed the tab, something like that). In any case, it seems like we could use `TerminateThread` to simply kill off the thread if we can't join for five seconds, so that it stops executing. It's probably going to leak some stuff, but at least it should prevent crashing. That said, it's pretty late in the cycle if we want to attempt a fix. It _should_ be a one liner (`TerminateThread` and returning instead of simply not doing anything and normally closing the handle). Matthew, does the above make sense ? It's mostly speculations since I can't really repro. I'll try a couple things monday when I'm back in the office. : https://hg.mozilla.org/releases/mozilla-beta/annotate/829a3f99f260/media/libcubeb/src/cubeb_wasapi.cpp#l1140 : https://hg.mozilla.org/releases/mozilla-release/annotate/7356baae8e73/media/libcubeb/src/cubeb_wasapi.cpp#l1187 : https://crash-stats.mozilla.com/report/index/eb4c7366-d156-4c5f-9a45-9a4332161030, https://hg.mozilla.org/releases/mozilla-aurora/annotate/0c44c99f7b57/media/libcubeb/src/cubeb_wasapi.cpp#l834 : https://crash-stats.mozilla.com/report/index/7223fd5e-8d42-473d-9041-f02dc2161104, https://hg.mozilla.org/releases/mozilla-release/annotate/4ef9e419a8e1/media/libcubeb/src/cubeb_wasapi.cpp#l827
Assignee: nobody → padenot
Thanks for digging in to this. The analysis makes sense, it makes sense that a blocked render thread may wake up after we abandon waiting for it. The current code isn't prepared to handle that situation since it was originally designed to wait for thread completion indefinitely, and when we added the timeout to work around reported hangs we missed this late wakeup case. In the short term, I wonder if we're better off reverting the thread wait timeout and living with a hang instead of several UAF crashes. It's probably more useful to learn more about where the code hangs than the random places we can crash with a UAF. Prior to the timeout, I suspect there were some hangs that were longer than the timeout we picked (5s) that eventually completed which the workaround converted from a hang to a crash. My take on the TerminateThread suggestion is that it's *very* difficult to use correctly, and probably impossible to use safely in this situation. Beyond the code directly under our control in libcubeb, which would need careful analysis and most likely reworking to permit safe termination at arbitrary points, there's (OS and Gecko (via the user callbacks)) code being called outside our control that may be left in a corrupted state, potentially with non-local consequences. There's a long history of subtle bugs caused by TerminateThread misuse, for example bug 535585. From the little information we have, it's most likely that we're blocked in OS (audio API, probably waiting on audiodg server responses) code, and it's not possible to know the consequences of terminating at arbitrary points within that code. Thinking through some other options: - Softening the risk of TerminateThread by using SuspendThread on the render thread when we abandon the wait, hopefully ensuring it never attempts to access any of the shared state again. That should avoid the risk of corrupting state, but (as with TerminateThread) doesn't provide protection from deadlocks if the hung thread ends up suspended while holding locks. Unfortunately, this effectively converts the crashes back into the original hang, so we're no better off than simply reverting the timeout. In both cases, the forensic evidence from a hang is more useful for a long term fix than reports we get about crashes caused by UAFs. Also worth noting that TerminateThread and SuspendThread are asynchronous, so you need to wait for the target thread's state to change before continuing. - Implement something like .NET's Thread.Abort() where a ThreadAbortException exception is raised in the target thread, allowing some unwinding and cleanup before terminating the thread. I can't see a simple way to do this with the native Win32 API, and I suspect it relies upon VM safepoints and/or other .NET runtime stuff unavailable to native code and difficult or impossible to retrofit. - Avoid the UAF entirely by transferring ownership of libcubeb's shared resources to the render thread, allowing the rest of the code to relinquish its reference while avoiding UAF hazards, reducing the worst case to a leak. Doing this for all shared resources is a significant and risky change, and I haven't thought about the design much to see how easy it'd be to do in practice. - Minimal variant of the last option: introduce a piece of state tracking the shutdown request allocated at the time the thread is created but transferred to the thread for it to release when the thread exits, ensuring it's safe to access at any time. This would then need to be checked before absolutely any access of shared state not owned by the thread (e.g. the cubeb_stream), but this is going to have a small performance penalty and may be difficult to enforce as the code changes.
I like the last option. I'll implement it first thing tomorrow.
Component: Audio/Video → Audio/Video: cubeb
Probably too late for 50 :-(. I'd look to take a safe or one-line fix for a 50.0.1 for stability reasons, and secondarily sec reasons. And it looks like we have several options for that.
The fix for this landed yesterday on Bug 1314514. It should merge to m-c today. We'd like it to bake at least a couple of days before cherry-picking the fix to uplift to Aurora/Beta. padenot is taking point on the uplift.
Was this uplifted around now? I lost track of the various "uplift cubeb" bugs that have been going around.
I agree this was a bit complicated, but this particular underlying problem was causing a bunch of different crash signatures. This has been dealt with in bug 1302231. Duping as it should have been done (although we discovered it was a dupe rather late).
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1302231
Thanks for confirming, Paul.
You need to log in before you can comment on or make changes to this bug.