Crash in [@ IPCError-browser | ShutDownKill | mozilla::gfx::VRManagerChild::ShutDown]
Categories
(Core :: Graphics, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox73 | --- | unaffected |
firefox74 | --- | fixed |
firefox75 | --- | fixed |
People
(Reporter: gsvelto, Assigned: sotaro)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
This bug is for crash report bp-1ab9ea5d-2b34-42d5-858b-be9120200210.
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 kernelbase.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1140
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll static mozilla::gfx::VRManagerChild::ShutDown gfx/vr/ipc/VRManagerChild.cpp:140
8 xul.dll static mozilla::gfx::VRManagerChild::ReinitForContent gfx/vr/ipc/VRManagerChild.cpp:107
9 xul.dll mozilla::dom::ContentChild::RecvReinitRendering dom/ipc/ContentChild.cpp:1634
The affected content processes are stuck in VRManagerChild::ShutDown()
which is spinning the event loop waiting for something to happen. All the other threads are also idle so the process is effectively deadlocked.
Reporter | ||
Comment 1•5 years ago
|
||
Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:
- The process has already received a shutdown IPC message (the
IPCShutdownState
annotation is set toRecvShutdown
). - So we've sent a
content-child-will-shutdown
notification (see here) - That message has told the thread manager that we're shutting down
- Which IIUC means we're not spinning the event loop anymore
- But this process received another IPC message after the shutdown one (which I think was expected) and is now shutting down the VR manager by posting a runnable
- ... and then waiting on it
Obviously the wait will never be over because the event loop isn't spinning and I suspect that's what's happening in other content process shutdown deadlocks like this one.
Note that if the same code executes in the same sequence in the main process (w/o the IPC obviously) it won't be a problem because in the main process we don't stop the event loop until we hit the xpcom-shutdown
. And that's very, very late during the shutdown process; all the machinery that needs to do non-trivial stuff with threads at shutdown will already have executed long before we get there.
Comment 2•5 years ago
|
||
Crash reports suggest this started around build 20200131213333. Pushlog for that is:
Which suggests bug 1612256.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 3•5 years ago
|
||
Disregard the starting date, these crashes have been reprocessed. Before crashes with this signature showed up as bug 1279293 but after landing bug 1612569 they now have separate signatures. The fact that the first now was from build 20200131213333 is just an artifact of that, the crashes started certainly before.
Assignee | ||
Comment 4•5 years ago
•
|
||
Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.
Assignee | ||
Comment 5•5 years ago
•
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #4)
Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.
I could not reproduce the problem locally. But Firefox does a lot of things during re-creating GPU process. Then if re-creating GPU process and Firefox shutdown happen concurrently, cpu becomes very busy and shutdown kill could happen by timeout.
Assignee | ||
Comment 6•5 years ago
|
||
Bug 1612256 fix added latency, then it seems better to backout the fix for now.
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Comment 7•5 years ago
|
||
We've reprocessed all the crashes, so no we're sure that the first buildid with a crash is 20200131213333 which is consistent with when bug 1612256 landed. Note that this is not just a latency problem, like other shutdown crashes it might require a higher-level fix as described in comment 1.
The chain of events here is:
- The shutdown IPC message was sent from the main process triggering
ContentChild::RecvShutdown()
, this stopped the event loops. - The re-init rendering IPC messages was sent from the main process triggering
ContentChild::RecvReinitRendering()
which you can see on the stack VRManagerChild::Destroy()
posted a runnable to perform its shutdown procedure, but the event loop has already stopped so it won't be executed.- So we're stuck here waiting for an event that will never happen
I don't know if it's possible to fix this w/o altering the way the event loop behaves when child processes are shut down.
![]() |
||
Comment 8•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #1)
Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:
- The process has already received a shutdown IPC message (the
IPCShutdownState
annotation is set toRecvShutdown
).- So we've sent a
content-child-will-shutdown
notification (see here)- That message has told the thread manager that we're shutting down
- Which IIUC means we're not spinning the event loop anymore
I don't see that these two steps here actually stop spinning the event loop. In context, nsThreadManager::SpinEventLoopUntilOrShutdown
:
is only really meant to be called from JS:
https://searchfox.org/mozilla-central/source/xpcom/threads/nsIThreadManager.idl#135-142
and indeed, you can see that this function only being called from JS is currently the case:
https://searchfox.org/mozilla-central/search?q=SpinEventLoopUntilOrShutdown&redirect=false
I can totally believe somebody might be waiting for an event that will never happen. But it looks more like the main thread either:
a. isn't getting woken up by people posting events so that it can go check whether sVRManagerChildSingleton
is actually null; or
b. the task that's supposed to be running VRManagerChild::AfterDestroy
isn't actually getting run.
a) seems kind of unlikely, but possible, I guess?
Assignee | ||
Comment 9•5 years ago
•
|
||
VRManagerChild::ShutDown() was implemented by mimicking CompositorBridgeChild::ShutDown(). It is called from CompositorManagerChild::Shutdown().
CompositorBridgeChild::ShutDown() is already used in gecko, then I thought that VRManagerChild::ShutDown() should work well. But there was a difference in ContentChild::RecvReinitRendering(). For CompositorManagerChild, CompositorManagerChild::Init() is called. But for VRManagerChild, VRManagerChild::ReinitForContent() is called. And VRManagerChild::ReinitForContent() calls VRManagerChild::ShutDown() in it.
From the above, it seems that the problem could be addressed by changing VRManagerChild more similar to CompositorManagerChild.
Assignee | ||
Comment 10•5 years ago
|
||
Reporter | ||
Comment 11•5 years ago
|
||
Thanks for the analysis to both. I had missed the boolean in SpinEventLoopUntilInternal()
, so in this case we're calling SpinEventLoopUntil()
which is setting that to false so it won't care about the mShuttindDown
flag. My guess is that we're seeing b) because this code should have posted a runnable at this point, but that runnable doesn't seem to be running anywhere, how is that possible?
There's also another wrinkle I hadn't noticed before: the IPCShutdownState
annotation has been set, which means that ContentChild::RecvShutdown()
has been called, however ContentChild::ShutdownInternal()
has not yet run because it otherwise it would have changed that annotation value to SendFinishShutdown (something)
.
That might happen because sometimes we delay that operation here.
So the call to ContentChild::RecvReinitRendering()
might have happened before we executed ContentChild::ShutdownInternal()
.
There's also yet another wrinkle that's even scarier: we have two process kill timers! One in the parent and one in the child. I never noticed before and I don't know why that's the case and I would assume the parent one should always go off before the child one... but I have no idea what's the rationale for having both and I don't know how they interact.
Reporter | ||
Comment 12•5 years ago
|
||
Anyway, sorry for the bugspam Sotaro, I'll file another bug to better analyze the overall issue.
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
bugherder |
Comment 15•5 years ago
|
||
Is this something we need to uplift to Beta?
Assignee | ||
Comment 16•5 years ago
•
|
||
Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild
Beta/Release Uplift Approval Request
- User impact if declined: Shutdown kill could happen in content process during closing Firefox and GPU process re-start happened concurrently.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: none
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The fix changed VRManagerChild shutdown as same as CompositorManagerChild shutdown that is already used.
- String changes made/needed: none
Comment 17•5 years ago
|
||
Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild
Crash fix, proved effective on nightly, approved for 74.0b5, thanks.
Comment 18•5 years ago
|
||
bugherder uplift |
Description
•