Closed Bug 1614570 Opened 1 year ago Closed 1 year ago

Crash in [@ IPCError-browser | ShutDownKill | mozilla::gfx::VRManagerChild::ShutDown]

Categories

(Core :: Graphics, defect, P2)

Desktop
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla75
Tracking Status
firefox-esr68 --- unaffected
firefox73 --- unaffected
firefox74 --- fixed
firefox75 --- fixed

People

(Reporter: gsvelto, Assigned: sotaro)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-1ab9ea5d-2b34-42d5-858b-be9120200210.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1140
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll static mozilla::gfx::VRManagerChild::ShutDown gfx/vr/ipc/VRManagerChild.cpp:140
8 xul.dll static mozilla::gfx::VRManagerChild::ReinitForContent gfx/vr/ipc/VRManagerChild.cpp:107
9 xul.dll mozilla::dom::ContentChild::RecvReinitRendering dom/ipc/ContentChild.cpp:1634

The affected content processes are stuck in VRManagerChild::ShutDown() which is spinning the event loop waiting for something to happen. All the other threads are also idle so the process is effectively deadlocked.

Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:

Obviously the wait will never be over because the event loop isn't spinning and I suspect that's what's happening in other content process shutdown deadlocks like this one.

Note that if the same code executes in the same sequence in the main process (w/o the IPC obviously) it won't be a problem because in the main process we don't stop the event loop until we hit the xpcom-shutdown. And that's very, very late during the shutdown process; all the machinery that needs to do non-trivial stuff with threads at shutdown will already have executed long before we get there.

Flags: needinfo?(nfroyd)

Crash reports suggest this started around build 20200131213333. Pushlog for that is:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3fedfc4923c&tochange=76ee1827c820f34b3b595f887f57b4c847316fcc

Which suggests bug 1612256.

Flags: needinfo?(sotaro.ikeda.g)
Regressed by: 1612256
Keywords: regression
OS: Windows 10 → Windows
Hardware: Unspecified → Desktop
Priority: -- → P2

Disregard the starting date, these crashes have been reprocessed. Before crashes with this signature showed up as bug 1279293 but after landing bug 1612569 they now have separate signatures. The fact that the first now was from build 20200131213333 is just an artifact of that, the crashes started certainly before.

Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.

(In reply to Sotaro Ikeda [:sotaro] from comment #4)

Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.

I could not reproduce the problem locally. But Firefox does a lot of things during re-creating GPU process. Then if re-creating GPU process and Firefox shutdown happen concurrently, cpu becomes very busy and shutdown kill could happen by timeout.

Bug 1612256 fix added latency, then it seems better to backout the fix for now.

Flags: needinfo?(sotaro.ikeda.g)
Assignee: nobody → sotaro.ikeda.g

We've reprocessed all the crashes, so no we're sure that the first buildid with a crash is 20200131213333 which is consistent with when bug 1612256 landed. Note that this is not just a latency problem, like other shutdown crashes it might require a higher-level fix as described in comment 1.

The chain of events here is:

  • The shutdown IPC message was sent from the main process triggering ContentChild::RecvShutdown(), this stopped the event loops.
  • The re-init rendering IPC messages was sent from the main process triggering ContentChild::RecvReinitRendering() which you can see on the stack
  • VRManagerChild::Destroy() posted a runnable to perform its shutdown procedure, but the event loop has already stopped so it won't be executed.
  • So we're stuck here waiting for an event that will never happen

I don't know if it's possible to fix this w/o altering the way the event loop behaves when child processes are shut down.

(In reply to Gabriele Svelto [:gsvelto] from comment #1)

Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:

I don't see that these two steps here actually stop spinning the event loop. In context, nsThreadManager::SpinEventLoopUntilOrShutdown:

https://searchfox.org/mozilla-central/rev/a1592902acabf9bded973067133baaac1457f3d3/xpcom/threads/nsThreadManager.cpp#676-718

is only really meant to be called from JS:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsIThreadManager.idl#135-142

and indeed, you can see that this function only being called from JS is currently the case:

https://searchfox.org/mozilla-central/search?q=SpinEventLoopUntilOrShutdown&redirect=false

I can totally believe somebody might be waiting for an event that will never happen. But it looks more like the main thread either:

a. isn't getting woken up by people posting events so that it can go check whether sVRManagerChildSingleton is actually null; or
b. the task that's supposed to be running VRManagerChild::AfterDestroy isn't actually getting run.

a) seems kind of unlikely, but possible, I guess?

Flags: needinfo?(nfroyd)

VRManagerChild::ShutDown() was implemented by mimicking CompositorBridgeChild::ShutDown(). It is called from CompositorManagerChild::Shutdown().

CompositorBridgeChild::ShutDown() is already used in gecko, then I thought that VRManagerChild::ShutDown() should work well. But there was a difference in ContentChild::RecvReinitRendering(). For CompositorManagerChild, CompositorManagerChild::Init() is called. But for VRManagerChild, VRManagerChild::ReinitForContent() is called. And VRManagerChild::ReinitForContent() calls VRManagerChild::ShutDown() in it.

From the above, it seems that the problem could be addressed by changing VRManagerChild more similar to CompositorManagerChild.

Thanks for the analysis to both. I had missed the boolean in SpinEventLoopUntilInternal(), so in this case we're calling SpinEventLoopUntil() which is setting that to false so it won't care about the mShuttindDown flag. My guess is that we're seeing b) because this code should have posted a runnable at this point, but that runnable doesn't seem to be running anywhere, how is that possible?

There's also another wrinkle I hadn't noticed before: the IPCShutdownState annotation has been set, which means that ContentChild::RecvShutdown() has been called, however ContentChild::ShutdownInternal() has not yet run because it otherwise it would have changed that annotation value to SendFinishShutdown (something).

That might happen because sometimes we delay that operation here.

So the call to ContentChild::RecvReinitRendering() might have happened before we executed ContentChild::ShutdownInternal().

There's also yet another wrinkle that's even scarier: we have two process kill timers! One in the parent and one in the child. I never noticed before and I don't know why that's the case and I would assume the parent one should always go off before the child one... but I have no idea what's the rationale for having both and I don't know how they interact.

Anyway, sorry for the bugspam Sotaro, I'll file another bug to better analyze the overall issue.

Pushed by sikeda.birchill@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/51b0f3fef43a
Make VRManagerChild life time handling more like CompositorManagerChild r=aosmond
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75

Is this something we need to uplift to Beta?

Flags: needinfo?(sotaro.ikeda.g)

Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild

Beta/Release Uplift Approval Request

  • User impact if declined: Shutdown kill could happen in content process during closing Firefox and GPU process re-start happened concurrently.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The fix changed VRManagerChild shutdown as same as CompositorManagerChild shutdown that is already used.
  • String changes made/needed: none
Flags: needinfo?(sotaro.ikeda.g)
Attachment #9126344 - Flags: approval-mozilla-beta?

Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild

Crash fix, proved effective on nightly, approved for 74.0b5, thanks.

Attachment #9126344 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.