Closed Bug 1614570 Opened 5 years ago Closed 5 years ago

Crash in [@ IPCError-browser | ShutDownKill | mozilla::gfx::VRManagerChild::ShutDown]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla75

Tracking Flags:

Tracking

Status

firefox-esr68

---

unaffected

firefox73

---

unaffected

firefox74

---

fixed

firefox75

---

fixed

People

(Reporter: gsvelto, Assigned: sotaro)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild 5 years ago Sotaro Ikeda [:sotaro] 47 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review

Gabriele Svelto [:gsvelto]

Reporter

Description

•

5 years ago

This bug is for crash report bp-1ab9ea5d-2b34-42d5-858b-be9120200210.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue>::GetEvent xpcom/threads/ThreadEventQueue.cpp:208
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1140
6 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
7 xul.dll static mozilla::gfx::VRManagerChild::ShutDown gfx/vr/ipc/VRManagerChild.cpp:140
8 xul.dll static mozilla::gfx::VRManagerChild::ReinitForContent gfx/vr/ipc/VRManagerChild.cpp:107
9 xul.dll mozilla::dom::ContentChild::RecvReinitRendering dom/ipc/ContentChild.cpp:1634

The affected content processes are stuck in VRManagerChild::ShutDown() which is spinning the event loop waiting for something to happen. All the other threads are also idle so the process is effectively deadlocked.

Gabriele Svelto [:gsvelto]

Reporter

Comment 1

•

5 years ago

Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:

The process has already received a shutdown IPC message (the IPCShutdownState annotation is set to RecvShutdown).
So we've sent a content-child-will-shutdown notification (see here)
That message has told the thread manager that we're shutting down
Which IIUC means we're not spinning the event loop anymore
But this process received another IPC message after the shutdown one (which I think was expected) and is now shutting down the VR manager by posting a runnable
... and then waiting on it

Obviously the wait will never be over because the event loop isn't spinning and I suspect that's what's happening in other content process shutdown deadlocks like this one.

Note that if the same code executes in the same sequence in the main process (w/o the IPC obviously) it won't be a problem because in the main process we don't stop the event loop until we hit the xpcom-shutdown. And that's very, very late during the shutdown process; all the machinery that needs to do non-trivial stuff with threads at shutdown will already have executed long before we get there.

Flags: needinfo?(nfroyd)

Andrew Osmond [:aosmond] (he/him)

Comment 2

•

5 years ago

Crash reports suggest this started around build 20200131213333. Pushlog for that is:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3fedfc4923c&tochange=76ee1827c820f34b3b595f887f57b4c847316fcc

Which suggests bug 1612256.

Flags: needinfo?(sotaro.ikeda.g)

Regressed by: 1612256

BMO Automation

Updated

•

5 years ago

Has Regression Range: --- → yes

Andrew Osmond [:aosmond] (he/him)

Updated

•

5 years ago

Keywords: regression

OS: Windows 10 → Windows

Hardware: Unspecified → Desktop

Andrew Osmond [:aosmond] (he/him)

Updated

•

5 years ago

Priority: -- → P2

Gabriele Svelto [:gsvelto]

Reporter

Comment 3

•

5 years ago

Disregard the starting date, these crashes have been reprocessed. Before crashes with this signature showed up as bug 1279293 but after landing bug 1612569 they now have separate signatures. The fact that the first now was from build 20200131213333 is just an artifact of that, the crashes started certainly before.

Sotaro Ikeda [:sotaro]

Assignee

Comment 4

•

5 years ago

•

Edited

Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.

Sotaro Ikeda [:sotaro]

Assignee

Comment 5

•

5 years ago

•

Edited

(In reply to Sotaro Ikeda [:sotaro] from comment #4)

Majority of crash happened under VRManagerChild::ReinitForContent(). I am going to look into if it is possible to address to reproduce the problem.

I could not reproduce the problem locally. But Firefox does a lot of things during re-creating GPU process. Then if re-creating GPU process and Firefox shutdown happen concurrently, cpu becomes very busy and shutdown kill could happen by timeout.

Sotaro Ikeda [:sotaro]

Assignee

Comment 6

•

5 years ago

Bug 1612256 fix added latency, then it seems better to backout the fix for now.

Flags: needinfo?(sotaro.ikeda.g)

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

5 years ago

Assignee: nobody → sotaro.ikeda.g

Gabriele Svelto [:gsvelto]

Reporter

Comment 7

•

5 years ago

We've reprocessed all the crashes, so no we're sure that the first buildid with a crash is 20200131213333 which is consistent with when bug 1612256 landed. Note that this is not just a latency problem, like other shutdown crashes it might require a higher-level fix as described in comment 1.

The chain of events here is:

The shutdown IPC message was sent from the main process triggering ContentChild::RecvShutdown(), this stopped the event loops.
The re-init rendering IPC messages was sent from the main process triggering ContentChild::RecvReinitRendering() which you can see on the stack
VRManagerChild::Destroy() posted a runnable to perform its shutdown procedure, but the event loop has already stopped so it won't be executed.
So we're stuck here waiting for an event that will never happen

I don't know if it's possible to fix this w/o altering the way the event loop behaves when child processes are shut down.

Nathan Froyd [:froydnj]

Comment 8

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #1)

Nathan this is another report exhibiting the behavior I described yesterday. What seems to have happened is the following:

The process has already received a shutdown IPC message (the IPCShutdownState annotation is set to RecvShutdown).

So we've sent a content-child-will-shutdown notification (see here)

That message has told the thread manager that we're shutting down

Which IIUC means we're not spinning the event loop anymore

I don't see that these two steps here actually stop spinning the event loop. In context, nsThreadManager::SpinEventLoopUntilOrShutdown:

https://searchfox.org/mozilla-central/rev/a1592902acabf9bded973067133baaac1457f3d3/xpcom/threads/nsThreadManager.cpp#676-718

is only really meant to be called from JS:

https://searchfox.org/mozilla-central/source/xpcom/threads/nsIThreadManager.idl#135-142

and indeed, you can see that this function only being called from JS is currently the case:

https://searchfox.org/mozilla-central/search?q=SpinEventLoopUntilOrShutdown&redirect=false

I can totally believe somebody might be waiting for an event that will never happen. But it looks more like the main thread either:

a. isn't getting woken up by people posting events so that it can go check whether sVRManagerChildSingleton is actually null; or
b. the task that's supposed to be running VRManagerChild::AfterDestroy isn't actually getting run.

a) seems kind of unlikely, but possible, I guess?

Flags: needinfo?(nfroyd)

Sotaro Ikeda [:sotaro]

Assignee

Comment 9

•

5 years ago

•

Edited

VRManagerChild::ShutDown() was implemented by mimicking CompositorBridgeChild::ShutDown(). It is called from CompositorManagerChild::Shutdown().

CompositorBridgeChild::ShutDown() is already used in gecko, then I thought that VRManagerChild::ShutDown() should work well. But there was a difference in ContentChild::RecvReinitRendering(). For CompositorManagerChild, CompositorManagerChild::Init() is called. But for VRManagerChild, VRManagerChild::ReinitForContent() is called. And VRManagerChild::ReinitForContent() calls VRManagerChild::ShutDown() in it.

From the above, it seems that the problem could be addressed by changing VRManagerChild more similar to CompositorManagerChild.

Sotaro Ikeda [:sotaro]

Assignee

Comment 10

•

5 years ago

Attached file Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild — Details

Gabriele Svelto [:gsvelto]

Reporter

Comment 11

•

5 years ago

Thanks for the analysis to both. I had missed the boolean in SpinEventLoopUntilInternal(), so in this case we're calling SpinEventLoopUntil() which is setting that to false so it won't care about the mShuttindDown flag. My guess is that we're seeing b) because this code should have posted a runnable at this point, but that runnable doesn't seem to be running anywhere, how is that possible?

There's also another wrinkle I hadn't noticed before: the IPCShutdownState annotation has been set, which means that ContentChild::RecvShutdown() has been called, however ContentChild::ShutdownInternal() has not yet run because it otherwise it would have changed that annotation value to SendFinishShutdown (something).

That might happen because sometimes we delay that operation here.

So the call to ContentChild::RecvReinitRendering() might have happened before we executed ContentChild::ShutdownInternal().

There's also yet another wrinkle that's even scarier: we have two process kill timers! One in the parent and one in the child. I never noticed before and I don't know why that's the case and I would assume the parent one should always go off before the child one... but I have no idea what's the rationale for having both and I don't know how they interact.

Gabriele Svelto [:gsvelto]

Reporter

Comment 12

•

5 years ago

Anyway, sorry for the bugspam Sotaro, I'll file another bug to better analyze the overall issue.

Pulsebot

Comment 13

•

5 years ago

Pushed by sikeda.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/51b0f3fef43a Make VRManagerChild life time handling more like CompositorManagerChild r=aosmond

Stefan Hindli [:stefan_hindli]

Comment 14

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/51b0f3fef43a

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox75: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla75

(PTO June 19-July 7) Ryan VanderMeulen

Comment 15

•

5 years ago

Is this something we need to uplift to Beta?

status-firefox73: --- → unaffected

status-firefox74: --- → affected

status-firefox-esr68: --- → unaffected

Flags: needinfo?(sotaro.ikeda.g)

Sotaro Ikeda [:sotaro]

Assignee

Comment 16

•

5 years ago

•

Edited

Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild

Beta/Release Uplift Approval Request

User impact if declined: Shutdown kill could happen in content process during closing Firefox and GPU process re-start happened concurrently.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: none
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): The fix changed VRManagerChild shutdown as same as CompositorManagerChild shutdown that is already used.
String changes made/needed: none

Flags: needinfo?(sotaro.ikeda.g)

Attachment #9126344 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc

Comment 17

•

5 years ago

Comment on attachment 9126344 [details]
Bug 1614570 - Make VRManagerChild life time handling more like CompositorManagerChild

Crash fix, proved effective on nightly, approved for 74.0b5, thanks.

Attachment #9126344 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Noemi Erli[:noemi_erli]

Comment 18

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/178d86c0541a

status-firefox74: affected → fixed

You need to log in before you can comment on or make changes to this bug.