Closed Bug 1442730 Opened 6 years ago Closed 6 years ago

[meta] Various QuantumRender testrun crashes, with fatal errors mentioning mozilla::detail::MutexImpl and pthread_mutex_*

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox60 --- affected

People

(Reporter: dholbert, Unassigned)

References

Details

I've noticed several intermittents being filed over the past few days with "application terminated with exit code 11", where the first sign of trouble is in a MutexImpl failure:

bug 1441498:
mozilla::detail::MutexImpl::~MutexImpl: pthread_mutex_destroy failed: Device or resource busy

bug 1442169:
mozilla::detail::MutexImpl::lock: pthread_mutex_lock failed: Invalid argument

bug 1442624:
mozilla::detail::MutexImpl::unlock: pthread_mutex_unlock failed: Invalid argument


I'm guessing these are symptomatic of some general broader issue, so I'm filing this bug to cover that broader issue & so that we can keep these bugs (and future ones of the same sort) logically grouped together.

I'm tentatively filing this in the WebRender component, because these all seem to have happened on the "Linux x64 QuantumRender" platform so far.
Summary: Various QuantumRender testrun crashes, with fatal errors mentioning mozilla::detail::MutexImpl and pthread_mutex_* → [meta] Various QuantumRender testrun crashes, with fatal errors mentioning mozilla::detail::MutexImpl and pthread_mutex_*
Hm, interesting. I looked at the stack from the log in bug 1441498, where the crash happens during MutexImpl destruction. I thought one of the other threads would be holding on to the Mutex while it was being destroyed, which would probably explain the crash. However all the other threads seemed pretty much idle.

The other two bugs though show other places where the Mutex crashes, and provide the missing information. The mutex destruction happens on the main thread, but the WebRenderBridgeParent::RecvShutdownSync call (which then tries to acquire/release the mutex as part of is handling) is happening on the Compositor thread, probably concurrently.

Sotaro, do you know offhand what the shutdown order is supposed to be here? At some point I used to know the shutdown order of stuff but I've forgotten :/
Flags: needinfo?(sotaro.ikeda.g)
Note: bug 1442511 (another instance of this) has a crash-volume graph at the top of it (since its "crash signature" field is populated, as [@ mozilla::detail::MutexImpl::~MutexImpl ]).

That graph seems to indicate that we're seeing a constant low level of crashes like this in the wild, too.  Though if I click through to crash-stats, I can't make it show me any results. (that may just be user error on my part)
Blocks: 1442511, 1367489
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)
> Sotaro, do you know offhand what the shutdown order is supposed to be here?
> At some point I used to know the shutdown order of stuff but I've forgotten
> :/

Shutdown sequence on linux is like the following.

// Called in Parent process on linux
nsWindow::Destroy()
->nsBaseWidget::DestroyCompositor()
->InProcessCompositorSession::Shutdown()
->CompositorBridgeChild::Destroy()
   ->WebRenderBridgeChild::Destroy() // Trigger PWebRenderBridge shut down
   ->WebRenderBridgeParent::SendShutdown() // This triggers async IPC
   ->WebRenderBridgeParent::RecvShutdown()
   ->WebRenderBridgeParent::HandleShutdown()
   ->WebRenderBridgeParent::Destroy()
   ->WebRenderBridgeParent::ClearResources()
->CompositorBridgeChild::SendWillClose() // Wait until CompositorBridgeParent side shut down
->CompositorBridgeParent::RecvWillClose()
->CompositorBridgeParent::StopAndClearResources()
From the crash logs, I feel that the crash might be caused by nsBaseWidget::mCompositorVsyncDispatcher. It is used by both main  thread and compositor thread when gpu process doe not exit.

The mCompositorVsyncDispatcher is cleared in nsBaseWidget::DestroyCompositor() before calling InProcessCompositorSession::Shutdown(). It might cause race problem.
  https://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp#265
mWidget->GetCompositorVsyncDispatcher() is called in compositor thread like the following.

----------------------------------------------------

void
InProcessGtkCompositorWidget::ObserveVsync(VsyncObserver* aObserver)
{
  if (RefPtr<CompositorVsyncDispatcher> cvd = mWidget->GetCompositorVsyncDispatcher()) {
    cvd->SetCompositorVsyncObserver(aObserver);
  }
}
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #5)
> mWidget->GetCompositorVsyncDispatcher() is called in compositor thread like
> the following.

As I mentioned on bug 1441498 I don't think this matters, since the CompositorVsyncDispatcher uses threadsafe refcounting.

What seems really fishy to me is that the crash stacks in bug 1442169 and bug 1442624 are going through WebRenderBridgeParent::RecvShutdownSync. The sync shutdown message (as opposed to the async one) is only ever triggered from [1] in the content process. And that in turn is only triggered from [2] in the in-process compositor case. Both of those should only happen if WR is getting disabled for some reason and if that's the case I would expect some sort of error message in the log, but I don't see anything. But it might be that we have multiple widgets in play here, and one of them is having the WR error while another widget is shutting down and we're missing handling of some edge case.

[1] https://searchfox.org/mozilla-central/rev/bffd3e0225b65943364be721881470590b9377c1/dom/ipc/TabChild.cpp#3214
[2] https://searchfox.org/mozilla-central/rev/bffd3e0225b65943364be721881470590b9377c1/gfx/ipc/GPUProcessManager.cpp#483
The patch on bug 1441498 should fix this; kudos to Sotaro for making a gtest that replicates the problem so I could understand it better.

The RecvShutdownSync stuff is still fishy... but may be a red herring here. :)
No longer blocks: 1441498
Depends on: 1441498
Depends on: 1442104
Blocks: 1442104
No longer depends on: 1442104
We haven't seen any more of these since the fix landed.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.