Closed Bug 1374254 Opened 7 years ago Closed 7 years ago

Crash in RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | CUseCountedObject<T>::UCReleaseUse

Categories

(Core :: Graphics: Layers, defect, P2)

55 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: philipp, Assigned: vliu)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-e15ee2ed-a9b1-4d08-aa97-83bba0170619.
=============================================================
Crashing Thread (22), Name: Compositor
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	RtlpWaitOnCriticalSection 	
1 	ntdll.dll 	RtlpEnterCriticalSectionContended 	
2 	ntdll.dll 	RtlEnterCriticalSection 	
3 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::UCReleaseUse() 	
4 	d3d11.dll 	NDXGI::CDeviceChild<IDXGISurface, IUnknown>::FinalRelease() 	
5 	d3d11.dll 	CUseCountedObject<NOutermost::CDeviceChild>::UCDestroy() 	
6 	d3d11.dll 	TOptImmediateContext::AcquireDevCtxIfaceNoSync(bool) 	
7 	d3d11.dll 	CDevCtxInterface::CDevCtxInterface<CContext>(CContext*, bool, bool, bool) 	
8 	d3d11.dll 	NDXGI::CDevice::RotateResourceIdentities(IDXGIResource*, IDXGIResource* const*, unsigned int) 	
9 	dxgi.dll 	CDXGISwapChain::PresentImplCore(SPresentArgs const*, unsigned int, tagRECT const*, unsigned int, DXGI_SCROLL_RECT const*, IDXGIResource*, bool&, bool&, bool&) 	
10 	dxgi.dll 	CDXGISwapChain::Present(unsigned int, unsigned int) 	
11 	xul.dll 	mozilla::layers::CompositorD3D11::Present() 	gfx/layers/d3d11/CompositorD3D11.cpp:1187
12 	xul.dll 	mozilla::layers::CompositorD3D11::EndFrame() 	gfx/layers/d3d11/CompositorD3D11.cpp:1123
13 	xul.dll 	mozilla::layers::LayerManagerComposite::Render(mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gfx::IntRegionTyped<mozilla::gfx::UnknownUnits> const&) 	gfx/layers/composite/LayerManagerComposite.cpp:971
14 	xul.dll 	mozilla::layers::LayerManagerComposite::UpdateAndRender() 	gfx/layers/composite/LayerManagerComposite.cpp:511
15 	xul.dll 	mozilla::layers::LayerManagerComposite::EndTransaction(mozilla::TimeStamp const&, mozilla::layers::LayerManager::EndTransactionFlags) 	gfx/layers/composite/LayerManagerComposite.cpp:435
16 	xul.dll 	mozilla::layers::CompositorBridgeParent::CompositeToTarget(mozilla::gfx::DrawTarget*, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) 	gfx/layers/ipc/CompositorBridgeParent.cpp:1013
17 	xul.dll 	mozilla::layers::CompositorVsyncScheduler::Composite(mozilla::TimeStamp) 	gfx/layers/ipc/CompositorVsyncScheduler.cpp:258
18 	xul.dll 	mozilla::detail::RunnableMethodImpl<SoftwareDisplay* const, void ( SoftwareDisplay::*)(mozilla::TimeStamp), 1, 1, mozilla::TimeStamp>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:1133
19 	xul.dll 	MessageLoop::RunTask(already_AddRefed<mozilla::Runnable>) 	ipc/chromium/src/base/message_loop.cc:361
20 	xul.dll 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) 	ipc/chromium/src/base/message_loop.cc:369
21 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc:444
22 	xul.dll 	base::MessagePumpForUI::DoRunLoop() 	ipc/chromium/src/base/message_pump_win.cc:212
23 	xul.dll 	base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate*, base::MessagePumpWin::Dispatcher*) 	ipc/chromium/src/base/message_pump_win.cc:56
24 	xul.dll 	base::MessagePumpWin::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_win.h:80
25 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:231
26 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:211
27 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc:180
28 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:28
29 	kernel32.dll 	BaseThreadInitThunk 	
30 	mozglue.dll 	patched_BaseThreadInitThunk 	mozglue/build/WindowsDllBlocklist.cpp:809
31 	ntdll.dll 	__RtlUserThreadStart 	
32 	ntdll.dll 	_RtlUserThreadStart

this signature is regressing in volume since 55.0a1 build 20170604030205 and subsequent versions. crash reports are coming from users on windows 10 & 8.1 and with all different gpu vendors.
the signature is currently accounting for > 1% of browser crashes in early data from the 55.0b staged rollout.
Looks like a driver/dxgi bug occurring on `swapchain->Present()` call.
Severity: critical → major
Priority: -- → P2
Whiteboard: [gfx-noted]
Milan, is there anything we can do here? /me apologizes for not knowing what dxgi is :)
Flags: needinfo?(milan)
Given the timing of when the crashes spiked, I wouldn't be surprised if patches on bug 1377324 end up fixing it.  Just got r+, let's see if it helps.
There were still a couple of crashes on nightly after bug 1377324 landed, but the volume on nightly is pretty low overall.  Do we need to wait until that's on beta?
(In reply to Julien Cristau [:jcristau] from comment #4)
> There were still a couple of crashes on nightly after bug 1377324 landed,
> but the volume on nightly is pretty low overall.  Do we need to wait until
> that's on beta?

Yeah, it doesn't look like it was related; this is probably another of the "driver reset" ones, actively worked on in bug 1351349.
Flags: needinfo?(milan)
See Also: → 1351349
Add kevin/vincent to monitor this issue
(In reply to Peter Chang[:pchang] from comment #6)
> Add kevin/vincent to monitor this issue

From many of the Crash Signatures indicate that they crashed in swapchain->present in ::EndFrame(). The remind of me that we just revert device reset check in EndFrame in bug 1356537. From this result, it seems that device-reset checked only in BeginFrame() is not enough. David, do you think if we can try to push back this check to ::EndFrame() to see if the crash rate can drop or not?
Assignee: nobody → vliu
Flags: needinfo?(dvander)
These are Windows 10 crashes in the UI process. That should never happen. It sounds like we're using D3D11 in the UI after a device reset.
Flags: needinfo?(dvander)
Oh, I see - these sessions are all not getting E10S, so the GPU process is disabled. That sucks.

@Vincent, if the timing of the EndFrame removal check coincides with this crash spiking - sure, I'd be in favor of re-adding the aggressive TDR checks. Could we gate them on XRE_IsParentProcess() this time?
(In reply to David Anderson [:dvander] from comment #9)
> Oh, I see - these sessions are all not getting E10S, so the GPU process is
> disabled. That sucks.
> 
> @Vincent, if the timing of the EndFrame removal check coincides with this
> crash spiking - sure, I'd be in favor of re-adding the aggressive TDR
> checks. Could we gate them on XRE_IsParentProcess() this time?

Yes, from filtering data Telemetry, it is confirmed that all crash reports were run without E10S enabled and GPU process were unavailable. Based on this. it is worth re-adding TDR checks in ::EndFrame().
Attachment #8889340 - Flags: review?(dvander)
Comment on attachment 8889340 [details] [diff] [review]
0001-Bug-1374254-Re-adding-the-aggressive-TDR-checks-in-D.patch

Review of attachment 8889340 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1105,5 @@
>      return;
>    }
>  
> +  if (XRE_IsParentProcess() &&
> +      mDevice->GetDeviceRemovedReason() != S_OK) {

nit: this condition fits on one line, and the body of this |if| should have two-space indents.
Attachment #8889340 - Flags: review?(dvander) → review+
(In reply to David Anderson [:dvander] from comment #11)
> Comment on attachment 8889340 [details] [diff] [review]
> 0001-Bug-1374254-Re-adding-the-aggressive-TDR-checks-in-D.patch
> 
> Review of attachment 8889340 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/d3d11/CompositorD3D11.cpp
> @@ +1105,5 @@
> >      return;
> >    }
> >  
> > +  if (XRE_IsParentProcess() &&
> > +      mDevice->GetDeviceRemovedReason() != S_OK) {
> 
> nit: this condition fits on one line, and the body of this |if| should have
> two-space indents.

Thanks for the suggestion. Carried r+ed with nit.
Attachment #8889340 - Attachment is obsolete: true
Pushed by vliu@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/01828d347c22
Re-adding the aggressive TDR checks in D3D11 EndFrame(). r=dvander
https://hg.mozilla.org/mozilla-central/rev/01828d347c22
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
thanks for the patch, please request uplift to beta if you deem fit to do so!
(the crash is less frequent on nightly so it won't be possible to verify if this worked by looking at 56.0a1 crash stats)
Flags: needinfo?(vliu)
Comment on attachment 8890167 [details] [diff] [review]
0001-Bug-1374254-Re-adding-the-aggressive-TDR-checks-in-D.patch

Approval Request Comment
[Feature/Bug causing the regression]: No.
[User impact if declined]: Crash
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: No because lower crash rate in Nightly
[Needs manual test from QE? If yes, steps to reproduce]: No 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Only adding abnormal checking mechanism
[String changes made/needed]:
Flags: needinfo?(vliu)
Attachment #8890167 - Flags: approval-mozilla-beta?
Comment on attachment 8890167 [details] [diff] [review]
0001-Bug-1374254-Re-adding-the-aggressive-TDR-checks-in-D.patch

graphics crash fix with device reset, beta55+
Attachment #8890167 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Vincent Liu[:vliu] from comment #17)
> [Is this code covered by automated tests?]: No
> [Has the fix been verified in Nightly?]: No because lower crash rate in
> Nightly
> [Needs manual test from QE? If yes, steps to reproduce]: No 

Setting qe-verify- based on Vincent's assessment on manual testing needs.
Flags: qe-verify-
(In reply to [:philipp] from comment #21)
> crash reports with this signature are still around in 55.0b13 which got the
> patch:
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&build_id=%3E%3D20170727114534&signature=RtlpWaitOnCriticalSe
> ction%20%7C%20RtlpEnterCriticalSectionContended%20%7C%20RtlEnterCriticalSecti
> on%20%7C%20CUseCountedObject%3CT%3E%3A%3AUCReleaseUse#reports

yes, it still crashed in Present() even adding TDR check in the beginning of ::EndFrame(). Most of signatures were showed that they got device reset many times before crashing, showing that device-reset happens after TDR checks and before Gecko called Present().

I'd ever tried to filter all signatures about this crash from:

1. The first signature it counted was on build id 20170604030205. At this time 55 still in Nightly build.
2. The first signature counted to 18 was on build id 20170612224034

Looked into Pushlog to see all comments around the above of 2 check point but it is hard to see any root cause. All were e10s disabled and almost were on Windows10.
Until now there is no this crash happens on FF56, probably it still in Nightly build. Anyway, I will keep monitoring to see if there is any way to do more.
Flags: needinfo?(vliu)
You need to log in before you can comment on or make changes to this bug.