Open Bug 1447146 Opened 2 years ago Updated 6 months ago

Crash in nvwgf2umx.dll | RtlAllocateMemoryBlockLookaside | ... (advapi32.dll)

Categories

(Core :: Graphics, defect, P2, critical)

60 Branch
x86_64
Windows 7
defect

Tracking

()

Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox59 --- unaffected
firefox60 --- wontfix
firefox61 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- fix-optional

People

(Reporter: philipp, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

This bug was filed from the Socorro interface and is
report bp-e2b60a52-6910-4ce5-bd93-b12bd0180319.
=============================================================

content crashes with this signature are regressing in volume in the 60.0b cycle from
users on windows 7.

this seems to be constrained to a number of device/driver configurations. would
there a way to blocklist them or should we reach out to nvidia?

Adapter driver version facet
1 	10.18.13.6510 	61 	30.50 %
2 	10.18.13.6472 	37 	18.50 %
3 	10.18.13.6175 	31 	15.50 %
4 	10.18.13.5900 	21 	10.50 %
5 	10.18.13.5921 	18 	9.00 %
6 	10.18.13.6519 	13 	6.50 %
7 	10.18.13.6451 	8 	4.00 %
8 	10.18.13.5850 	6 	3.00 %
9 	10.18.13.5582 	4 	2.00 %
10 	10.18.13.5906 	1 	0.50 %

Adapter device id facet
1 	0x128b 	77 	38.50 %
2 	0x1380 	41 	20.50 %
3 	0x11c2 	21 	10.50 %
4 	0x1402 	19 	9.50 %
5 	0x11c0 	13 	6.50 %
6 	0x1187 	12 	6.00 %
7 	0x0f02 	10 	5.00 %
8 	0x1287 	4 	2.00 %
9 	0x104a 	3 	1.50 %
Flags: needinfo?(milan)
Whiteboard: [gfx-noted]
This looks like 60 only, but low volume. Lets wait and see.
Still small number of installs, low numbers in B13, maybe gone?
Milan: Verify that it's no longer showing up?
(In reply to Marion Daly [:mdaly] from comment #3)
> Milan: Verify that it's no longer showing up?

still present, very low user count, high volume for those users.
Crash Signature: ApiSetResolveToHost] [@ nvwgf2umx.dll | RtlAllocateMemoryBlockLookaside | nvwgf2umx.dll | RtlAllocateMemoryBlockLookaside | nvwgf2umx.dll | RtlpFindNextActivationContextSection | RtlpDosPathNameToRelativeNtPathName_Ustr | ApiSetResolveToHost] → ApiSetResolveToHost] [@ nvwgf2umx.dll | RtlAllocateMemoryBlockLookaside | nvwgf2umx.dll | RtlAllocateMemoryBlockLookaside | nvwgf2umx.dll | RtlpFindNextActivationContextSection | RtlpDosPathNameToRelativeNtPathName_Ustr | ApiSetResolveToHost] [@ nvwgf2…
Adding 63 and 64 as affected.
The number of crashes and users affected is going up since we shipped 63, maybe this should be reprioritized as a P2 instead of a P3
This has been scaling up on Windows on release even prior to 63.
Flags: needinfo?(dbolter)
Priority: P3 → P2
Jeff, any thoughts on who could look at this? (Normally I'd ask Bas)
Flags: needinfo?(dbolter) → needinfo?(jmuizelaar)
Not really. The stacks are pretty unhelpful. It might be best to ask nvidia on mozilla-nvidia discuss.
Flags: needinfo?(jmuizelaar)
Any update here?
Flags: needinfo?(dbolter)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> Not really. The stacks are pretty unhelpful. It might be best to ask nvidia
> on mozilla-nvidia discuss.

Agreed. Jeff could you kick that off?
Flags: needinfo?(dbolter) → needinfo?(jmuizelaar)
Done.
Flags: needinfo?(jmuizelaar)

Hey Jeff - any update on this bug?

Flags: needinfo?(jmuizelaar)

No. Nvidia did not respond.

Flags: needinfo?(jmuizelaar)

Any suggestions on anything else that could be done for investigation?

Flags: needinfo?(jmuizelaar)

The most productive thing is probably to continue to try to get in touch with Nvidia. I'll try harder.

Flags: needinfo?(jmuizelaar)

Thanks for the report. This is NV bug number 2446669. I'll look into this..

I am emailing 6 users who submitted crash reports including their email addresses with one of these signatures. If anyone replies with permission to share the crash dumps, then Jeff or I can provide them, in accordance with Mozilla's data protection policies.

Though looking at it I'm not sure how useful that's going to be!

I've passed a minidump on to Kimmo at Nvidia.

Flags: needinfo?(jmuizelaar)

This particular minidump used driver version: 341.44, driver date "2-3-2015", geforce gtx 260
Latest "supported" driver for the card is: Version: 342.01 WHQL, release Date: 2016.12.14
The other crash reports report drivers from newer branches, but around the same era. Example: one crashed with 364.72, it is 2016/H1.

I tried Win7 + 342.01 + GT210 to repro this. The callstack looks like initialization code and IIRC the reports indicate the process lives only few seconds. For the repro I tried to just open and close the app and open multiple windows. I did not succeed. In theory the 342.01 could be fixed wrt this, but I doubt it's the case. More probable explanation is that this more transient and/or needs more specific repro scenario. I'll try the exact 241.44 later.

The main thread crashes at
ul.dll!mozilla::gfx::DoesTextureSharingWorkInternal(ID3D11Device * device, DXGI_FORMAT format, unsigned int bindflags) Line 253 C++
xul.dll!mozilla::gfx::DeviceManagerDx::CreateCompositorDevice(mozilla::gfx::FeatureState & d3d11) Line 495 C++
xul.dll!mozilla::gfx::DeviceManagerDx::CreateCompositorDevices() Line 180 C++
xul.dll!gfxWindowsPlatform::InitializeD3D11() Line 1522 C++
xul.dll!gfxWindowsPlatform::InitializeDevices() Line 1497 C++
xul.dll!gfxWindowsPlatform::HandleDeviceReset() Line 431 C++
xul.dll!gfxWindowsPlatform::UpdateRenderMode() Line 486 C++
xul.dll!nsWindow::OnPaint(HDC__ * aDC, unsigned int aNestingLevel) Line 181 C++
xul.dll!nsWindow::ProcessMessage(unsigned int msg, unsigned __int64 & wParam, __int64 & lParam, int64 * aRetValue) Line 5576 C++
xul.dll!nsWindow::WindowProcInternal(HWND
* hWnd, unsigned int msg, unsigned __int64 wParam, int64 lParam) Line 5034 C++
xul.dll!nsWindow::WindowProc(HWND
* hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) Line 4986 C++
[External Code]
xul.dll!nsAppShell::ProcessNextNativeEvent(bool mayWait) Line 569 C++
xul.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal * thr, bool mayWait) Line 273 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 1160 C++
xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Line 530 C++
xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Line 97 C++
xul.dll!MessageLoop::RunHandler() Line 319 C++
xul.dll!MessageLoop::Run() Line 299 C++
xul.dll!nsBaseAppShell::Run() Line 160 C++
xul.dll!nsAppShell::Run() Line 420 C++
xul.dll!nsAppStartup::Run() Line 291 C++
xul.dll!XREMain::XRE_mainRun() Line 4777 C++
xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4922 C++
xul.dll!XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 5014 C++

The last moz point in the call stack causes some sort of d3d11 driver flush. The flush drains the work in the driver worker thread, and some bug in the driver code then crashes the process. I got the driver callstacks and did some internal bug queries. Unfortunately the queries did not help, I cannot pinpoint which operation would trigger the bug, nor what the bug is.

If other minidumps would indicate the same crash point, then maybe one workaround could be blacklisting the D3D11 sharing on these old drivers. My knowledge doesn't extend to the driver internals, so I'm not sure how realistic this is.
If checking the other minidumps is a fast operation, then that could be one avenue of investigation.

The driver call stack is related to texture uploads, if that makes sense. I'm not sure if creating d3d11 keyedmutex sharetexture internally needs to upload something. If this is part of gfx system initialization, I'm not sure how much other textures have been created. Of course, if this is done per window, then most likely there could already be multiple textures in flight from other windows..

Note: I could not find the main thread from the crash report web ui reports.

Jessie, are we still looking into this?

Flags: needinfo?(jbonisteel)

Hey Jeff - is this actionable + important?

Flags: needinfo?(jbonisteel) → needinfo?(bugzmuiz)
Flags: needinfo?(bugzmuiz) → needinfo?(jmuizelaar)
Depends on: 1548018

It seems like our symbol server is not serving up binaries any more. That makes analyzing this a bit of a pain.

Flags: needinfo?(jmuizelaar)
You need to log in before you can comment on or make changes to this bug.