Closed Bug 1351349 Opened 3 years ago Closed 3 years ago

Crash in igd10iumd32.dll | CContext::EmptyOutAllDDIBindPoints

Categories

(Core :: Graphics, defect, critical)

52 Branch
x86
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr45 --- wontfix
firefox52 --- wontfix
firefox-esr52 55+ fixed
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- fixed

People

(Reporter: philipp, Assigned: kechen)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-45d5a2bd-9c47-46bc-8b93-2c6162170328.
=============================================================
Crashing Thread (22)
Frame 	Module 	Signature 	Source
Ø 0 	igd10iumd32.dll 	igd10iumd32.dll@0x3a81ea 	
1 	d3d11.dll 	CContext::EmptyOutAllDDIBindPoints(bool) 	
2 	d3d11.dll 	CContext::CompleteContextRemoval(bool) 	
3 	d3d11.dll 	CContext::PerformAmortizedRenderOperations() 	
4 	d3d11.dll 	TOptImmediateContext::AcquireDevCtxIfaceNoSync() 	
5 	d3d11.dll 	CDevCtxInterface::CDevCtxInterface<CContext>(CContext*) 	
6 	d3d11.dll 	CContext::ID3D11DeviceContext_Flush_AppEntered(ID3D11DeviceContext*) 	
7 	xul.dll 	mozilla::layers::CompositorD3D11::CancelFrame() 	gfx/layers/d3d11/CompositorD3D11.cpp:1192
8 	xul.dll 	mozilla::layers::LayerManagerComposite::ChangeCompositor(mozilla::layers::Compositor*) 	gfx/layers/composite/LayerManagerComposite.cpp:1345
9 	xul.dll 	mozilla::layers::CompositorBridgeParent::ResetCompositorImpl(nsTArray<mozilla::layers::LayersBackend> const&) 	gfx/layers/ipc/CompositorBridgeParent.cpp:1859
10 	xul.dll 	mozilla::layers::CompositorBridgeParent::ResetCompositorTask(nsTArray<mozilla::layers::LayersBackend> const&, unsigned __int64, mozilla::Maybe<mozilla::layers::TextureFactoryIdentifier>*) 	gfx/layers/ipc/CompositorBridgeParent.cpp:1807
11 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::layers::CompositorBridgeParent* const, void ( mozilla::layers::CompositorBridgeParent::*)(nsTArray<mozilla::layers::LayersBackend> const&, unsigned __int64, mozilla::Maybe<mozilla::layers::TextureFactoryIdentifier>*), 1, 0, StoreCopyPassByConstLRef<nsTArray<mozilla::layers::LayersBackend> >, unsigned __int64, mozilla::Maybe<mozilla::layers::TextureFactoryIdentifier>*>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:860
12 	xul.dll 	MessageLoop::RunTask(already_AddRefed<mozilla::Runnable>) 	ipc/chromium/src/base/message_loop.cc:358
13 	xul.dll 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) 	ipc/chromium/src/base/message_loop.cc:366
14 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc:441
15 	xul.dll 	base::MessagePumpForUI::DoRunLoop() 	ipc/chromium/src/base/message_pump_win.cc:212
16 	xul.dll 	base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate*, base::MessagePumpWin::Dispatcher*) 	ipc/chromium/src/base/message_pump_win.cc:56
17 	xul.dll 	base::MessagePumpWin::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_win.h:80
18 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:231
19 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:211
20 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc:179
21 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:28
22 	kernel32.dll 	BaseThreadInitThunk 	
23 	ntdll.dll 	__RtlUserThreadStart 	
24 	ntdll.dll 	_RtlUserThreadStart

this crash has been around for a while but it's increasing in frequency since the 53.0b cycle: 
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&signature=igd10iumd32.dll%20%7C%20CContext%3A%3AEmptyOutAllDDIBindPoints&date=%3E%3D2016-09-28T15%3A24%3A52.000Z&date=%3C2017-03-28T15%3A24%3A52.000Z#graphs - on 53.0b6 it's 0.35% of browser crashes currently.

the signature seems to affect mostly gpus from a particular family:
Adapter device id facet
1 	0x1912 	595 	57.49 %
2 	0x1902 	259 	25.02 %
3 	0x1916 	153 	14.78 %
4 	0x1616 	17 	1.64 %
5 	0x1606 	4 	0.39 %
6 	0x191b 	3 	0.29 %
7 	0x1906 	2 	0.19 %
8 	0x22b1 	2 	0.19 %

& some other correlations for Firefox Beta:
(97.25% in signature vs 00.88% overall) Module "igc32.dll" = true
(98.62% in signature vs 03.18% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE
(83.49% in signature vs 01.02% overall) CPU Info = GenuineIntel family 6 model 94 stepping 3
(100.0% in signature vs 37.89% overall) Module "d3d11.dll" = true [100.0% vs 32.25% if platform_version = 6.1.7601 Service Pack 1]
(100.0% in signature vs 39.64% overall) Module "dxgi.dll" = true
(100.0% in signature vs 40.07% overall) abort_message = null
(62.84% in signature vs 00.43% overall) GFX_ERROR "(nsWindow) Detected device reset: " = true [56.92% vs 08.77% if adapter_device_id = 0x1912]
(44.95% in signature vs 00.05% overall) address = 0x198
(54.59% in signature vs 00.31% overall) GFX_ERROR "GFX: D3D11 skip BeginFrame with device-removed." = true [40.13% vs 00.29% if startup_crash = 0]
(50.92% in signature vs 00.49% overall) GFX_ERROR "(gfxWindowsPlatform) Detected device reset: " = true [44.62% vs 07.56% if adapter_device_id = 0x1912]
(50.00% in signature vs 00.47% overall) GFX_ERROR "(gfxWindowsPlatform) Finished device reset." = true [43.85% vs 07.26% if adapter_device_id = 0x1912]
(47.25% in signature vs 00.44% overall) GFX_ERROR "LayerManager::EndTransaction skip RenderLayer()." = true [42.31% vs 06.58% if adapter_device_id = 0x1912]
(22.02% in signature vs 00.23% overall) bios_manufacturer = HP
(18.35% in signature vs 00.07% overall) adapter_driver_version_clean = 4364
(18.35% in signature vs 00.07% overall) adapter_driver_version = 20.19.15.4364
Milan, it looks like this is happening more often on beta 6 (for example) than it does on release with many more users. That makes me worry it will be a high volume crash on release 53.
Flags: needinfo?(milan)
Not really Intel specific, just more common there (e.g., here is an AMD crash - https://crash-stats.mozilla.com/report/index/71709da4-4e1d-45ed-8e70-84bc02170330).  It all comes from CompositorD3D11::CancelFrame, I assume as a response to a driver reset.

Given the timing, and the increase in frequency with 53, it's quite possible that bug 1300121 tickled it the wrong way.  Bas, can you take a look?
Flags: needinfo?(milan) → needinfo?(bas)
NI Peter if somebody that has worked on device resets can also take a look.
Flags: needinfo?(howareyou322)
So it seems that we've had this particular function crashing in a bunch of different places, see also for example:

https://crash-stats.mozilla.com/signature/?signature=igd10iumd32.dll%20%7C%20CContext%3A%3AEmptyOutAllDDIBindPoints&date=%3E%3D2017-03-24T19%3A32%3A00.000Z&date=%3C2017-03-31T19%3A32%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

Where crashes go as far back as 24. I suspect this is some invalid state (due to a real bug on our side or just a driver issues) that has been present for a long time, and we've simply added a new callsite for it in bug 1300121, causing a new call stack to show up.
Flags: needinfo?(bas)
Too late for firefox 52, mass-wontfix.
Assign to Kevin to investigate.
Assignee: nobody → kechen
Flags: needinfo?(howareyou322)
Most of the crashes happened with Intel driver on x86 Windows 7 platform; however, most of the driver version start with "20" or "10" which, according to Intel driver naming rule[1], is for Windows 10 and Windows 8.1. It might be some compatibility problem between driver and graphics card.

To avoid this crash, we might be able to remove the flush function from ChangeCompositor since this function can be involved only in device reset process and we've reinitialized the driver so there is no point to flush the old context which is in an unstable status and gain the risk to corrupt the driver.

This is only one of the entry points of this crash, will open another bug for easier to trace in the future.

[1] http://www.intel.com/content/www/us/en/support/graphics-drivers/000005654.html
[2] https://dxr.mozilla.org/mozilla-central/rev/f40e24f40b4c4556944c762d4764eace261297f5/gfx/layers/composite/LayerManagerComposite.cpp#1372
Depends on: 1356119
From this call stack of the crash report[1], firefox crashes after we are destructing the CompositorD3D11 from LayerComposite.

What happened here was that we only replaced the compositor for LayerManagerComposite after device reset[2]; however, the layers were still holding the old one.
And my assumption is when we were trying to render the layers[3], we used the old compositor to execute the draw commands which might put the driver into an unstable state. And it finally crashes when we destroyed the old compositor.

I am working now on a patch which also updates the layers' compositor after device reset.


[1] https://crash-stats.mozilla.com/report/index/193f4c52-836e-4a47-8ad8-47c6c0170423#tab-details
[2] https://dxr.mozilla.org/mozilla-central/rev/8e969cc9aff49f845678cba5b35d9dd8aa340f16/gfx/layers/ipc/CompositorBridgeParent.cpp#2025
[3] https://hg.mozilla.org/mozilla-central/annotate/8e969cc9aff49f845678cba5b35d9dd8aa340f16/gfx/layers/composite/LayerManagerComposite.cpp#l515
Some updates on comment 8:

After the device reset, we will invalidate all layers in ClientLayerManagers and won't reuse them[1].
So the FlushRendering function in the comment 8's crash report might be sent before the device reset is processed in CompositeBridgeChild side but after CompositeBridgeParent side completes the recovery.

As a result, rather than changing the compositor for the LayerComposites, I think I should avoid rendering these LayerComposites after device reset.

Hello David, do you have any thoughts or advice about this?

[1] https://dxr.mozilla.org/mozilla-central/rev/8e969cc9aff49f845678cba5b35d9dd8aa340f16/layout/painting/FrameLayerBuilder.cpp#2099
Flags: needinfo?(dvander)
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #9)
> Some updates on comment 8:
> 
> After the device reset, we will invalidate all layers in ClientLayerManagers
> and won't reuse them[1].
> So the FlushRendering function in the comment 8's crash report might be sent
> before the device reset is processed in CompositeBridgeChild side but after
> CompositeBridgeParent side completes the recovery.
> 
> As a result, rather than changing the compositor for the LayerComposites, I
> think I should avoid rendering these LayerComposites after device reset.
> 
> Hello David, do you have any thoughts or advice about this?
> 
> [1]
> https://dxr.mozilla.org/mozilla-central/rev/
> 8e969cc9aff49f845678cba5b35d9dd8aa340f16/layout/painting/FrameLayerBuilder.
> cpp#2099

Yes, I think we should refuse to composite any layers whatsoever until we receive a new layer tree from content. There are two mechanisms in place to protect against this, here [1] and here [2]. I'm not sure [1] didn't kick in - comment #8 suggests it should have. Unfortunately [2] only protects against attaching bad compositables, but we could use it to make sure AutoResolveRefLayers doesn't attach anything that hasn't acknowledged a compositor update. (Or anywhere else that might make sense.)

[1] http://searchfox.org/mozilla-central/source/gfx/layers/composite/ContainerLayerComposite.cpp#413
[2] http://searchfox.org/mozilla-central/source/gfx/layers/ipc/LayerTransactionParent.h#101
Flags: needinfo?(dvander)
Err, also [2] only works for content layers, not UI layers. Still not sure why HasStaleCompositor didn't help though.
David, thank you for the feedback!
"The crash is caused by executing draw command via old compositor" in comment 8 is just my deduction, but I didn't notice that we have the function "HasStaleCompositor", maybe it's enough for handling this situation.
I will keep investigating this.
See Also: → 1285333
Depends on: 1363594
There are several crash reports showed that the application crashed when executing DrawTargetD2D1's EndDraw() or Flush() function[1].

In the both methods, the program processes draw commands in the command buffer.
My assumption is that executing these draw commands with old DrawTarget after device reset might corrupts the driver.

We've already skipped a Flush() function to an old Context after device reset in Bug 1356119, and the result is good, the crash rate decreased after the fix was landed[2].

[1] https://crash-stats.mozilla.com/signature/?product=Firefox&address=0x7084&address=0x7088&address=0x707c&address=0x708c&address=0x7080&address=0x70dc&address=0x7078&signature=igd10iumd32.dll%20%7C%20CContext%3A%3AEmptyOutAllDDIBindPoints&date=%3E%3D2017-05-03T02%3A18%3A47.000Z&date=%3C2017-05-10T02%3A18%3A47.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-version&_sort=-date&page=1
[2] https://crash-stats.mozilla.com/signature/?product=Firefox&version=54.0b&signature=igd10iumd32.dll%20%7C%20CContext%3A%3AEmptyOutAllDDIBindPoints&date=%3E%3D2017-02-10T03%3A52%3A51.000Z&date=%3C2017-05-10T03%3A52%3A51.000Z#aggregations
Depends on: 1363677
Kevin, do we have action after bug 1363677? I didn't see the crash volume went down.
Flags: needinfo?(kechen)
I think the crashes are redirected to other call stacks like[1] after bug 1363677.

I guess this crash might be triggered by the deferred destruction of Direct3D 11[2].
Some objects had been already corrupted but the issue didn't came out until the flush method was involved and trigger the deferred destruction.
The reason that skipping flush could not solved it might be that the corrupted object still will be destructed in sometime in the future so we just made it lived a little longer.

It's very difficult to find the root cause of this bug since we couldn't even know when did the object went corrupt.
Maybe we can wrap all the Direct3D API like what we've done in webGL and record the draw commands for every flush method; therefore we can know the draw commands we've sent before the last flush method, and it might help us to find out which draw command makes the object corrupt.

David, do you have any thought about this deduction and proposal ?

[1] https://crash-stats.mozilla.com/report/index/24e0b6d8-b5b1-4fa7-8bfa-a69c10170714
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ff476425(v=vs.85).aspx
Flags: needinfo?(kechen) → needinfo?(dvander)
Also looking at telemetry of the reports, lots of the reports show the "gpuProcess":{"status":"unavailable"}", maybe it's related to tdr fallback process.
(In reply to Kevin Chen[:kechen] (UTC + 8) (PTO 7/3 - 7/11) from comment #17)
> Also looking at telemetry of the reports, lots of the reports show the
> "gpuProcess":{"status":"unavailable"}", maybe it's related to tdr fallback
> process.

Hopefully not, since when the GPU process dies the UI process should not use d3d11. In fact the vast, vast majority of these crashes appear to be on Windows 7 where we're less likely to get the GPU process. Probably these sessions don't use the Platform Update installed.
(In reply to Kevin Chen[:kechen] (UTC + 8) (PTO 7/3 - 7/11) from comment #16)
>
> It's very difficult to find the root cause of this bug since we couldn't
> even know when did the object went corrupt.
> Maybe we can wrap all the Direct3D API like what we've done in webGL and
> record the draw commands for every flush method; therefore we can know the
> draw commands we've sent before the last flush method, and it might help us
> to find out which draw command makes the object corrupt.
> 
> David, do you have any thought about this deduction and proposal ?

My worry is that it sounds very complicated to do that. On the other hand, maybe it's worth trying on Nightly. It could help us see if there's a particular pattern of ID3D11DeviceContext state that is crashy.

Another idea is to call ID3D11DeviceContext::ClearState at the end of every frame. If we're worried about performance we can make it Nightly only. CompositorD3D11 rebuilds the state each frame so it shouldn't be a problem. (Advanced Layers does not, but I've been meaning to fix that.) I'd be curious if ClearState made the crash any different.

It's worth noting that 94% of these crashes occur on adapters 0x1912, 0x1916, and 0x1902. Those are all Intel HD Graphics 5xx cards. We could consider just blocking that adapter on Windows 7 pre-Platform Update. Let me get some Telemetry on that.
The reason crashes appear heavy on Windows 7 sessions without a GPU process, is probably because GPU process crash reports are unlikely to be submitted on release channels. So all the crashes we're seeing are on Windows 7 w/o the Platform Update. But that's still good to know because that means the UI process is being affected.

Some further data from Telemetry... I wanted to see if we can narrow a blacklist down to non-GPU-process users, and how many users that would affect.

I sampled 2,406,510 sessions for Firefox 53+ - all versions that can use the GPU process. Out of those sessions sampled, 20,482 (0.85%) have Windows 7, one of the affected devices, and a D3D11 compositor. Of *those* sessions, 7,733 use the GPU process, and 12,703 do not. Those 12,703 users almost certainly do not have the platform update installed, since otherwise the GPU process would work.

That means, if we blacklist these devices for users who have Windows 7 without the platform update, about 0.5% of users will lose D3D11 support.

So, I'd be fine with a blocklist entry, if all of this seems correct.
Flags: needinfo?(dvander)
Blocks: 1374254
I tried to call ID3D11DeviceContext::ClearState at the end of every frame, but the performance was bad.

Therefore, in this patch, I use a preference to enable the ID3D11DeviceContext::ClearState function only on nightly with Windows 7 w/o platform update with specific graphics card(Intel HD Graphics 510/520/530).

However, there are only 11 samples that match this condition in last 6 months, not sure if this can get the result we want to monitor.

And for the other channels, I blacklist some specific graphics cards according to comment 20.

Hello David, do you have any thought about this patch?
Attachment #8888296 - Flags: feedback?(dvander)
Comment on attachment 8888296 [details] [diff] [review]
Bug 1351349 - Blacklist Intel HD Graphics 510/520/530 for Windows 7 without platform update;

Review of attachment 8888296 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +1328,5 @@
>                             NS_LITERAL_CSTRING("FEATURE_FAILURE_D3D11_NEED_HWCOMP"));
>      return;
>    }
>  
> +  if ((!IsWin8OrLater() && IsWin7SP1OrLater())

Should "!IsWin8OrLater() && IsWin7SP1OrLater()" just be "!IsWin8OrLater()" since any version of Windows 7 might not have the Platform Update?

@@ +1339,5 @@
> +    // update due to the crashes in Bug 1351349.
> +    if (adaptorId.EqualsLiteral("0x1912") || adaptorId.EqualsLiteral("0x1916") ||
> +        adaptorId.EqualsLiteral("0x1902")) {
> +#ifdef RELEASE_OR_BETA
> +      d3d11.Disable(FeatureStatus::Blacklisted, "Block D3D11", NS_LITERAL_CSTRING("block"));

This should be more descriptive, like

    d3d11.Disable(FeatureStatus::Blacklisted,
                  "Blacklisted, see bug 1351349",
                  NS_LITERAL_CSTRING("FEATURE_FAILURE_BUG_1351349"));
Comment on attachment 8888296 [details] [diff] [review]
Bug 1351349 - Blacklist Intel HD Graphics 510/520/530 for Windows 7 without platform update;

Seems okay, if before the next merge we can't confirm if ClearState helps, we should back this out and put in a normal blocklist entry.
Attachment #8888296 - Flags: feedback?(dvander) → feedback+
Comment on attachment 8888593 [details]
Bug 1351349 - Blacklist Intel HD Graphics 510/520/530 for Windows 7 without platform update;

https://reviewboard.mozilla.org/r/159576/#review164958
Attachment #8888593 - Flags: review?(dvander) → review+
Attachment #8888296 - Attachment is obsolete: true
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f3058956dcd3
Blacklist Intel HD Graphics 510/520/530 for Windows 7 without platform update; r=dvander
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/f3058956dcd3
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Please request Beta/ESR52 approval on this when you get a chance.
Approval Request Comment
[Feature/Bug causing the regression]:
  Not a regression, this bug exits for a while.
[User impact if declined]:
  Users with specific platform and specific device will run into the crash.
[Is this code covered by automated tests?]:
  No.
[Has the fix been verified in Nightly?]:
  No test for verify, but it is landed for two days and no crash is found currently.
[Needs manual test from QE? If yes, steps to reproduce]: 
  No.
[List of other uplifts needed for the feature/fix]:
  No.
[Is the change risky?]:
  Not risky.
[Why is the change risky/not risky?]:
  It just block some device from using hardware acceleration.
[String changes made/needed]:
  No.

Try push for this :
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a76d8d8324da95f821791dc7bdc2df54fe8f6572
Flags: needinfo?(kechen)
Attachment #8889764 - Flags: review+
Attachment #8889764 - Flags: approval-mozilla-beta?
Attachment #8889764 - Attachment filename: 0001-Bug-1351349-Blacklist-Intel-HD-Graphics-510-520-530-.patch → 530 for Windows 7 without platform update;
[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: 
  Users with specific platform and specific device will run into the crash.
Fix Landed on Version:
  56
Risk to taking this patch (and alternatives if risky): 
  Not really risky.
String or UUID changes made by this patch: 
  No.
See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8889767 - Flags: approval-mozilla-esr52?
Comment on attachment 8889764 [details] [diff] [review]
0001-Bug-1351349-Blacklist-Intel-HD-Graphics-510-520-530-.patch

I'm going to defer this to 56, it's not a new issue and we don't have time to verify the fix in 55 before release.
Attachment #8889764 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
We should let this bake more. Let's target it on ESR52.4.
Comment on attachment 8889767 [details] [diff] [review]
[esr52 uplift] Blacklist Intel HD Graphics 510/520/530 for Windows 7 without platform update;

The crashes in ESR52 look huge in the past 7 days. I think we can take this one in ESR52 and see if it helps. Take it in ESR52.3.
Attachment #8889767 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #29)
> [Is this code covered by automated tests?]:
>   No.
> [Has the fix been verified in Nightly?]:
>   No test for verify, but it is landed for two days and no crash is found
> currently.
> [Needs manual test from QE? If yes, steps to reproduce]: 
>   No.

Setting qe-verify- based on Kevin's assessment on manual testing needs.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.