Closed Bug 1254400 Opened 4 years ago Closed 4 years ago

crash in mozilla::layers::CompositorD3D9::FailedToResetDevice

Categories

(Core :: Graphics, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 - wontfix
firefox46 + wontfix
firefox47 + fixed
firefox48 + fixed

People

(Reporter: milan, Assigned: mtseng)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(3 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-93a03670-63ef-44aa-bb6f-ed82b2160303.
=============================================================

See also bug 1245781.
There are too many of these reports to ignore, despite bug 1245781 being closed.  Peter, is there somebody that could take a look?  David is a good person to talk to about different device reset strategies, or if we can still end up in the "compositors might be mixed" situation.
Whiteboard: [gfx-noted]
I'll take a look and see if there's anything obvious we can do short-term. D3D9 may be a candidate for the crash guard.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
I think this is us not handling the device going away and falling back to software.
Yes, we have a "Compositors might be mixed" message.  David, I have Peter CC-d on this bug, if you can guide somebody from his team through this, we can maybe get more people involved.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #3)
> I think this is us not handling the device going away and falling back to
> software.

Yeah, that looks right. Instead of this MOZ_CRASH:

  https://dxr.mozilla.org/mozilla-central/source/gfx/layers/d3d9/CompositorD3D9.cpp#678

We should post a message back to the main thread to run nsBaseWidget::OnRenderingDeviceReset:

  https://dxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp?from=nsBaseWidget.cpp#296

The compositor will probably render nothing for a few frames but when the task round trips back to the compositor it should fall back to software.
Assignee: dvander → nobody
Status: ASSIGNED → NEW
Thanks - Peter, let me know if you can run with this info.
Morris, please have a look.
Assignee: nobody → mtseng
I'm trying post a message to main thread for running OnRenderingDeviceReset. But it results a crash. I'll keep look into it.
Attachment #8732714 - Flags: review?(matt.woodrow)
Attachment #8732714 - Flags: review?(matt.woodrow) → review?(dvander)
Hi David, review ping for comment 9.
Flags: needinfo?(dvander)
Comment on attachment 8732714 [details] [diff] [review]
Handle device reset for DirectX 9.

Review of attachment 8732714 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.h
@@ +328,5 @@
>      bool mIsWARP;
>      bool mHasDeviceReset;
>      bool mHasFakeDeviceReset;
>      bool mCompositorD3D11TextureSharingWorks;
> +    mozilla::Atomic<bool> mHasD3D9DeviceReset;

We can't use the mHasDeviceReset for this?
(In reply to Morris Tseng [:mtseng] from comment #10)
> Hi David, review ping for comment 9.

Sorry for the delay - I'll test this today.
Flags: needinfo?(dvander)
(In reply to Milan Sreckovic [:milan] from comment #11)
> Comment on attachment 8732714 [details] [diff] [review]
> Handle device reset for DirectX 9.
> 
> Review of attachment 8732714 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/thebes/gfxWindowsPlatform.h
> @@ +328,5 @@
> >      bool mIsWARP;
> >      bool mHasDeviceReset;
> >      bool mHasFakeDeviceReset;
> >      bool mCompositorD3D11TextureSharingWorks;
> > +    mozilla::Atomic<bool> mHasD3D9DeviceReset;
> 
> We can't use the mHasDeviceReset for this?

mHasDeviceReset involved d3d11 logic. I think if we use mHasDeviceReset, it might cause some side effect for d3d11. Therefore, I create an another variable for d3d9 only.
Comment on attachment 8732714 [details] [diff] [review]
Handle device reset for DirectX 9.

This patch will fix the crash, but if we fail to acquire a new device, the browser will stop compositing and will appear frozen (or entirely black, on Window XP). You can see that by injecting a late permanent failure into EnsureSwapChain and GetD3D9DeviceManager.

We need to tell the main thread to recreate the compositor, which is what OnRenderingDeviceReset does. This would also let you remove mHasD3D9DeviceReset.

Morris, you mentioned you tried that approach in comment #8 and it crashed. Do you remember what the crash was?

One other note is that mDeviceManager is not atomic, which is a pre-existing bug. We should be updating and acquiring references to it within a lock. (This used to affect D3D11 too, see bug 1258174)
Attachment #8732714 - Flags: review?(dvander) → review-
(In reply to David Anderson [:dvander] from comment #14)
> Comment on attachment 8732714 [details] [diff] [review]
> Handle device reset for DirectX 9.
> 
> This patch will fix the crash, but if we fail to acquire a new device, the
> browser will stop compositing and will appear frozen (or entirely black, on
> Window XP). You can see that by injecting a late permanent failure into
> EnsureSwapChain and GetD3D9DeviceManager.
> 
> We need to tell the main thread to recreate the compositor, which is what
> OnRenderingDeviceReset does. This would also let you remove
> mHasD3D9DeviceReset.

https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsWindowGfx.cpp#189

Here will detect device reset and call OnRenderingDeviceReset(). That's why I add mHasD3D9DeviceReset. Do you think I should call OnRenderingDeviceReset directly at FailedToResetDevice? I think the mechanism in nsWindowGfx work quiet well and we should follow it.

On my local test, I injected a permanent failure into CompositorD3D9::Ready and it will recreate a basic compositor and the compositing result is correct. I'll do more tests in EnsureSwapChain and GetD3D9DeviceManager.
 
> 
> Morris, you mentioned you tried that approach in comment #8 and it crashed.
> Do you remember what the crash was?
Yes, that's because the first d3d9 swap chain hold window handle when we fallback to basic compositor. The swap chain didn't release. So I set mSwapChain and mDeviceManager to null when we detect device reset.

> 
> One other note is that mDeviceManager is not atomic, which is a pre-existing
> bug. We should be updating and acquiring references to it within a lock.
> (This used to affect D3D11 too, see bug 1258174)

Ok, I'll try to address this in my next patch. Thanks.
Flags: needinfo?(dvander)
(In reply to Morris Tseng [:mtseng] from comment #15)

> Here will detect device reset and call OnRenderingDeviceReset(). That's why
> I add mHasD3D9DeviceReset. Do you think I should call OnRenderingDeviceReset
> directly at FailedToResetDevice? I think the mechanism in nsWindowGfx work
> quiet well and we should follow it.
> 
> On my local test, I injected a permanent failure into CompositorD3D9::Ready
> and it will recreate a basic compositor and the compositing result is
> correct. I'll do more tests in EnsureSwapChain and GetD3D9DeviceManager.

Interesting, I didn't see the compositor get recreated, but it you're right - it should. It might be that I didn't receive a WM_PAINT. I'll check again.
Flags: needinfo?(dvander)
Sorry for the confusion - I think I see what I did wrong now. When I injected a failure into VerifyDeviceForRendering, I didn't call DestroyDevice. That function (racily) revokes the device from gfxWindowsPlatform, and FailedToResetDevice is only called if there's no device.

Now when I test your patch, we do recover from the device reset. It seems delayed though - Firefox turns black until the window gains focus or something triggers a paint. That's probably okay, so just fixing the mD3D9DeviceManager race is good enough.

We also shouldn't need to null out the pointer in gfxWindowsPlatform::D3D9DeviceReset since it will already be null. (Not a big deal to do it, but it's a little confusing who is responsible for clearing it, so either way maybe put a comment in that function.)
Addressed dvander's comment. Major changes:

1. Remove null out the mDeviceManager in D3D9DeviceReset.
2. Let TextureSourceD3D9 hold refptr of DeviceManagerD3D9 instead of rawptr. Because gfxWindowsPlatform might destroy device manager while device reset and the pointer of device manager in TextureSourceD3D9 would become dangling pointer.
3. Remove gfxCriticalError in TextureHostD3D9::UpdatedInternal because when device reset this operation might fail. We shouldn't crash program here because we handle the device reset.
Attachment #8732714 - Attachment is obsolete: true
Attachment #8734666 - Flags: review?(dvander)
Comment on attachment 8734666 [details] [diff] [review]
Part 1: Handle device reset for DirectX 9 v2.

Review of attachment 8734666 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d9/CompositorD3D9.cpp
@@ +670,5 @@
>    // 10 is a totally arbitrary number that we may want to increase or decrease
>    // depending on how things behave in the wild.
>    if (mFailedResetAttempts > 10) {
> +    mFailedResetAttempts = 0;
> +    gfxWindowsPlatform::GetPlatform()->D3D9DeviceReset();

Maybe add more information when this happens - gfxCriticalNote, or something like that, so that we can see if it happens in crashes.

::: gfx/layers/d3d9/TextureD3D9.cpp
@@ +892,5 @@
>        regionToUpdate = nullptr;
>      }
>    }
>  
> +  mTextureSource->UpdateFromTexture(mTexture, regionToUpdate);

Do we not care if this fails?  We should at least have a gfxWarning, or gfxCriticalNote if we do not want to assert.
(In reply to Milan Sreckovic [:milan] from comment #20)
> Comment on attachment 8734666 [details] [diff] [review]
> Part 1: Handle device reset for DirectX 9 v2.
> 
> Review of attachment 8734666 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/d3d9/CompositorD3D9.cpp
> @@ +670,5 @@
> >    // 10 is a totally arbitrary number that we may want to increase or decrease
> >    // depending on how things behave in the wild.
> >    if (mFailedResetAttempts > 10) {
> > +    mFailedResetAttempts = 0;
> > +    gfxWindowsPlatform::GetPlatform()->D3D9DeviceReset();
> 
> Maybe add more information when this happens - gfxCriticalNote, or something
> like that, so that we can see if it happens in crashes.
I'll add a gfxWarning here.

> 
> ::: gfx/layers/d3d9/TextureD3D9.cpp
> @@ +892,5 @@
> >        regionToUpdate = nullptr;
> >      }
> >    }
> >  
> > +  mTextureSource->UpdateFromTexture(mTexture, regionToUpdate);
> 
> Do we not care if this fails?  We should at least have a gfxWarning, or
> gfxCriticalNote if we do not want to assert.
I'll add gfxWarning as well in next patch.
Addressed Milan's comment.
Attachment #8734666 - Attachment is obsolete: true
Attachment #8734666 - Flags: review?(dvander)
Attachment #8735357 - Flags: review?(dvander)
Attachment #8735357 - Flags: review?(dvander) → review+
Comment on attachment 8734668 [details] [diff] [review]
Part 2: Make access to d3d9 device manager thread-safe.

Review of attachment 8734668 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for addressing this!

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +1398,5 @@
>  {
>    // We should only create the d3d9 device on the compositor thread
>    // or we don't have a compositor thread.
> +  RefPtr<DeviceManagerD3D9> result;
> +  MutexAutoLock lock(mDeviceLock);

I don't think it matters much, but I like to do the initialization of mDeviceManager outside the lock, and only take the lock to do the final assignment or reference transfer.
Attachment #8734668 - Flags: review?(dvander) → review+
Try is good!
https://hg.mozilla.org/mozilla-central/rev/f75d2232611e
https://hg.mozilla.org/mozilla-central/rev/2626d857b985
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
(In reply to Morris Tseng [:mtseng] from comment #21)
> > 
> > Do we not care if this fails?  We should at least have a gfxWarning, or
> > gfxCriticalNote if we do not want to assert.
> I'll add gfxWarning as well in next patch.

gfxWarning doesn't show up in crash reports or non-debug builds (by default.)  gfxCriticalNote, on the other hand, shows up in crash reports and non-debug builds' about:support section, which is useful.
(In reply to Milan Sreckovic [:milan] from comment #28)
> (In reply to Morris Tseng [:mtseng] from comment #21)
> > > 
> > > Do we not care if this fails?  We should at least have a gfxWarning, or
> > > gfxCriticalNote if we do not want to assert.
> > I'll add gfxWarning as well in next patch.
> 
> gfxWarning doesn't show up in crash reports or non-debug builds (by
> default.)  gfxCriticalNote, on the other hand, shows up in crash reports and
> non-debug builds' about:support section, which is useful.
But gfxCriticalNote will crash debug build. Do we have another way which doesn't crash debug build and show useful information in about:support section?
gfxCriticalNote would not crash the debug build, so you're OK to use it.  gfxCriticalError does.
(In reply to Milan Sreckovic [:milan] from comment #30)
> gfxCriticalNote would not crash the debug build, so you're OK to use it. 
> gfxCriticalError does.

Thanks for the info. Create a follow-up bug 1262361.
[Tracking Requested - why for this release]:
I'm not sure if this was missed but this shows up for Beta and Release, not just Nightly, but strangely not Aurora.

7-day Ranks:
#33 in Firefox 45.0.2 with 2084 crashes.
#25 in Firefox 46.0b* with 732 crashes.

Is it too late to uplift?
Flags: needinfo?(milan)
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:
45 is EOL in a few days, tracking request moved to 46 & 47
Too late to make it into 46 RC2. But I will track this and keep it open for 46 in case we do a dot release and want to take it as a ridealong.
this is currently the #25 crash on 47.0b2 with 0.3% of crashes.
Hello Morris, David, given that this is a top crasher in Fx47, should we consider uplifting to Beta? I would like to get your assessment on the risk involved. Thanks!
Flags: needinfo?(mtseng)
Flags: needinfo?(dvander)
I'd personally be nervous about uplifting this, but I'll let the author and reviewer pipe in.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #37)
> I'd personally be nervous about uplifting this, but I'll let the author and
> reviewer pipe in.

We've uplifted much scarier reset-handling code (for D3D11, mainly). This doesn't seem to have caused any problems so far, and AFAICT (I've read crash-stats wrong before) there don't seem to be any crashes in Firefox 48/49.

So I'd be okay with uplifting this.
Flags: needinfo?(dvander)
(In reply to David Anderson [:dvander] from comment #38)

That is, this signature seems to have disappeared in 48/49.
David, since bug 1258174 didn't uplift to beta. We may only uplift part 1 of this bug to beta. Do you think this is acceptable?
Flags: needinfo?(mtseng) → needinfo?(dvander)
(In reply to Morris Tseng [:mtseng] from comment #40)
> David, since bug 1258174 didn't uplift to beta. We may only uplift part 1 of
> this bug to beta. Do you think this is acceptable?

Yeah.
Flags: needinfo?(dvander)
This is the patch for beta.
Attachment #8749537 - Flags: review+
Comment on attachment 8749537 [details] [diff] [review]
Handle device reset for d3d9.

Approval Request Comment
[Feature/regressing bug #]: 1254400
[User impact if declined]: Crashes when gpu driver reset on DirectX 9.
[Describe test coverage new/current, TreeHerder]: Tested on nightly and it seems no more crash happened on crash stats.
[Risks and why]: Should be low risk since the crash stats on nightly and aurora looks good.
[String/UUID change made/needed]: n/a
Attachment #8749537 - Flags: approval-mozilla-beta?
Comment on attachment 8749537 [details] [diff] [review]
Handle device reset for d3d9.

This is ranked #31 on 47.0b, let's hope this fix helps, Beta47+
Attachment #8749537 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
No reports from 47.0b4 yet, looks like it worked.
You need to log in before you can comment on or make changes to this bug.