Closed Bug 1160157 Opened 5 years ago Closed 2 years ago

non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()

Categories

(Core :: Graphics, defect, P2, critical)

x86
Windows NT
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s + ---
firefox39 - wontfix
firefox40 - wontfix
firefox41 --- wontfix
firefox42 --- wontfix
firefox43 --- wontfix
firefox44 --- wontfix
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- fix-optional
firefox53 --- affected
firefox54 --- affected

People

(Reporter: milan, Assigned: kechen)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, topcrash-win, Whiteboard: gfx-noted)

Crash Data

Attachments

(13 files, 4 obsolete files)

4.83 KB, patch
bas.schouten
: feedback+
jrmuizel
: feedback+
Details | Diff | Splinter Review
913 bytes, patch
jrmuizel
: review+
milan
: checkin+
Details | Diff | Splinter Review
8.73 KB, patch
Details | Diff | Splinter Review
10.73 KB, patch
bas.schouten
: review+
Details | Diff | Splinter Review
2.65 KB, patch
Details | Diff | Splinter Review
2.33 KB, patch
dvander
: review+
Details | Diff | Splinter Review
1.48 KB, patch
dvander
: review+
Details | Diff | Splinter Review
13.88 KB, patch
dvander
: review+
Details | Diff | Splinter Review
2.33 KB, patch
kechen
: review+
Details | Diff | Splinter Review
2.27 KB, patch
RyanVM
: checkin+
Details | Diff | Splinter Review
26.19 KB, image/png
Details
1.75 KB, patch
dvander
: feedback+
Details | Diff | Splinter Review
5.04 KB, patch
kechen
: review+
Details | Diff | Splinter Review
+++ This bug was initially created as a clone of Bug #1143806 +++

+++ This bug was initially created as a clone of Bug #1131370 +++

The patches in bug 1131370 reduced the crash volume down to 25% of the original volume, but there is still something that needs tracking down and fixing. See bug 1131370 comment 17 and bug 1131370 comment 18 for some potential next-steps.

Patch in bug 1143806 dealt with timeouts due to device reset.  At least one crash remains that is not being reported as due to device reset, so either we're not reporting it correctly or there is another reason.

https://crash-stats.mozilla.com/report/index/0518dfb3-bd13-4c08-8462-02c7f2150430
Lorenso, just one more question - you get this crash when using the Nvidia card, and only during E10S?  Assuming the driver version is 350.12?  Just in case, can you attach the about:support graphics section for the configuration that does crash?  Thanks for all your help!
Flags: needinfo?(lorensgwine)
(In reply to Milan Sreckovic [:milan] from comment #1)
> Lorenso, just one more question - you get this crash when using the Nvidia
> card, and only during E10S?  Assuming the driver version is 350.12?

Yes. E10s mode with Intel HD works fine.

Just updated Nightly. So crash: https://crash-stats.mozilla.com/report/index/e6be345e-0d74-4fb9-8ad7-910d52150430 On this configuration:

Graphics
--------

Adapter Description: Intel(R) HD Graphics Family
Adapter Description (GPU #2): NVIDIA GeForce 840M
Adapter Drivers: igdumdim64 igd10iumd64 igd10iumd64 igdumdim32 igd10iumd32 igd10iumd32
Adapter Drivers (GPU #2): nvd3dumx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um
Adapter RAM: Unknown
Adapter RAM (GPU #2): 2048
Asynchronous Pan/Zoom: none
Device ID: 0x0a16
Device ID (GPU #2): 0x1341
Direct2D Enabled: true
DirectWrite Enabled: true (6.3.9600.17415)
Driver Date: 3-16-2015
Driver Date (GPU #2): 4-8-2015
Driver Version: 10.18.14.4170
Driver Version (GPU #2): 9.18.13.5012
GPU #2 Active: false
GPU Accelerated Windows: 2/2 Direct3D 11 (OMTC)
Subsys ID: 130d1043
Subsys ID (GPU #2): 130d1043
Supports Hardware H264 Decoding: false
Vendor ID: 0x8086
Vendor ID (GPU #2): 0x10de
WebGL Renderer: Google Inc. -- ANGLE (NVIDIA GeForce 840M Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote: true
AzureCanvasBackend: direct2d 1.1
AzureContentBackend: direct2d 1.1
AzureFallbackCanvasBackend: cairo
AzureSkiaAccelerated: 0
Flags: needinfo?(lorensgwine)
Different crashes, perhaps same underlying reason:

https://crash-stats.mozilla.com/report/index/a8758ae5-8220-4d01-8d70-2f15b2150427 - crash because of a null D3D10 device, while trying to process D3D10 synced textures.  Also points out we should probably check the return from AcquireMutex beyond just the timeout (e.g., E_FAIL or WAIT_ABANDONED.)

https://crash-stats.mozilla.com/report/index/e6be345e-0d74-4fb9-8ad7-910d52150430 - MOZ_CRASH because of a failed OpenShareResource, but with DidRenderingDeviceReset() not firing.
Lorenso, one more (for now :) question - you seem to be able to reproduce this at will - is this always a startup crash for you?
Flags: needinfo?(lorensgwine)
(In reply to Milan Sreckovic [:milan] from comment #5)
> Lorenso, one more (for now :) question - you seem to be able to reproduce
> this at will - is this always a startup crash for you?

Actually, now I'm thinking it's not a startup crash, you're just able to reproduce it very quickly, but I'll let you answer the question anyway :)
Just to start the discussion about some of the things we do in FinalizeFrame, related to the "first" scenario (missing D3D10 device.)
Attachment #8599996 - Flags: feedback?(jmuizelaar)
Attachment #8599996 - Flags: feedback?(bas)
Comment on attachment 8599996 [details] [diff] [review]
Conversation piece (wip)

Review of attachment 8599996 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
Attachment #8599996 - Flags: feedback?(jmuizelaar) → feedback+
(In reply to Milan Sreckovic [:milan] from comment #4)
> Different crashes, perhaps same underlying reason:
> 
> https://crash-stats.mozilla.com/report/index/a8758ae5-8220-4d01-8d70-
> 2f15b2150427 - crash because of a null D3D10 device, while trying to process
> D3D10 synced textures..

This crash is in the D2D scenario.  GetD3D10Device can return null if gfxWindowsPlatform::mD2DDevice is null, or if cairo_d2d_device_get_device(mD2DDevice) returns null. D2D device as non-null and D3D10 device as null, doesn't really happen, we create D2D from D3D10 using cairo_d2d_create_device_from_d3d10device, assuming D2D is not force enabled.  mD2DDevice could be set to null if DoesD3D11TextureSharingWork return false, among other things.
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?

They are, but not everyone knows it.
Attachment #8600030 - Flags: review?(jmuizelaar) → review+
Do we ever end up in a D3D 10.1 and D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX combination?
(In reply to Milan Sreckovic [:milan] from comment #5)
> Lorenso, one more (for now :) question - you seem to be able to reproduce
> this at will - is this always a startup crash for you?

Nightly crashes on everytab. Just watch https://youtu.be/oGQLKIET58w
Flags: needinfo?(lorensgwine)
Attachment #8599996 - Flags: feedback?(bas) → feedback+
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?

This is the patch landed in comment 14.
Attachment #8600030 - Flags: checkin+
Attachment #8600030 - Attachment description: Are uninitialized statics safe? → Part 1. Are uninitialized statics safe?
This contains the changes from the "conversation piece" patch, and some additional checks and reports.
Attachment #8600382 - Flags: review?(bas)
Try to get more information, and hopefully crash earlier.  The logic was mostly kept, but perhaps we want to MOZ_CRASH in some of the cases where we didn't before?
Attachment #8600383 - Flags: review?(bas)
Trying some of these patches on try, it turns out that we cannot get mutex during crash/ref/mochi tests (reproducible locally.)  Expected?  OK?  Just in case it's useful, I added the stack when this happens.  Note that today we treat these situations (no mutex coming back) as successful locks/unlocks.  The two patches for the review changed that and try lit up :)

The stack:

Just the critical error "overhead", but note that the texture is not null.

 0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90
 0:14.37 Assertion failure: false (An assert from the graphics logger), at c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492
 0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274)
 0:14.37 #02: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx::CriticalLogger> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265)

Relevant part here:

 0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:129)
 0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:278)
 0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient.cpp:346)
 0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672)
 0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc-mingw32\dist\include\mozilla\layers\contentclient.h:220)
 0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:55)
 0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:1
32
 0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71)
 0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.h:379)
 0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71)
 0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:273)
 0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:316)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
Comment on attachment 8600382 [details] [diff] [review]
Part 2. Watch for and report missing device, or unexpected texture situations. r=bschouten

Review of attachment 8600382 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/TextureD3D11.cpp
@@ +236,5 @@
>    MOZ_ASSERT(aSyncObject->GetSyncType() == SyncObject::SyncType::D3D11);
>  
>    SyncObjectD3D11* sync = static_cast<SyncObjectD3D11*>(aSyncObject);
>  
> +  MOZ_ASSERT(mTexture || mTexture10);

There's shouldn't be a real point to the checks below, there's no way this can change within a session.

If somehow both are true we should add a MOZ_ABORT_IF_FALSE below because this would mean our session is -completely- FUBAR.

@@ +1067,5 @@
>      ID3D10Device* device = gfxWindowsPlatform::GetPlatform()->GetD3D10Device();
> +    if (!device) {
> +      gfxCriticalError() << "Missing D3D10 device(2) " << hexa(gfxWindowsPlatform::GetPlatform()->GetD2DDevice());
> +      if (gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset()) {
> +        mutex->ReleaseSync(0);

I'm not certain we should not crash here, this would be really bad, and I'm not sure we can somehow survive if this occurs (I'm also still not clear on how it could occur). There's so much else which should blow up at this point. And I'm worried we'll just spread the pain out through all kinds of other places.
Comment on attachment 8600383 [details] [diff] [review]
Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten

Review of attachment 8600383 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1074,5 @@
>      mAttachments->mSyncTexture->QueryInterface((IDXGIKeyedMutex**)byRef(mutex));
>  
>      MOZ_ASSERT(mutex);
>      HRESULT hr = mutex->AcquireSync(0, 10000);
> +    if (FAILED(hr)) {

It's probably a good idea to gfxCriticalError if a WAIT_TIMEOUT occurs. We also probably shouldn't be ignoring WAIT_ABANDONED, so a better thing here might be something like:

if (hr == WAIT_TIMEOUT) {
  gfxCriticalError() << etc.
} else if (hr != S_OK) {
  MOZ_CRASH();
}

::: gfx/layers/d3d11/TextureD3D11.cpp
@@ +201,5 @@
>      MOZ_ASSERT(!mIsLocked);
>      MOZ_ASSERT(mTexture || mTexture10);
>      MOZ_ASSERT(mDrawTarget->refCount() == 1);
>      if (mTexture) {
> +      MOZ_ASSERT(LockD3DTexture(mTexture.get()));

Eep! This will make LockD3DTexture not occur in release builds since MOZ_ASSERT will compile to null.

DebugOnly<bool> didSucceed = LockD3DTexture(mTexture.get()));
MOZ_ASSERT(didSucceed);

Or something of the likes would do it.
Attachment #8600383 - Flags: review?(bas) → review-
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?

Review of attachment 8600030 [details] [diff] [review]:
-----------------------------------------------------------------

Just for the record, I believe statics are guaranteed by the spec to be initialized to zero'ed memory by any sane compiler.

ISO 9899:1999 section 6.7.8 paragraph 10 leaves a little bit of wiggle room for bools, but I believe MSVC does this just fine.

Not a bad idea to make it explicit as far as I'm concerned though, but unlikely to be the cause of any practical issue.
(In reply to Milan Sreckovic [:milan] from comment #18)
> Trying some of these patches on try, it turns out that we cannot get mutex
> during crash/ref/mochi tests (reproducible locally.)  Expected?  OK?  Just
> in case it's useful, I added the stack when this happens.  Note that today
> we treat these situations (no mutex coming back) as successful
> locks/unlocks.  The two patches for the review changed that and try lit up :)
> 
> The stack:
> 
> Just the critical error "overhead", but note that the texture is not null.
> 
>  0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90
>  0:14.37 Assertion failure: false (An assert from the graphics logger), at
> c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492
>  0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush
> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274)
>  0:14.37 #02:
> mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx::
> CriticalLogger>
> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265)
> 
> Relevant part here:
> 
>  0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D>
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:
> 129)
>  0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:
> 278)
>  0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient.
> cpp:346)
>  0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672)
>  0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer
> (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc-
> mingw32\dist\include\mozilla\layers\contentclient.h:220)
>  0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientpaintedlayer.cpp:55)
>  0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientpaintedlayer.cpp:1
> 32
>  0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientcontainerlayer.h:71)
>  0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.h:379)
>  0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientcontainerlayer.h:71)
>  0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.cpp:273)
>  0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.cpp:316)

Yes, expected, those textures are no longer actually created as KeyedMutex supporting textures since the 'One Texture to Rule Them All' solution. (Un)lockD3DTexture are essentially no-ops these days for our practical purposes. They're more or less left in in case there's TextureClients that do choose to use them, we could just remove the methods all together.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
Comment on attachment 8600383 [details] [diff] [review]
Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten

Review of attachment 8600383 [details] [diff] [review]:
-----------------------------------------------------------------

MOZ_ASSERT is inside a #ifdef DEBUG, missed this.
Attachment #8600383 - Flags: review- → review+
Milan, Bas, does this need more work? Looks like maybe part 3 never landed.

Do you want to request uplift to 39? This crash signature is still showing up for 39.0a2.
Flags: needinfo?(milan)
Flags: needinfo?(bas)
I may have confused some of this with the work in bug 1143806 where there was some discussion of uplift to 39.
No, you're right, there was some discussion, but I'm not sure the patches are correct, especially the mutex side, at least try doesn't like some of them. Need to get back to this.
Noting also, the crash rate so far looks high on this signature for 38.0b99 and 38.0.  
 
38.0b99 	14.69 %	2031
38.0 	3.14 %	434
(In reply to Milan Sreckovic [:milan] from comment #4)
> Different crashes, perhaps same underlying reason:

I'm seeing two categories of crash with this signature, about evenly split. I get the feeling they're not related.

There are MOZ_CRASHes due to the WAIT_TIMEOUTs. Almost all of them have code 0x8899000C (D2DERR_RECREATE_TARGET). Good mix of vendors, devices, and drivers.

The other group have null derefs on |device|. This group is 86% AMD family 20 CPUs. Suspiciously, those are the CPUs from "the AMD bug". But the CPU could be a red herring. Those chips are integrated CPU+GPU, so we could also say that this group is mostly devices 0x9802 to 0x9807. Their driver versions are all over.
http://www.geforce.com/whats-new/articles/geforce-352-86-whql-driver-released

"Windows Vista/Windows 7/Windows 8/Windows 8.1 Fixed Issues
[Firefox] Firefox Nightly crashes when started on the NVIDIA GPU on an Optimus
system. [1609030]"

Release notes says this is fixed now.
My problem seems solved with new drivers
(In reply to Lorenso Gwine from comment #31)
> My problem seems solved with new drivers

That sounds awesome - now we need to get more people to update to those new drivers...
Note that per comment 29, nvidia drivers are only going to solve a portion of these crashes.
At least some of these are startup crashes; any correlation with the startup vs. not-startup and info from comment 29?
Flags: needinfo?(milan)
Summary: non device reset crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()
Let's pick at these little by little - sounds like there is some blocklisting that can happen, we can convert few more crashes into MOZ_CRASHes, check if there was a driver reset more consistently in the FinalizeFrame.
Assignee: nobody → bas
[Tracking Requested - why for this release]: Tracking all startup crashes of any significant numbers.
Adding a tracking flag for firefox40. Based on the discussion so far, it's unclear whether we have a stable fix. It may be getting too close for firefox39. Will let Liz/Sylvestre know about this one.
(In reply to Milan Sreckovic [:milan] from comment #34)
> At least some of these are startup crashes; any correlation with the startup
> vs. not-startup and info from comment 29?

About half of the null-device crashes are within the first 60 seconds. Mostly towards the end of that first minute though; not single-digit uptimes.

The MOZ_CRASH breakpoints are not startup crashes.
Should we go for fixing the MOZ_CRASH issues anyway for 39?
Flags: needinfo?(milan)
This one is still a bit of a mystery, it will probably miss 39 if we can't figure out what to do.  Bas, one last push?
Flags: needinfo?(milan)
There's something very weird going on in the current reports, first of all there's this:

D2D- D2D1.1- D2D+ DWrite- DWrite+ D3D11 Layers- D3D11 Layers+

Second of all this MOZ_CRASH -clearly- passed a gfxCriticalError but for some reason that log message is not in the crash report.

Removing this MOZ_CRASH does not feel like a good idea since it will mean those people would get a -lot- of rendering artifacts. Those weird - and + mixes are likely a good indicator as to what's responsible for this situation though, we should try to figure out what can cause that.
Flags: needinfo?(bas)
I've found one where the gfxCriticalError did get through, and it shows 0x80070057 ERROR_INVALID_PARAMETER, this further suggests we're getting into some weird state due to this - + business.
I'm still able to reproduce this crash (or perhaps one with a similar signature) every time I update my NVIDIA drivers.

https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33-185e92150608

Per comment 30, I waited until after I installed 352.86 to see if the newer driver had actually solved my crash.

Unfortunately upon installing driver 353.06 a few minutes ago I was able to bring my e10s content tabs into a broken state where nightly only displays a spinner instead of the page content. Additionally, opening a new about:newtab works fine, but upon navigating the new tab to any website, the tab crashed and submitted the crash dump above.
(In reply to Ryan Stecker from comment #43)
> I'm still able to reproduce this crash (or perhaps one with a similar
> signature) every time I update my NVIDIA drivers.
> 
> https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33-
> 185e92150608
> 
> Per comment 30, I waited until after I installed 352.86 to see if the newer
> driver had actually solved my crash.
> 
> Unfortunately upon installing driver 353.06 a few minutes ago I was able to
> bring my e10s content tabs into a broken state where nightly only displays a
> spinner instead of the page content. Additionally, opening a new
> about:newtab works fine, but upon navigating the new tab to any website, the
> tab crashed and submitted the crash dump above.

Interesting, Milan, do we have an NVidia (non-optimus, Ryan has a GTX970 as it looks from the report, so he's running dual GPUs with his intel on board but he's not using optimus) machine somewhere in the Toronto office where we could try updating an NVidia driver.. the crash Ryan posted has the interesting -/+ stuff in there and maybe this reproduces fairly reliably when updating the driver on a real NVidia machine?
Flags: needinfo?(milan)
Not sure if we have one - Jeff?

For the "disappearing" critical errors - they may have wrapped, in which case setting gfx.logging.crash.length (int) preference to a larger number (e.g., 50) may keep it around, if we think this will help us.
Flags: needinfo?(milan) → needinfo?(jmuizelaar)
(In reply to Milan Sreckovic [:milan] from comment #45)
> Not sure if we have one - Jeff?
> 
> For the "disappearing" critical errors - they may have wrapped, in which
> case setting gfx.logging.crash.length (int) preference to a larger number
> (e.g., 50) may keep it around, if we think this will help us.

There's none at all in that report as far as I can tell, but in the new ones we have one, which is an ILLEGAL_PARAMETER one, almost certainly the -/+ business causes us to use a null handle or an invalid handle somehow. This is likely going to be very easy to fix when we can reproduce it. If we -can't- reproduce this I would suggest we decide to ignore INVALID_PARAMETER errors for now and assume the situation corrects itself in a valid way, but I would really rather not.
Unless I'm mistaking, I'm not seeing this show up in significant numbers on either 39 or 40, can anyone confirm this?
They're still around, but way down on the rankings (100+). I'm seeing relatively more of the null-device vs MOZ_CRASH variety; before they used to be pretty even.
I think the driver updates mentioned in comment #30 did really help here.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> I think the driver updates mentioned in comment #30 did really help here.

I suspect it's more likely something else helped here, users don't tend to update their drivers this fast.

We expanded some of our video blacklisting, I suspect that may be involved. But it's hard to be certain.
Should we consider this sufficiently 'fixed' for 39 and 40 at this point?
I think so.
I will stop tracking for 39 & 40.
Duplicate of this bug: 1188326
Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and Intel Graphics HD4600 when upgrading from
353.30-notebook-win8-win7-64bit-international-whql.exe
to
353.62-notebook-win8-win7-64bit-international-whql.exe
(In reply to Aryx [:archaeopteryx][:aryx] from comment #54)
> Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and
> Intel Graphics HD4600 when upgrading from
> 353.30-notebook-win8-win7-64bit-international-whql.exe
> to
> 353.62-notebook-win8-win7-64bit-international-whql.exe

Was Firefox running at the time of the driver upgrade or did Firefox crash during startup after having updated the driver? Did you restart your computer at any point during the driver update process?
I let them running during the upgrade on purpose to test this (the driver upgrade didn't prompt for a restart). Just saw that it was Firefox 39.0 which crashed and not Beta or Nightly which were also running (but had black areas). More info: https://crash-stats.mozilla.com/report/index/e57ac6c5-b7ca-4dd7-9984-acdc12150805
This remains a significant issue but is no longer a top crash. This is currently ranked 43rd in Firefox 41.0.1 with 976 crashes accounting for 0.43%.
Crash Signature: [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()] → [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()] [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame]
I am noticing content tabs crashing 1-2 times every day since Aurora47 updates were enabled late last week. MConley looked at the crash reports (below) and pointed me to this bug. 

Here are the top items from my about:crashes
* https://crash-stats.mozilla.com/report/index/bp-533909e4-3ad1-420d-8cbc-c2be92160316 - 3/16/16
* https://crash-stats.mozilla.com/report/index/bp-633d5c25-05df-4346-affb-d1f562160315 - 3/15/16
* https://crash-stats.mozilla.com/report/index/bp-b54dee5b-eb40-41b6-859f-d1bdc2160315 - 3/14/16
Nom'ing for tracking by e10s team mainly due to comment 58 (more crashiness seen since 47 moved to Aurora channel).
tracking-e10s: --- → ?
Yeah, we see these coming in a small number, hopefully some of the patches we have on the device reset handling eventually clear this up.  Note that this is a forced crash on nightly and aurora - it will not happen (as such) on beta or release.
Flags: needinfo?(jmuizelaar)
Summary: non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only
Priority: -- → P2
Attachment #8600382 - Flags: review?(bas)
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
 - nightly (version 50): 136 crashes from 2016-06-06.
 - aurora  (version 49): 304 crashes from 2016-06-07.
 - beta    (version 48): 1174 crashes from 2016-06-06.
 - release (version 47): 9240 crashes from 2016-05-31.
 - esr     (version 45): 263 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly         23         16         22         14         20         20         16
 - aurora          42         64         52         49         36         33          9
 - beta           163        183        161        171        188        184         62
 - release       1444       1472       1323       1390       1420       1301        472
 - esr             39         27         31         32         34         20         26

Affected platform: Windows
Jerry, here is an interesting crash: https://crash-stats.mozilla.com/report/index/c2d46775-f73e-4b05-94e5-aae0e2160821.  We hit this one, MOZ_CRASH becausewe fail to open D3D11 shared resource, and then gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset() returns false inside of SyncObjectD3D11::FinalizeFrame.  Looking at the error log, we see the the "device reset" messages just two seconds before this happens and we crash.  So, something is not quite getting initialized in time, perhaps we reset the "device reset" too soon.  What do you think?
Flags: needinfo?(hshih)
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
 - nightly (version 51): 67 crashes from 2016-08-01.
 - aurora  (version 50): 216 crashes from 2016-08-01.
 - beta    (version 49): 580 crashes from 2016-08-02.
 - release (version 48): 1474 crashes from 2016-07-25.
 - esr     (version 45): 392 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly      32      20       7
 - aurora       81      68      23
 - beta        223     194      52
 - release     552     398     204
 - esr          36      31      36

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #49       #86
 - aurora  #131      #13
 - beta    #87       #58
 - release #36       #330
 - esr     #233
In my first view, maybe we still failed at the creation of new compositor(which is triggered by our device removed detection flow).
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2293

Then, hit the early return at:
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2305

So, there is no new SyncObjectD3D11 updated from
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2311

Finally, we still use the "probably invalid" syncObj during painting.

I will try to add some criticalNote to show this.
Flags: needinfo?(hshih)
Jerry, did you have a chance to come up with a patch you mention in comment 65?
Flags: needinfo?(hshih)
Some critical error correlations:

(95.27% in signature vs 00.35% overall) gfx_error 'Failed to create a D3D11 content device: '
(75.27% in signature vs 06.38% overall) gfx_error 'Failed 2 buffer db='
(36.88% in signature vs 00.60% overall) gfx_error 'D3D11 swap chain preset failed '
(21.29% in signature vs 00.70% overall) gfx_error 'Invalid draw target type specified: '
(16.34% in signature vs 00.12% overall) gfx_error '[D3D11] TextureSourceD3D11:GetShaderResourceView CreateSRV failure '
Yes, I'm trying to write a patch for comment 65.

But I have no idea about the crash in comment 67. Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). So, maybe content side should skip all painting task between the timing of knowing driver-remove and receiving the RecvCompositorUpdated() call.
Assignee: bas → hshih
Status: NEW → ASSIGNED
Flags: needinfo?(hshih)
Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). Before RecvCompositorUpdated() message, the content side still use that bad-status device for rendering. This patch try to skip the painging and the SyncObject waiting during driver-reset.
Attachment #8794480 - Flags: review?(dvander)
Since we already have the device-remove checking in ClientLayerManager, this checking is redundant.
Attachment #8794481 - Flags: review?(dvander)
Comment on attachment 8794479 [details] [diff] [review]
show some gfxCritical logs for compositor creation failed during device-reset. v1.

Review of attachment 8794479 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorBridgeParent.cpp
@@ +2379,5 @@
>      return Nothing();
>    }
>  
> +  if (aBackendHints.IsEmpty()) {
> +    gfxCriticalNote << "No BackendHint for compositor creation.";

This should never happen, we should always have at least a basic compositor.

@@ +2387,3 @@
>    RefPtr<Compositor> compositor = NewCompositor(aBackendHints);
>    if (!compositor) {
> +    gfxCriticalNote << "Try to create a new compositor but failed.";

Same here. The only place this should truly be able to fail is OS X, since nsChildView::InitCompositor can technically return false.

::: gfx/thebes/DeviceManagerDx.cpp
@@ +124,5 @@
>  
>    CreateCompositorDevice(d3d11);
>  
>    if (!d3d11.IsEnabled()) {
> +    gfxCriticalNote << "Can't create d3d11 device in CreateCompositorDevices()";

Are these notes needed? The information should already be available in about:support (and crash reports), since every failure should be propagated through gfxConfig/telemetry.
Comment on attachment 8794480 [details] [diff] [review]
skip content painting when the device is in device-reset/remove status. v1

Review of attachment 8794480 [details] [diff] [review]:
-----------------------------------------------------------------

This seems reasonable, but I'm curious: do we actually have a case where the content process gets a device reset, and the compositor does not? Or is this patch more about not trying to use a failing SyncObject?
Attachment #8794480 - Flags: review?(dvander) → review+
Attachment #8794481 - Flags: review?(dvander) → review+
(In reply to David Anderson [:dvander] from comment #74)
> Comment on attachment 8794480 [details] [diff] [review]
> skip content painting when the device is in device-reset/remove status. v1
> 
> Review of attachment 8794480 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This seems reasonable, but I'm curious: do we actually have a case where the
> content process gets a device reset, and the compositor does not? Or is this
> patch more about not trying to use a failing SyncObject?

I think both content and chrome could get the device reset status.

The problem is that the content side can't recover the device-reset status by itself. It will still use that bad-status device for rendering and synchronization until the CompositorBridgeChild::RecvCompositorUpdated() message received. This patch try to skip the d2d/d3d calls before the recovering in RecvCompositorUpdated().
Flags: needinfo?(dvander)
Attachment #8794479 - Flags: review?(dvander)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #75)
> (In reply to David Anderson [:dvander] from comment #74)
> > Comment on attachment 8794480 [details] [diff] [review]
> > skip content painting when the device is in device-reset/remove status. v1
> > 
> > Review of attachment 8794480 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This seems reasonable, but I'm curious: do we actually have a case where the
> > content process gets a device reset, and the compositor does not? Or is this
> > patch more about not trying to use a failing SyncObject?
> 
> I think both content and chrome could get the device reset status.
> 
> The problem is that the content side can't recover the device-reset status
> by itself. It will still use that bad-status device for rendering and
> synchronization until the CompositorBridgeChild::RecvCompositorUpdated()
> message received. This patch try to skip the d2d/d3d calls before the
> recovering in RecvCompositorUpdated().

Okay, thanks for explaining.
Flags: needinfo?(dvander)
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e855469d6acf
skip content painting when the device is in device-reset/remove status. r=dvander
https://hg.mozilla.org/integration/mozilla-inbound/rev/4247ce021880
remove the painting aborting for ClientPaintedLayer in device-remove/reset status. r=dvander
Keywords: checkin-needed
Blocks: 1297204
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
 - nightly (version 52): 61 crashes from 2016-09-19.
 - aurora  (version 51): 65 crashes from 2016-09-19.
 - beta    (version 50): 332 crashes from 2016-09-20.
 - release (version 49): 1523 crashes from 2016-09-05.
 - esr     (version 45): 562 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly      42      19
 - aurora       56       9
 - beta        290      42
 - release    1211     312
 - esr          48      44

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #39       #47
 - aurora  #35       #44
 - beta    #65       #36
 - release #36       #131
 - esr     #220
:kechen is trying to investigate the crash problem in comment 80.
There are two main cases in this crash :

The first one is like [1], we tried to open the shared resource with the handle and it failed without device-removed. I think there are two possible reasons:
  1. The shared resource still exists but the handle in content process is not updated.
  2. We accidentally released the shared resource in somewhere, which cause the handle invalid.
According to the crash report[1], device-removed doesn't happened and the crash point is far from starting time; therefore, I think this is caused by the reason 2. I suggest to force reset driver at this point like what we do in chrome process[2].

The second one is like[3], we got a timeout when attempting to acquire the mutex of shared resource and this crash always happened after device-removed. This also appears in other signatures [4][5].
I found that there are some early return cases in OnRenderingDeviceReset[6] so they do not update the SyncObject, should we reset the SyncObject here ?
And if we are allowed to reuse the shared resource, this code[7] might cause problem, it leaves the function in the mutext section. In this case, move this part of code out of the section might solve the problem.
I am now doing an experiment about this.

Jerry, do you think it's proper to let content process reset the driver when we encounter this crash ? and do you have any thought about the second case ?


[1] https://crash-stats.mozilla.com/report/index/46aec604-fad8-4ab7-b282-fa7282161017#tab-metadata
[2] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#781
[3] https://crash-stats.mozilla.com/report/index/bb0fef1e-155a-4ab5-a833-b45772161026#tab-metadata
[4] https://crash-stats.mozilla.com/report/index/2d480ff9-6d6a-4174-ac2a-400482161028#tab-metadata
[5] https://crash-stats.mozilla.com/report/index/a1104467-20da-4782-8efd-25d4b2161027#tab-metadata
[6] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/widget/nsBaseWidget.cpp#335
[7] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#1237
Hello Jerry, can you take a look at comment 84 ?
Flags: needinfo?(hshih)
Assignee: hshih → kechen
Hello David,

Can you help me to review this patch ? This patch use RAII to avoid calling AcquireSync without ReleaseSync.

It mainly fixes the early return case without ReleaseSync in [1]. Hoping this can fix some timeout crashes for AcquireSync in this signature.

The following link[2] is the try of this patch, it looks good since the patch only effects Windows platform.


[1] https://dxr.mozilla.org/mozilla-central/rev/8e8b146fcb8b268e3c09b646087c6b2ef9f0af6f/gfx/layers/d3d11/TextureD3D11.cpp#1238
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=00abdae10271&selectedJob=30434764
Attachment #8808079 - Flags: review?(dvander)
Attachment #8808079 - Flags: review?(dvander) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9db19aed6c0
Use AutoTextureLock to manage the shared resource's mutex. r=dvander
Keywords: checkin-needed
Flags: needinfo?(hshih)
Depends on: 1317131
The crash rate doesn't decrease, I am now working on another patch to fix it.
Hello David,
There are some crashes like[1] which attempt to access a null device in[2].

I think the reason of this problem is that content process failed to create D3D11 device during device reset process and leave the device null while chrome process successfully create it. Also, device reset check will return normal in this case[3].
As the result, the code will pass the device check in [4] and run into the crash.

My proposal to solve this problem is to set the device reset reason and ask chrome process to do the device reset with D3D11_COMPOSITING disabled again so that we can recover from this situation.

Can you give me some advices about the patch ? Thank you.

[1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216
[2] https://hg.mozilla.org/releases/mozilla-release/annotate/cc272f7d48d3/gfx/layers/d3d11/TextureD3D11.cpp#l1124
[3] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/thebes/DeviceManagerDx.cpp#689
[4] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/layers/client/ClientLayerManager.cpp#655
Attachment #8819217 - Flags: feedback?(dvander)
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
 - nightly (version 53): 170 crashes from 2016-11-14.
 - aurora  (version 52): 850 crashes from 2016-11-14.
 - beta    (version 51): 1123 crashes from 2016-11-14.
 - release (version 50): 18984 crashes from 2016-11-01.
 - esr     (version 45): 1152 crashes from 2016-07-06.

Crash volume on the last weeks (Week N is from 01-02 to 01-08):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       4       5      10      12      40      31      65
 - aurora      119     136     169     177     146      76       0
 - beta        169     192     193     218     124     140      45
 - release    2631    2706    2955    3105    3028    2709    1007
 - esr          31      50      67      57      79      47      62

Affected platform: Windows

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #391      #37
 - aurora  #23       #23
 - beta    #395      #26
 - release #35       #8
 - esr     #239
Most of crashes move to Bug 1322741 due to changes in Bug 1319557.
After uplifting the fix in Bug 1322741, the crash volume should decrease.

The crashes in firefox 53 might be caused by other reasons, I will keep looking into it.
See Also: → 1322741
Perhaps we can land a version of the WIP patch that has gfxCriticalError instead of SendDeviceReset in those places?  Perhaps the ForceDeviceReset as well - in other words, where the WIP patch detects some condition and does something because of it, we just detect it and put a unique gfxCriticalError so that we can see if it triggers in these crashes?
Flags: needinfo?(kechen)
Hello Milan,
Thank you for the advice, this patch can help us to confirm the root cause of null accessing crashes.
Could you help me to review it ? Thank you.
Flags: needinfo?(kechen)
Attachment #8828195 - Flags: review?(milan)
Comment on attachment 8819217 [details] [diff] [review]
[WIP] Trigger device reset when failing to create D3D11 device in content process

Is this obsoleted by bug 1322741? If we're accessing a null D3D11 device after a reset it's not clear how this would fix it. We'd still be accessing a stale device.
Attachment #8819217 - Flags: feedback?(dvander)
Hello David,

I think there are two different problem in this signature :
  1. The SharedResourceHandler is not synchronized with D3D11 device (which is solved by bug 1322741).
  2. The device from DeviceManagerDx is nullptr like[1].

The second problem, which I think is not covered by first problem, is the one I want to solve with this patch.

My theory is that when handling device reset, we reset the device in [2] but failed to create new device in the following code which cause this crash.
Therefore I rise the device reset flag after the creation failure to make the code early-return in [3] without entering the crash segment in the first place to avoid using the stale device.

And the reason that I send a force device reset message to compositor process is to make sure that it knows content side need a new SharedResourceHandler because I don't know if there is any chance that content process have problem in creating D3D11 device while everything works fine in compositor process.

And I think it might be still profitable to land attachment 8828195 [details] [diff] [review], in one hand, it can verify if the crash is really caused by the failure creation of D3D11 device when handling device reset.

Do you have any advice on this solution ?


[1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216
[2] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/thebes/gfxWindowsPlatform.cpp#445
[3] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/layers/client/ClientLayerManager.cpp#661
Flags: needinfo?(dvander)
Comment on attachment 8828195 [details] [diff] [review]
Add gfxCriticalError for failing to create D3D11 device in content side

Review of attachment 8828195 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good - should we also have that change in CompositorBridgeChild::RecvCompositorUpdated that you had in the other patch:
  // If we still get device reset here, something must wrong when creating
  // d3d11 devices.
  if (gfxPlatform::GetPlatform()->DidRenderingDeviceReset()) {
    gfxCriticalError() << "Unexpected reset device processing the compositor update";
  }
or something like that?
Attachment #8828195 - Flags: review?(milan) → review+
Keywords: checkin-needed
Pushed by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/19507933747f
Add gfxCriticalError for failing to create D3D11 device in content side. r=milan
Keywords: checkin-needed
Approval Request Comment
[Feature/Bug causing the regression]:
Not a regression, this patch is for crash tracing.
[User impact if declined]:
This patch is for finding the root cause of this crash. Without uplifting this, it will be difficult for the crash investigating.
[Is this code covered by automated tests?]:
No.
[Has the fix been verified in Nightly?]:
Yes.
[Needs manual test from QE? If yes, steps to reproduce]: 
No.
[List of other uplifts needed for the feature/fix]:
No.
[Is the change risky?]:
Low.
[Why is the change risky/not risky?]:
Only add some gfx errors to somewhere that might cause the crash.
[String changes made/needed]:
No.
Attachment #8829310 - Flags: approval-mozilla-aurora?
Comment on attachment 8829310 [details] [diff] [review]
[Uplift] Add gfxCriticalError for failing to create D3D11 device in content side

add diagnostics for debug gfx crash, beta52+
Attachment #8829310 - Flags: approval-mozilla-aurora? → approval-mozilla-beta+
Summary: non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #96)
> Hello David,
> 
> I think there are two different problem in this signature :
>   1. The SharedResourceHandler is not synchronized with D3D11 device (which
> is solved by bug 1322741).
>   2. The device from DeviceManagerDx is nullptr like[1].
> 
> The second problem, which I think is not covered by first problem, is the
> one I want to solve with this patch.
> 
> My theory is that when handling device reset, we reset the device in [2] but
> failed to create new device in the following code which cause this crash.
> Therefore I rise the device reset flag after the creation failure to make
> the code early-return in [3] without entering the crash segment in the first
> place to avoid using the stale device.

If we handle a device reset, we throw away layers and acquire a new TextureFactoryIdentifier for everything affected. This should eliminate any possibility of FinalizeFrame not being synchronized to the device configuration. For content tabs, this happens via TabChild::CompositorUpdated[1]. For top-level widgets, it happens in nsWindow::OnRenderingDeviceReset [2]. Both of these happen *after* we acquire (or fail to acquire) a new ID3D11Device. 

So: if we paint after a device reset, we should have discarded layers and any device information on that ClientLayerManager. If we're not doing that it's a bug.

And in fact, that's what bug 1322741 papers over. Because of how we send messages to content, we can paint a tab without it having acknowledged a device reset. (There's a guard to prevent this from happening [3], but it doesn't update TextureFactoryIdentifier. I also proposed making the messages atomic in bug 1322741 comment #9.)

Another loophole is that, if we fail to acquire a content device, we still have a SyncObjectD3D11. But FinalizeFrame should fail a check [4] - we won't have any ID3D11Texture2D objects registered, so there's no reason to use the ID3D11Device. But defin

Is there a situation you're thinking of that isn't covered by these cases?

[1] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/ipc/CompositorBridgeChild.cpp#366
[2] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/widget/nsBaseWidget.cpp#368
[3] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/client/ClientLayerManager.cpp#220
[4] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/d3d11/TextureD3D11.cpp#1256
Sorry, last sentence got cut off. I'm not sure what I meant to say :)
Flags: needinfo?(dvander)
Attachment #8829310 - Flags: checkin+
Comment on attachment 8819217 [details] [diff] [review]
[WIP] Trigger device reset when failing to create D3D11 device in content process

According to comment 103, I find the problem this patch want to fix is covered by Bug 1322741, so I obsolete this patch.
Attachment #8819217 - Attachment is obsolete: true
A crash report with the "Unexpected reset device processing..." message: https://crash-stats.mozilla.com/report/index/74e44e19-1947-4881-8ade-2f7d52170131#tab-metadata

A crash report with the "Failed to create D3D11 device..." message: https://crash-stats.mozilla.com/report/index/0c48ce0d-dafa-4e62-8961-892762170129#tab-metadata
In the meantime, there is a lot of these crashes - we don't want to just check for the null content device in SyncObjectD3D11::FinalizeFrame, before calling OpenSharedResource?  It would then combine this with the other path, where the OpenSharedResource fails, and we can act accordingly depending on whether it's a device reset or not.
Flags: needinfo?(kechen)
I think the null content device issue is covered by Bug 1322741.

After the patch uplift in Bug 1322741, we will check the null device before calling FinalizeFrame and prevent violent reading to null pointer[1]. It will eliminate most of the crashes this signature since most of them are null pointer accessing crashes.
Currently the patch is in firefox 52, should we also uplift it to firefox 51?

Since we don't crash due to the null pointer accessing now, the following question is how to recover from the null content device situation.
According to current device reset structure, if chrome process find out content device is invalid, it will trigger device reset process and renew the syncobject and update the device in content side, in this case, I think we don't have to do anything to current code.


[1] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/gfx/layers/d3d11/TextureD3D11.cpp#1247
Flags: needinfo?(kechen)
Bug 1322741 fix is in 52 beta 2.  Here is a crash in 52 beta 2:
https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202

We fail to open the shared resource, we don't have device reset flag, so we crash.

Also, same crash signature, with a sync lock timeout in 54, after the patch to bug 1322741 landed on central (in 53):
https://crash-stats.mozilla.com/report/index/df10f88d-9dd2-4628-a302-2e5b22170131

Doesn't look like bug 1322741 quite took care of it all, and we still crash.
Flags: needinfo?(kechen)
With the fix in Bug 1322741, we can ensure the mSyncHandle is updated.
However, according to GraphicsCriticalError in the crashes[1], the failure of calling OpenSharedResource is caused by invalid argument and the only parameter we pass is mSyncHandle.
Which might indicates the mSyncHandle is no longer available.

I am still trying to find out the root cause right now while an easy walkaround might just asks compositor process to update a new mSyncHandle.

David do you have any advice on this ?

And about the sync lock timeout crash, I don't have any thought about it. We've raised the time limitation from 10sec to 20sec to avoid the timeout but raise the number again might not be a good idea. I will keep looking into it.

[1] https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202
Flags: needinfo?(kechen) → needinfo?(dvander)
See Also: → 1338639
Here is my assumption about the current OpenSharedResource crashes after Bug 1322741:
There is a latency between the chrome process sending the handle and the content process receiving the handle.
If another device rest happened during the latency, the content process will get an expired handle and use the latest device to open it which will cause the failure of the OpenSharedHandle function call while the device won't report device reset[1].
The Same case might happen when opening a new tab which also contains the SyncObject update.

While the patch in Bug 1316788 might reduce the crash probability efficiently, I am now working on a patch that sends a synchronize ipc to chrome process to update the handle before we crash the program in [1].

[1] https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11.cpp#1237
Flags: needinfo?(dvander)
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #112)
> Created attachment 8839374 [details]
> Invalid handle due to the ipc latency.
> 
> Here is my assumption about the current OpenSharedResource crashes after Bug
> 1322741:
> There is a latency between the chrome process sending the handle and the
> content process receiving the handle.
> If another device rest happened during the latency, the content process will
> get an expired handle and use the latest device to open it which will cause
> the failure of the OpenSharedHandle function call while the device won't
> report device reset[1].
> The Same case might happen when opening a new tab which also contains the
> SyncObject update.
> 
> While the patch in Bug 1316788 might reduce the crash probability
> efficiently, I am now working on a patch that sends a synchronize ipc to
> chrome process to update the handle before we crash the program in [1].
> 
> [1]
> https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11.
> cpp#1237

Hrm... at this point, why not just make the whole thing fallible? If OpenSharedResource fails, stop painting until we get a new TextureFactoryIdentifier. Requesting one synchronously might have the same problem otherwise (the compositor could have closed it already).
Hello David, could you give me some feedbacks for this patch ?

In this patch, I enable the device reset flag when failing to open the shared handle to make it fallible.
However, there is a risk that if there are other reasons besides comment 112 cause this crash and there were no updating ipc after this, the content process will stay in device reset status forever.

Instead of sending a sync ipc in comment 112, what if I send an async ipc to compositor thread, make it do the device reset again and update the handle ?
Attachment #8841463 - Flags: feedback?(dvander)
Why do we need to force a fake device reset, vs. just ignore the bad handle and wait for a new one?
(In reply to David Anderson [:dvander] from comment #115)
> Why do we need to force a fake device reset, vs. just ignore the bad handle
> and wait for a new one?

The reason of forcing a fake device reset is to skip the paintings before we get the new handle. So that we can resume the paintings when we get the new one.
Flags: needinfo?(dvander)
Why do we need a device reset to do that, versus just returning false from the function and not painting anything for that frame?
Flags: needinfo?(dvander)
Clearing this bug may take care of bug 1306619 as well.
This signature spiked in Nightly starting with 20170309030216. We now have 3727 crashes (1096 installations) in the last 7 days.
Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55 (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process (92.1%). I don't see any reports showing up in Aurora or Beta in the last week but we do have 18 reports in Release.
Keywords: topcrash-win
(In reply to David Anderson [:dvander] from comment #117)
> Why do we need a device reset to do that, versus just returning false from
> the function and not painting anything for that frame?

Hello David, sorry for the late reply.
The reason I set the device rest is to avoid generating transactions[1] until available handle arrives.

[1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/gfx/layers/client/ClientLayerManager.cpp#352
Flags: needinfo?(dvander)
The spike might be related to Bug 1345814.
Attached patch Make AcquireSync RAII. (obsolete) — Splinter Review
I think we might need to make the AcuireSync RAII, or any return between [1] might cause the AcquireSync without Release.

Since the mutex of the texture is generated from ImageBridgeChild::GetSingleton()[2] like mSyncObject does, it might also cause the timeout of mSyncObject's AcquireSync and run into the crash.

[1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#902-946
[2] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#900
Attachment #8847534 - Flags: feedback?(bas)
Comment on attachment 8841463 [details] [diff] [review]
Make OpenSharedResource fallible.

Review of attachment 8841463 [details] [diff] [review]:
-----------------------------------------------------------------

If you think injecting a fake reset is quite reliable, I'll r+ that. Otherwise, it might make more sense to just set an invalid bit on the SyncObject and have ClientLayerManager check that before it paints.
Attachment #8841463 - Flags: feedback?(dvander) → feedback+
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #120)
> Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55
> (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process
> (92.1%). I don't see any reports showing up in Aurora or Beta in the last
> week but we do have 18 reports in Release.

It looks like this is a separate bug, probably, since it occurs in media code in the GPU process (this bug occurs in layers code in the content process).
Flags: needinfo?(dvander)
Comment on attachment 8847534 [details] [diff] [review]
Make AcquireSync RAII.

Review of attachment 8847534 [details] [diff] [review]:
-----------------------------------------------------------------

You're absolutely right, good catch, looks good to me.
Attachment #8847534 - Flags: feedback?(bas) → review+
The actual cause of the new crashes though is likely bug 1345814.
Carry r+ from comment 126.

The try result looks fine. (Only run the tests on Windows platform since the change is only for Windows and I think we don't run tests on XP now.)

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a5baba64a082d67db01c2401dba168fcaff3f75&selectedJob=84520438
Attachment #8847534 - Attachment is obsolete: true
Attachment #8848404 - Flags: review+
Keywords: checkin-needed
Depends on: 1352376
Hi Kevin,

How do you think about this modification?
Attachment #8891165 - Flags: feedback?(kechen)
Comment on attachment 8891165 [details] [diff] [review]
bug1160157_lock_all_or_lock_none.patch

Thanks Kevin, it seems this modification is not related to this issue.
Attachment #8891165 - Attachment is obsolete: true
Attachment #8891165 - Flags: feedback?(kechen)
This call stack is removed by Bug 1357299, and the crashes are resolved in Bug 1409176.
Set wonfix to this bug for now.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.