Closed
Bug 1160157
Opened 10 years ago
Closed 7 years ago
non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()
Categories
(Core :: Graphics, defect, P2)
Tracking
()
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
e10s | + | --- |
firefox39 | - | wontfix |
firefox40 | - | wontfix |
firefox41 | --- | wontfix |
firefox42 | --- | wontfix |
firefox43 | --- | wontfix |
firefox44 | --- | wontfix |
firefox47 | --- | wontfix |
firefox48 | --- | wontfix |
firefox49 | --- | wontfix |
firefox-esr45 | --- | wontfix |
firefox50 | --- | wontfix |
firefox51 | --- | wontfix |
firefox52 | --- | fix-optional |
firefox53 | --- | affected |
firefox54 | --- | affected |
People
(Reporter: milan, Assigned: kechen)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash-win, Whiteboard: gfx-noted)
Crash Data
Attachments
(13 files, 4 obsolete files)
4.83 KB,
patch
|
bas.schouten
:
feedback+
jrmuizel
:
feedback+
|
Details | Diff | Splinter Review |
913 bytes,
patch
|
jrmuizel
:
review+
milan
:
checkin+
|
Details | Diff | Splinter Review |
8.73 KB,
patch
|
Details | Diff | Splinter Review | |
10.73 KB,
patch
|
bas.schouten
:
review+
|
Details | Diff | Splinter Review |
2.65 KB,
patch
|
Details | Diff | Splinter Review | |
2.33 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
1.48 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
13.88 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
2.33 KB,
patch
|
kechen
:
review+
|
Details | Diff | Splinter Review |
2.27 KB,
patch
|
jcristau
:
approval-mozilla-beta+
RyanVM
:
checkin+
|
Details | Diff | Splinter Review |
26.19 KB,
image/png
|
Details | |
1.75 KB,
patch
|
dvander
:
feedback+
|
Details | Diff | Splinter Review |
5.04 KB,
patch
|
kechen
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1143806 +++
+++ This bug was initially created as a clone of Bug #1131370 +++
The patches in bug 1131370 reduced the crash volume down to 25% of the original volume, but there is still something that needs tracking down and fixing. See bug 1131370 comment 17 and bug 1131370 comment 18 for some potential next-steps.
Patch in bug 1143806 dealt with timeouts due to device reset. At least one crash remains that is not being reported as due to device reset, so either we're not reporting it correctly or there is another reason.
https://crash-stats.mozilla.com/report/index/0518dfb3-bd13-4c08-8462-02c7f2150430
Reporter | ||
Comment 1•10 years ago
|
||
Lorenso, just one more question - you get this crash when using the Nvidia card, and only during E10S? Assuming the driver version is 350.12? Just in case, can you attach the about:support graphics section for the configuration that does crash? Thanks for all your help!
Flags: needinfo?(lorensgwine)
Comment 2•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #1)
> Lorenso, just one more question - you get this crash when using the Nvidia
> card, and only during E10S? Assuming the driver version is 350.12?
Yes. E10s mode with Intel HD works fine.
Just updated Nightly. So crash: https://crash-stats.mozilla.com/report/index/e6be345e-0d74-4fb9-8ad7-910d52150430 On this configuration:
Graphics
--------
Adapter Description: Intel(R) HD Graphics Family
Adapter Description (GPU #2): NVIDIA GeForce 840M
Adapter Drivers: igdumdim64 igd10iumd64 igd10iumd64 igdumdim32 igd10iumd32 igd10iumd32
Adapter Drivers (GPU #2): nvd3dumx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um
Adapter RAM: Unknown
Adapter RAM (GPU #2): 2048
Asynchronous Pan/Zoom: none
Device ID: 0x0a16
Device ID (GPU #2): 0x1341
Direct2D Enabled: true
DirectWrite Enabled: true (6.3.9600.17415)
Driver Date: 3-16-2015
Driver Date (GPU #2): 4-8-2015
Driver Version: 10.18.14.4170
Driver Version (GPU #2): 9.18.13.5012
GPU #2 Active: false
GPU Accelerated Windows: 2/2 Direct3D 11 (OMTC)
Subsys ID: 130d1043
Subsys ID (GPU #2): 130d1043
Supports Hardware H264 Decoding: false
Vendor ID: 0x8086
Vendor ID (GPU #2): 0x10de
WebGL Renderer: Google Inc. -- ANGLE (NVIDIA GeForce 840M Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote: true
AzureCanvasBackend: direct2d 1.1
AzureContentBackend: direct2d 1.1
AzureFallbackCanvasBackend: cairo
AzureSkiaAccelerated: 0
Flags: needinfo?(lorensgwine)
Reporter | ||
Comment 3•10 years ago
|
||
Thanks!
Reporter | ||
Comment 4•10 years ago
|
||
Different crashes, perhaps same underlying reason:
https://crash-stats.mozilla.com/report/index/a8758ae5-8220-4d01-8d70-2f15b2150427 - crash because of a null D3D10 device, while trying to process D3D10 synced textures. Also points out we should probably check the return from AcquireMutex beyond just the timeout (e.g., E_FAIL or WAIT_ABANDONED.)
https://crash-stats.mozilla.com/report/index/e6be345e-0d74-4fb9-8ad7-910d52150430 - MOZ_CRASH because of a failed OpenShareResource, but with DidRenderingDeviceReset() not firing.
Reporter | ||
Comment 5•10 years ago
|
||
Lorenso, one more (for now :) question - you seem to be able to reproduce this at will - is this always a startup crash for you?
Flags: needinfo?(lorensgwine)
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #5)
> Lorenso, one more (for now :) question - you seem to be able to reproduce
> this at will - is this always a startup crash for you?
Actually, now I'm thinking it's not a startup crash, you're just able to reproduce it very quickly, but I'll let you answer the question anyway :)
Reporter | ||
Comment 7•10 years ago
|
||
Just to start the discussion about some of the things we do in FinalizeFrame, related to the "first" scenario (missing D3D10 device.)
Attachment #8599996 -
Flags: feedback?(jmuizelaar)
Attachment #8599996 -
Flags: feedback?(bas)
Comment 8•10 years ago
|
||
Comment on attachment 8599996 [details] [diff] [review]
Conversation piece (wip)
Review of attachment 8599996 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me.
Attachment #8599996 -
Flags: feedback?(jmuizelaar) → feedback+
Reporter | ||
Comment 9•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #4)
> Different crashes, perhaps same underlying reason:
>
> https://crash-stats.mozilla.com/report/index/a8758ae5-8220-4d01-8d70-
> 2f15b2150427 - crash because of a null D3D10 device, while trying to process
> D3D10 synced textures..
This crash is in the D2D scenario. GetD3D10Device can return null if gfxWindowsPlatform::mD2DDevice is null, or if cairo_d2d_device_get_device(mD2DDevice) returns null. D2D device as non-null and D3D10 device as null, doesn't really happen, we create D2D from D3D10 using cairo_d2d_create_device_from_d3d10device, assuming D2D is not force enabled. mD2DDevice could be set to null if DoesD3D11TextureSharingWork return false, among other things.
Reporter | ||
Comment 10•10 years ago
|
||
Attachment #8600030 -
Flags: review?(jmuizelaar)
Comment 11•10 years ago
|
||
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?
They are, but not everyone knows it.
Attachment #8600030 -
Flags: review?(jmuizelaar) → review+
Reporter | ||
Comment 12•10 years ago
|
||
Do we ever end up in a D3D 10.1 and D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX combination?
Comment 13•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #5)
> Lorenso, one more (for now :) question - you seem to be able to reproduce
> this at will - is this always a startup crash for you?
Nightly crashes on everytab. Just watch https://youtu.be/oGQLKIET58w
Flags: needinfo?(lorensgwine)
Updated•10 years ago
|
Attachment #8599996 -
Flags: feedback?(bas) → feedback+
Reporter | ||
Comment 14•10 years ago
|
||
The reviewed trivial patch: https://hg.mozilla.org/integration/mozilla-inbound/rev/8fe081b9ff71
Keywords: leave-open
Reporter | ||
Comment 15•10 years ago
|
||
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?
This is the patch landed in comment 14.
Attachment #8600030 -
Flags: checkin+
Reporter | ||
Updated•10 years ago
|
Attachment #8600030 -
Attachment description: Are uninitialized statics safe? → Part 1. Are uninitialized statics safe?
Reporter | ||
Comment 16•10 years ago
|
||
This contains the changes from the "conversation piece" patch, and some additional checks and reports.
Attachment #8600382 -
Flags: review?(bas)
Reporter | ||
Comment 17•10 years ago
|
||
Try to get more information, and hopefully crash earlier. The logic was mostly kept, but perhaps we want to MOZ_CRASH in some of the cases where we didn't before?
Attachment #8600383 -
Flags: review?(bas)
Reporter | ||
Comment 18•10 years ago
|
||
Trying some of these patches on try, it turns out that we cannot get mutex during crash/ref/mochi tests (reproducible locally.) Expected? OK? Just in case it's useful, I added the stack when this happens. Note that today we treat these situations (no mutex coming back) as successful locks/unlocks. The two patches for the review changed that and try lit up :)
The stack:
Just the critical error "overhead", but note that the texture is not null.
0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90
0:14.37 Assertion failure: false (An assert from the graphics logger), at c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492
0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274)
0:14.37 #02: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx::CriticalLogger> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265)
Relevant part here:
0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:129)
0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:278)
0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient.cpp:346)
0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672)
0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc-mingw32\dist\include\mozilla\layers\contentclient.h:220)
0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:55)
0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:1
32
0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71)
0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.h:379)
0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71)
0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:273)
0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:316)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
Comment 19•10 years ago
|
||
Comment on attachment 8600382 [details] [diff] [review]
Part 2. Watch for and report missing device, or unexpected texture situations. r=bschouten
Review of attachment 8600382 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/d3d11/TextureD3D11.cpp
@@ +236,5 @@
> MOZ_ASSERT(aSyncObject->GetSyncType() == SyncObject::SyncType::D3D11);
>
> SyncObjectD3D11* sync = static_cast<SyncObjectD3D11*>(aSyncObject);
>
> + MOZ_ASSERT(mTexture || mTexture10);
There's shouldn't be a real point to the checks below, there's no way this can change within a session.
If somehow both are true we should add a MOZ_ABORT_IF_FALSE below because this would mean our session is -completely- FUBAR.
@@ +1067,5 @@
> ID3D10Device* device = gfxWindowsPlatform::GetPlatform()->GetD3D10Device();
> + if (!device) {
> + gfxCriticalError() << "Missing D3D10 device(2) " << hexa(gfxWindowsPlatform::GetPlatform()->GetD2DDevice());
> + if (gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset()) {
> + mutex->ReleaseSync(0);
I'm not certain we should not crash here, this would be really bad, and I'm not sure we can somehow survive if this occurs (I'm also still not clear on how it could occur). There's so much else which should blow up at this point. And I'm worried we'll just spread the pain out through all kinds of other places.
Comment 20•10 years ago
|
||
Comment on attachment 8600383 [details] [diff] [review]
Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten
Review of attachment 8600383 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/d3d11/CompositorD3D11.cpp
@@ +1074,5 @@
> mAttachments->mSyncTexture->QueryInterface((IDXGIKeyedMutex**)byRef(mutex));
>
> MOZ_ASSERT(mutex);
> HRESULT hr = mutex->AcquireSync(0, 10000);
> + if (FAILED(hr)) {
It's probably a good idea to gfxCriticalError if a WAIT_TIMEOUT occurs. We also probably shouldn't be ignoring WAIT_ABANDONED, so a better thing here might be something like:
if (hr == WAIT_TIMEOUT) {
gfxCriticalError() << etc.
} else if (hr != S_OK) {
MOZ_CRASH();
}
::: gfx/layers/d3d11/TextureD3D11.cpp
@@ +201,5 @@
> MOZ_ASSERT(!mIsLocked);
> MOZ_ASSERT(mTexture || mTexture10);
> MOZ_ASSERT(mDrawTarget->refCount() == 1);
> if (mTexture) {
> + MOZ_ASSERT(LockD3DTexture(mTexture.get()));
Eep! This will make LockD3DTexture not occur in release builds since MOZ_ASSERT will compile to null.
DebugOnly<bool> didSucceed = LockD3DTexture(mTexture.get()));
MOZ_ASSERT(didSucceed);
Or something of the likes would do it.
Attachment #8600383 -
Flags: review?(bas) → review-
Comment 21•10 years ago
|
||
Comment on attachment 8600030 [details] [diff] [review]
Part 1. Are uninitialized statics safe?
Review of attachment 8600030 [details] [diff] [review]:
-----------------------------------------------------------------
Just for the record, I believe statics are guaranteed by the spec to be initialized to zero'ed memory by any sane compiler.
ISO 9899:1999 section 6.7.8 paragraph 10 leaves a little bit of wiggle room for bools, but I believe MSVC does this just fine.
Not a bad idea to make it explicit as far as I'm concerned though, but unlikely to be the cause of any practical issue.
Comment 22•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #18)
> Trying some of these patches on try, it turns out that we cannot get mutex
> during crash/ref/mochi tests (reproducible locally.) Expected? OK? Just
> in case it's useful, I added the stack when this happens. Note that today
> we treat these situations (no mutex coming back) as successful
> locks/unlocks. The two patches for the review changed that and try lit up :)
>
> The stack:
>
> Just the critical error "overhead", but note that the texture is not null.
>
> 0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90
> 0:14.37 Assertion failure: false (An assert from the graphics logger), at
> c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492
> 0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush
> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274)
> 0:14.37 #02:
> mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx::
> CriticalLogger>
> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265)
>
> Relevant part here:
>
> 0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D>
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:
> 129)
> 0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:
> 278)
> 0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient.
> cpp:346)
> 0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint
> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672)
> 0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer
> (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc-
> mingw32\dist\include\mozilla\layers\contentclient.h:220)
> 0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientpaintedlayer.cpp:55)
> 0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientpaintedlayer.cpp:1
> 32
> 0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientcontainerlayer.h:71)
> 0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.h:379)
> 0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientcontainerlayer.h:71)
> 0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.cpp:273)
> 0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction
> (c:\users\msreckovic\repos\mozilla-
> central\gfx\layers\client\clientlayermanager.cpp:316)
Yes, expected, those textures are no longer actually created as KeyedMutex supporting textures since the 'One Texture to Rule Them All' solution. (Un)lockD3DTexture are essentially no-ops these days for our practical purposes. They're more or less left in in case there's TextureClients that do choose to use them, we could just remove the methods all together.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
Comment 23•10 years ago
|
||
Comment 24•10 years ago
|
||
Comment on attachment 8600383 [details] [diff] [review]
Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten
Review of attachment 8600383 [details] [diff] [review]:
-----------------------------------------------------------------
MOZ_ASSERT is inside a #ifdef DEBUG, missed this.
Attachment #8600383 -
Flags: review- → review+
Comment 25•10 years ago
|
||
Milan, Bas, does this need more work? Looks like maybe part 3 never landed.
Do you want to request uplift to 39? This crash signature is still showing up for 39.0a2.
Flags: needinfo?(milan)
Flags: needinfo?(bas)
Comment 26•10 years ago
|
||
I may have confused some of this with the work in bug 1143806 where there was some discussion of uplift to 39.
Reporter | ||
Comment 27•10 years ago
|
||
No, you're right, there was some discussion, but I'm not sure the patches are correct, especially the mutex side, at least try doesn't like some of them. Need to get back to this.
Comment 28•10 years ago
|
||
Noting also, the crash rate so far looks high on this signature for 38.0b99 and 38.0.
38.0b99 14.69 % 2031
38.0 3.14 % 434
![]() |
||
Comment 29•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #4)
> Different crashes, perhaps same underlying reason:
I'm seeing two categories of crash with this signature, about evenly split. I get the feeling they're not related.
There are MOZ_CRASHes due to the WAIT_TIMEOUTs. Almost all of them have code 0x8899000C (D2DERR_RECREATE_TARGET). Good mix of vendors, devices, and drivers.
The other group have null derefs on |device|. This group is 86% AMD family 20 CPUs. Suspiciously, those are the CPUs from "the AMD bug". But the CPU could be a red herring. Those chips are integrated CPU+GPU, so we could also say that this group is mostly devices 0x9802 to 0x9807. Their driver versions are all over.
Comment 30•10 years ago
|
||
http://www.geforce.com/whats-new/articles/geforce-352-86-whql-driver-released
"Windows Vista/Windows 7/Windows 8/Windows 8.1 Fixed Issues
[Firefox] Firefox Nightly crashes when started on the NVIDIA GPU on an Optimus
system. [1609030]"
Release notes says this is fixed now.
Comment 31•10 years ago
|
||
My problem seems solved with new drivers
![]() |
||
Comment 32•10 years ago
|
||
(In reply to Lorenso Gwine from comment #31)
> My problem seems solved with new drivers
That sounds awesome - now we need to get more people to update to those new drivers...
![]() |
||
Comment 33•10 years ago
|
||
Note that per comment 29, nvidia drivers are only going to solve a portion of these crashes.
Reporter | ||
Comment 34•10 years ago
|
||
At least some of these are startup crashes; any correlation with the startup vs. not-startup and info from comment 29?
Flags: needinfo?(milan)
Summary: non device reset crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()
Reporter | ||
Comment 35•10 years ago
|
||
Let's pick at these little by little - sounds like there is some blocklisting that can happen, we can convert few more crashes into MOZ_CRASHes, check if there was a driver reset more consistently in the FinalizeFrame.
Assignee: nobody → bas
Reporter | ||
Comment 36•10 years ago
|
||
[Tracking Requested - why for this release]: Tracking all startup crashes of any significant numbers.
tracking-firefox39:
--- → ?
tracking-firefox40:
--- → ?
Comment 37•10 years ago
|
||
Adding a tracking flag for firefox40. Based on the discussion so far, it's unclear whether we have a stable fix. It may be getting too close for firefox39. Will let Liz/Sylvestre know about this one.
Updated•10 years ago
|
status-firefox39:
--- → affected
![]() |
||
Comment 38•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #34)
> At least some of these are startup crashes; any correlation with the startup
> vs. not-startup and info from comment 29?
About half of the null-device crashes are within the first 60 seconds. Mostly towards the end of that first minute though; not single-digit uptimes.
The MOZ_CRASH breakpoints are not startup crashes.
Comment 39•10 years ago
|
||
Should we go for fixing the MOZ_CRASH issues anyway for 39?
Flags: needinfo?(milan)
Reporter | ||
Comment 40•10 years ago
|
||
This one is still a bit of a mystery, it will probably miss 39 if we can't figure out what to do. Bas, one last push?
Flags: needinfo?(milan)
Comment 41•10 years ago
|
||
There's something very weird going on in the current reports, first of all there's this:
D2D- D2D1.1- D2D+ DWrite- DWrite+ D3D11 Layers- D3D11 Layers+
Second of all this MOZ_CRASH -clearly- passed a gfxCriticalError but for some reason that log message is not in the crash report.
Removing this MOZ_CRASH does not feel like a good idea since it will mean those people would get a -lot- of rendering artifacts. Those weird - and + mixes are likely a good indicator as to what's responsible for this situation though, we should try to figure out what can cause that.
Flags: needinfo?(bas)
Comment 42•10 years ago
|
||
I've found one where the gfxCriticalError did get through, and it shows 0x80070057 ERROR_INVALID_PARAMETER, this further suggests we're getting into some weird state due to this - + business.
Comment 43•10 years ago
|
||
I'm still able to reproduce this crash (or perhaps one with a similar signature) every time I update my NVIDIA drivers.
https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33-185e92150608
Per comment 30, I waited until after I installed 352.86 to see if the newer driver had actually solved my crash.
Unfortunately upon installing driver 353.06 a few minutes ago I was able to bring my e10s content tabs into a broken state where nightly only displays a spinner instead of the page content. Additionally, opening a new about:newtab works fine, but upon navigating the new tab to any website, the tab crashed and submitted the crash dump above.
Comment 44•10 years ago
|
||
(In reply to Ryan Stecker from comment #43)
> I'm still able to reproduce this crash (or perhaps one with a similar
> signature) every time I update my NVIDIA drivers.
>
> https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33-
> 185e92150608
>
> Per comment 30, I waited until after I installed 352.86 to see if the newer
> driver had actually solved my crash.
>
> Unfortunately upon installing driver 353.06 a few minutes ago I was able to
> bring my e10s content tabs into a broken state where nightly only displays a
> spinner instead of the page content. Additionally, opening a new
> about:newtab works fine, but upon navigating the new tab to any website, the
> tab crashed and submitted the crash dump above.
Interesting, Milan, do we have an NVidia (non-optimus, Ryan has a GTX970 as it looks from the report, so he's running dual GPUs with his intel on board but he's not using optimus) machine somewhere in the Toronto office where we could try updating an NVidia driver.. the crash Ryan posted has the interesting -/+ stuff in there and maybe this reproduces fairly reliably when updating the driver on a real NVidia machine?
Flags: needinfo?(milan)
Reporter | ||
Comment 45•10 years ago
|
||
Not sure if we have one - Jeff?
For the "disappearing" critical errors - they may have wrapped, in which case setting gfx.logging.crash.length (int) preference to a larger number (e.g., 50) may keep it around, if we think this will help us.
Flags: needinfo?(milan) → needinfo?(jmuizelaar)
Comment 46•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #45)
> Not sure if we have one - Jeff?
>
> For the "disappearing" critical errors - they may have wrapped, in which
> case setting gfx.logging.crash.length (int) preference to a larger number
> (e.g., 50) may keep it around, if we think this will help us.
There's none at all in that report as far as I can tell, but in the new ones we have one, which is an ILLEGAL_PARAMETER one, almost certainly the -/+ business causes us to use a null handle or an invalid handle somehow. This is likely going to be very easy to fix when we can reproduce it. If we -can't- reproduce this I would suggest we decide to ignore INVALID_PARAMETER errors for now and assume the situation corrects itself in a valid way, but I would really rather not.
Comment 47•10 years ago
|
||
Unless I'm mistaking, I'm not seeing this show up in significant numbers on either 39 or 40, can anyone confirm this?
![]() |
||
Comment 48•10 years ago
|
||
They're still around, but way down on the rankings (100+). I'm seeing relatively more of the null-device vs MOZ_CRASH variety; before they used to be pretty even.
![]() |
||
Comment 49•10 years ago
|
||
I think the driver updates mentioned in comment #30 did really help here.
Comment 50•10 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> I think the driver updates mentioned in comment #30 did really help here.
I suspect it's more likely something else helped here, users don't tend to update their drivers this fast.
We expanded some of our video blacklisting, I suspect that may be involved. But it's hard to be certain.
Comment 51•10 years ago
|
||
Should we consider this sufficiently 'fixed' for 39 and 40 at this point?
![]() |
||
Comment 54•10 years ago
|
||
Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and Intel Graphics HD4600 when upgrading from
353.30-notebook-win8-win7-64bit-international-whql.exe
to
353.62-notebook-win8-win7-64bit-international-whql.exe
Comment 55•10 years ago
|
||
(In reply to Aryx [:archaeopteryx][:aryx] from comment #54)
> Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and
> Intel Graphics HD4600 when upgrading from
> 353.30-notebook-win8-win7-64bit-international-whql.exe
> to
> 353.62-notebook-win8-win7-64bit-international-whql.exe
Was Firefox running at the time of the driver upgrade or did Firefox crash during startup after having updated the driver? Did you restart your computer at any point during the driver update process?
![]() |
||
Comment 56•10 years ago
|
||
I let them running during the upgrade on purpose to test this (the driver upgrade didn't prompt for a restart). Just saw that it was Firefox 39.0 which crashed and not Beta or Nightly which were also running (but had black areas). More info: https://crash-stats.mozilla.com/report/index/e57ac6c5-b7ca-4dd7-9984-acdc12150805
Comment 57•9 years ago
|
||
This remains a significant issue but is no longer a top crash. This is currently ranked 43rd in Firefox 41.0.1 with 976 crashes accounting for 0.43%.
status-firefox41:
--- → affected
status-firefox42:
--- → affected
status-firefox43:
--- → affected
status-firefox44:
--- → affected
Keywords: topcrash
Updated•9 years ago
|
Crash Signature: [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()] → [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()]
[@ mozilla::layers::SyncObjectD3D11::FinalizeFrame]
I am noticing content tabs crashing 1-2 times every day since Aurora47 updates were enabled late last week. MConley looked at the crash reports (below) and pointed me to this bug.
Here are the top items from my about:crashes
* https://crash-stats.mozilla.com/report/index/bp-533909e4-3ad1-420d-8cbc-c2be92160316 - 3/16/16
* https://crash-stats.mozilla.com/report/index/bp-633d5c25-05df-4346-affb-d1f562160315 - 3/15/16
* https://crash-stats.mozilla.com/report/index/bp-b54dee5b-eb40-41b6-859f-d1bdc2160315 - 3/14/16
Nom'ing for tracking by e10s team mainly due to comment 58 (more crashiness seen since 47 moved to Aurora channel).
tracking-e10s:
--- → ?
Reporter | ||
Comment 60•9 years ago
|
||
Yeah, we see these coming in a small number, hopefully some of the patches we have on the device reset handling eventually clear this up. Note that this is a forced crash on nightly and aurora - it will not happen (as such) on beta or release.
Flags: needinfo?(jmuizelaar)
Reporter | ||
Updated•9 years ago
|
Summary: non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only
Updated•9 years ago
|
Priority: -- → P2
Updated•9 years ago
|
Attachment #8600382 -
Flags: review?(bas)
![]() |
||
Comment 61•9 years ago
|
||
This is still happening sometimes, e.g. there were 5 occurrences in Nightly 20160501030217:
https://crash-stats.mozilla.com/report/index/cf2f68e7-096e-4459-9c74-f23a62160502
https://crash-stats.mozilla.com/report/index/38b4938b-0763-444d-9cca-0f2c72160502
https://crash-stats.mozilla.com/report/index/08de245b-a4a9-4d29-b13e-9ed7e2160502
https://crash-stats.mozilla.com/report/index/4d178bd8-4216-472f-b5a0-718f52160502
https://crash-stats.mozilla.com/report/index/022c4265-c2f3-4ff8-8ef3-3abac2160502
Comment 62•9 years ago
|
||
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
- nightly (version 50): 136 crashes from 2016-06-06.
- aurora (version 49): 304 crashes from 2016-06-07.
- beta (version 48): 1174 crashes from 2016-06-06.
- release (version 47): 9240 crashes from 2016-05-31.
- esr (version 45): 263 crashes from 2016-04-07.
Crash volume on the last weeks:
Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7
- nightly 23 16 22 14 20 20 16
- aurora 42 64 52 49 36 33 9
- beta 163 183 161 171 188 184 62
- release 1444 1472 1323 1390 1420 1301 472
- esr 39 27 31 32 34 20 26
Affected platform: Windows
status-firefox47:
--- → affected
status-firefox48:
--- → affected
status-firefox49:
--- → affected
status-firefox50:
--- → affected
status-firefox-esr45:
--- → affected
Reporter | ||
Comment 63•9 years ago
|
||
Jerry, here is an interesting crash: https://crash-stats.mozilla.com/report/index/c2d46775-f73e-4b05-94e5-aae0e2160821. We hit this one, MOZ_CRASH becausewe fail to open D3D11 shared resource, and then gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset() returns false inside of SyncObjectD3D11::FinalizeFrame. Looking at the error log, we see the the "device reset" messages just two seconds before this happens and we crash. So, something is not quite getting initialized in time, perhaps we reset the "device reset" too soon. What do you think?
Flags: needinfo?(hshih)
Comment 64•9 years ago
|
||
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
- nightly (version 51): 67 crashes from 2016-08-01.
- aurora (version 50): 216 crashes from 2016-08-01.
- beta (version 49): 580 crashes from 2016-08-02.
- release (version 48): 1474 crashes from 2016-07-25.
- esr (version 45): 392 crashes from 2016-05-02.
Crash volume on the last weeks (Week N is from 08-22 to 08-28):
W. N-1 W. N-2 W. N-3
- nightly 32 20 7
- aurora 81 68 23
- beta 223 194 52
- release 552 398 204
- esr 36 31 36
Affected platform: Windows
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #49 #86
- aurora #131 #13
- beta #87 #58
- release #36 #330
- esr #233
status-firefox51:
--- → affected
Comment 65•9 years ago
|
||
In my first view, maybe we still failed at the creation of new compositor(which is triggered by our device removed detection flow).
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2293
Then, hit the early return at:
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2305
So, there is no new SyncObjectD3D11 updated from
https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2311
Finally, we still use the "probably invalid" syncObj during painting.
I will try to add some criticalNote to show this.
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(hshih)
Reporter | ||
Comment 66•8 years ago
|
||
Jerry, did you have a chance to come up with a patch you mention in comment 65?
Flags: needinfo?(hshih)
Reporter | ||
Comment 67•8 years ago
|
||
This crash has no critical errors until the one that takes us down: https://crash-stats.mozilla.com/report/index/35996059-ede0-4199-9d0a-af5742160911
Reporter | ||
Comment 68•8 years ago
|
||
Some critical error correlations:
(95.27% in signature vs 00.35% overall) gfx_error 'Failed to create a D3D11 content device: '
(75.27% in signature vs 06.38% overall) gfx_error 'Failed 2 buffer db='
(36.88% in signature vs 00.60% overall) gfx_error 'D3D11 swap chain preset failed '
(21.29% in signature vs 00.70% overall) gfx_error 'Invalid draw target type specified: '
(16.34% in signature vs 00.12% overall) gfx_error '[D3D11] TextureSourceD3D11:GetShaderResourceView CreateSRV failure '
Comment 69•8 years ago
|
||
Yes, I'm trying to write a patch for comment 65.
But I have no idea about the crash in comment 67. Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). So, maybe content side should skip all painting task between the timing of knowing driver-remove and receiving the RecvCompositorUpdated() call.
Updated•8 years ago
|
Assignee: bas → hshih
Status: NEW → ASSIGNED
Flags: needinfo?(hshih)
Comment 70•8 years ago
|
||
Attachment #8794479 -
Flags: review?(dvander)
Comment 71•8 years ago
|
||
Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). Before RecvCompositorUpdated() message, the content side still use that bad-status device for rendering. This patch try to skip the painging and the SyncObject waiting during driver-reset.
Attachment #8794480 -
Flags: review?(dvander)
Comment 72•8 years ago
|
||
Since we already have the device-remove checking in ClientLayerManager, this checking is redundant.
Attachment #8794481 -
Flags: review?(dvander)
Comment on attachment 8794479 [details] [diff] [review]
show some gfxCritical logs for compositor creation failed during device-reset. v1.
Review of attachment 8794479 [details] [diff] [review]:
-----------------------------------------------------------------
::: gfx/layers/ipc/CompositorBridgeParent.cpp
@@ +2379,5 @@
> return Nothing();
> }
>
> + if (aBackendHints.IsEmpty()) {
> + gfxCriticalNote << "No BackendHint for compositor creation.";
This should never happen, we should always have at least a basic compositor.
@@ +2387,3 @@
> RefPtr<Compositor> compositor = NewCompositor(aBackendHints);
> if (!compositor) {
> + gfxCriticalNote << "Try to create a new compositor but failed.";
Same here. The only place this should truly be able to fail is OS X, since nsChildView::InitCompositor can technically return false.
::: gfx/thebes/DeviceManagerDx.cpp
@@ +124,5 @@
>
> CreateCompositorDevice(d3d11);
>
> if (!d3d11.IsEnabled()) {
> + gfxCriticalNote << "Can't create d3d11 device in CreateCompositorDevices()";
Are these notes needed? The information should already be available in about:support (and crash reports), since every failure should be propagated through gfxConfig/telemetry.
Comment on attachment 8794480 [details] [diff] [review]
skip content painting when the device is in device-reset/remove status. v1
Review of attachment 8794480 [details] [diff] [review]:
-----------------------------------------------------------------
This seems reasonable, but I'm curious: do we actually have a case where the content process gets a device reset, and the compositor does not? Or is this patch more about not trying to use a failing SyncObject?
Attachment #8794480 -
Flags: review?(dvander) → review+
Attachment #8794481 -
Flags: review?(dvander) → review+
Comment 75•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #74)
> Comment on attachment 8794480 [details] [diff] [review]
> skip content painting when the device is in device-reset/remove status. v1
>
> Review of attachment 8794480 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> This seems reasonable, but I'm curious: do we actually have a case where the
> content process gets a device reset, and the compositor does not? Or is this
> patch more about not trying to use a failing SyncObject?
I think both content and chrome could get the device reset status.
The problem is that the content side can't recover the device-reset status by itself. It will still use that bad-status device for rendering and synchronization until the CompositorBridgeChild::RecvCompositorUpdated() message received. This patch try to skip the d2d/d3d calls before the recovering in RecvCompositorUpdated().
Flags: needinfo?(dvander)
Updated•8 years ago
|
Attachment #8794479 -
Flags: review?(dvander)
Comment 76•8 years ago
|
||
Please land the
attachment 8794480 [details] [diff] [review] and
attachment 8794481 [details] [diff] [review]
to m-c.
Keywords: checkin-needed
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #75)
> (In reply to David Anderson [:dvander] from comment #74)
> > Comment on attachment 8794480 [details] [diff] [review]
> > skip content painting when the device is in device-reset/remove status. v1
> >
> > Review of attachment 8794480 [details] [diff] [review]:
> > -----------------------------------------------------------------
> >
> > This seems reasonable, but I'm curious: do we actually have a case where the
> > content process gets a device reset, and the compositor does not? Or is this
> > patch more about not trying to use a failing SyncObject?
>
> I think both content and chrome could get the device reset status.
>
> The problem is that the content side can't recover the device-reset status
> by itself. It will still use that bad-status device for rendering and
> synchronization until the CompositorBridgeChild::RecvCompositorUpdated()
> message received. This patch try to skip the d2d/d3d calls before the
> recovering in RecvCompositorUpdated().
Okay, thanks for explaining.
Flags: needinfo?(dvander)
Comment 78•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e855469d6acf
skip content painting when the device is in device-reset/remove status. r=dvander
https://hg.mozilla.org/integration/mozilla-inbound/rev/4247ce021880
remove the painting aborting for ClientPaintedLayer in device-remove/reset status. r=dvander
Keywords: checkin-needed
Comment 79•8 years ago
|
||
bugherder |
Assignee | ||
Comment 80•8 years ago
|
||
Comment 81•8 years ago
|
||
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
- nightly (version 52): 61 crashes from 2016-09-19.
- aurora (version 51): 65 crashes from 2016-09-19.
- beta (version 50): 332 crashes from 2016-09-20.
- release (version 49): 1523 crashes from 2016-09-05.
- esr (version 45): 562 crashes from 2016-06-01.
Crash volume on the last weeks (Week N is from 10-03 to 10-09):
W. N-1 W. N-2
- nightly 42 19
- aurora 56 9
- beta 290 42
- release 1211 312
- esr 48 44
Affected platform: Windows
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #39 #47
- aurora #35 #44
- beta #65 #36
- release #36 #131
- esr #220
status-firefox52:
--- → affected
Comment 82•8 years ago
|
||
:kechen is trying to investigate the crash problem in comment 80.
Assignee | ||
Comment 84•8 years ago
|
||
There are two main cases in this crash :
The first one is like [1], we tried to open the shared resource with the handle and it failed without device-removed. I think there are two possible reasons:
1. The shared resource still exists but the handle in content process is not updated.
2. We accidentally released the shared resource in somewhere, which cause the handle invalid.
According to the crash report[1], device-removed doesn't happened and the crash point is far from starting time; therefore, I think this is caused by the reason 2. I suggest to force reset driver at this point like what we do in chrome process[2].
The second one is like[3], we got a timeout when attempting to acquire the mutex of shared resource and this crash always happened after device-removed. This also appears in other signatures [4][5].
I found that there are some early return cases in OnRenderingDeviceReset[6] so they do not update the SyncObject, should we reset the SyncObject here ?
And if we are allowed to reuse the shared resource, this code[7] might cause problem, it leaves the function in the mutext section. In this case, move this part of code out of the section might solve the problem.
I am now doing an experiment about this.
Jerry, do you think it's proper to let content process reset the driver when we encounter this crash ? and do you have any thought about the second case ?
[1] https://crash-stats.mozilla.com/report/index/46aec604-fad8-4ab7-b282-fa7282161017#tab-metadata
[2] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#781
[3] https://crash-stats.mozilla.com/report/index/bb0fef1e-155a-4ab5-a833-b45772161026#tab-metadata
[4] https://crash-stats.mozilla.com/report/index/2d480ff9-6d6a-4174-ac2a-400482161028#tab-metadata
[5] https://crash-stats.mozilla.com/report/index/a1104467-20da-4782-8efd-25d4b2161027#tab-metadata
[6] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/widget/nsBaseWidget.cpp#335
[7] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#1237
Assignee | ||
Comment 85•8 years ago
|
||
Hello Jerry, can you take a look at comment 84 ?
Flags: needinfo?(hshih)
Updated•8 years ago
|
Assignee: hshih → kechen
Assignee | ||
Comment 86•8 years ago
|
||
Hello David,
Can you help me to review this patch ? This patch use RAII to avoid calling AcquireSync without ReleaseSync.
It mainly fixes the early return case without ReleaseSync in [1]. Hoping this can fix some timeout crashes for AcquireSync in this signature.
The following link[2] is the try of this patch, it looks good since the patch only effects Windows platform.
[1] https://dxr.mozilla.org/mozilla-central/rev/8e8b146fcb8b268e3c09b646087c6b2ef9f0af6f/gfx/layers/d3d11/TextureD3D11.cpp#1238
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=00abdae10271&selectedJob=30434764
Attachment #8808079 -
Flags: review?(dvander)
Attachment #8808079 -
Flags: review?(dvander) → review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 87•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9db19aed6c0
Use AutoTextureLock to manage the shared resource's mutex. r=dvander
Keywords: checkin-needed
Updated•8 years ago
|
Flags: needinfo?(hshih)
Comment 88•8 years ago
|
||
bugherder |
Assignee | ||
Comment 89•8 years ago
|
||
The crash rate doesn't decrease, I am now working on another patch to fix it.
Assignee | ||
Comment 90•8 years ago
|
||
Hello David,
There are some crashes like[1] which attempt to access a null device in[2].
I think the reason of this problem is that content process failed to create D3D11 device during device reset process and leave the device null while chrome process successfully create it. Also, device reset check will return normal in this case[3].
As the result, the code will pass the device check in [4] and run into the crash.
My proposal to solve this problem is to set the device reset reason and ask chrome process to do the device reset with D3D11_COMPOSITING disabled again so that we can recover from this situation.
Can you give me some advices about the patch ? Thank you.
[1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216
[2] https://hg.mozilla.org/releases/mozilla-release/annotate/cc272f7d48d3/gfx/layers/d3d11/TextureD3D11.cpp#l1124
[3] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/thebes/DeviceManagerDx.cpp#689
[4] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/layers/client/ClientLayerManager.cpp#655
Attachment #8819217 -
Flags: feedback?(dvander)
Comment 91•8 years ago
|
||
Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame':
- nightly (version 53): 170 crashes from 2016-11-14.
- aurora (version 52): 850 crashes from 2016-11-14.
- beta (version 51): 1123 crashes from 2016-11-14.
- release (version 50): 18984 crashes from 2016-11-01.
- esr (version 45): 1152 crashes from 2016-07-06.
Crash volume on the last weeks (Week N is from 01-02 to 01-08):
W. N-1 W. N-2 W. N-3 W. N-4 W. N-5 W. N-6 W. N-7
- nightly 4 5 10 12 40 31 65
- aurora 119 136 169 177 146 76 0
- beta 169 192 193 218 124 140 45
- release 2631 2706 2955 3105 3028 2709 1007
- esr 31 50 67 57 79 47 62
Affected platform: Windows
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #391 #37
- aurora #23 #23
- beta #395 #26
- release #35 #8
- esr #239
status-firefox53:
--- → affected
Assignee | ||
Comment 92•8 years ago
|
||
Most of crashes move to Bug 1322741 due to changes in Bug 1319557.
After uplifting the fix in Bug 1322741, the crash volume should decrease.
The crashes in firefox 53 might be caused by other reasons, I will keep looking into it.
See Also: → 1322741
Reporter | ||
Comment 93•8 years ago
|
||
Perhaps we can land a version of the WIP patch that has gfxCriticalError instead of SendDeviceReset in those places? Perhaps the ForceDeviceReset as well - in other words, where the WIP patch detects some condition and does something because of it, we just detect it and put a unique gfxCriticalError so that we can see if it triggers in these crashes?
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(kechen)
Assignee | ||
Comment 94•8 years ago
|
||
Hello Milan,
Thank you for the advice, this patch can help us to confirm the root cause of null accessing crashes.
Could you help me to review it ? Thank you.
Flags: needinfo?(kechen)
Attachment #8828195 -
Flags: review?(milan)
Comment on attachment 8819217 [details] [diff] [review]
[WIP] Trigger device reset when failing to create D3D11 device in content process
Is this obsoleted by bug 1322741? If we're accessing a null D3D11 device after a reset it's not clear how this would fix it. We'd still be accessing a stale device.
Attachment #8819217 -
Flags: feedback?(dvander)
Assignee | ||
Comment 96•8 years ago
|
||
Hello David,
I think there are two different problem in this signature :
1. The SharedResourceHandler is not synchronized with D3D11 device (which is solved by bug 1322741).
2. The device from DeviceManagerDx is nullptr like[1].
The second problem, which I think is not covered by first problem, is the one I want to solve with this patch.
My theory is that when handling device reset, we reset the device in [2] but failed to create new device in the following code which cause this crash.
Therefore I rise the device reset flag after the creation failure to make the code early-return in [3] without entering the crash segment in the first place to avoid using the stale device.
And the reason that I send a force device reset message to compositor process is to make sure that it knows content side need a new SharedResourceHandler because I don't know if there is any chance that content process have problem in creating D3D11 device while everything works fine in compositor process.
And I think it might be still profitable to land attachment 8828195 [details] [diff] [review], in one hand, it can verify if the crash is really caused by the failure creation of D3D11 device when handling device reset.
Do you have any advice on this solution ?
[1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216
[2] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/thebes/gfxWindowsPlatform.cpp#445
[3] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/layers/client/ClientLayerManager.cpp#661
Flags: needinfo?(dvander)
Reporter | ||
Comment 97•8 years ago
|
||
Comment on attachment 8828195 [details] [diff] [review]
Add gfxCriticalError for failing to create D3D11 device in content side
Review of attachment 8828195 [details] [diff] [review]:
-----------------------------------------------------------------
This looks good - should we also have that change in CompositorBridgeChild::RecvCompositorUpdated that you had in the other patch:
// If we still get device reset here, something must wrong when creating
// d3d11 devices.
if (gfxPlatform::GetPlatform()->DidRenderingDeviceReset()) {
gfxCriticalError() << "Unexpected reset device processing the compositor update";
}
or something like that?
Attachment #8828195 -
Flags: review?(milan) → review+
Assignee | ||
Comment 98•8 years ago
|
||
Carry r+ from comment 97.
Attachment #8828195 -
Attachment is obsolete: true
Attachment #8828655 -
Flags: review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 99•8 years ago
|
||
Pushed by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/19507933747f
Add gfxCriticalError for failing to create D3D11 device in content side. r=milan
Keywords: checkin-needed
Comment 100•8 years ago
|
||
bugherder |
Assignee | ||
Comment 101•8 years ago
|
||
Approval Request Comment
[Feature/Bug causing the regression]:
Not a regression, this patch is for crash tracing.
[User impact if declined]:
This patch is for finding the root cause of this crash. Without uplifting this, it will be difficult for the crash investigating.
[Is this code covered by automated tests?]:
No.
[Has the fix been verified in Nightly?]:
Yes.
[Needs manual test from QE? If yes, steps to reproduce]:
No.
[List of other uplifts needed for the feature/fix]:
No.
[Is the change risky?]:
Low.
[Why is the change risky/not risky?]:
Only add some gfx errors to somewhere that might cause the crash.
[String changes made/needed]:
No.
Attachment #8829310 -
Flags: approval-mozilla-aurora?
Comment 102•8 years ago
|
||
Comment on attachment 8829310 [details] [diff] [review]
[Uplift] Add gfxCriticalError for failing to create D3D11 device in content side
add diagnostics for debug gfx crash, beta52+
Attachment #8829310 -
Flags: approval-mozilla-aurora? → approval-mozilla-beta+
Assignee | ||
Updated•8 years ago
|
Summary: non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #96)
> Hello David,
>
> I think there are two different problem in this signature :
> 1. The SharedResourceHandler is not synchronized with D3D11 device (which
> is solved by bug 1322741).
> 2. The device from DeviceManagerDx is nullptr like[1].
>
> The second problem, which I think is not covered by first problem, is the
> one I want to solve with this patch.
>
> My theory is that when handling device reset, we reset the device in [2] but
> failed to create new device in the following code which cause this crash.
> Therefore I rise the device reset flag after the creation failure to make
> the code early-return in [3] without entering the crash segment in the first
> place to avoid using the stale device.
If we handle a device reset, we throw away layers and acquire a new TextureFactoryIdentifier for everything affected. This should eliminate any possibility of FinalizeFrame not being synchronized to the device configuration. For content tabs, this happens via TabChild::CompositorUpdated[1]. For top-level widgets, it happens in nsWindow::OnRenderingDeviceReset [2]. Both of these happen *after* we acquire (or fail to acquire) a new ID3D11Device.
So: if we paint after a device reset, we should have discarded layers and any device information on that ClientLayerManager. If we're not doing that it's a bug.
And in fact, that's what bug 1322741 papers over. Because of how we send messages to content, we can paint a tab without it having acknowledged a device reset. (There's a guard to prevent this from happening [3], but it doesn't update TextureFactoryIdentifier. I also proposed making the messages atomic in bug 1322741 comment #9.)
Another loophole is that, if we fail to acquire a content device, we still have a SyncObjectD3D11. But FinalizeFrame should fail a check [4] - we won't have any ID3D11Texture2D objects registered, so there's no reason to use the ID3D11Device. But defin
Is there a situation you're thinking of that isn't covered by these cases?
[1] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/ipc/CompositorBridgeChild.cpp#366
[2] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/widget/nsBaseWidget.cpp#368
[3] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/client/ClientLayerManager.cpp#220
[4] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/d3d11/TextureD3D11.cpp#1256
Sorry, last sentence got cut off. I'm not sure what I meant to say :)
Flags: needinfo?(dvander)
Comment 105•8 years ago
|
||
status-firefox54:
--- → affected
Updated•8 years ago
|
Attachment #8829310 -
Flags: checkin+
Assignee | ||
Comment 106•8 years ago
|
||
Comment on attachment 8819217 [details] [diff] [review]
[WIP] Trigger device reset when failing to create D3D11 device in content process
According to comment 103, I find the problem this patch want to fix is covered by Bug 1322741, so I obsolete this patch.
Attachment #8819217 -
Attachment is obsolete: true
Reporter | ||
Comment 107•8 years ago
|
||
A crash report with the "Unexpected reset device processing..." message: https://crash-stats.mozilla.com/report/index/74e44e19-1947-4881-8ade-2f7d52170131#tab-metadata
A crash report with the "Failed to create D3D11 device..." message: https://crash-stats.mozilla.com/report/index/0c48ce0d-dafa-4e62-8961-892762170129#tab-metadata
Reporter | ||
Comment 108•8 years ago
|
||
In the meantime, there is a lot of these crashes - we don't want to just check for the null content device in SyncObjectD3D11::FinalizeFrame, before calling OpenSharedResource? It would then combine this with the other path, where the OpenSharedResource fails, and we can act accordingly depending on whether it's a device reset or not.
Flags: needinfo?(kechen)
Assignee | ||
Comment 109•8 years ago
|
||
I think the null content device issue is covered by Bug 1322741.
After the patch uplift in Bug 1322741, we will check the null device before calling FinalizeFrame and prevent violent reading to null pointer[1]. It will eliminate most of the crashes this signature since most of them are null pointer accessing crashes.
Currently the patch is in firefox 52, should we also uplift it to firefox 51?
Since we don't crash due to the null pointer accessing now, the following question is how to recover from the null content device situation.
According to current device reset structure, if chrome process find out content device is invalid, it will trigger device reset process and renew the syncobject and update the device in content side, in this case, I think we don't have to do anything to current code.
[1] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/gfx/layers/d3d11/TextureD3D11.cpp#1247
Flags: needinfo?(kechen)
Reporter | ||
Comment 110•8 years ago
|
||
Bug 1322741 fix is in 52 beta 2. Here is a crash in 52 beta 2:
https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202
We fail to open the shared resource, we don't have device reset flag, so we crash.
Also, same crash signature, with a sync lock timeout in 54, after the patch to bug 1322741 landed on central (in 53):
https://crash-stats.mozilla.com/report/index/df10f88d-9dd2-4628-a302-2e5b22170131
Doesn't look like bug 1322741 quite took care of it all, and we still crash.
Flags: needinfo?(kechen)
Assignee | ||
Comment 111•8 years ago
|
||
With the fix in Bug 1322741, we can ensure the mSyncHandle is updated.
However, according to GraphicsCriticalError in the crashes[1], the failure of calling OpenSharedResource is caused by invalid argument and the only parameter we pass is mSyncHandle.
Which might indicates the mSyncHandle is no longer available.
I am still trying to find out the root cause right now while an easy walkaround might just asks compositor process to update a new mSyncHandle.
David do you have any advice on this ?
And about the sync lock timeout crash, I don't have any thought about it. We've raised the time limitation from 10sec to 20sec to avoid the timeout but raise the number again might not be a good idea. I will keep looking into it.
[1] https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202
Flags: needinfo?(kechen) → needinfo?(dvander)
Assignee | ||
Comment 112•8 years ago
|
||
Here is my assumption about the current OpenSharedResource crashes after Bug 1322741:
There is a latency between the chrome process sending the handle and the content process receiving the handle.
If another device rest happened during the latency, the content process will get an expired handle and use the latest device to open it which will cause the failure of the OpenSharedHandle function call while the device won't report device reset[1].
The Same case might happen when opening a new tab which also contains the SyncObject update.
While the patch in Bug 1316788 might reduce the crash probability efficiently, I am now working on a patch that sends a synchronize ipc to chrome process to update the handle before we crash the program in [1].
[1] https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11.cpp#1237
Flags: needinfo?(dvander)
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #112)
> Created attachment 8839374 [details]
> Invalid handle due to the ipc latency.
>
> Here is my assumption about the current OpenSharedResource crashes after Bug
> 1322741:
> There is a latency between the chrome process sending the handle and the
> content process receiving the handle.
> If another device rest happened during the latency, the content process will
> get an expired handle and use the latest device to open it which will cause
> the failure of the OpenSharedHandle function call while the device won't
> report device reset[1].
> The Same case might happen when opening a new tab which also contains the
> SyncObject update.
>
> While the patch in Bug 1316788 might reduce the crash probability
> efficiently, I am now working on a patch that sends a synchronize ipc to
> chrome process to update the handle before we crash the program in [1].
>
> [1]
> https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11.
> cpp#1237
Hrm... at this point, why not just make the whole thing fallible? If OpenSharedResource fails, stop painting until we get a new TextureFactoryIdentifier. Requesting one synchronously might have the same problem otherwise (the compositor could have closed it already).
Assignee | ||
Comment 114•8 years ago
|
||
Hello David, could you give me some feedbacks for this patch ?
In this patch, I enable the device reset flag when failing to open the shared handle to make it fallible.
However, there is a risk that if there are other reasons besides comment 112 cause this crash and there were no updating ipc after this, the content process will stay in device reset status forever.
Instead of sending a sync ipc in comment 112, what if I send an async ipc to compositor thread, make it do the device reset again and update the handle ?
Attachment #8841463 -
Flags: feedback?(dvander)
Why do we need to force a fake device reset, vs. just ignore the bad handle and wait for a new one?
Assignee | ||
Comment 116•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #115)
> Why do we need to force a fake device reset, vs. just ignore the bad handle
> and wait for a new one?
The reason of forcing a fake device reset is to skip the paintings before we get the new handle. So that we can resume the paintings when we get the new one.
Flags: needinfo?(dvander)
Why do we need a device reset to do that, versus just returning false from the function and not painting anything for that frame?
Flags: needinfo?(dvander)
Reporter | ||
Comment 118•8 years ago
|
||
Clearing this bug may take care of bug 1306619 as well.
Comment 119•8 years ago
|
||
This signature spiked in Nightly starting with 20170309030216. We now have 3727 crashes (1096 installations) in the last 7 days.
Comment 120•8 years ago
|
||
Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55 (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process (92.1%). I don't see any reports showing up in Aurora or Beta in the last week but we do have 18 reports in Release.
Keywords: topcrash-win
Assignee | ||
Comment 121•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #117)
> Why do we need a device reset to do that, versus just returning false from
> the function and not painting anything for that frame?
Hello David, sorry for the late reply.
The reason I set the device rest is to avoid generating transactions[1] until available handle arrives.
[1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/gfx/layers/client/ClientLayerManager.cpp#352
Flags: needinfo?(dvander)
Assignee | ||
Comment 122•8 years ago
|
||
The spike might be related to Bug 1345814.
Assignee | ||
Comment 123•8 years ago
|
||
I think we might need to make the AcuireSync RAII, or any return between [1] might cause the AcquireSync without Release.
Since the mutex of the texture is generated from ImageBridgeChild::GetSingleton()[2] like mSyncObject does, it might also cause the timeout of mSyncObject's AcquireSync and run into the crash.
[1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#902-946
[2] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#900
Attachment #8847534 -
Flags: feedback?(bas)
Comment on attachment 8841463 [details] [diff] [review]
Make OpenSharedResource fallible.
Review of attachment 8841463 [details] [diff] [review]:
-----------------------------------------------------------------
If you think injecting a fake reset is quite reliable, I'll r+ that. Otherwise, it might make more sense to just set an invalid bit on the SyncObject and have ClientLayerManager check that before it paints.
Attachment #8841463 -
Flags: feedback?(dvander) → feedback+
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #120)
> Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55
> (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process
> (92.1%). I don't see any reports showing up in Aurora or Beta in the last
> week but we do have 18 reports in Release.
It looks like this is a separate bug, probably, since it occurs in media code in the GPU process (this bug occurs in layers code in the content process).
Flags: needinfo?(dvander)
Comment 126•8 years ago
|
||
Comment on attachment 8847534 [details] [diff] [review]
Make AcquireSync RAII.
Review of attachment 8847534 [details] [diff] [review]:
-----------------------------------------------------------------
You're absolutely right, good catch, looks good to me.
Attachment #8847534 -
Flags: feedback?(bas) → review+
Comment 127•8 years ago
|
||
The actual cause of the new crashes though is likely bug 1345814.
Assignee | ||
Comment 128•8 years ago
|
||
Carry r+ from comment 126.
The try result looks fine. (Only run the tests on Windows platform since the change is only for Windows and I think we don't run tests on XP now.)
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a5baba64a082d67db01c2401dba168fcaff3f75&selectedJob=84520438
Attachment #8847534 -
Attachment is obsolete: true
Attachment #8848404 -
Flags: review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 129•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9747d8849faa
Make AcquireSync RAII. r=bas
Keywords: checkin-needed
![]() |
||
Comment 130•8 years ago
|
||
bugherder |
Comment 131•8 years ago
|
||
Hi Kevin,
How do you think about this modification?
Attachment #8891165 -
Flags: feedback?(kechen)
Comment 132•8 years ago
|
||
Comment on attachment 8891165 [details] [diff] [review]
bug1160157_lock_all_or_lock_none.patch
Thanks Kevin, it seems this modification is not related to this issue.
Attachment #8891165 -
Attachment is obsolete: true
Attachment #8891165 -
Flags: feedback?(kechen)
Assignee | ||
Comment 133•7 years ago
|
||
This call stack is removed by Bug 1357299, and the crashes are resolved in Bug 1409176.
Set wonfix to this bug for now.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Comment 134•7 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•