<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

10 years ago

Attachment #8599996 - Flags: feedback?(bas) → feedback+

Reporter

Comment 14

•

10 years ago

The reviewed trivial patch: https://hg.mozilla.org/integration/mozilla-inbound/rev/8fe081b9ff71

Keywords: leave-open

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 15

•

10 years ago

Comment on attachment 8600030 [details] [diff] [review] Part 1. Are uninitialized statics safe? This is the patch landed in comment 14.

Attachment #8600030 - Flags: checkin+

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

10 years ago

Attachment #8600030 - Attachment description: Are uninitialized statics safe? → Part 1. Are uninitialized statics safe?

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 16

•

10 years ago

Attached patch Part 2. Watch for and report missing device, or unexpected texture situations. r=bschouten — Details — Splinter Review

This contains the changes from the "conversation piece" patch, and some additional checks and reports.

Attachment #8600382 - Flags: review?(bas)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 17

•

10 years ago

Attached patch Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten — Details — Splinter Review

Try to get more information, and hopefully crash earlier. The logic was mostly kept, but perhaps we want to MOZ_CRASH in some of the cases where we didn't before?

Attachment #8600383 - Flags: review?(bas)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 18

•

10 years ago

Trying some of these patches on try, it turns out that we cannot get mutex during crash/ref/mochi tests (reproducible locally.) Expected? OK? Just in case it's useful, I added the stack when this happens. Note that today we treat these situations (no mutex coming back) as successful locks/unlocks. The two patches for the review changed that and try lit up :) The stack: Just the critical error "overhead", but note that the texture is not null. 0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90 0:14.37 Assertion failure: false (An assert from the graphics logger), at c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492 0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274) 0:14.37 #02: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx::CriticalLogger> (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265) Relevant part here: 0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D> (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:129) 0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp:278) 0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient.cpp:346) 0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672) 0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc-mingw32\dist\include\mozilla\layers\contentclient.h:220) 0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:55) 0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientpaintedlayer.cpp:1 32 0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71) 0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.h:379) 0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientcontainerlayer.h:71) 0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:273) 0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\clientlayermanager.cpp:316)

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(bas)

Comment 19

•

10 years ago

Comment on attachment 8600382 [details] [diff] [review] Part 2. Watch for and report missing device, or unexpected texture situations. r=bschouten Review of attachment 8600382 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/TextureD3D11.cpp @@ +236,5 @@ > MOZ_ASSERT(aSyncObject->GetSyncType() == SyncObject::SyncType::D3D11); > > SyncObjectD3D11* sync = static_cast<SyncObjectD3D11*>(aSyncObject); > > + MOZ_ASSERT(mTexture || mTexture10); There's shouldn't be a real point to the checks below, there's no way this can change within a session. If somehow both are true we should add a MOZ_ABORT_IF_FALSE below because this would mean our session is -completely- FUBAR. @@ +1067,5 @@ > ID3D10Device* device = gfxWindowsPlatform::GetPlatform()->GetD3D10Device(); > + if (!device) { > + gfxCriticalError() << "Missing D3D10 device(2) " << hexa(gfxWindowsPlatform::GetPlatform()->GetD2DDevice()); > + if (gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset()) { > + mutex->ReleaseSync(0); I'm not certain we should not crash here, this would be really bad, and I'm not sure we can somehow survive if this occurs (I'm also still not clear on how it could occur). There's so much else which should blow up at this point. And I'm worried we'll just spread the pain out through all kinds of other places.

Comment 20

•

10 years ago

Comment on attachment 8600383 [details] [diff] [review] Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten Review of attachment 8600383 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/d3d11/CompositorD3D11.cpp @@ +1074,5 @@ > mAttachments->mSyncTexture->QueryInterface((IDXGIKeyedMutex**)byRef(mutex)); > > MOZ_ASSERT(mutex); > HRESULT hr = mutex->AcquireSync(0, 10000); > + if (FAILED(hr)) { It's probably a good idea to gfxCriticalError if a WAIT_TIMEOUT occurs. We also probably shouldn't be ignoring WAIT_ABANDONED, so a better thing here might be something like: if (hr == WAIT_TIMEOUT) { gfxCriticalError() << etc. } else if (hr != S_OK) { MOZ_CRASH(); } ::: gfx/layers/d3d11/TextureD3D11.cpp @@ +201,5 @@ > MOZ_ASSERT(!mIsLocked); > MOZ_ASSERT(mTexture || mTexture10); > MOZ_ASSERT(mDrawTarget->refCount() == 1); > if (mTexture) { > + MOZ_ASSERT(LockD3DTexture(mTexture.get())); Eep! This will make LockD3DTexture not occur in release builds since MOZ_ASSERT will compile to null. DebugOnly<bool> didSucceed = LockD3DTexture(mTexture.get())); MOZ_ASSERT(didSucceed); Or something of the likes would do it.

Attachment #8600383 - Flags: review?(bas) → review-

Comment 21

•

10 years ago

Comment on attachment 8600030 [details] [diff] [review] Part 1. Are uninitialized statics safe? Review of attachment 8600030 [details] [diff] [review]: ----------------------------------------------------------------- Just for the record, I believe statics are guaranteed by the spec to be initialized to zero'ed memory by any sane compiler. ISO 9899:1999 section 6.7.8 paragraph 10 leaves a little bit of wiggle room for bools, but I believe MSVC does this just fine. Not a bad idea to make it explicit as far as I'm concerned though, but unlikely to be the cause of any practical issue.

https://hg.mozilla.org/mozilla-central/rev/8fe081b9ff71

Comment 22

•

10 years ago

(In reply to Milan Sreckovic [:milan] from comment #18) > Trying some of these patches on try, it turns out that we cannot get mutex > during crash/ref/mochi tests (reproducible locally.) Expected? OK? Just > in case it's useful, I added the stack when this happens. Note that today > we treat these situations (no mutex coming back) as successful > locks/unlocks. The two patches for the review changed that and try lit up :) > > The stack: > > Just the critical error "overhead", but note that the texture is not null. > > 0:13.34 [GFX1]: Unable to get mutex when locking D3D Texture 0x0D7BCA90 > 0:14.37 Assertion failure: false (An assert from the graphics logger), at > c:\users\msreckovic\repos\mozilla-central\gfx\2d\Logging.h:492 > 0:14.37 #01: mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::Flush > (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:274) > 0:14.37 #02: > mozilla::gfx::Log<1,mozilla::gfx::CriticalLogger>::~Log<1,mozilla::gfx:: > CriticalLogger> > (c:\users\msreckovic\repos\mozilla-central\gfx\2d\logging.h:265) > > Relevant part here: > > 0:14.37 #03: mozilla::layers::LockD3DTexture<ID3D11Texture2D> > (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp: > 129) > 0:14.37 #04: mozilla::layers::TextureClientD3D11::Lock > (c:\users\msreckovic\repos\mozilla-central\gfx\layers\d3d11\textured3d11.cpp: > 278) > 0:14.37 #05: mozilla::layers::ContentClientRemoteBuffer::CreateBuffer > (c:\users\msreckovic\repos\mozilla-central\gfx\layers\client\contentclient. > cpp:346) > 0:14.40 #06: mozilla::layers::RotatedContentBuffer::BeginPaint > (c:\users\msreckovic\repos\mozilla-central\gfx\layers\rotatedbuffer.cpp:672) > 0:14.40 #07: mozilla::layers::ContentClientRemoteBuffer::BeginPaintBuffer > (c:\users\msreckovic\repos\mozilla-central\obj-i686-pc- > mingw32\dist\include\mozilla\layers\contentclient.h:220) > 0:14.40 #08: mozilla::layers::ClientPaintedLayer::PaintThebes > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientpaintedlayer.cpp:55) > 0:14.40 #09: mozilla::layers::ClientPaintedLayer::RenderLayerWithReadback > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientpaintedlayer.cpp:1 > 32 > 0:14.40 #10: mozilla::layers::ClientContainerLayer::RenderLayer > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientcontainerlayer.h:71) > 0:14.40 #11: mozilla::layers::ClientLayer::RenderLayerWithReadback > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientlayermanager.h:379) > 0:14.40 #12: mozilla::layers::ClientContainerLayer::RenderLayer > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientcontainerlayer.h:71) > 0:14.40 #13: mozilla::layers::ClientLayerManager::EndTransactionInternal > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientlayermanager.cpp:273) > 0:14.40 #14: mozilla::layers::ClientLayerManager::EndTransaction > (c:\users\msreckovic\repos\mozilla- > central\gfx\layers\client\clientlayermanager.cpp:316) Yes, expected, those textures are no longer actually created as KeyedMutex supporting textures since the 'One Texture to Rule Them All' solution. (Un)lockD3DTexture are essentially no-ops these days for our practical purposes. They're more or less left in in case there's TextureClients that do choose to use them, we could just remove the methods all together.

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(bas)

Phil Ringnalda (:philor)

Comment 23

•

10 years ago

Comment 24

•

10 years ago

Comment on attachment 8600383 [details] [diff] [review] Part 3. More explicit with the AcquireSync/ReleaseSync calls, locking/unlocking. r=bschouten Review of attachment 8600383 [details] [diff] [review]: ----------------------------------------------------------------- MOZ_ASSERT is inside a #ifdef DEBUG, missed this.

Attachment #8600383 - Flags: review- → review+

Comment 25

•

10 years ago

Milan, Bas, does this need more work? Looks like maybe part 3 never landed. Do you want to request uplift to 39? This crash signature is still showing up for 39.0a2.

Flags: needinfo?(milan)

Flags: needinfo?(bas)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 26

•

10 years ago

I may have confused some of this with the work in bug 1143806 where there was some discussion of uplift to 39.

Reporter

Comment 27

•

10 years ago

No, you're right, there was some discussion, but I'm not sure the patches are correct, especially the mutex side, at least try doesn't like some of them. Need to get back to this.

Comment 28

•

10 years ago

Noting also, the crash rate so far looks high on this signature for 38.0b99 and 38.0. 38.0b99 14.69 % 2031 38.0 3.14 % 434

Comment 29

•

10 years ago

(In reply to Milan Sreckovic [:milan] from comment #4) > Different crashes, perhaps same underlying reason: I'm seeing two categories of crash with this signature, about evenly split. I get the feeling they're not related. There are MOZ_CRASHes due to the WAIT_TIMEOUTs. Almost all of them have code 0x8899000C (D2DERR_RECREATE_TARGET). Good mix of vendors, devices, and drivers. The other group have null derefs on |device|. This group is 86% AMD family 20 CPUs. Suspiciously, those are the CPUs from "the AMD bug". But the CPU could be a red herring. Those chips are integrated CPU+GPU, so we could also say that this group is mostly devices 0x9802 to 0x9807. Their driver versions are all over.

NVD

Comment 30

•

10 years ago

http://www.geforce.com/whats-new/articles/geforce-352-86-whql-driver-released "Windows Vista/Windows 7/Windows 8/Windows 8.1 Fixed Issues [Firefox] Firefox Nightly crashes when started on the NVIDIA GPU on an Optimus system. [1609030]" Release notes says this is fixed now.

Lorenso Gwine

Comment 31

•

10 years ago

My problem seems solved with new drivers

Robert Kaiser

Comment 32

•

10 years ago

(In reply to Lorenso Gwine from comment #31) > My problem seems solved with new drivers That sounds awesome - now we need to get more people to update to those new drivers...

Milan Sreckovic [:milan] (needinfo for best results)

Comment 33

•

10 years ago

Note that per comment 29, nvidia drivers are only going to solve a portion of these crashes.

Reporter

Comment 34

•

10 years ago

At least some of these are startup crashes; any correlation with the startup vs. not-startup and info from comment 29?

Flags: needinfo?(milan)

Summary: non device reset crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 35

•

10 years ago

Let's pick at these little by little - sounds like there is some blocklisting that can happen, we can convert few more crashes into MOZ_CRASHes, check if there was a driver reset more consistently in the FinalizeFrame.

Assignee: nobody → bas

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 36

•

10 years ago

[Tracking Requested - why for this release]: Tracking all startup crashes of any significant numbers.

tracking-firefox39: --- → ?

tracking-firefox40: --- → ?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 37

•

10 years ago

Adding a tracking flag for firefox40. Based on the discussion so far, it's unclear whether we have a stable fix. It may be getting too close for firefox39. Will let Liz/Sylvestre know about this one.

tracking-firefox40: ? → +

Updated

•

10 years ago

status-firefox39: --- → affected

tracking-firefox39: ? → +

Comment 38

•

10 years ago

(In reply to Milan Sreckovic [:milan] from comment #34) > At least some of these are startup crashes; any correlation with the startup > vs. not-startup and info from comment 29? About half of the null-device crashes are within the first 60 seconds. Mostly towards the end of that first minute though; not single-digit uptimes. The MOZ_CRASH breakpoints are not startup crashes.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 39

•

10 years ago

Should we go for fixing the MOZ_CRASH issues anyway for 39?

Flags: needinfo?(milan)

Reporter

Comment 40

•

10 years ago

This one is still a bit of a mystery, it will probably miss 39 if we can't figure out what to do. Bas, one last push?

Flags: needinfo?(milan)

Comment 41

•

10 years ago

There's something very weird going on in the current reports, first of all there's this: D2D- D2D1.1- D2D+ DWrite- DWrite+ D3D11 Layers- D3D11 Layers+ Second of all this MOZ_CRASH -clearly- passed a gfxCriticalError but for some reason that log message is not in the crash report. Removing this MOZ_CRASH does not feel like a good idea since it will mean those people would get a -lot- of rendering artifacts. Those weird - and + mixes are likely a good indicator as to what's responsible for this situation though, we should try to figure out what can cause that.

Flags: needinfo?(bas)

Comment 42

•

10 years ago

I've found one where the gfxCriticalError did get through, and it shows 0x80070057 ERROR_INVALID_PARAMETER, this further suggests we're getting into some weird state due to this - + business.

Ryan Stecker

Comment 43

•

10 years ago

I'm still able to reproduce this crash (or perhaps one with a similar signature) every time I update my NVIDIA drivers. https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33-185e92150608 Per comment 30, I waited until after I installed 352.86 to see if the newer driver had actually solved my crash. Unfortunately upon installing driver 353.06 a few minutes ago I was able to bring my e10s content tabs into a broken state where nightly only displays a spinner instead of the page content. Additionally, opening a new about:newtab works fine, but upon navigating the new tab to any website, the tab crashed and submitted the crash dump above.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 44

•

10 years ago

(In reply to Ryan Stecker from comment #43) > I'm still able to reproduce this crash (or perhaps one with a similar > signature) every time I update my NVIDIA drivers. > > https://crash-stats.mozilla.com/report/index/b1927cab-476d-48a4-ad33- > 185e92150608 > > Per comment 30, I waited until after I installed 352.86 to see if the newer > driver had actually solved my crash. > > Unfortunately upon installing driver 353.06 a few minutes ago I was able to > bring my e10s content tabs into a broken state where nightly only displays a > spinner instead of the page content. Additionally, opening a new > about:newtab works fine, but upon navigating the new tab to any website, the > tab crashed and submitted the crash dump above. Interesting, Milan, do we have an NVidia (non-optimus, Ryan has a GTX970 as it looks from the report, so he's running dual GPUs with his intel on board but he's not using optimus) machine somewhere in the Toronto office where we could try updating an NVidia driver.. the crash Ryan posted has the interesting -/+ stuff in there and maybe this reproduces fairly reliably when updating the driver on a real NVidia machine?

Flags: needinfo?(milan)

Reporter

Comment 45

•

10 years ago

Not sure if we have one - Jeff? For the "disappearing" critical errors - they may have wrapped, in which case setting gfx.logging.crash.length (int) preference to a larger number (e.g., 50) may keep it around, if we think this will help us.

Flags: needinfo?(milan) → needinfo?(jmuizelaar)

Comment 46

•

10 years ago

(In reply to Milan Sreckovic [:milan] from comment #45) > Not sure if we have one - Jeff? > > For the "disappearing" critical errors - they may have wrapped, in which > case setting gfx.logging.crash.length (int) preference to a larger number > (e.g., 50) may keep it around, if we think this will help us. There's none at all in that report as far as I can tell, but in the new ones we have one, which is an ILLEGAL_PARAMETER one, almost certainly the -/+ business causes us to use a null handle or an invalid handle somehow. This is likely going to be very easy to fix when we can reproduce it. If we -can't- reproduce this I would suggest we decide to ignore INVALID_PARAMETER errors for now and assume the situation corrects itself in a valid way, but I would really rather not.

Comment 47

•

10 years ago

Unless I'm mistaking, I'm not seeing this show up in significant numbers on either 39 or 40, can anyone confirm this?

Comment 48

•

10 years ago

They're still around, but way down on the rankings (100+). I'm seeing relatively more of the null-device vs MOZ_CRASH variety; before they used to be pretty even.

Robert Kaiser

Comment 49

•

10 years ago

I think the driver updates mentioned in comment #30 did really help here.

Comment 50

•

10 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49) > I think the driver updates mentioned in comment #30 did really help here. I suspect it's more likely something else helped here, users don't tend to update their drivers this fast. We expanded some of our video blacklisting, I suspect that may be involved. But it's hard to be certain.

Sylvestre Ledru [:Sylvestre]

Comment 51

•

10 years ago

Should we consider this sufficiently 'fixed' for 39 and 40 at this point?

Comment 52

•

10 years ago

I think so. I will stop tracking for 39 & 40.

tracking-firefox39: + → -

tracking-firefox40: + → -

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 54

•

10 years ago

Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and Intel Graphics HD4600 when upgrading from 353.30-notebook-win8-win7-64bit-international-whql.exe to 353.62-notebook-win8-win7-64bit-international-whql.exe

u279076

Comment 55

•

10 years ago

(In reply to Aryx [:archaeopteryx][:aryx] from comment #54) > Saw this with Nightly 20150804 on Windows 8.1 64 bit with Nvidia 765M and > Intel Graphics HD4600 when upgrading from > 353.30-notebook-win8-win7-64bit-international-whql.exe > to > 353.62-notebook-win8-win7-64bit-international-whql.exe Was Firefox running at the time of the driver upgrade or did Firefox crash during startup after having updated the driver? Did you restart your computer at any point during the driver update process?

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 56

•

10 years ago

I let them running during the upgrade on purpose to test this (the driver upgrade didn't prompt for a restart). Just saw that it was Firefox 39.0 which crashed and not Beta or Nightly which were also running (but had black areas). More info: https://crash-stats.mozilla.com/report/index/e57ac6c5-b7ca-4dd7-9984-acdc12150805

u279076

Comment 57

•

10 years ago

This remains a significant issue but is no longer a top crash. This is currently ranked 43rd in Firefox 41.0.1 with 976 crashes accounting for 0.43%.

status-firefox39: affected → wontfix

status-firefox40: affected → wontfix

status-firefox41: --- → affected

status-firefox42: --- → affected

status-firefox43: --- → affected

status-firefox44: --- → affected

Keywords: topcrash

BMO Automation

Updated

•

10 years ago

Crash Signature: [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()] → [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame()] [@ mozilla::layers::SyncObjectD3D11::FinalizeFrame]

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 58

•

9 years ago

I am noticing content tabs crashing 1-2 times every day since Aurora47 updates were enabled late last week. MConley looked at the crash reports (below) and pointed me to this bug. Here are the top items from my about:crashes * https://crash-stats.mozilla.com/report/index/bp-533909e4-3ad1-420d-8cbc-c2be92160316 - 3/16/16 * https://crash-stats.mozilla.com/report/index/bp-633d5c25-05df-4346-affb-d1f562160315 - 3/15/16 * https://crash-stats.mozilla.com/report/index/bp-b54dee5b-eb40-41b6-859f-d1bdc2160315 - 3/14/16

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 59

•

9 years ago

Nom'ing for tracking by e10s team mainly due to comment 58 (more crashiness seen since 47 moved to Aurora channel).

tracking-e10s: --- → ?

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 60

•

9 years ago

Yeah, we see these coming in a small number, hopefully some of the patches we have on the device reset handling eventually clear this up. Note that this is a forced crash on nightly and aurora - it will not happen (as such) on beta or release.

Flags: needinfo?(jmuizelaar)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

9 years ago

Summary: non device reset startup crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

tracking-e10s: ? → +

Priority: -- → P2

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Attachment #8600382 - Flags: review?(bas)

Comment 61

•

9 years ago

This is still happening sometimes, e.g. there were 5 occurrences in Nightly 20160501030217: https://crash-stats.mozilla.com/report/index/cf2f68e7-096e-4459-9c74-f23a62160502 https://crash-stats.mozilla.com/report/index/38b4938b-0763-444d-9cca-0f2c72160502 https://crash-stats.mozilla.com/report/index/08de245b-a4a9-4d29-b13e-9ed7e2160502 https://crash-stats.mozilla.com/report/index/4d178bd8-4216-472f-b5a0-718f52160502 https://crash-stats.mozilla.com/report/index/022c4265-c2f3-4ff8-8ef3-3abac2160502

Milan Sreckovic [:milan] (needinfo for best results)

Comment 62

•

9 years ago

Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame': - nightly (version 50): 136 crashes from 2016-06-06. - aurora (version 49): 304 crashes from 2016-06-07. - beta (version 48): 1174 crashes from 2016-06-06. - release (version 47): 9240 crashes from 2016-05-31. - esr (version 45): 263 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 23 16 22 14 20 20 16 - aurora 42 64 52 49 36 33 9 - beta 163 183 161 171 188 184 62 - release 1444 1472 1323 1390 1420 1301 472 - esr 39 27 31 32 34 20 26 Affected platform: Windows

status-firefox47: --- → affected

status-firefox48: --- → affected

status-firefox49: --- → affected

status-firefox50: --- → affected

status-firefox-esr45: --- → affected

Reporter

Comment 63

•

9 years ago

Jerry, here is an interesting crash: https://crash-stats.mozilla.com/report/index/c2d46775-f73e-4b05-94e5-aae0e2160821. We hit this one, MOZ_CRASH becausewe fail to open D3D11 shared resource, and then gfxWindowsPlatform::GetPlatform()->DidRenderingDeviceReset() returns false inside of SyncObjectD3D11::FinalizeFrame. Looking at the error log, we see the the "device reset" messages just two seconds before this happens and we crash. So, something is not quite getting initialized in time, perhaps we reset the "device reset" too soon. What do you think?

Flags: needinfo?(hshih)

Comment 64

•

9 years ago

Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame': - nightly (version 51): 67 crashes from 2016-08-01. - aurora (version 50): 216 crashes from 2016-08-01. - beta (version 49): 580 crashes from 2016-08-02. - release (version 48): 1474 crashes from 2016-07-25. - esr (version 45): 392 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 32 20 7 - aurora 81 68 23 - beta 223 194 52 - release 552 398 204 - esr 36 31 36 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #49 #86 - aurora #131 #13 - beta #87 #58 - release #36 #330 - esr #233

status-firefox51: --- → affected

Milan Sreckovic [:milan] (needinfo for best results)

Comment 65

•

9 years ago

In my first view, maybe we still failed at the creation of new compositor(which is triggered by our device removed detection flow). https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2293 Then, hit the early return at: https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2305 So, there is no new SyncObjectD3D11 updated from https://hg.mozilla.org/mozilla-central/annotate/7963ebdd52b93f96b812eff2eab8d94097147b9c/gfx/layers/ipc/CompositorBridgeParent.cpp#l2311 Finally, we still use the "probably invalid" syncObj during painting. I will try to add some criticalNote to show this.

Reporter

Updated

•

9 years ago

Flags: needinfo?(hshih)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 66

•

9 years ago

Jerry, did you have a chance to come up with a patch you mention in comment 65?

Flags: needinfo?(hshih)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 67

•

9 years ago

This crash has no critical errors until the one that takes us down: https://crash-stats.mozilla.com/report/index/35996059-ede0-4199-9d0a-af5742160911

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 68

•

9 years ago

Some critical error correlations: (95.27% in signature vs 00.35% overall) gfx_error 'Failed to create a D3D11 content device: ' (75.27% in signature vs 06.38% overall) gfx_error 'Failed 2 buffer db=' (36.88% in signature vs 00.60% overall) gfx_error 'D3D11 swap chain preset failed ' (21.29% in signature vs 00.70% overall) gfx_error 'Invalid draw target type specified: ' (16.34% in signature vs 00.12% overall) gfx_error '[D3D11] TextureSourceD3D11:GetShaderResourceView CreateSRV failure '

Comment 69

•

9 years ago

Yes, I'm trying to write a patch for comment 65. But I have no idea about the crash in comment 67. Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). So, maybe content side should skip all painting task between the timing of knowing driver-remove and receiving the RecvCompositorUpdated() call.

Updated

•

9 years ago

Assignee: bas → hshih

Status: NEW → ASSIGNED

Flags: needinfo?(hshih)

Comment 70

•

9 years ago

Attached patch show some gfxCritical logs for compositor creation failed during device-reset. v1. — Details — Splinter Review

Attachment #8794479 - Flags: review?(dvander)

Comment 71

•

9 years ago

Attached patch skip content painting when the device is in device-reset/remove status. v1 — Details — Splinter Review

Content process will do nothing when it detects driver-remove. The UpdateRenderMode() (which will re-create d3d device) always triggered by CompositorBridgeChild::RecvCompositorUpdated(). Before RecvCompositorUpdated() message, the content side still use that bad-status device for rendering. This patch try to skip the painging and the SyncObject waiting during driver-reset.

Attachment #8794480 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 72

•

9 years ago

Attached patch remove the painting aborting for ClientPaintedLayer in device-remove/reset status. v1 — Details — Splinter Review

Since we already have the device-remove checking in ClientLayerManager, this checking is redundant.

Attachment #8794481 - Flags: review?(dvander)

Comment 73

•

9 years ago

Comment on attachment 8794479 [details] [diff] [review] show some gfxCritical logs for compositor creation failed during device-reset. v1. Review of attachment 8794479 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorBridgeParent.cpp @@ +2379,5 @@ > return Nothing(); > } > > + if (aBackendHints.IsEmpty()) { > + gfxCriticalNote << "No BackendHint for compositor creation."; This should never happen, we should always have at least a basic compositor. @@ +2387,3 @@ > RefPtr<Compositor> compositor = NewCompositor(aBackendHints); > if (!compositor) { > + gfxCriticalNote << "Try to create a new compositor but failed."; Same here. The only place this should truly be able to fail is OS X, since nsChildView::InitCompositor can technically return false. ::: gfx/thebes/DeviceManagerDx.cpp @@ +124,5 @@ > > CreateCompositorDevice(d3d11); > > if (!d3d11.IsEnabled()) { > + gfxCriticalNote << "Can't create d3d11 device in CreateCompositorDevices()"; Are these notes needed? The information should already be available in about:support (and crash reports), since every failure should be propagated through gfxConfig/telemetry.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 74

•

9 years ago

Comment on attachment 8794480 [details] [diff] [review] skip content painting when the device is in device-reset/remove status. v1 Review of attachment 8794480 [details] [diff] [review]: ----------------------------------------------------------------- This seems reasonable, but I'm curious: do we actually have a case where the content process gets a device reset, and the compositor does not? Or is this patch more about not trying to use a failing SyncObject?

Attachment #8794480 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

9 years ago

Attachment #8794481 - Flags: review?(dvander) → review+

Comment 75

•

9 years ago

(In reply to David Anderson [:dvander] from comment #74) > Comment on attachment 8794480 [details] [diff] [review] > skip content painting when the device is in device-reset/remove status. v1 > > Review of attachment 8794480 [details] [diff] [review]: > ----------------------------------------------------------------- > > This seems reasonable, but I'm curious: do we actually have a case where the > content process gets a device reset, and the compositor does not? Or is this > patch more about not trying to use a failing SyncObject? I think both content and chrome could get the device reset status. The problem is that the content side can't recover the device-reset status by itself. It will still use that bad-status device for rendering and synchronization until the CompositorBridgeChild::RecvCompositorUpdated() message received. This patch try to skip the d2d/d3d calls before the recovering in RecvCompositorUpdated().

Flags: needinfo?(dvander)

Updated

•

9 years ago

Attachment #8794479 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 76

•

9 years ago

Please land the attachment 8794480 [details] [diff] [review] and attachment 8794481 [details] [diff] [review] to m-c.

Keywords: checkin-needed

Comment 77

•

9 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #75) > (In reply to David Anderson [:dvander] from comment #74) > > Comment on attachment 8794480 [details] [diff] [review] > > skip content painting when the device is in device-reset/remove status. v1 > > > > Review of attachment 8794480 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > This seems reasonable, but I'm curious: do we actually have a case where the > > content process gets a device reset, and the compositor does not? Or is this > > patch more about not trying to use a failing SyncObject? > > I think both content and chrome could get the device reset status. > > The problem is that the content side can't recover the device-reset status > by itself. It will still use that bad-status device for rendering and > synchronization until the CompositorBridgeChild::RecvCompositorUpdated() > message received. This patch try to skip the d2d/d3d calls before the > recovering in RecvCompositorUpdated(). Okay, thanks for explaining.

Flags: needinfo?(dvander)

https://hg.mozilla.org/mozilla-central/rev/e855469d6acf https://hg.mozilla.org/mozilla-central/rev/4247ce021880

Comment 78

•

9 years ago

Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e855469d6acf skip content painting when the device is in device-reset/remove status. r=dvander https://hg.mozilla.org/integration/mozilla-inbound/rev/4247ce021880 remove the painting aborting for ClientPaintedLayer in device-remove/reset status. r=dvander

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 79

•

9 years ago

bugherder

Vincent Liu[:vliu]

Updated

•

9 years ago

Blocks: 1297204

This crash still happens. https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Alayers%3A%3ASyncObjectD3D11%3A%3AFinalizeFrame&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-version&_sort=-date&page=1

Assignee

Comment 80

•

9 years ago

Comment 81

•

9 years ago

Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame': - nightly (version 52): 61 crashes from 2016-09-19. - aurora (version 51): 65 crashes from 2016-09-19. - beta (version 50): 332 crashes from 2016-09-20. - release (version 49): 1523 crashes from 2016-09-05. - esr (version 45): 562 crashes from 2016-06-01. Crash volume on the last weeks (Week N is from 10-03 to 10-09): W. N-1 W. N-2 - nightly 42 19 - aurora 56 9 - beta 290 42 - release 1211 312 - esr 48 44 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #39 #47 - aurora #35 #44 - beta #65 #36 - release #36 #131 - esr #220

status-firefox52: --- → affected

Comment 82

•

9 years ago

:kechen is trying to investigate the crash problem in comment 80.

Assignee

Comment 84

•

9 years ago

There are two main cases in this crash : The first one is like [1], we tried to open the shared resource with the handle and it failed without device-removed. I think there are two possible reasons: 1. The shared resource still exists but the handle in content process is not updated. 2. We accidentally released the shared resource in somewhere, which cause the handle invalid. According to the crash report[1], device-removed doesn't happened and the crash point is far from starting time; therefore, I think this is caused by the reason 2. I suggest to force reset driver at this point like what we do in chrome process[2]. The second one is like[3], we got a timeout when attempting to acquire the mutex of shared resource and this crash always happened after device-removed. This also appears in other signatures [4][5]. I found that there are some early return cases in OnRenderingDeviceReset[6] so they do not update the SyncObject, should we reset the SyncObject here ? And if we are allowed to reuse the shared resource, this code[7] might cause problem, it leaves the function in the mutext section. In this case, move this part of code out of the section might solve the problem. I am now doing an experiment about this. Jerry, do you think it's proper to let content process reset the driver when we encounter this crash ? and do you have any thought about the second case ? [1] https://crash-stats.mozilla.com/report/index/46aec604-fad8-4ab7-b282-fa7282161017#tab-metadata [2] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#781 [3] https://crash-stats.mozilla.com/report/index/bb0fef1e-155a-4ab5-a833-b45772161026#tab-metadata [4] https://crash-stats.mozilla.com/report/index/2d480ff9-6d6a-4174-ac2a-400482161028#tab-metadata [5] https://crash-stats.mozilla.com/report/index/a1104467-20da-4782-8efd-25d4b2161027#tab-metadata [6] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/widget/nsBaseWidget.cpp#335 [7] https://dxr.mozilla.org/mozilla-central/rev/944cb0fd05526894fcd90fbe7d1e625ee53cd73d/gfx/layers/d3d11/TextureD3D11.cpp#1237

Assignee

Comment 85

•

9 years ago

Hello Jerry, can you take a look at comment 84 ?

Flags: needinfo?(hshih)

Peter Chang[:pchang]

Updated

•

9 years ago

Assignee: hshih → kechen

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 86

•

9 years ago

Attached patch Use AutoTextureLock to manage the shared resource's mutex. — Details — Splinter Review

Hello David, Can you help me to review this patch ? This patch use RAII to avoid calling AcquireSync without ReleaseSync. It mainly fixes the early return case without ReleaseSync in [1]. Hoping this can fix some timeout crashes for AcquireSync in this signature. The following link[2] is the try of this patch, it looks good since the patch only effects Windows platform. [1] https://dxr.mozilla.org/mozilla-central/rev/8e8b146fcb8b268e3c09b646087c6b2ef9f0af6f/gfx/layers/d3d11/TextureD3D11.cpp#1238 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=00abdae10271&selectedJob=30434764

Attachment #8808079 - Flags: review?(dvander)

Updated

•

9 years ago

Attachment #8808079 - Flags: review?(dvander) → review+

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

Comment 87

•

9 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e9db19aed6c0 Use AutoTextureLock to manage the shared resource's mutex. r=dvander

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

9 years ago

Flags: needinfo?(hshih)

Comment 88

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e9db19aed6c0

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

9 years ago

Updated

•

9 years ago

Depends on: 1317131

Assignee

Comment 89

•

9 years ago

The crash rate doesn't decrease, I am now working on another patch to fix it.

Assignee

Comment 90

•

9 years ago

Attached patch [WIP] Trigger device reset when failing to create D3D11 device in content process (obsolete) — Details — Splinter Review

Hello David, There are some crashes like[1] which attempt to access a null device in[2]. I think the reason of this problem is that content process failed to create D3D11 device during device reset process and leave the device null while chrome process successfully create it. Also, device reset check will return normal in this case[3]. As the result, the code will pass the device check in [4] and run into the crash. My proposal to solve this problem is to set the device reset reason and ask chrome process to do the device reset with D3D11_COMPOSITING disabled again so that we can recover from this situation. Can you give me some advices about the patch ? Thank you. [1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216 [2] https://hg.mozilla.org/releases/mozilla-release/annotate/cc272f7d48d3/gfx/layers/d3d11/TextureD3D11.cpp#l1124 [3] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/thebes/DeviceManagerDx.cpp#689 [4] https://dxr.mozilla.org/mozilla-central/rev/6dbc6e9f62a705d5f523cc750811bd01c8275ec6/gfx/layers/client/ClientLayerManager.cpp#655

Attachment #8819217 - Flags: feedback?(dvander)

Comment 91

•

9 years ago

Crash volume for signature 'mozilla::layers::SyncObjectD3D11::FinalizeFrame': - nightly (version 53): 170 crashes from 2016-11-14. - aurora (version 52): 850 crashes from 2016-11-14. - beta (version 51): 1123 crashes from 2016-11-14. - release (version 50): 18984 crashes from 2016-11-01. - esr (version 45): 1152 crashes from 2016-07-06. Crash volume on the last weeks (Week N is from 01-02 to 01-08): W. N-1 W. N-2 W. N-3 W. N-4 W. N-5 W. N-6 W. N-7 - nightly 4 5 10 12 40 31 65 - aurora 119 136 169 177 146 76 0 - beta 169 192 193 218 124 140 45 - release 2631 2706 2955 3105 3028 2709 1007 - esr 31 50 67 57 79 47 62 Affected platform: Windows Crash rank on the last 7 days: Browser Content Plugin - nightly #391 #37 - aurora #23 #23 - beta #395 #26 - release #35 #8 - esr #239

status-firefox53: --- → affected

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 92

•

9 years ago

Most of crashes move to Bug 1322741 due to changes in Bug 1319557. After uplifting the fix in Bug 1322741, the crash volume should decrease. The crashes in firefox 53 might be caused by other reasons, I will keep looking into it.

Comment 93

•

9 years ago

Perhaps we can land a version of the WIP patch that has gfxCriticalError instead of SendDeviceReset in those places? Perhaps the ForceDeviceReset as well - in other words, where the WIP patch detects some condition and does something because of it, we just detect it and put a unique gfxCriticalError so that we can see if it triggers in these crashes?

Reporter

Updated

•

9 years ago

Flags: needinfo?(kechen)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 94

•

9 years ago

Attached patch Add gfxCriticalError for failing to create D3D11 device in content side (obsolete) — Details — Splinter Review

Hello Milan, Thank you for the advice, this patch can help us to confirm the root cause of null accessing crashes. Could you help me to review it ? Thank you.

Flags: needinfo?(kechen)

Attachment #8828195 - Flags: review?(milan)

Comment 95

•

9 years ago

Comment on attachment 8819217 [details] [diff] [review] [WIP] Trigger device reset when failing to create D3D11 device in content process Is this obsoleted by bug 1322741? If we're accessing a null D3D11 device after a reset it's not clear how this would fix it. We'd still be accessing a stale device.

Attachment #8819217 - Flags: feedback?(dvander)

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 96

•

9 years ago

Hello David, I think there are two different problem in this signature : 1. The SharedResourceHandler is not synchronized with D3D11 device (which is solved by bug 1322741). 2. The device from DeviceManagerDx is nullptr like[1]. The second problem, which I think is not covered by first problem, is the one I want to solve with this patch. My theory is that when handling device reset, we reset the device in [2] but failed to create new device in the following code which cause this crash. Therefore I rise the device reset flag after the creation failure to make the code early-return in [3] without entering the crash segment in the first place to avoid using the stale device. And the reason that I send a force device reset message to compositor process is to make sure that it knows content side need a new SharedResourceHandler because I don't know if there is any chance that content process have problem in creating D3D11 device while everything works fine in compositor process. And I think it might be still profitable to land attachment 8828195 [details] [diff] [review], in one hand, it can verify if the crash is really caused by the failure creation of D3D11 device when handling device reset. Do you have any advice on this solution ? [1] https://crash-stats.mozilla.com/report/index/9f76066c-beb5-4e2d-af1d-5c4802161216 [2] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/thebes/gfxWindowsPlatform.cpp#445 [3] https://dxr.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d/gfx/layers/client/ClientLayerManager.cpp#661

Flags: needinfo?(dvander)

Reporter

Comment 97

•

9 years ago

Comment on attachment 8828195 [details] [diff] [review] Add gfxCriticalError for failing to create D3D11 device in content side Review of attachment 8828195 [details] [diff] [review]: ----------------------------------------------------------------- This looks good - should we also have that change in CompositorBridgeChild::RecvCompositorUpdated that you had in the other patch: // If we still get device reset here, something must wrong when creating // d3d11 devices. if (gfxPlatform::GetPlatform()->DidRenderingDeviceReset()) { gfxCriticalError() << "Unexpected reset device processing the compositor update"; } or something like that?

Attachment #8828195 - Flags: review?(milan) → review+

Assignee

Comment 98

•

9 years ago

Attached patch Add gfxCriticalError for failing to create D3D11 device in content side — Details — Splinter Review

Carry r+ from comment 97.

Attachment #8828195 - Attachment is obsolete: true

Attachment #8828655 - Flags: review+

Assignee

Updated

•

9 years ago

Keywords: checkin-needed

https://hg.mozilla.org/mozilla-central/rev/19507933747f

Comment 99

•

9 years ago

Pushed by ihsiao@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/19507933747f Add gfxCriticalError for failing to create D3D11 device in content side. r=milan

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 100

•

9 years ago

bugherder

Julien Cristau [:jcristau]

Assignee

Comment 101

•

9 years ago

Attached patch [Uplift] Add gfxCriticalError for failing to create D3D11 device in content side — Details — Splinter Review

Approval Request Comment [Feature/Bug causing the regression]: Not a regression, this patch is for crash tracing. [User impact if declined]: This patch is for finding the root cause of this crash. Without uplifting this, it will be difficult for the crash investigating. [Is this code covered by automated tests?]: No. [Has the fix been verified in Nightly?]: Yes. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: No. [Is the change risky?]: Low. [Why is the change risky/not risky?]: Only add some gfx errors to somewhere that might cause the crash. [String changes made/needed]: No.

Attachment #8829310 - Flags: approval-mozilla-aurora?

Comment 102

•

9 years ago

Comment on attachment 8829310 [details] [diff] [review] [Uplift] Add gfxCriticalError for failing to create D3D11 device in content side add diagnostics for debug gfx crash, beta52+

Attachment #8829310 - Flags: approval-mozilla-aurora? → approval-mozilla-beta+

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Updated

•

9 years ago

Summary: non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame() - nightly and aurora only → non device reset startup forced crash in mozilla::layers::SyncObjectD3D11::FinalizeFrame()

Comment 103

•

9 years ago

(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #96) > Hello David, > > I think there are two different problem in this signature : > 1. The SharedResourceHandler is not synchronized with D3D11 device (which > is solved by bug 1322741). > 2. The device from DeviceManagerDx is nullptr like[1]. > > The second problem, which I think is not covered by first problem, is the > one I want to solve with this patch. > > My theory is that when handling device reset, we reset the device in [2] but > failed to create new device in the following code which cause this crash. > Therefore I rise the device reset flag after the creation failure to make > the code early-return in [3] without entering the crash segment in the first > place to avoid using the stale device. If we handle a device reset, we throw away layers and acquire a new TextureFactoryIdentifier for everything affected. This should eliminate any possibility of FinalizeFrame not being synchronized to the device configuration. For content tabs, this happens via TabChild::CompositorUpdated[1]. For top-level widgets, it happens in nsWindow::OnRenderingDeviceReset [2]. Both of these happen *after* we acquire (or fail to acquire) a new ID3D11Device. So: if we paint after a device reset, we should have discarded layers and any device information on that ClientLayerManager. If we're not doing that it's a bug. And in fact, that's what bug 1322741 papers over. Because of how we send messages to content, we can paint a tab without it having acknowledged a device reset. (There's a guard to prevent this from happening [3], but it doesn't update TextureFactoryIdentifier. I also proposed making the messages atomic in bug 1322741 comment #9.) Another loophole is that, if we fail to acquire a content device, we still have a SyncObjectD3D11. But FinalizeFrame should fail a check [4] - we won't have any ID3D11Texture2D objects registered, so there's no reason to use the ID3D11Device. But defin Is there a situation you're thinking of that isn't covered by these cases? [1] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/ipc/CompositorBridgeChild.cpp#366 [2] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/widget/nsBaseWidget.cpp#368 [3] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/client/ClientLayerManager.cpp#220 [4] http://searchfox.org/mozilla-central/rev/ed04f7a44925dace34f2d2e15ef9c0f2809d70b0/gfx/layers/d3d11/TextureD3D11.cpp#1256

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 104

•

9 years ago

Sorry, last sentence got cut off. I'm not sure what I meant to say :)

Flags: needinfo?(dvander)

Ryan VanderMeulen [:RyanVM]

Comment 105

•

9 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/8d303d330ec9

status-firefox41: affected → wontfix

status-firefox42: affected → wontfix

status-firefox43: affected → wontfix

status-firefox44: affected → wontfix

status-firefox47: affected → wontfix

status-firefox48: affected → wontfix

status-firefox49: affected → wontfix

status-firefox50: affected → wontfix

status-firefox51: affected → wontfix

status-firefox52: affected → fix-optional

status-firefox54: --- → affected

status-firefox-esr45: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Attachment #8829310 - Flags: checkin+

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 106

•

9 years ago

Comment on attachment 8819217 [details] [diff] [review] [WIP] Trigger device reset when failing to create D3D11 device in content process According to comment 103, I find the problem this patch want to fix is covered by Bug 1322741, so I obsolete this patch.

Attachment #8819217 - Attachment is obsolete: true

Reporter

Comment 107

•

9 years ago

A crash report with the "Unexpected reset device processing..." message: https://crash-stats.mozilla.com/report/index/74e44e19-1947-4881-8ade-2f7d52170131#tab-metadata A crash report with the "Failed to create D3D11 device..." message: https://crash-stats.mozilla.com/report/index/0c48ce0d-dafa-4e62-8961-892762170129#tab-metadata

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 108

•

9 years ago

In the meantime, there is a lot of these crashes - we don't want to just check for the null content device in SyncObjectD3D11::FinalizeFrame, before calling OpenSharedResource? It would then combine this with the other path, where the OpenSharedResource fails, and we can act accordingly depending on whether it's a device reset or not.

Flags: needinfo?(kechen)

Milan Sreckovic [:milan] (needinfo for best results)

Assignee

Comment 109

•

9 years ago

I think the null content device issue is covered by Bug 1322741. After the patch uplift in Bug 1322741, we will check the null device before calling FinalizeFrame and prevent violent reading to null pointer[1]. It will eliminate most of the crashes this signature since most of them are null pointer accessing crashes. Currently the patch is in firefox 52, should we also uplift it to firefox 51? Since we don't crash due to the null pointer accessing now, the following question is how to recover from the null content device situation. According to current device reset structure, if chrome process find out content device is invalid, it will trigger device reset process and renew the syncobject and update the device in content side, in this case, I think we don't have to do anything to current code. [1] https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/gfx/layers/d3d11/TextureD3D11.cpp#1247

Flags: needinfo?(kechen)

Reporter

Comment 110

•

9 years ago

Bug 1322741 fix is in 52 beta 2. Here is a crash in 52 beta 2: https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202 We fail to open the shared resource, we don't have device reset flag, so we crash. Also, same crash signature, with a sync lock timeout in 54, after the patch to bug 1322741 landed on central (in 53): https://crash-stats.mozilla.com/report/index/df10f88d-9dd2-4628-a302-2e5b22170131 Doesn't look like bug 1322741 quite took care of it all, and we still crash.

Flags: needinfo?(kechen)

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 111

•

9 years ago

With the fix in Bug 1322741, we can ensure the mSyncHandle is updated. However, according to GraphicsCriticalError in the crashes[1], the failure of calling OpenSharedResource is caused by invalid argument and the only parameter we pass is mSyncHandle. Which might indicates the mSyncHandle is no longer available. I am still trying to find out the root cause right now while an easy walkaround might just asks compositor process to update a new mSyncHandle. David do you have any advice on this ? And about the sync lock timeout crash, I don't have any thought about it. We've raised the time limitation from 10sec to 20sec to avoid the timeout but raise the number again might not be a good idea. I will keep looking into it. [1] https://crash-stats.mozilla.com/report/index/aaa9e577-4ddb-40b3-84fc-5f5c82170202

Flags: needinfo?(kechen) → needinfo?(dvander)

Vincent Liu[:vliu]

Updated

•

8 years ago

Comment 112

•

8 years ago

Attached image Invalid handle due to the ipc latency. — Details

Here is my assumption about the current OpenSharedResource crashes after Bug 1322741: There is a latency between the chrome process sending the handle and the content process receiving the handle. If another device rest happened during the latency, the content process will get an expired handle and use the latest device to open it which will cause the failure of the OpenSharedHandle function call while the device won't report device reset[1]. The Same case might happen when opening a new tab which also contains the SyncObject update. While the patch in Bug 1316788 might reduce the crash probability efficiently, I am now working on a patch that sends a synchronize ipc to chrome process to update the handle before we crash the program in [1]. [1] https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11.cpp#1237

Flags: needinfo?(dvander)

Comment 113

•

8 years ago

(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #112) > Created attachment 8839374 [details] > Invalid handle due to the ipc latency. > > Here is my assumption about the current OpenSharedResource crashes after Bug > 1322741: > There is a latency between the chrome process sending the handle and the > content process receiving the handle. > If another device rest happened during the latency, the content process will > get an expired handle and use the latest device to open it which will cause > the failure of the OpenSharedHandle function call while the device won't > report device reset[1]. > The Same case might happen when opening a new tab which also contains the > SyncObject update. > > While the patch in Bug 1316788 might reduce the crash probability > efficiently, I am now working on a patch that sends a synchronize ipc to > chrome process to update the handle before we crash the program in [1]. > > [1] > https://dxr.mozilla.org/mozilla-beta/source/gfx/layers/d3d11/TextureD3D11. > cpp#1237 Hrm... at this point, why not just make the whole thing fallible? If OpenSharedResource fails, stop painting until we get a new TextureFactoryIdentifier. Requesting one synchronously might have the same problem otherwise (the compositor could have closed it already).

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 114

•

8 years ago

Attached patch Make OpenSharedResource fallible. — Details — Splinter Review

Hello David, could you give me some feedbacks for this patch ? In this patch, I enable the device reset flag when failing to open the shared handle to make it fallible. However, there is a risk that if there are other reasons besides comment 112 cause this crash and there were no updating ipc after this, the content process will stay in device reset status forever. Instead of sending a sync ipc in comment 112, what if I send an async ipc to compositor thread, make it do the device reset again and update the handle ?

Attachment #8841463 - Flags: feedback?(dvander)

Comment 115

•

8 years ago

Why do we need to force a fake device reset, vs. just ignore the bad handle and wait for a new one?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 116

•

8 years ago

(In reply to David Anderson [:dvander] from comment #115) > Why do we need to force a fake device reset, vs. just ignore the bad handle > and wait for a new one? The reason of forcing a fake device reset is to skip the paintings before we get the new handle. So that we can resume the paintings when we get the new one.

Flags: needinfo?(dvander)

Comment 117

•

8 years ago

Why do we need a device reset to do that, versus just returning false from the function and not painting anything for that frame?

Flags: needinfo?(dvander)

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 118

•

8 years ago

Clearing this bug may take care of bug 1306619 as well.

Marcia Knous [:marcia]

Comment 119

•

8 years ago

This signature spiked in Nightly starting with 20170309030216. We now have 3727 crashes (1096 installations) in the last 7 days.

u279076

Comment 120

•

8 years ago

Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55 (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process (92.1%). I don't see any reports showing up in Aurora or Beta in the last week but we do have 18 reports in Release.

Keywords: topcrash-win

Assignee

Comment 121

•

8 years ago

(In reply to David Anderson [:dvander] from comment #117) > Why do we need a device reset to do that, versus just returning false from > the function and not painting anything for that frame? Hello David, sorry for the late reply. The reason I set the device rest is to avoid generating transactions[1] until available handle arrives. [1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/gfx/layers/client/ClientLayerManager.cpp#352

Flags: needinfo?(dvander)

Assignee

Comment 122

•

8 years ago

The spike might be related to Bug 1345814.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 123

•

8 years ago

Attached patch Make AcquireSync RAII. (obsolete) — Details — Splinter Review

I think we might need to make the AcuireSync RAII, or any return between [1] might cause the AcquireSync without Release. Since the mutex of the texture is generated from ImageBridgeChild::GetSingleton()[2] like mSyncObject does, it might also cause the timeout of mSyncObject's AcquireSync and run into the crash. [1] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#902-946 [2] https://dxr.mozilla.org/mozilla-central/rev/6d38ad302429c98115c354d643e81987ecec5d3c/dom/media/platforms/wmf/DXVA2Manager.cpp#900

Attachment #8847534 - Flags: feedback?(bas)

Comment 124

•

8 years ago

Comment on attachment 8841463 [details] [diff] [review] Make OpenSharedResource fallible. Review of attachment 8841463 [details] [diff] [review]: ----------------------------------------------------------------- If you think injecting a fake reset is quite reliable, I'll r+ that. Otherwise, it might make more sense to just set an invalid bit on the SyncObject and have ClientLayerManager check that before it paints.

Attachment #8841463 - Flags: feedback?(dvander) → feedback+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 125

•

8 years ago

(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #120) > Adding a note that this is the #2 GPU Process crash @ 41.13% in Nightly 55 > (#3 @ 10.24% overall) with 3786 of 4110 reports coming from the GPU Process > (92.1%). I don't see any reports showing up in Aurora or Beta in the last > week but we do have 18 reports in Release. It looks like this is a separate bug, probably, since it occurs in media code in the GPU process (this bug occurs in layers code in the content process).

Flags: needinfo?(dvander)

Comment 126

•

8 years ago

Comment on attachment 8847534 [details] [diff] [review] Make AcquireSync RAII. Review of attachment 8847534 [details] [diff] [review]: ----------------------------------------------------------------- You're absolutely right, good catch, looks good to me.

Attachment #8847534 - Flags: feedback?(bas) → review+

Comment 127

•

8 years ago

The actual cause of the new crashes though is likely bug 1345814.

Assignee

Comment 128

•

8 years ago

Attached patch Make AcquireSync RAII. — Details — Splinter Review

Carry r+ from comment 126. The try result looks fine. (Only run the tests on Windows platform since the change is only for Windows and I think we don't run tests on XP now.) [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0a5baba64a082d67db01c2401dba168fcaff3f75&selectedJob=84520438

Attachment #8847534 - Attachment is obsolete: true

Attachment #8848404 - Flags: review+

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 129

•

8 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9747d8849faa Make AcquireSync RAII. r=bas

Keywords: checkin-needed

Comment 130

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9747d8849faa

Assignee

Updated

•

8 years ago

Depends on: 1352376

Bruce Sun [:brsun]

Comment 131

•

8 years ago

Attached patch bug1160157_lock_all_or_lock_none.patch (obsolete) — Details — Splinter Review

Hi Kevin, How do you think about this modification?

Attachment #8891165 - Flags: feedback?(kechen)

Bruce Sun [:brsun]

Comment 132

•

8 years ago

Comment on attachment 8891165 [details] [diff] [review] bug1160157_lock_all_or_lock_none.patch Thanks Kevin, it seems this modification is not related to this issue.

Attachment #8891165 - Attachment is obsolete: true

Attachment #8891165 - Flags: feedback?(kechen)