Closed Bug 1374164 Opened 7 years ago Closed 7 years ago

Intermittent Main app process exited normally | application crashed [@ mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::~Log]

Categories

(Core :: Graphics, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jesup)

References

Details

(Keywords: crash, intermittent-failure, Whiteboard: [stockwell fixed:product])

Crash Data

Attachments

(2 files, 2 obsolete files)

[task 2017-06-18T19:35:58.600574Z] 19:35:58     INFO - GECKO(2930) | Assertion failure: [GFX1]: Texture deallocated too late during shutdown, at /home/worker/workspace/build/src/gfx/2d/Logging.h:519
[task 2017-06-18T19:36:31.027046Z] 19:36:31     INFO - GECKO(2930) | #01: mozilla::layers::DeallocateTextureClient [gfx/layers/client/TextureClient.cpp:326]
[task 2017-06-18T19:36:31.028023Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.028093Z] 19:36:31     INFO - GECKO(2930) | #02: mozilla::layers::TextureClient::Destroy [gfx/layers/client/TextureClient.cpp:423]
[task 2017-06-18T19:36:31.028707Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.029412Z] 19:36:31     INFO - GECKO(2930) | #03: mozilla::layers::TextureClient::~TextureClient [gfx/layers/client/TextureClient.cpp:601]
[task 2017-06-18T19:36:31.029959Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.030512Z] 19:36:31     INFO - GECKO(2930) | #04: mozilla::layers::TextureClient::~TextureClient [gfx/layers/client/TextureClient.cpp:605]
[task 2017-06-18T19:36:31.031038Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.031589Z] 19:36:31     INFO - GECKO(2930) | #05: mozilla::AtomicRefCountedWithFinalize<mozilla::layers::TextureClient>::Release [gfx/layers/AtomicRefCountedWithFinalize.h:138]
[task 2017-06-18T19:36:31.032138Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.032742Z] 19:36:31     INFO - GECKO(2930) | #06: mozilla::layers::TextureClientHolder::Release [memory/mozalloc/mozalloc.h:218]
[task 2017-06-18T19:36:31.032827Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.033400Z] 19:36:31     INFO - GECKO(2930) | #07: mozilla::layers::TextureClientRecycleAllocator::Destroy [gcc/include/c++/4.9.4/bits/stl_stack.h:218]
[task 2017-06-18T19:36:31.033943Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.034497Z] 19:36:31     INFO - GECKO(2930) | #08: mozilla::layers::CompositableClient::Destroy [gfx/layers/client/CompositableClient.cpp:85]
[task 2017-06-18T19:36:31.034669Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.035345Z] 19:36:31     INFO - GECKO(2930) | #09: mozilla::layers::CompositableClient::~CompositableClient [mfbt/RefPtr.h:77]
[task 2017-06-18T19:36:31.036881Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.037591Z] 19:36:31     INFO - GECKO(2930) | #10: mozilla::layers::ImageClientSingle::~ImageClientSingle [gfx/layers/client/ImageClient.h:87]
[task 2017-06-18T19:36:31.038136Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.039653Z] 19:36:31     INFO - GECKO(2930) | #11: mozilla::layers::CompositableClient::Release [gfx/layers/client/CompositableClient.h:75]
[task 2017-06-18T19:36:31.040193Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.041120Z] 19:36:31     INFO - GECKO(2930) | #12: mozilla::layers::ImageContainer::~ImageContainer [gfx/layers/ImageContainer.cpp:183]
[task 2017-06-18T19:36:31.041697Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.042368Z] 19:36:31     INFO - GECKO(2930) | #13: RefPtr<mozilla::layers::ImageContainer>::~RefPtr [memory/mozalloc/mozalloc.h:218]
[task 2017-06-18T19:36:31.042905Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.043777Z] 19:36:31     INFO - GECKO(2930) | #14: mozilla::MediaEngineCameraVideoSource::~MediaEngineCameraVideoSource [dom/media/webrtc/MediaEngineCameraVideoSource.h:75]
[task 2017-06-18T19:36:31.044408Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.045265Z] 19:36:31     INFO - GECKO(2930) | #15: mozilla::MediaEngineRemoteVideoSource::~MediaEngineRemoteVideoSource [dom/media/webrtc/MediaEngineRemoteVideoSource.h:98]
[task 2017-06-18T19:36:31.045883Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.046789Z] 19:36:31     INFO - GECKO(2930) | #16: mozilla::MediaEngineRemoteVideoSource::Release [dom/media/webrtc/MediaEngineRemoteVideoSource.cpp:28]
[task 2017-06-18T19:36:31.047399Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.048459Z] 19:36:31     INFO - GECKO(2930) | #17: mozilla::MediaDevice::~MediaDevice [xpcom/string/nsTString.h:20]
[task 2017-06-18T19:36:31.049133Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.050080Z] 19:36:31     INFO - GECKO(2930) | #18: mozilla::VideoDevice::~VideoDevice [dom/media/MediaManager.h:112]
[task 2017-06-18T19:36:31.050801Z] 19:36:31     INFO - 
[task 2017-06-18T19:36:31.051666Z] 19:36:31     INFO - GECKO(2930) | #19: mozilla::MediaDevice::Release [dom/media/MediaManager.cpp:704]
[task 2017-06-18T19:36:31.052364Z] 19:36:31     INFO -
Blake, it seems like the resource that held on media side was deallocated after shutdown. Can you have someone to check it?
Flags: needinfo?(bwu)
Ni Randell since this is related to WebRTC.
Flags: needinfo?(bwu) → needinfo?(rjesup)
Sigh....

Something is holding a MediaDevice from some object in a CC situation it appears, and the "final-cc" is after we shut down gfx (which causes a lot of problems, btw).


#18: mozilla::VideoDevice::~VideoDevice [dom/media/MediaManager.h:112]
#19: mozilla::MediaDevice::Release [dom/media/MediaManager.cpp:704]
<called from CC code in final shutdown - nsCycleCollector::Shutdown>

Given I don't think we can know/stop people from doing this, we'll need to neuter all video sources on Shutdown().
Flags: needinfo?(rjesup)
Speculative fix; we know somehow mImage isn't null at final-cc here.
Attachment #8879633 - Flags: review?(padenot)
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Rank: 15
Priority: -- → P1
Attachment #8879633 - Flags: review?(padenot) → review+
this is linux debug in mochitest-media-e10s tests.  In the last week this has doubled in the number of failures and we should be taking a look at this.

There is a patch with r+, can we just land this?
Flags: needinfo?(padenot)
Whiteboard: [stockwell needswork]
FWIW I seem able to provoke this same assert quite reliably with my local patch set for bug 1088621:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=08d27ef79fc105be8711d300cc5e738dd41ccbd6&selectedJob=111305606

I then added a fix that seemed to help avoid late refs messing up GC for that particular case:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=23b79e7e8b76b8b17d87f59cfea2f5427c7be3bc&selectedJob=111396466

I'll try to rebase those and attach them to a new bug and link it here, in case it's helpful.
clearing NI, per comment 11, jib will handle this.
Flags: needinfo?(padenot)
See Also: → 1339568
(In reply to Paul Adenot (:padenot) from comment #12)
> clearing NI, per comment 11, jib will handle this.

Much thanks for jib for work on bug 1382095, but failures continue, and may have actually increased in frequency since that change.
See Also: → 1347625, 1384125
:jesup - Are you working on this? Making progress? This is one of our most frequent failures.
Flags: needinfo?(rjesup)
(In reply to Geoff Brown [:gbrown] from comment #25)
> :jesup - Are you working on this? Making progress? This is one of our most
> frequent failures.

It's a speculative fix.... and bandaid on something that shouldn't be happening.  I'd like to at least put out a warning when we hit the "bad" case.  We still should never be getting to this point.

Pushing a Try before landing
jib: I just added the 'Unused << NS_WARN_IF...', because I don't want to totally hide this (but want to resolve the orange since it causes pain)
Attachment #8896962 - Flags: review?(jib)
Attachment #8879633 - Attachment is obsolete: true
Attachment #8896962 - Flags: review?(jib) → review+
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3f44f13e8021
guarantee mImage is released by video sources before final-cc shutdown r=padenot,jib
Pushed the wallpaper patch... This doesn't really fix the bug, but demotes it to a warning.

Leave-open
Assignee: rjesup → nobody
Severity: critical → normal
Status: ASSIGNED → NEW
Rank: 15 → 18
Flags: needinfo?(rjesup)
Keywords: leave-open
Whiteboard: [stockwell needswork]
Thanks! That will clear up a lot of orange on treeherder.
This is still happening. Is there anything preventing an actual investigation of this? It seems to happen frequently enough on the linux64-qr M-e10s(mda2) job that it should be debuggable via try pushes even if you can't reproduce locally.
Yeah, it looks like the wallpaper patch had no effect on failure frequency. This remains the #1 most frequent test failure.
Whiteboard: [stockwell needswork]
(In reply to Geoff Brown [:gbrown] (pto Aug 19 - Aug 27) from comment #38)
> Yeah, it looks like the wallpaper patch had no effect on failure frequency.
> This remains the #1 most frequent test failure.

AHA!  this is failing on mImageContainer not being nulled, not mImage!  Easy fix.
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8898281 [details] [diff] [review]
Clear mImageContainer as well as mImage when stopping a gUM stream

Review of attachment 8898281 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/webrtc/MediaEngineCameraVideoSource.h
@@ -71,5 @@
> -    // somehow.) (Bug 1374164)
> -
> -    Unused << NS_WARN_IF(mImage);
> -
> -    mImage = nullptr;

I see you're backing off of clearing mImage here. Note that this is a different mImage from that in MediaEngineDefault.h. See bug 1391278.

I think we should still clear mImage and mImageContainer here. This is a hard requirement of the gfx code, and I don't know that we can guarantee that MediaEngine references are not held too long in all cases, or that stop will always have been called (may be the case).

Can we do both, MOZ_ASSERT AND clear?
Attachment #8898281 - Flags: review?(jib) → review-
Comment on attachment 8898281 [details] [diff] [review]
Clear mImageContainer as well as mImage when stopping a gUM stream

From #media, we'd "rather let it be caught by the gfx check than hide it by clearing it", so switching to r+
Attachment #8898281 - Flags: review- → review+
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9e36f27b698
Clear mImageContainer as well as mImage when stopping a gUM stream r=jib
Backed out for failing e.g. mda's test_getUserMedia_mediaElementCapture_tracks.html and browser-chrome's browser_devices_get_user_media_unprompted_access.js:

https://hg.mozilla.org/integration/mozilla-inbound/rev/806d980dba310e653175973bd7f7e1c31b73c2d1

A push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=05358f0058a7a711dce865ea74af6ff451686657&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

Failure log mda: https://treeherder.mozilla.org/logviewer.html#?job_id=123953587&repo=mozilla-inbound
GECKO(3163) | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315
GECKO(3163) | Assertion failure: mDestroyed, at /home/worker/workspace/build/src/gfx/layers/composite/ImageLayerComposite.cpp:44
TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_getUserMedia_mediaElementCapture_tracks.html | application terminated with exit code 11

Failure log bc: https://treeherder.mozilla.org/logviewer.html#?job_id=123953681&repo=mozilla-inbound
GECKO(1734) | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315
TEST-UNEXPECTED-FAIL | browser/base/content/test/webrtc/browser_devices_get_user_media_unprompted_access.js | Test timed out -
Flags: needinfo?(rjesup)
Backout by kwierso@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/6613ef541bb1
Backed out changeset e9e36f27b698 for failing e.g. mda's test_getUserMedia_mediaElementCapture_tracks.html and browser-chrome's browser_devices_get_user_media_unprompted_access.js. r=backout
Also removes hiding of mImage/mImageContainer/mMonitor by
MediaEngineDefault, general cleanup of MediaEngineDefault shutdown, and
fixing locking around access to protected members.
Attachment #8898834 - Flags: review?(jib)
Comment on attachment 8898834 [details] [diff] [review]
Clear mImageContainer as well as mImage when stopping a gUM stream

Review of attachment 8898834 [details] [diff] [review]:
-----------------------------------------------------------------

Lgtm. One question on whether we need to lock twice.

::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp
@@ +435,2 @@
>    // implicitly releases last image
>    mImage = image.forget();

Just wondering if there are ways to avoid locking twice here. Since borh image and mImage are refptrs, couldn't we set mImage ahead of CopyData instead under the first lock? I may not understand the guarantees here, but wanted to point that out, as a potential optimization.
Attachment #8898834 - Flags: review?(jib) → review+
> ::: dom/media/webrtc/MediaEngineRemoteVideoSource.cpp
> @@ +435,2 @@
> >    // implicitly releases last image
> >    mImage = image.forget();
> 
> Just wondering if there are ways to avoid locking twice here. Since borh
> image and mImage are refptrs, couldn't we set mImage ahead of CopyData
> instead under the first lock? I may not understand the guarantees here, but
> wanted to point that out, as a potential optimization.

I'm locking twice to avoid holding it on a large memcpy/etc.  mImage needs a lock to access/modify, since it's touched from multiple threads.  If we copy data outside the lock, after setting mImage, then the MSG may come along and grab mImage and send to the compositor before we've finished the copy.
Flags: needinfo?(rjesup)
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4d6210358c49
Clear mImageContainer as well as mImage when stopping a gUM stream r=jib
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
:jesup, this has so many failures in the last week, almost all on linux64-qr, but we still see issues on regular linux- can you look into this some more, or should we file a new bug?
Flags: needinfo?(rjesup)
Whiteboard: [stockwell fixed:product] → [stockwell needswork]
Seems fixed to me, the last few on basstacks are on builds without the fix.
oh, thanks for pointing out my oversight!  After the 18th (when the fix was merged everywhere) we just have a few spots on try/beta.  sorry for the confusion
Flags: needinfo?(rjesup)
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
Attachment #8898281 - Attachment is obsolete: true
This grafts cleanly to Beta and OrangeFactor seems to indicate that the latest fix worked. Please nominate these patches for Beta approval when you get a chance.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(rjesup)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Comment on attachment 8898834 [details] [diff] [review]
Clear mImageContainer as well as mImage when stopping a gUM stream

Approval Request Comment
[Feature/Bug causing the regression]: Timing changes especially with QuantumRender

[User impact if declined]: Oranges in automation, possible shutdown crashes

[Is this code covered by automated tests?]: yes

[Has the fix been verified in Nightly?]: yes

[Needs manual test from QE? If yes, steps to reproduce]:  no

[List of other uplifts needed for the feature/fix]: none

[Is the change risky?]: no

[Why is the change risky/not risky?]: simple changes it add locking, and avoid shadowing the monitor

[String changes made/needed]: none
Flags: needinfo?(rjesup)
Attachment #8898834 - Flags: approval-mozilla-beta?
Comment on attachment 8898834 [details] [diff] [review]
Clear mImageContainer as well as mImage when stopping a gUM stream

Possible crash fix, should help with intermittent orange issues in tests.
Please uplift for tomorrow's beta 6 build.
Attachment #8898834 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Randell Jesup [:jesup] from comment #61)
> [Is this code covered by automated tests?]: yes
> 
> [Has the fix been verified in Nightly?]: yes
> 
> [Needs manual test from QE? If yes, steps to reproduce]:  no

Setting qe-verify- based on Randell's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.