Closed Bug 1207220 Opened 9 years ago Closed 8 years ago

Media has to be fully shut down before gfx

Categories

(Core :: Audio/Video: Playback, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox44 - wontfix
firefox45 + fixed
firefox46 + fixed

People

(Reporter: nical, Assigned: jya)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(2 files)

Filing this under gfx but it might move to media since I think that it's where most or all of the references to textures are kept until this late.

Basically, anything that holds on to TextureClient objects should let go of all of its references to them at the latest when the event NS_XPCOM_SHUTDOWN_OBSERVER_ID is handled. After that, the shutdown of ipc stuff and threads commences and the gfx-ipc code goes into panic mode if it finds out that some textures are still alive. It tries to force deallocation of remaining textures in a sad and thread-unsafe way, which leads to intermittent leaks and crashes. Even if we managed to handle the thread-safety issue (which would be complicated and expensive), we'd run into other hazards because TextureClients are holding on to platform resources that are probably already shut down. We also can't have gfx ipc systems shut down later because it is already happening just before the destruction of xpcom threads.

Let's make sure textures are deallocated before things get messy, especially if whatever is holding on to textures is living off the main thread.
Whiteboard: [gfx-noted]
Blocks: 1208071
Note this is blocking a beta top crasher.
No longer blocks: 1208071
See Also: → 1208071
Blocks: 1209724
Depends on: 1217135
Chris, most of the bugs that depend/block this bug are related to gfx stuff being called after gfx was shut down. So far it's mostly been related to media.
Assigning to Chris since we talked about it during the workweek, feel free to reassign to whoever makes sense in the media crew.
Assignee: nobody → cpearce
Bug 1217135 is another case of gfx being used during shutdown (beta top crash). The stack traces in that bug are interesting because we see that Media is also shutting down (which generates some gfx activity), just a tad too late.
Blocks: 1217135, 1223193
No longer depends on: 1217135
Component: Graphics → Audio/Video: Playback
Summary: Don't keep TextureClient references around late in the shutdown → Media has to be fully shut down before gfx
Flags: needinfo?(cpearce)
Blocks: 1215195
[Tracking Requested - why for this release]:
This blocks bug 1217135 which has been identified as a high percentage beta 44 crash, and needs to be handled and uplifted - Chris, let me know if the timing or size of this is such that we shouldn't count on it being done in time.
Blocks: 1215265
I will look at this first thing when I get back from PTO on Jan 5.

(In reply to Milan Sreckovic [:milan] from comment #5)
> [Tracking Requested - why for this release]:
> This blocks bug 1217135 which has been identified as a high percentage beta
> 44 crash, and needs to be handled and uplifted - Chris, let me know if the
> timing or size of this is such that we shouldn't count on it being done in
> time.

Where is the gfx stack's XPCOM shutdown observer? The safest quick fix may just be to call MediaShutdownManager::Shutdown() from there. Or we can add a new xpcom-shutdown notification for media that gets dispatched before whatever one gfx is listening on.

Long term both gfx and media should use nsIAsyncShutdown, where we can express the dependency ordering. That I don't feel comfortable uplifting.
(In reply to PTO until Jan 5 - Chris Pearce (:cpearce) from comment #6)
> Where is the gfx stack's XPCOM shutdown observer? The safest quick fix may
> just be to call MediaShutdownManager::Shutdown() from there.

nical?
Flags: needinfo?(nical.bugzilla)
The relevant gfx shutdown happens in gfxPlatform::ShutdownLayersIPC which is called manually in NS_ShutdownXPCOM in XPComInit.cpp. It's between NS_XPCOM_SHUTDOWN_OBSERVER_ID and NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID, it might just be a matter of making sure everything is shut down synchronously in the former.

I agree about the async shutdown stuff althou I think it means that the shutdowns of threads must move as well then (funky things happens if gfx shut down after the threading stuff).
Flags: needinfo?(nical.bugzilla)
MediaShutdownManager is already called before gfx's shutdown since it is registered on NS_XPCOM_SHUTDOWN_OBSERVER_ID which is fired before gfxPlatform::ShutdownLayersIPC
https://dxr.mozilla.org/mozilla-central/rev/388bdc46ba51ee31da8b8abe977e0ca38d117434/dom/media/MediaShutdownManager.cpp#97

I assume there are some media resources that are destroyed asynchronously, or not handled explicitly by the MediaShutdownManager.
Hmmm, I fixed that in bug 1207220, only discover this bug then...

Will mark as duplicate then.
Flags: needinfo?(cpearce)
Copying analysis here so it's all in one place:

"The issue was properly described by Nicolas in bug 1223193.

When shutting down, ShutdownXPCOM is called which notifies its observers with NS_XPCOM_SHUTDOWN_OBSERVER_ID so that they can shutdown.

XPCOM expects these operations to be synchronous.

One of those observers is the MediaShutdownManager which will loop over all its registered MediaDecoders and call MediaDecoder::Shutdown(), expecting things to be synchronous too.

Once the XPCOM observers have been notified , gfxPlatform::ShutdownLayersIPC() is called.

However, MediaDecoder::Shutdown() itself isn't synchronous. Internally it will call MediaDecoderStateMachine::Shutdown() which is async which itself calls MediaDecoderReader::Shutdown() also async.

So when the MediaShutdownManager loop over the registered MediaDecoder, there are still potentially tasks running.

Should those tasks happen to run after gfxPlatform::ShutdownLayersIPC() has been called: we'll typically get some crashes. The gfx team has papered over the issue for years, but fundamentally the issue is that shutting down of the MediaDecoder / MediaShutdownManager is asyncronous.

Ideally we would want to use MozPromise/AllPromiseHolder to continue shutting down XPCOM once all the decoders have completed their shutdown.
However this is likely too big of a task.

so in the mean time, MediaShutdownManager should block until all decoders have fully shutdown."
PCOM when shutting down expects all tasks to be run synchronously. As such, we must ensure that the remaining MediaDecoder are shut down before continuing on the next task.
In particular destroying gfxPlatforms must only ever happen after, as it is possible for the MediaDecoderReader to make use of gfx resources during shutdown.
Attachment #8703505 - Flags: review?(cpearce)
Assignee: cpearce → jyavenard
Blocks: 1217128
Priority: -- → P2
Comment on attachment 8703505 [details] [diff] [review]
Ensure MediaShutdownManager waits until all MediaDecoder have completed their shutdown.

Review of attachment 8703505 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks jya.
Attachment #8703505 - Flags: review?(cpearce) → review+
jya: How comfortable uplifting this patch are you?
Flags: needinfo?(jyavenard)
I don't think the patch can't be immediately applied as the MediaDecoder / MDSM didn't use ShutdownPromise until very recently.

The problem in earlier version also existed of course: instead of using ShutdownPromise it was calling MDSM::BeginShutdown which when done would call back the MediaDecoder when done, so still async.

But otherwise, rather confident. The logic of running the event loop was taken from the GMP shutdown code (thanks to Gerald who pointed me to that code).

will wrap a patch for 44.
Flags: needinfo?(jyavenard)
To uplift to beta (44), we need bug 1230004. Unfortunately that one relies on a lot of existing cleanup in the MDSM/MediaDecoder ; it also unfortunately combined two tasks into on (making MDSM::Shutdown return a promise and total removal of MediaDecoder::mDecoder).

I'll rework the patch to narrows the changes.
Depends on: 1230004
Comment on attachment 8703505 [details] [diff] [review]
Ensure MediaShutdownManager waits until all MediaDecoder have completed their shutdown.

Approval Request Comment
[Feature/regressing bug #]: 1207220
[User impact if declined]: Crashes on shutdown. So far things have been papered over to avoid the crashes, but really without this fix more can still occur.
[Describe test coverage new/current, TreeHerder]: Manual tests. 
[Risks and why]: Low. There's always a risk that it could cause a hang on shutdown for an unforeseen reasons (I can't think of any). We are already performing a similar operation when shutting down GMP so it's very unlikely
[String/UUID change made/needed]: None
Attachment #8703505 - Flags: approval-mozilla-aurora?
Comment on attachment 8704000 [details] [diff] [review]
Ensure MediaShutdownManager waits until all MediaDecoder have completed their shutdown. Beta

Approval Request Comment
[Feature/regressing bug #]: 1207220
[User impact if declined]: Crashes on shutdown (this is the proper fix for bug 1217135 and many others (1223193, 1209724, etc). So far things have been papered over to avoid the crashes, but really without this fix more can still occur.
[Describe test coverage new/current, TreeHerder]: Manual tests. 
[Risks and why]: Low. There's always a risk that it could cause a hang on shutdown for an unforeseen reasons (I can't think of any). We are already performing a similar operation when shutting down GMP so it's very unlikely
[String/UUID change made/needed]: None

Bug 1230004 must be first uplifted to beta (it's in aurora already)
Attachment #8704000 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/94eef5cd59ef
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Important issue, tracking.
Comment on attachment 8703505 [details] [diff] [review]
Ensure MediaShutdownManager waits until all MediaDecoder have completed their shutdown.

Fix crashes, taking it!
Attachment #8703505 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
It certainly looks like to me from the patches, but can you confirm that these changes are not at all likely to have an impact outside of the shutdown procedure?
(In reply to Milan Sreckovic [:milan] from comment #27)
> It certainly looks like to me from the patches, but can you confirm that
> these changes are not at all likely to have an impact outside of the
> shutdown procedure?

MediaShutdownManager::Shudown is called only when XPCOM shutdown.
Note that if there are no MediaDecoder still alive when we shutdown, then the behaviour will be identical to what was there previously.

Now a change that affects outside the global shutdown procedure is the change to MediaDecoder::Shutdown to return a ShutdownPromise, for other cases, there will be no one actioning on the ShutdownPromise, it will simply go out of scope and behave just like before. We can let it bake in central and now aurora for a little while and see how it pans out
my beta try runs shows intermittent gtest timeout on 10.7 debug. I'm not 100% sure it's related.
It shows that all gtests have completed, only to timeout 1200s later.
 06:03:14     INFO - TEST-PASS | pkixocsp_VerifyEncodedResponse_WithoutResponseBytes/pkixocsp_VerifyEncodedResponse_WithoutResponseBytes.CorrectErrorCode/7 | test completed (time: 0ms)
 06:03:14     INFO - TEST-PASS | GTest unit test: passed
 06:03:14     INFO - Passed: 1901
 06:03:14     INFO - Failed: 0
 06:08:14  WARNING - gtest TEST-UNEXPECTED-FAIL | gtest | timed out after 1200 seconds
 06:08:15     INFO - make[1]: *** [check] Error 1
 06:08:15     INFO - make: *** [check] Error 2
 06:08:15    ERROR - Return code: 2

I have that exact same machine here and will test locally (can't reproduce it on either my desktop nor mac laptop)
Comment on attachment 8704000 [details] [diff] [review]
Ensure MediaShutdownManager waits until all MediaDecoder have completed their shutdown. Beta

A- until the issues mentioned in comment 30 are addressed.
Attachment #8704000 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Ok.. I reproduced the issue on both 10.10 mac, a 10.7 VM and 10.6.8 mac mini (same hardware as the try machines)

It's actually a crash:
[       OK ] ExpirationTracker.main (18492 ms)
[----------] 1 test from ExpirationTracker (18492 ms total)

[----------] 6 tests from PLDHashTableTest
[ RUN      ] PLDHashTableTest.InitCapacityOk
Hit MOZ_CRASH(Initial length is too large) at /Users/jyavenard/Work/Mozilla/mozilla-central/xpcom/glue/PLDHashTable.cpp:186
#01: PLDHashTable::HashShift(unsigned int, unsigned int)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/gtest/XUL +0x5be708]
#02: PLDHashTable::PLDHashTable(PLDHashTableOps const*, unsigned int, unsigned int)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/

So nothing to do with the changes this patch made.

I also got intermittent hang with gtest in 10.7, but even without those patches. So likely a red herring and unrelated to those changes either.
In particular, GeckoMediaPlugins.GMPTestCodec always hang for me (on either 10.7 or 10.6, fine on 10.10)

My level of confidence is rather high for those changes...

I will run more tests tomorrow, but now I have to go and attend a 40th...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ac0f73005d78

Turned out that timeout happened once. And I can not reproduce it anymore on try today. 

So must have been a temporary localised issue unrelated to the patch.
Milan, this is another one that I would be grateful to get your opinion on. I am inclined towards not uplifting this patch or the patch from bug 1230004 in Beta44 given that we are a week away from RC and general stability issues. 

Do you think the risk associated is low and therefore worth uplifting to Beta44? This has only been on central for a week. As such these kinds of fixes (shutdown crash) are good candidates to uplift but the timing of this change in Beta44 cycle is quite worrisome. Please help!
Flags: needinfo?(milan)
Agreed - I understand we want to improve the shutdown experience, but I would not uplift this to beta.
Flags: needinfo?(milan)
(In reply to Milan Sreckovic [:milan] from comment #35)
> Agreed - I understand we want to improve the shutdown experience, but I
> would not uplift this to beta.

Thanks Milan for a super prompt response. :)
See Also: → 1242891
Blocks: 1264293
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: