Discussion of current patches: ### Making PGMPContent + PGMPService refcounted The GMP machinery has had reoccurring issues with racy shutdowns and IPC (like the one in this bug). It makes sense to me to refcount more and more actors involved to avoid these issues. I'm refcounting PGMPContent + PGMPService because I have triggered asserts and crashes related to these specifically while debugging using the testcase in comment 66. I plan to revise https://phabricator.services.mozilla.com/D75178 once bug 1638124 lands as that patch can then be simplified. ### Having GeckoMediaPluginServiceChild block shutdown This has been touched on in comment 59, but I want to make sure my working is sane so I'm going to go into a lot more detail below. [Chris Pearce's blog on EME/GMP](https://blog.pearce.org.nz/2019/06/firefoxs-gecko-media-plugin-eme.html) is a useful reference here, and the diagram of the [GMP architecture](https://1.bp.blogspot.com/-dpM8VMBt7Zc/XRQiLXIGWcI/AAAAAAAAGDg/5dMSe7UGu_Ih8yHPz-d3r6f-VOnQPlDCACLcBGAs/s1600/gmp.png) is something I often gaze at. The GMP architecture is complicated, I personally find it quite difficult to reason about how it functions in a non-ideal-happy path shutdown. By blocking shutdown we can require that the GMP components are in a certain state before progressing. I.e. blocking at xpcom-will-shutdown until we're in a certain happy state makes it much easier to reason about what we need to deal with when we observe xpcom-shutdown-threads. The specific failure case that this bug stems from is if we shutdown while the content process is requesting a GMP process be started to host the CDM. A quick version of what goes wrong: - We call [`GetCDM` in the content process](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/ChromiumCDMProxy.cpp#86-87) because a site is using the `MediaKeys` interface. - This will call into [`GetContentParent` in the GMPServiceChild](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#126). - This begins the process of requesting a GMP process be started by the parent process. - Ideally the parent then passes back an endpoint to content, which will then [bind that endpoint](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#428) and allow communication directly from content to the GMP process. - We shutdown while the above is happening, in the content process we move through shutdown steps and observe xpcom-shutdown-threads so we [shutdown/kill the GMP thread](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#359). - We run whatever remaining tasks we have queued on the GMP thread, and end up in a state where we'd need to run yet more tasks on the GMP thread to not leak some of our objects. However, we're out of luck, because our thread is dead and we're shutting down, so we trigger LSAN (seen prior to the backout of my first patch). The parent process actor for the GMPService [has code to block shutdown](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceParent.cpp#1501) and prevent itself shutting down too early, but the content process actor does not. My patch tries to change that so content will block shutdown until the GMP process associated with it is no longer in use. I'm hoping this results in the following shutdown dependencies: - GMPServiceChild will not shutdown until its GMPContentParents are ready for shutdown. I.e. GMP machinery will block shutdown in the content process until GMP processes are happy. - GMPServiceParent will not shutdown its GMPServiceChildren are ready for shutdown. I.e. GMP machinery will block shutdown in the parent process until the content processes using GMP are happy. This should already be true prior to my patch. This ideally results in a more deterministic shutdown of the machinery involved that avoids bugs like this one. The current patch has comments + logs + asserts I'd like to land in an ideal world, but can yank them for sec and landing and return them later. --- I'm happy to provide further details to the above and/or on how I debugged this to arrive at my conclusions. The testcase above + MOZ_LOG=EME:5,GMP:5 + an LSAN build and connecting different leaks and crashes with logging output where my primary tools. I'm looking for a reviewer for the last patch. I'm not certain we have any suitable media folks ATM. Nika, would you feel comfortable reviewing https://phabricator.services.mozilla.com/D75180 or have any suggestions as to a suitable reviewer?
Bug 1551615 Comment 69 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
Discussion of current patches: ### Making PGMPContent + PGMPService refcounted The GMP machinery has had reoccurring issues with racy shutdowns and IPC (like the one in this bug). It makes sense to me to refcount more and more actors involved to avoid these issues. I'm refcounting PGMPContent + PGMPService because I have triggered asserts and crashes related to these specifically while debugging using the testcase in comment 66. I plan to revise https://phabricator.services.mozilla.com/D75178 once bug 1638124 lands as that patch can then be simplified. ### Having GeckoMediaPluginServiceChild block shutdown This has been touched on in comment 59, but I want to make sure my working is sane so I'm going to go into a lot more detail below. [Chris Pearce's blog on EME/GMP](https://blog.pearce.org.nz/2019/06/firefoxs-gecko-media-plugin-eme.html) is a useful reference here, and the diagram of the [GMP architecture](https://1.bp.blogspot.com/-dpM8VMBt7Zc/XRQiLXIGWcI/AAAAAAAAGDg/5dMSe7UGu_Ih8yHPz-d3r6f-VOnQPlDCACLcBGAs/s1600/gmp.png) is something I often gaze at. The GMP architecture is complicated, I personally find it quite difficult to reason about how it functions in a non-ideal-happy path shutdown. By blocking shutdown we can require that the GMP components are in a certain state before progressing. I.e. blocking at xpcom-will-shutdown until we're in a certain happy state makes it much easier to reason about what we need to deal with when we observe xpcom-shutdown-threads. The specific failure case that this bug stems from is if we shutdown while the content process is requesting a GMP process be started to host the CDM. A quick version of what goes wrong: - We call [`GetCDM` in the content process](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/ChromiumCDMProxy.cpp#86-87) because a site is using the `MediaKeys` interface. - This will call into [`GetContentParent` in the GMPServiceChild](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#126). - This begins the process of requesting a GMP process be started by the parent process. - Ideally the parent then passes back an endpoint to content, which will then [bind that endpoint](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#428) and allow communication directly from content to the GMP process. - We shutdown while the above is happening, in the content process we move through shutdown steps and observe xpcom-shutdown-threads so we [shutdown/kill the GMP thread](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceChild.cpp#359). - We run whatever remaining tasks we have queued on the GMP thread, and end up in a state where we'd need to run yet more tasks on the GMP thread to not leak some of our objects. However, we're out of luck, because our thread is dead and we're shutting down, so we trigger LSAN (seen prior to the backout of my first patch). The parent process actor for the GMPService [has code to block shutdown](https://searchfox.org/mozilla-central/rev/9193635dca8cfdcb68f114306194ffc860456044/dom/media/gmp/GMPServiceParent.cpp#1501) and prevent itself shutting down too early, but the content process actor does not. My patch tries to change that so content will block shutdown until the GMP process associated with it is no longer in use. I'm hoping this results in the following shutdown dependencies: - GMPServiceChild will not shutdown until its GMPContentParents are ready for shutdown. I.e. GMP machinery will block shutdown in the content process until GMP processes are happy. - GMPServiceParent will not shutdown its GMPServiceChildren are ready for shutdown. I.e. GMP machinery will block shutdown in the parent process until the content processes using GMP are happy. This should already be true prior to my patch. This ideally results in a more deterministic shutdown of the machinery involved that avoids bugs like this one. The current patch has comments + logs + asserts I'd like to land in an ideal world, but can yank them for sec and landing and return them later. --- I'm happy to provide further details to the above and/or on how I debugged this to arrive at my conclusions. The testcase above + MOZ_LOG=EME:5,GMP:5 + an LSAN build and connecting different leaks and crashes with logging output were my primary tools. I'm looking for a reviewer for the last patch. I'm not certain we have any suitable media folks ATM. Nika, would you feel comfortable reviewing https://phabricator.services.mozilla.com/D75180 or have any suggestions as to a suitable reviewer?