(In reply to Jens Stutte [:jstutte] from comment #7)
Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of
ContentProcessManager::GetSingleton seems infallible but it is not, as
ClearOnShutdown deletes the CPM immediately if beyond the shutdown phase.
As I previously mentioned on Matrix, I was quite surprised that we were clearing
nsThreadManager::Shutdown(), which is probably what is leading to this issue. It is possible for us to perform some cleanup steps during
nsThreadManager::Shutdown() due to shutdown tasks, and we probably don't want to be running those after
I think a longer term solution here might be to partially-revert the changes in bug 1637890, and move
XPCOMShutdownFinal so that it happens after the bulk of work in
nsThreadManager::Shutdown. We will unfortunately probably need to split
nsThreadManager::Shutdown into two parts to keep the main thread accepting events until after
XPCOMShutdownFinal, but ideally we wouldn't need to anymore. Doing this would allow us to not have to worry as much about this manager object being destroyed before some IPDL actors are destroyed, as all IPDL actors are destroyed during
nsThreadManager::Shutdown due to their corresponding event targets dying. I'll file a seperate bug for that.
(In reply to Jens Stutte [:jstutte] from comment #5)
Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.
Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?
Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?
bug 1738103 is a general fix to try to avoid having actors running after threads have been killed. It's partially a helper patch to make it easier to implement actors like
PBackground which are bound to a thread or TaskQueue and should be shut down when that thread or task queue are shut down. It's possible that it lead to this surge in crashes as it means that we actually touch any actors which are alive during XPCOM Threads shutdown, when previously if we got that late we'd probably end up leaking them instead, as the other place we'd try to touch them would be during I/O thread shutdown, and the actor shutdown message dispatches would be leaked instead of being run.