Crash in [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy]
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
People
(Reporter: gsvelto, Assigned: jstutte)
References
Details
(Keywords: crash, topcrash, topcrash-thunderbird)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/a0e7c5d3-74fb-4256-b4e6-c4bf50210304
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll PLDHashTable::Search const xpcom/ds/PLDHashTable.cpp:492
1 xul.dll mozilla::dom::BrowserParent::ActorDestroy dom/ipc/BrowserParent.cpp:681
2 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:603
3 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:591
4 xul.dll mozilla::dom::PContentChild::OnChannelError ipc/ipdl/PContentParent.cpp:16201
5 xul.dll mozilla::dom::ContentParent::OnChannelError dom/ipc/ContentParent.cpp:1950
6 xul.dll mozilla::detail::RunnableMethodImpl<RefPtr<mozilla::dom::WorkerListener>, void xpcom/threads/nsThreadUtils.h:1201
7 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:741
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
9 xul.dll NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:496
If I'm reading the trace correctly we should be crashing here becaue mBrowserParentMap
is null. Some comments in the crash reports suggest this happens on shutdown but many others suggests it happens during normal browsing.
Comment 1•2 years ago
|
||
It looks like we're starting a new content process sometime after the async shutdown blocker for ContentParent has been cleared (which is in profile-before-change
or xpcom-shutdown
, whichever happens first), but before the ContentProcessManager
has been destroyed (which occurs during late shutdown due to a ClearOnShutdown
).
In this case we can have a launching process still after the ClearOnShutdown
observer has fired, because we didn't correctly stop process launching after the async shutdown blocker was blocked, but instead based on whether ContentProcessManager still exists (https://searchfox.org/mozilla-central/rev/9bf82ef9c097ee6af0e34a1d21c073b2616cc438/dom/ipc/ContentParent.cpp#2529).
We should instead have the decision be made based on whether the async shutdown blocker is still registered.
Comment 2•2 years ago
|
||
Crash volume peaked in 84 and 85, but may be declining in 86?
P2 S3
Comment 3•1 year ago
|
||
Here's a recent crash: bp-8b02a025-96f1-4b0c-83be-f632f0220324
This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.
Updated•1 year ago
|
Comment 4•1 year ago
|
||
[Tracking Requested - why for this release]: This is showing up on beta.
Assignee | ||
Comment 5•1 year ago
•
|
||
(In reply to Andrew McCreight [:mccr8] from comment #3)
This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.
Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.
Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?
Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?
Comment 6•1 year ago
|
||
[Tracking Requested - why for this release]:
#1 crasher on beta with 13% of our crashes and we are still at 50% rollout only, probably a release blocker (higher volume even than bug 1761182)
Assignee | ||
Comment 7•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #5)
Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least.
Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of ContentProcessManager::GetSingleton
seems infallible but it is not, as ClearOnShutdown
deletes the CPM immediately if beyond the shutdown phase.
So to avoid this crash we need to always null-check the CPM singleton.
Assignee | ||
Comment 8•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 9•1 year ago
|
||
Recap:
Landing this patch together with bug 1761182 should help to avoid these crashes.
The rest of comment 5 is still valid.
Updated•1 year ago
|
Comment 10•1 year ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2531ddbeeffa Always null check the ContentProcessManager singleton before use. r=smaug
Comment 11•1 year ago
|
||
Backed out for causing multiple failures (bc, dt, xpcshell, wpt, gl)
Failure log 1 // Failure log 2 // Failure log 3 // Failure log 4 // Failure log 5 // Failure log 6
Assignee | ||
Comment 12•1 year ago
|
||
OK, there was a really stupid oversight, sorry. Let's try this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8aa7821055de85f8fd22d17d34220ba6d6506613
Comment 13•1 year ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/be43f0901409 Always null check the ContentProcessManager singleton before use. r=smaug
Comment 14•1 year ago
|
||
bugherder |
Comment 15•1 year ago
|
||
:jstutte can you request an uplift for this patch as well?
Assignee | ||
Comment 16•1 year ago
|
||
Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug
Beta/Release Uplift Approval Request
- User impact if declined: This is a longstanding possible crash but the spike in numbers is a symptom of some order changes of events during shutdown and would bite users quite often (topcrash).
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch just adds appropriate null checks. In some cases this can lead to an unexpected error return value, but that's still preferable over a main process crash.
- String changes made/needed:
Comment 17•1 year ago
|
||
Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug
Approved for 100.0b5
Comment 18•1 year ago
|
||
bugherder uplift |
Comment 19•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #7)
Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of
ContentProcessManager::GetSingleton
seems infallible but it is not, asClearOnShutdown
deletes the CPM immediately if beyond the shutdown phase.
As I previously mentioned on Matrix, I was quite surprised that we were clearing XPCOMShutdownFinal
before nsThreadManager::Shutdown()
, which is probably what is leading to this issue. It is possible for us to perform some cleanup steps during nsThreadManager::Shutdown()
due to shutdown tasks, and we probably don't want to be running those after XPCOMShutdownFinal
.
I think a longer term solution here might be to partially-revert the changes in bug 1637890, and move XPCOMShutdownFinal
so that it happens after the bulk of work in nsThreadManager::Shutdown
. We will unfortunately probably need to split nsThreadManager::Shutdown
into two parts to keep the main thread accepting events until after XPCOMShutdownFinal
, but ideally we wouldn't need to anymore. Doing this would allow us to not have to worry as much about this manager object being destroyed before some IPDL actors are destroyed, as all IPDL actors are destroyed during nsThreadManager::Shutdown
due to their corresponding event targets dying. I'll file a seperate bug for that.
(In reply to Jens Stutte [:jstutte] from comment #5)
Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.
Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?
Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?
bug 1738103 is a general fix to try to avoid having actors running after threads have been killed. It's partially a helper patch to make it easier to implement actors like PBackground
which are bound to a thread or TaskQueue and should be shut down when that thread or task queue are shut down. It's possible that it lead to this surge in crashes as it means that we actually touch any actors which are alive during XPCOM Threads shutdown, when previously if we got that late we'd probably end up leaking them instead, as the other place we'd try to touch them would be during I/O thread shutdown, and the actor shutdown message dispatches would be leaked instead of being run.
Comment 20•1 year ago
|
||
Was #5 crash for Thunderbird 100.0b2 buildid 20220411124316. But no crashes so far for 100.0b3.
Updated•1 year ago
|
Description
•