Closed Bug 1696771 Opened 3 years ago Closed 2 years ago

Crash in [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy]

Categories

(Core :: DOM: Content Processes, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
101 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox99 --- wontfix
firefox100 + fixed
firefox101 + fixed

People

(Reporter: gsvelto, Assigned: jstutte)

References

Details

(Keywords: crash, topcrash, topcrash-thunderbird)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/a0e7c5d3-74fb-4256-b4e6-c4bf50210304

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll PLDHashTable::Search const xpcom/ds/PLDHashTable.cpp:492
1 xul.dll mozilla::dom::BrowserParent::ActorDestroy dom/ipc/BrowserParent.cpp:681
2 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:603
3 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:591
4 xul.dll mozilla::dom::PContentChild::OnChannelError ipc/ipdl/PContentParent.cpp:16201
5 xul.dll mozilla::dom::ContentParent::OnChannelError dom/ipc/ContentParent.cpp:1950
6 xul.dll mozilla::detail::RunnableMethodImpl<RefPtr<mozilla::dom::WorkerListener>, void  xpcom/threads/nsThreadUtils.h:1201
7 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:741
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
9 xul.dll NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:496

If I'm reading the trace correctly we should be crashing here becaue mBrowserParentMap is null. Some comments in the crash reports suggest this happens on shutdown but many others suggests it happens during normal browsing.

It looks like we're starting a new content process sometime after the async shutdown blocker for ContentParent has been cleared (which is in profile-before-change or xpcom-shutdown, whichever happens first), but before the ContentProcessManager has been destroyed (which occurs during late shutdown due to a ClearOnShutdown).

In this case we can have a launching process still after the ClearOnShutdown observer has fired, because we didn't correctly stop process launching after the async shutdown blocker was blocked, but instead based on whether ContentProcessManager still exists (https://searchfox.org/mozilla-central/rev/9bf82ef9c097ee6af0e34a1d21c073b2616cc438/dom/ipc/ContentParent.cpp#2529).

We should instead have the decision be made based on whether the async shutdown blocker is still registered.

Crash volume peaked in 84 and 85, but may be declining in 86?

P2 S3

Severity: -- → S3
Crash Signature: [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy] → [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy] [@ shutdownhang | PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy]
Priority: -- → P2

Here's a recent crash: bp-8b02a025-96f1-4b0c-83be-f632f0220324

This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.

See Also: → 1761182

[Tracking Requested - why for this release]: This is showing up on beta.

(In reply to Andrew McCreight [:mccr8] from comment #3)

This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.

Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?

Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?

Flags: needinfo?(nika)

[Tracking Requested - why for this release]:

#1 crasher on beta with 13% of our crashes and we are still at 50% rollout only, probably a release blocker (higher volume even than bug 1761182)

Severity: S3 → S1
Keywords: topcrash
Priority: P2 → P1

(In reply to Jens Stutte [:jstutte] from comment #5)

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least.

Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of ContentProcessManager::GetSingleton seems infallible but it is not, as ClearOnShutdown deletes the CPM immediately if beyond the shutdown phase.

So to avoid this crash we need to always null-check the CPM singleton.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED

Recap:
Landing this patch together with bug 1761182 should help to avoid these crashes.

The rest of comment 5 is still valid.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2531ddbeeffa
Always null check the ContentProcessManager singleton before use. r=smaug
See Also: → 1632740

Backed out for causing multiple failures (bc, dt, xpcshell, wpt, gl)

Backout link

Push with failures

Failure log 1 // Failure log 2 // Failure log 3 // Failure log 4 // Failure log 5 // Failure log 6

Flags: needinfo?(jstutte)

OK, there was a really stupid oversight, sorry. Let's try this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8aa7821055de85f8fd22d17d34220ba6d6506613

Flags: needinfo?(jstutte)
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/be43f0901409
Always null check the ContentProcessManager singleton before use. r=smaug
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch

:jstutte can you request an uplift for this patch as well?

Flags: needinfo?(jstutte)

Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug

Beta/Release Uplift Approval Request

  • User impact if declined: This is a longstanding possible crash but the spike in numbers is a symptom of some order changes of events during shutdown and would bite users quite often (topcrash).
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch just adds appropriate null checks. In some cases this can lead to an unexpected error return value, but that's still preferable over a main process crash.
  • String changes made/needed:
Flags: needinfo?(jstutte)
Attachment #9271292 - Flags: approval-mozilla-beta?
See Also: → 1764043

Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug

Approved for 100.0b5

Attachment #9271292 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Jens Stutte [:jstutte] from comment #7)

Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of ContentProcessManager::GetSingleton seems infallible but it is not, as ClearOnShutdown deletes the CPM immediately if beyond the shutdown phase.

As I previously mentioned on Matrix, I was quite surprised that we were clearing XPCOMShutdownFinal before nsThreadManager::Shutdown(), which is probably what is leading to this issue. It is possible for us to perform some cleanup steps during nsThreadManager::Shutdown() due to shutdown tasks, and we probably don't want to be running those after XPCOMShutdownFinal.

I think a longer term solution here might be to partially-revert the changes in bug 1637890, and move XPCOMShutdownFinal so that it happens after the bulk of work in nsThreadManager::Shutdown. We will unfortunately probably need to split nsThreadManager::Shutdown into two parts to keep the main thread accepting events until after XPCOMShutdownFinal, but ideally we wouldn't need to anymore. Doing this would allow us to not have to worry as much about this manager object being destroyed before some IPDL actors are destroyed, as all IPDL actors are destroyed during nsThreadManager::Shutdown due to their corresponding event targets dying. I'll file a seperate bug for that.

(In reply to Jens Stutte [:jstutte] from comment #5)

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.

Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?

Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?

bug 1738103 is a general fix to try to avoid having actors running after threads have been killed. It's partially a helper patch to make it easier to implement actors like PBackground which are bound to a thread or TaskQueue and should be shut down when that thread or task queue are shut down. It's possible that it lead to this surge in crashes as it means that we actually touch any actors which are alive during XPCOM Threads shutdown, when previously if we got that late we'd probably end up leaking them instead, as the other place we'd try to touch them would be during I/O thread shutdown, and the actor shutdown message dispatches would be leaked instead of being run.

Flags: needinfo?(nika)
See Also: → 1764181
See Also: → 1764251
Regressions: 1763893

Was #5 crash for Thunderbird 100.0b2 buildid 20220411124316. But no crashes so far for 100.0b3.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: