Closed Bug 1696771 Opened 4 years ago Closed 3 years ago

Crash in [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

101 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox-esr91

---

wontfix

firefox86

---

wontfix

firefox87

---

wontfix

firefox88

---

wontfix

firefox99

---

wontfix

firefox100

fixed

firefox101

fixed

People

(Reporter: gsvelto, Assigned: jstutte)

References

Details

(Keywords: crash, topcrash, topcrash-thunderbird)

Crash Data

Attachments

(1 file)

Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug 3 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request	diannaS : approval-mozilla-beta+	Details \| Review

Gabriele Svelto [:gsvelto]

Reporter

Description

•

4 years ago

Crash report: https://crash-stats.mozilla.org/report/index/a0e7c5d3-74fb-4256-b4e6-c4bf50210304

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll PLDHashTable::Search const xpcom/ds/PLDHashTable.cpp:492
1 xul.dll mozilla::dom::BrowserParent::ActorDestroy dom/ipc/BrowserParent.cpp:681
2 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:603
3 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:591
4 xul.dll mozilla::dom::PContentChild::OnChannelError ipc/ipdl/PContentParent.cpp:16201
5 xul.dll mozilla::dom::ContentParent::OnChannelError dom/ipc/ContentParent.cpp:1950
6 xul.dll mozilla::detail::RunnableMethodImpl<RefPtr<mozilla::dom::WorkerListener>, void  xpcom/threads/nsThreadUtils.h:1201
7 xul.dll mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:741
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
9 xul.dll NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:496

If I'm reading the trace correctly we should be crashing here becaue mBrowserParentMap is null. Some comments in the crash reports suggest this happens on shutdown but many others suggests it happens during normal browsing.

Nika Layzell [:nika] (ni? for response)

Comment 1

•

4 years ago

It looks like we're starting a new content process sometime after the async shutdown blocker for ContentParent has been cleared (which is in profile-before-change or xpcom-shutdown, whichever happens first), but before the ContentProcessManager has been destroyed (which occurs during late shutdown due to a ClearOnShutdown).

In this case we can have a launching process still after the ClearOnShutdown observer has fired, because we didn't correctly stop process launching after the async shutdown blocker was blocked, but instead based on whether ContentProcessManager still exists (https://searchfox.org/mozilla-central/rev/9bf82ef9c097ee6af0e34a1d21c073b2616cc438/dom/ipc/ContentParent.cpp#2529).

We should instead have the decision be made based on whether the async shutdown blocker is still registered.

Chris Peterson [:cpeterson]

Comment 2

•

4 years ago

Crash volume peaked in 84 and 85, but may be declining in 86?

P2 S3

Severity: -- → S3

Crash Signature: [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy] → [@ PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy] [@ shutdownhang | PLDHashTable::Search | mozilla::dom::BrowserParent::ActorDestroy]

status-firefox86: --- → affected

status-firefox87: --- → affected

status-firefox88: --- → ?

status-firefox-esr78: --- → unaffected

Priority: -- → P2

Andrew McCreight [:mccr8]

Comment 3

•

3 years ago

Here's a recent crash: bp-8b02a025-96f1-4b0c-83be-f632f0220324

This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.

Updated

•

3 years ago

status-firefox86: affected → wontfix

status-firefox87: affected → wontfix

status-firefox88: ? → wontfix

Andrew McCreight [:mccr8]

Comment 4

•

3 years ago

[Tracking Requested - why for this release]: This is showing up on beta.

status-firefox100: --- → affected

status-firefox101: --- → affected

tracking-firefox100: --- → ?

Jens Stutte [:jstutte]

Assignee

Comment 5

•

3 years ago

•

Edited

(In reply to Andrew McCreight [:mccr8] from comment #3)

This looks similar to bug 1761182, in that we're destroying a parent process actor during thread manager shutdown, so some things actor destroy expects have been nulled out.

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.

Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?

Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?

Flags: needinfo?(nika)

Pascal Chevrel:pascalc

Comment 6

•

3 years ago

[Tracking Requested - why for this release]:

#1 crasher on beta with 13% of our crashes and we are still at 50% rollout only, probably a release blocker (higher volume even than bug 1761182)

Severity: S3 → S1

tracking-firefox101: --- → ?

Keywords: topcrash

Priority: P2 → P1

Jens Stutte [:jstutte]

Assignee

Comment 7

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #5)

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least.

Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of ContentProcessManager::GetSingleton seems infallible but it is not, as ClearOnShutdown deletes the CPM immediately if beyond the shutdown phase.

So to avoid this crash we need to always null-check the CPM singleton.

Jens Stutte [:jstutte]

Assignee

Comment 8

•

3 years ago

Attached file Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Jens Stutte [:jstutte]

Assignee

Comment 9

•

3 years ago

Recap:
Landing this patch together with bug 1761182 should help to avoid these crashes.

The rest of comment 5 is still valid.

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox99: --- → wontfix

status-firefox-esr91: --- → affected

tracking-firefox100: ? → +

tracking-firefox101: ? → +

Pulsebot

Comment 10

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2531ddbeeffa Always null check the ContentProcessManager singleton before use. r=smaug

Jens Stutte [:jstutte]

Assignee

Updated

•

3 years ago

Comment 11

•

3 years ago

Backed out for causing multiple failures (bc, dt, xpcshell, wpt, gl)

Backout link

Push with failures

Failure log 1 // Failure log 2 // Failure log 3 // Failure log 4 // Failure log 5 // Failure log 6

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 12

•

3 years ago

OK, there was a really stupid oversight, sorry. Let's try this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8aa7821055de85f8fd22d17d34220ba6d6506613

Flags: needinfo?(jstutte)

Pulsebot

Comment 13

•

3 years ago

Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/be43f0901409 Always null check the ContentProcessManager singleton before use. r=smaug

Iulian Moraru

Comment 14

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/be43f0901409

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox101: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 101 Branch

Dianna Smith [:diannaS]

Comment 15

•

3 years ago

:jstutte can you request an uplift for this patch as well?

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 16

•

3 years ago

Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug

Beta/Release Uplift Approval Request

User impact if declined: This is a longstanding possible crash but the spike in numbers is a symptom of some order changes of events during shutdown and would bite users quite often (topcrash).
Is this code covered by automated tests?: Unknown
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This patch just adds appropriate null checks. In some cases this can lead to an unexpected error return value, but that's still preferable over a main process crash.
String changes made/needed:

Flags: needinfo?(jstutte)

Attachment #9271292 - Flags: approval-mozilla-beta?

Treeherder Bug Filer

Updated

•

3 years ago

Comment 17

•

3 years ago

Comment on attachment 9271292 [details]
Bug 1696771: Always null check the ContentProcessManager singleton before use. r?smaug

Approved for 100.0b5

Attachment #9271292 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Dianna Smith [:diannaS]

Comment 18

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/7aedec2cffc9

status-firefox100: affected → fixed

Nika Layzell [:nika] (ni? for response)

Comment 19

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #7)

Actually this is not enough for this flavor of crashes here. Thanks to :smaug we understood, that the current implementation of ContentProcessManager::GetSingleton seems infallible but it is not, as ClearOnShutdown deletes the CPM immediately if beyond the shutdown phase.

As I previously mentioned on Matrix, I was quite surprised that we were clearing XPCOMShutdownFinal before nsThreadManager::Shutdown(), which is probably what is leading to this issue. It is possible for us to perform some cleanup steps during nsThreadManager::Shutdown() due to shutdown tasks, and we probably don't want to be running those after XPCOMShutdownFinal.

I think a longer term solution here might be to partially-revert the changes in bug 1637890, and move XPCOMShutdownFinal so that it happens after the bulk of work in nsThreadManager::Shutdown. We will unfortunately probably need to split nsThreadManager::Shutdown into two parts to keep the main thread accepting events until after XPCOMShutdownFinal, but ideally we wouldn't need to anymore. Doing this would allow us to not have to worry as much about this manager object being destroyed before some IPDL actors are destroyed, as all IPDL actors are destroyed during nsThreadManager::Shutdown due to their corresponding event targets dying. I'll file a seperate bug for that.

(In reply to Jens Stutte [:jstutte] from comment #5)

Yes, we should land and uplift the patch on bug 1761182 to avoid those exact crashes, at least. Please note that the underlying problem should be mitigated better by the patch on bug 1632740, but that is not yet ready. Still this would just handle better (refuse effectively) late content process creation.

Nevertheless it would be interesting, why this is spiking now. One patch that might influence the number and timing of content processes that start could be bug 1728332 ? The other patch that seemed to make it more likely to hit this situation seemed to be bug 1738103. Not sure if it is an option to backout any of them, though?

Edit: I think bug 1738103 is kind of a different way to mitigate what might be mitigated better by the patch on bug 1632740 (or other ways to 100% avoid late content process creation) ?

bug 1738103 is a general fix to try to avoid having actors running after threads have been killed. It's partially a helper patch to make it easier to implement actors like PBackground which are bound to a thread or TaskQueue and should be shut down when that thread or task queue are shut down. It's possible that it lead to this surge in crashes as it means that we actually touch any actors which are alive during XPCOM Threads shutdown, when previously if we got that late we'd probably end up leaking them instead, as the other place we'd try to touch them would be during I/O thread shutdown, and the actor shutdown message dispatches would be leaked instead of being run.

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Regressions: 1763893

Wayne Mery (:wsmwk)

Comment 20

•

3 years ago

Was #5 crash for Thunderbird 100.0b2 buildid 20220411124316. But no crashes so far for 100.0b3.

Keywords: topcrash-thunderbird

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox-esr91: affected → wontfix

You need to log in before you can comment on or make changes to this bug.