Closed Bug 1762604 Opened 2 years ago Closed 2 years ago

Crash in [@ shutdownhang | mozilla::ProcessHangMonitor::RemoveProcess]

Categories

(Core :: DOM: Content Processes, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
101 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- unaffected
firefox100 --- disabled
firefox101 --- fixed

People

(Reporter: mccr8, Assigned: jstutte)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/e6cf80aa-848e-48cc-9b8b-dceb70220401

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 KERNELBASE.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll static mozilla::ProcessHangMonitor::RemoveProcess dom/ipc/ProcessHangMonitor.cpp:1193
5 xul.dll mozilla::dom::ContentParent::ActorDestroy dom/ipc/ContentParent.cpp:1972
6 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:577
7 xul.dll mozilla::dom::PContentParent::OnChannelClose ipc/ipdl/PContentParent.cpp:16912
8 xul.dll mozilla::ipc::MessageChannel::Close ipc/glue/MessageChannel.cpp:2082
9 xul.dll mozilla::ThreadEventQueue::RunShutdownTasks xpcom/threads/ThreadEventQueue.cpp:315

This looks like it could be another issue from calling ContentParent::ActorDestroy() during nsThreadManager::Shutdown().

Should the regressing patch be backed out? The next merge day is soon.

Flags: needinfo?(jstutte)

Hmm, unclear how this can happen, but yes, we should probably gain some time and backout:

as from my work on https://phabricator.services.mozilla.com/D142660 it seems that we touch a very fragile balance here.

Flags: needinfo?(jstutte)

what are the next steps here? Im trying to understand here since this happened before beta and the latest uplifts.

Flags: needinfo?(jstutte)

Hi Nika, I fear the nod is getting quite big here.

I looked at a crash and it looks as if we should check here if the shutdown dispatch succeeded?

We come from ContentParent::ActorDestroy and I assume we miss some check (or information) if the hang monitor is actually still alive?

Flags: needinfo?(jstutte) → needinfo?(nika)

PS: I can try to make a patch for this my tomorrow morning, not right now.

If run late in shutdown, we cannot rely on any of our normal infrastructure to be still in place. This patch collects a series of additional guards to avoid potential crashes.

Not all of them might ever been hit, but there is no harm in additional checks as this is not performance critical at all.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED

So we are kicking the can down the road of ContentParent::ActorDestroy it seems. Rather than wait for further crashes I tried to audit the rest of it by code reading, as far as I could. This specific crash should be fixed by that, too, hopefully.

In general we seem to have a bad pattern in our code here and there of not checking if the dispatch of important messages actually succeeded. During the work on shutdown hangs I've seen a few of those and it might be worth auditing our code in that sense.

Leaving the ni? to :nika for now, as backing out bug 1738104 might be still a last-resort option that is easier to apply than unwinding all the shutdown improvements we made so far on our way, considering that all those crashes happened also before (but less often, of course).

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/046d729752b6
Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r=smaug

I think we should probably look into backing out bug 1738104 from beta, to try to fix the issue at the root. I think it's possible to fix this, but trying to do it when rushed to meet deadlines will be difficult :-)

I'm optimistic we'll only see leaks after backing out that patch for now, but we'll have to see how it goes. My large patch which depends on it the most hasn't landed yet.

Flags: needinfo?(nika)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #9)

I think we should probably look into backing out bug 1738104 from beta, to try to fix the issue at the root. I think it's possible to fix this, but trying to do it when rushed to meet deadlines will be difficult :-)

I'd second that, thanks.

I'm optimistic we'll only see leaks after backing out that patch for now, but we'll have to see how it goes. My large patch which depends on it the most hasn't landed yet.

Good to hear that it will not create to much hassle for you then!

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch

(In reply to Dianna Smith [:diannaS] from comment #3)

what are the next steps here? Im trying to understand here since this happened before beta and the latest uplifts.

Per comment 9:

  1. Back out the patches of bug 1738104 from beta
  2. Then optional: uplift this patch that just landed (I'll take a short look at crash-stats tomorrow and in case ask for uplift)

Thanks

Flags: needinfo?(dsmith)

Sounds good. Ill backout bug 1738104 for beta6 and wait for the possibility of the other patch in the next beta.

Flags: needinfo?(dsmith)

Comment on attachment 9272248 [details]
Bug 1762604: Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r?smaug

Beta/Release Uplift Approval Request

  • User impact if declined: This crash is a long standing issue if we try to destroy a ContentParent late in shutdown. Bug 1738104 amplified the volume here but is now backed out, so the the user impact is as before and can result in sporadic shutdown hangs.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch just adds some sanity checks. There remains the risk that I overlooked some more issues, as solving the found ones might lead to the execution of formerly unreached code paths under this condition. But we are late enough in shutdown that whatever happens should not really matter.
  • String changes made/needed:
Attachment #9272248 - Flags: approval-mozilla-beta?

The backout of bug 1738104 from beta seemed to help here, too.

Comment on attachment 9272248 [details]
Bug 1762604: Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r?smaug

Declining uplift due to comment 15

Attachment #9272248 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
See Also: → 1766572
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: