Crash in [@ shutdownhang | mozilla::ProcessHangMonitor::RemoveProcess]
Categories
(Core :: DOM: Content Processes, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox99 | --- | unaffected |
firefox100 | --- | disabled |
firefox101 | --- | fixed |
People
(Reporter: mccr8, Assigned: jstutte)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-beta-
|
Details | Review |
Maybe Fission related. (DOMFissionEnabled=1)
Crash report: https://crash-stats.mozilla.org/report/index/e6cf80aa-848e-48cc-9b8b-dceb70220401
MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 KERNELBASE.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll static mozilla::ProcessHangMonitor::RemoveProcess dom/ipc/ProcessHangMonitor.cpp:1193
5 xul.dll mozilla::dom::ContentParent::ActorDestroy dom/ipc/ContentParent.cpp:1972
6 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:577
7 xul.dll mozilla::dom::PContentParent::OnChannelClose ipc/ipdl/PContentParent.cpp:16912
8 xul.dll mozilla::ipc::MessageChannel::Close ipc/glue/MessageChannel.cpp:2082
9 xul.dll mozilla::ThreadEventQueue::RunShutdownTasks xpcom/threads/ThreadEventQueue.cpp:315
This looks like it could be another issue from calling ContentParent::ActorDestroy() during nsThreadManager::Shutdown().
Comment 1•2 years ago
|
||
Should the regressing patch be backed out? The next merge day is soon.
Assignee | ||
Comment 2•2 years ago
|
||
Hmm, unclear how this can happen, but yes, we should probably gain some time and backout:
- https://phabricator.services.mozilla.com/D142082 and
- https://phabricator.services.mozilla.com/D142190
as from my work on https://phabricator.services.mozilla.com/D142660 it seems that we touch a very fragile balance here.
Comment 3•2 years ago
|
||
what are the next steps here? Im trying to understand here since this happened before beta and the latest uplifts.
- https://phabricator.services.mozilla.com/D142082 was merged to beta on b4 which is when the spike here started on beta. Backing this out would bring back bug 1761182
Assignee | ||
Comment 4•2 years ago
|
||
Hi Nika, I fear the nod is getting quite big here.
I looked at a crash and it looks as if we should check here if the shutdown dispatch succeeded?
We come from ContentParent::ActorDestroy
and I assume we miss some check (or information) if the hang monitor is actually still alive?
Assignee | ||
Comment 5•2 years ago
|
||
PS: I can try to make a patch for this my tomorrow morning, not right now.
Assignee | ||
Comment 6•2 years ago
|
||
If run late in shutdown, we cannot rely on any of our normal infrastructure to be still in place. This patch collects a series of additional guards to avoid potential crashes.
Not all of them might ever been hit, but there is no harm in additional checks as this is not performance critical at all.
Updated•2 years ago
|
Assignee | ||
Comment 7•2 years ago
|
||
So we are kicking the can down the road of ContentParent::ActorDestroy
it seems. Rather than wait for further crashes I tried to audit the rest of it by code reading, as far as I could. This specific crash should be fixed by that, too, hopefully.
In general we seem to have a bad pattern in our code here and there of not checking if the dispatch of important messages actually succeeded. During the work on shutdown hangs I've seen a few of those and it might be worth auditing our code in that sense.
Leaving the ni? to :nika for now, as backing out bug 1738104 might be still a last-resort option that is easier to apply than unwinding all the shutdown improvements we made so far on our way, considering that all those crashes happened also before (but less often, of course).
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/046d729752b6 Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r=smaug
Updated•2 years ago
|
Comment 9•2 years ago
|
||
I think we should probably look into backing out bug 1738104 from beta, to try to fix the issue at the root. I think it's possible to fix this, but trying to do it when rushed to meet deadlines will be difficult :-)
I'm optimistic we'll only see leaks after backing out that patch for now, but we'll have to see how it goes. My large patch which depends on it the most hasn't landed yet.
Assignee | ||
Comment 10•2 years ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #9)
I think we should probably look into backing out bug 1738104 from beta, to try to fix the issue at the root. I think it's possible to fix this, but trying to do it when rushed to meet deadlines will be difficult :-)
I'd second that, thanks.
I'm optimistic we'll only see leaks after backing out that patch for now, but we'll have to see how it goes. My large patch which depends on it the most hasn't landed yet.
Good to hear that it will not create to much hassle for you then!
Comment 11•2 years ago
|
||
bugherder |
Assignee | ||
Comment 12•2 years ago
|
||
(In reply to Dianna Smith [:diannaS] from comment #3)
what are the next steps here? Im trying to understand here since this happened before beta and the latest uplifts.
- https://phabricator.services.mozilla.com/D142082 was merged to beta on b4 which is when the spike here started on beta. Backing this out would bring back bug 1761182
Per comment 9:
- Back out the patches of bug 1738104 from beta
- Then optional: uplift this patch that just landed (I'll take a short look at crash-stats tomorrow and in case ask for uplift)
Thanks
Comment 13•2 years ago
|
||
Sounds good. Ill backout bug 1738104 for beta6 and wait for the possibility of the other patch in the next beta.
Assignee | ||
Comment 14•2 years ago
•
|
||
Comment on attachment 9272248 [details]
Bug 1762604: Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r?smaug
Beta/Release Uplift Approval Request
- User impact if declined: This crash is a long standing issue if we try to destroy a
ContentParent
late in shutdown. Bug 1738104 amplified the volume here but is now backed out, so the the user impact is as before and can result in sporadic shutdown hangs. - Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The patch just adds some sanity checks. There remains the risk that I overlooked some more issues, as solving the found ones might lead to the execution of formerly unreached code paths under this condition. But we are late enough in shutdown that whatever happens should not really matter.
- String changes made/needed:
Assignee | ||
Comment 15•2 years ago
|
||
The backout of bug 1738104 from beta seemed to help here, too.
Comment 16•2 years ago
|
||
Comment on attachment 9272248 [details]
Bug 1762604: Further audit ContentParent::ActorDestroy to avoid crashes if run late during shutdown. r?smaug
Declining uplift due to comment 15
Updated•2 years ago
|
Updated•2 years ago
|
Description
•