Investigate effect of raising the priority of Shutdown IPC messages
Categories
(Core :: DOM: Content Processes, task)
Tracking
()
People
(Reporter: jstutte, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
(Keywords: leave-open)
Attachments
(4 files, 1 obsolete file)
As of bug 1279293 comment 195 and bug 1279293 comment 196 we want to ensure a higher priority for child process shutdown messages.
The [Priority=control] annotation seems appropriate for this.
After a first sketch in D138708 I think we should move to a step-by-step approach:
- Introduce a high-priority
ShutdownConfirmed
message called right beforeSendShutdown
by the parent. This sets theCrashReporter::Annotation::IPCShutdownState
annotation to something we can recognize in crash-stats to see if we are looking at the right case. - Move the switch to
ShutdownPhase::AppShutdownConfirmed
fromRecvShutdown
toRecvShutdownConfirmed
. - Advance instead the shutdown phase to
ShutdownPhase::AppShutdown
inRecvShutdown
to replacemShuttingDown
. - Add more appropriate bailouts step by step (some of the places are already known from my experiments).
- (optional and not ContentProcess specific) Rename
AppShutdown
(andShutdownPhase::AppShutdownXxx
and …) toProcessShutdown
(see bug 1697745).
Step 1 is just diagnostic. Step 2 would have some effect on existing JS SpinEventLoopUntilOrQuit
, hopefully for the better, and maybe FOG. Step 3 would just replace mShuttingDown
and could be even done before 2, probably.
See this comment in phabricator for a longer discussion of the rationale and assumptions behind this plan.
Assignee | ||
Comment 1•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 2•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5a00e56ade15 Step 1 - Add high priority ShutdownConfirmedHP method and update Annotation::IPCShutdownState if received. r=smaug
Comment 4•2 years ago
|
||
bugherder |
Assignee | ||
Comment 5•2 years ago
•
|
||
Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.
I assume that we could add some annotation inside ProcessChild::NotifyImpendingShutdown()
to check if that special IMPENDING_SHUTDOWN_MESSAGE_TYPE
message sent by ContentParent::StartForceKillTimer()
is processed, which would give us an indication if we might be able to talk to the client process in these situations at all such that we could trigger shutdown in a similar way. Otherwise I would assume we do not even arrive with any kind of message and IPC communication is somehow stalled?
Comment hidden (obsolete) |
Assignee | ||
Comment 7•2 years ago
|
||
Assignee | ||
Comment 8•2 years ago
|
||
BTW, looking at the patches in bug 1633791 I was wondering if we could rename NotifyImpendingShutdown
to better indicate the difference between sending and receiving.
Assignee | ||
Comment 9•2 years ago
|
||
Depends on D139255
Comment 10•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #5)
Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.
Oh, that's really bad news. Well there's two possibilities left: the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent. The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.
Assignee | ||
Comment 11•2 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #10)
the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent.
Yeah, that would probably just be the "machine stuck" case I already mentioned in phabricator.
The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.
At least from the data we see real crashes of apparently started content processes where the shutdown annotation is never set. Those are the ones I am chasing with the additional annotation now. What you are suggesting here would be a totally different thing, IIUC.
Comment 12•2 years ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/89d6b8e05394 Update Annotation::IPCShutdownState on NotifyImpendingShutdown r=ipc-reviewers,mccr8
Comment 13•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Assignee | ||
Comment 14•2 years ago
|
||
So the first incoming data does not show much difference here. From 182 reports for version 20220223093800 so far I see only 1 with "NotifyImpendingShutdown received." (and 1 with "SendFinishShutdown (sent)"), all the others have no sign of any shutdown progress at all.
Comment 15•2 years ago
|
||
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/95d43cf0d451 Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r=ipc-reviewers,mccr8
Comment 16•2 years ago
|
||
bugherder |
Assignee | ||
Comment 17•2 years ago
|
||
Looking at bug 1279293 comment 198, it seems now that we probably should use special IO messages both to signal shutdown to the child and a finished shutdown to the parent.
This would decouple us from the normal event processing order and ensure we can timely react.
Assignee | ||
Comment 18•2 years ago
•
|
||
In particular it could be that even the ShutdownConfirmedHP
message is not processed if there are already many events in the control
priority queue.
Looking at where we use control
, to me the usage in Graphics: Layers (compositor) stands out, in particular we did bug 1730998.
Assignee | ||
Comment 19•2 years ago
•
|
||
Updated data from crashstats
Rank | Last annotation | # | % |
---|---|---|---|
1 | NotifyImpendingShutdown received. | 5605 | 58.07 % |
2 | SendFinishShutdown (sent) | 1735 | 17.98 % |
3 | content-child-shutdown started | 31 | 0.32 % |
4 | ShutdownInternal entry | 16 | 0.17 % |
5 | RecvShutdownConfirmedHP entry | 4 | 0.04 % |
6 | RecvShutdown entry | 1 | 0.01 % |
7 | SendFinishShutdown (sending) | 1 | 0.01 % |
This confirms that we still have mostly bottlenecks in receiving / running the shutdown messages, it seems they get queued way to often.
Assignee | ||
Comment 20•2 years ago
•
|
||
Bug 1775025 might be related to the cases where we see only NotifyImpendingShutdown
messages arrive (which is sent over the same channel, meaning that the channel as such seems to be generally working up to that point).
Edit: See :nika's bug 1775025 comment 5.
Comment 21•2 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 22•1 year ago
|
||
Since we landed D155814 we see a bit more noise with annotations like:
NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected)
NotifiedImpendingShutdown - NotifiedImpendingShutdown - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry
I think we should reduce the verbosity a bit and ensure that NotifiedImpendingShutdown
and RecvRequestContentJSInterrupt
are called only once. The repeated ShutdownInternal entry
message indicates a stuck nested event loop on the content process' main thread.
Assignee | ||
Comment 23•1 year ago
|
||
The repeated ShutdownInternal entry
case on all instances I looked at seems to come from this SpinEventLoopUntil
. It seems reasonable to add a condition for shutdown here to abort the loop. We could fake and throw some network timeout as a consequence just to do something consistent with normal operations.
Assignee | ||
Comment 24•1 year ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #22)
I think we should reduce the verbosity a bit and ensure that
NotifiedImpendingShutdown
andRecvRequestContentJSInterrupt
are called only once.
I filed bug 1808691 for this.
Description
•