Open Bug 1755376 Opened 2 years ago Updated 1 year ago

Investigate effect of raising the priority of Shutdown IPC messages

Categories

(Core :: DOM: Content Processes, task)

task

Tracking

()

ASSIGNED

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(4 files, 1 obsolete file)

As of bug 1279293 comment 195 and bug 1279293 comment 196 we want to ensure a higher priority for child process shutdown messages.

The [Priority=control] annotation seems appropriate for this.

After a first sketch in D138708 I think we should move to a step-by-step approach:

  1. Introduce a high-priority ShutdownConfirmed message called right before SendShutdown by the parent. This sets the CrashReporter::Annotation::IPCShutdownState annotation to something we can recognize in crash-stats to see if we are looking at the right case.
  2. Move the switch to ShutdownPhase::AppShutdownConfirmed from RecvShutdown to RecvShutdownConfirmed.
  3. Advance instead the shutdown phase to ShutdownPhase::AppShutdown in RecvShutdown to replace mShuttingDown.
  4. Add more appropriate bailouts step by step (some of the places are already known from my experiments).
  5. (optional and not ContentProcess specific) Rename AppShutdown (and ShutdownPhase::AppShutdownXxx and …) to ProcessShutdown (see bug 1697745).

Step 1 is just diagnostic. Step 2 would have some effect on existing JS SpinEventLoopUntilOrQuit, hopefully for the better, and maybe FOG. Step 3 would just replace mShuttingDown and could be even done before 2, probably.

See this comment in phabricator for a longer discussion of the rationale and assumptions behind this plan.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Attachment #9263834 - Attachment description: Bug 1755376: Raise priority of Shutdown and FinishShutdown IPC messages. r?mccr8 → WIP: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown.
Attachment #9263834 - Attachment description: WIP: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. → Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers
Attachment #9263834 - Attachment description: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers → Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers,smaug!
Summary: Raise priority of Shutdown and FinishShutdown IPC messages → Investigate effect of raising the priority of Shutdown IPC messages
Attachment #9263834 - Attachment is obsolete: true
Keywords: leave-open
Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5a00e56ade15
Step 1 - Add high priority ShutdownConfirmedHP method and update Annotation::IPCShutdownState if received. r=smaug

Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.

I assume that we could add some annotation inside ProcessChild::NotifyImpendingShutdown() to check if that special IMPENDING_SHUTDOWN_MESSAGE_TYPE message sent by ContentParent::StartForceKillTimer() is processed, which would give us an indication if we might be able to talk to the client process in these situations at all such that we could trigger shutdown in a similar way. Otherwise I would assume we do not even arrive with any kind of message and IPC communication is somehow stalled?

BTW, looking at the patches in bug 1633791 I was wondering if we could rename NotifyImpendingShutdown to better indicate the difference between sending and receiving.

See Also: → 1633791

(In reply to Jens Stutte [:jstutte] from comment #5)

Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.

Oh, that's really bad news. Well there's two possibilities left: the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent. The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent.

Yeah, that would probably just be the "machine stuck" case I already mentioned in phabricator.

The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.

At least from the data we see real crashes of apparently started content processes where the shutdown annotation is never set. Those are the ones I am chasing with the additional annotation now. What you are suggesting here would be a totally different thing, IIUC.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/89d6b8e05394
Update Annotation::IPCShutdownState on NotifyImpendingShutdown r=ipc-reviewers,mccr8
Attachment #9264795 - Attachment description: Bug 1755376: Rename NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers → Bug 1755376: Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers

So the first incoming data does not show much difference here. From 182 reports for version 20220223093800 so far I see only 1 with "NotifyImpendingShutdown received." (and 1 with "SendFinishShutdown (sent)"), all the others have no sign of any shutdown progress at all.

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/95d43cf0d451
Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r=ipc-reviewers,mccr8

Looking at bug 1279293 comment 198, it seems now that we probably should use special IO messages both to signal shutdown to the child and a finished shutdown to the parent.

This would decouple us from the normal event processing order and ensure we can timely react.

In particular it could be that even the ShutdownConfirmedHP message is not processed if there are already many events in the control priority queue.

Looking at where we use control, to me the usage in Graphics: Layers (compositor) stands out, in particular we did bug 1730998.

Updated data from crashstats

Rank Last annotation # %
1 NotifyImpendingShutdown received. 5605 58.07 %
2 SendFinishShutdown (sent) 1735 17.98 %
3 content-child-shutdown started 31 0.32 %
4 ShutdownInternal entry 16 0.17 %
5 RecvShutdownConfirmedHP entry 4 0.04 %
6 RecvShutdown entry 1 0.01 %
7 SendFinishShutdown (sending) 1 0.01 %

This confirms that we still have mostly bottlenecks in receiving / running the shutdown messages, it seems they get queued way to often.

Bug 1775025 might be related to the cases where we see only NotifyImpendingShutdown messages arrive (which is sent over the same channel, meaning that the channel as such seems to be generally working up to that point).

Edit: See :nika's bug 1775025 comment 5.

See Also: → 1775025
See Also: → 1777198

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Since we landed D155814 we see a bit more noise with annotations like:

NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) 
NotifiedImpendingShutdown - NotifiedImpendingShutdown - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry 

I think we should reduce the verbosity a bit and ensure that NotifiedImpendingShutdown and RecvRequestContentJSInterrupt are called only once. The repeated ShutdownInternal entry message indicates a stuck nested event loop on the content process' main thread.

Flags: needinfo?(jstutte)

The repeated ShutdownInternal entry case on all instances I looked at seems to come from this SpinEventLoopUntil. It seems reasonable to add a condition for shutdown here to abort the loop. We could fake and throw some network timeout as a consequence just to do something consistent with normal operations.

See Also: → 1808533

(In reply to Jens Stutte [:jstutte] from comment #22)

I think we should reduce the verbosity a bit and ensure that NotifiedImpendingShutdown and RecvRequestContentJSInterrupt are called only once.

I filed bug 1808691 for this.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: