Open Bug 1755376 Opened 2 years ago Updated 1 year ago

Investigate effect of raising the priority of Shutdown IPC messages

Tracking

()

Status:

ASSIGNED

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(4 files, 1 obsolete file)

Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers,smaug! 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1755376: Step 1 - Add high priority ShutdownConfirmedHP method and update Annotation::IPCShutdownState if received. r?smaug 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
RecvShutdownConfirmed not appearing? 2 years ago Jens Stutte [:jstutte] 13.21 KB, image/png		Details
Bug 1755376: Update Annotation::IPCShutdownState on NotifyImpendingShutdown r?#ipc-reviewers 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1755376: Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers 2 years ago Jens Stutte [:jstutte] 48 bytes, text/x-phabricator-request		Details \| Review

Jens Stutte [:jstutte]

Assignee

Description

•

2 years ago

•

Edited

As of bug 1279293 comment 195 and bug 1279293 comment 196 we want to ensure a higher priority for child process shutdown messages.

The [Priority=control] annotation seems appropriate for this.

After a first sketch in D138708 I think we should move to a step-by-step approach:

Introduce a high-priority ShutdownConfirmed message called right before SendShutdown by the parent. This sets the CrashReporter::Annotation::IPCShutdownState annotation to something we can recognize in crash-stats to see if we are looking at the right case.
Move the switch to ShutdownPhase::AppShutdownConfirmed from RecvShutdown to RecvShutdownConfirmed.
Advance instead the shutdown phase to ShutdownPhase::AppShutdown in RecvShutdown to replace mShuttingDown.
Add more appropriate bailouts step by step (some of the places are already known from my experiments).
(optional and not ContentProcess specific) Rename AppShutdown (and ShutdownPhase::AppShutdownXxx and …) to ProcessShutdown (see bug 1697745).

Step 1 is just diagnostic. Step 2 would have some effect on existing JS SpinEventLoopUntilOrQuit, hopefully for the better, and maybe FOG. Step 3 would just replace mShuttingDown and could be even done before 2, probably.

See this comment in phabricator for a longer discussion of the rationale and assumptions behind this plan.

Jens Stutte [:jstutte]

Assignee

Comment 1

•

2 years ago

Attached file Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers,smaug! (obsolete) — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → jstutte

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

2 years ago

Attachment #9263834 - Attachment description: Bug 1755376: Raise priority of Shutdown and FinishShutdown IPC messages. r?mccr8 → WIP: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown.

Phabricator Automation

Updated

•

2 years ago

Attachment #9263834 - Attachment description: WIP: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. → Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers

Phabricator Automation

Updated

•

2 years ago

Attachment #9263834 - Attachment description: Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers → Bug 1755376: Raise priority of ContentChild::Shutdown and harden some entry points against being late in shutdown. r?#ipc-reviewers,smaug!

Jens Stutte [:jstutte]

Assignee

Comment 2

•

2 years ago

Attached file Bug 1755376: Step 1 - Add high priority ShutdownConfirmedHP method and update Annotation::IPCShutdownState if received. r?smaug — Details

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Summary: Raise priority of Shutdown and FinishShutdown IPC messages → Investigate effect of raising the priority of Shutdown IPC messages

Phabricator Automation

Updated

•

2 years ago

Attachment #9263834 - Attachment is obsolete: true

Jens Stutte [:jstutte]

Assignee

Updated

•

2 years ago

Keywords: leave-open

Pulsebot

Comment 3

•

2 years ago

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5a00e56ade15
Step 1 - Add high priority ShutdownConfirmedHP method and update Annotation::IPCShutdownState if received. r=smaug

Cosmin Sabou [:CosminS]

Comment 4

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5a00e56ade15

Jens Stutte [:jstutte]

Assignee

Comment 5

•

2 years ago

•

Edited

Attached image RecvShutdownConfirmed not appearing? — Details

Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.

I assume that we could add some annotation inside ProcessChild::NotifyImpendingShutdown() to check if that special IMPENDING_SHUTDOWN_MESSAGE_TYPE message sent by ContentParent::StartForceKillTimer() is processed, which would give us an indication if we might be able to talk to the client process in these situations at all such that we could trigger shutdown in a similar way. Otherwise I would assume we do not even arrive with any kind of message and IPC communication is somehow stalled?

Comment hidden (obsolete)

Jens Stutte [:jstutte]

Assignee

Comment 7

•

2 years ago

Attached file Bug 1755376: Update Annotation::IPCShutdownState on NotifyImpendingShutdown r?#ipc-reviewers — Details

Jens Stutte [:jstutte]

Assignee

Comment 8

•

2 years ago

BTW, looking at the patches in bug 1633791 I was wondering if we could rename NotifyImpendingShutdown to better indicate the difference between sending and receiving.

Comment 9

•

2 years ago

Attached file Bug 1755376: Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers — Details

Depends on D139255

Gabriele Svelto [:gsvelto]

Comment 10

•

2 years ago

(In reply to Jens Stutte [:jstutte] from comment #5)

Sadly the incoming numbers seem not to confirm that just raising the priority would help much here. The vast majority of crashes continue to not have any IPC shutdown state annotation.

Oh, that's really bad news. Well there's two possibilities left: the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent. The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.

Jens Stutte [:jstutte]

Assignee

Comment 11

•

2 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

the first one is that the content process is really getting stuck somewhere, but that would cover only the cases when the acknowledgment message hasn't been sent.

Yeah, that would probably just be the "machine stuck" case I already mentioned in phabricator.

The second one is that the process was never scheduled and thus never had a chance of running in the first place. It's impossible to discount that, especially on older machines with hard-drives and little memory, which might be slowed down by swapping.

At least from the data we see real crashes of apparently started content processes where the shutdown annotation is never set. Those are the ones I am chasing with the additional annotation now. What you are suggesting here would be a totally different thing, IIUC.

Pulsebot

Comment 12

•

2 years ago

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/89d6b8e05394
Update Annotation::IPCShutdownState on NotifyImpendingShutdown r=ipc-reviewers,mccr8

Cristina Cozmuta (:CrissCozmuta)

Comment 13

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/89d6b8e05394

Phabricator Automation

Updated

•

2 years ago

Attachment #9264795 - Attachment description: Bug 1755376: Rename NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers → Bug 1755376: Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r?#ipc-reviewers

Jens Stutte [:jstutte]

Assignee

Comment 14

•

2 years ago

So the first incoming data does not show much difference here. From 182 reports for version 20220223093800 so far I see only 1 with "NotifyImpendingShutdown received." (and 1 with "SendFinishShutdown (sent)"), all the others have no sign of any shutdown progress at all.

Pulsebot

Comment 15

•

2 years ago

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/95d43cf0d451
Rename receiving NotifyImpendingShutdown to distinguish send and receive case. r=ipc-reviewers,mccr8

Marian-Vasile Laza

Comment 16

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/95d43cf0d451

Jens Stutte [:jstutte]

Assignee

Comment 17

•

2 years ago

Looking at bug 1279293 comment 198, it seems now that we probably should use special IO messages both to signal shutdown to the child and a finished shutdown to the parent.

This would decouple us from the normal event processing order and ensure we can timely react.

Jens Stutte [:jstutte]

Assignee

Comment 18

•

2 years ago

•

Edited

In particular it could be that even the ShutdownConfirmedHP message is not processed if there are already many events in the control priority queue.

Looking at where we use control, to me the usage in Graphics: Layers (compositor) stands out, in particular we did bug 1730998.

Jens Stutte [:jstutte]

Assignee

Comment 19

•

2 years ago

•

Edited

Updated data from crashstats

Rank	Last annotation	#	%
1	NotifyImpendingShutdown received.	5605	58.07 %
2	SendFinishShutdown (sent)	1735	17.98 %
3	content-child-shutdown started	31	0.32 %
4	ShutdownInternal entry	16	0.17 %
5	RecvShutdownConfirmedHP entry	4	0.04 %
6	RecvShutdown entry	1	0.01 %
7	SendFinishShutdown (sending)	1	0.01 %

This confirms that we still have mostly bottlenecks in receiving / running the shutdown messages, it seems they get queued way to often.

Jens Stutte [:jstutte]

Assignee

Comment 20

•

2 years ago

•

Edited

Bug 1775025 might be related to the cases where we see only NotifyImpendingShutdown messages arrive (which is sent over the same channel, meaning that the channel as such seems to be generally working up to that point).

Edit: See :nika's bug 1775025 comment 5.

Updated

•

2 years ago

Comment 21

•

2 years ago

The leave-open keyword is there and there is no activity for 6 months.
:jstutte, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 22

•

1 year ago

Since we landed D155814 we see a bit more noise with annotations like:

NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) 
NotifiedImpendingShutdown - NotifiedImpendingShutdown - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - content-child-shutdown started - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - StartForceKillTimer - SendFinishShutdown (sending) - SendFinishShutdown (sent)
NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - NotifiedImpendingShutdown - HangMonitorChild::RecvRequestContentJSInterrupt (expected) - RecvShutdownConfirmedHP entry - RecvShutdown entry - content-child-will-shutdown started - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry - ShutdownInternal entry

I think we should reduce the verbosity a bit and ensure that NotifiedImpendingShutdown and RecvRequestContentJSInterrupt are called only once. The repeated ShutdownInternal entry message indicates a stuck nested event loop on the content process' main thread.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Assignee

Comment 23

•

1 year ago

The repeated ShutdownInternal entry case on all instances I looked at seems to come from this SpinEventLoopUntil. It seems reasonable to add a condition for shutdown here to abort the loop. We could fake and throw some network timeout as a consequence just to do something consistent with normal operations.

Jens Stutte [:jstutte]

Assignee

Updated

•

1 year ago

Comment 24

•

1 year ago

(In reply to Jens Stutte [:jstutte] from comment #22)

I think we should reduce the verbosity a bit and ensure that NotifiedImpendingShutdown and RecvRequestContentJSInterrupt are called only once.

I filed bug 1808691 for this.

You need to log in before you can comment on or make changes to this bug.