Open Bug 1795902 Opened 2 years ago Updated 2 years ago

Consider adding a sync IPC timeout by-default to all parent process main-thread channels

Categories

(Core :: IPC, enhancement, P2)

enhancement

Tracking

()

People

(Reporter: nika, Unassigned)

References

Details

MessageChannel has support for timeouts on sync ipc messages, which was originally used to implement timeouts for CPOWs over the PContent protocol (https://searchfox.org/mozilla-central/rev/0a2eba79c24300ce0539f91c1bebac2e75264e58/dom/ipc/ContentParent.cpp#2732). While the name of the pref (dom.ipc.cpow.timeout) suggests that it only applies to CPOWs (which no longer exist), it actually applies to all sync ipc replies.

We don't use any timeout at all on all other toplevel protocols, including other protocols which send sync ipc from the parent process main thread.

As an example, I recently triggered https://bugzilla.mozilla.org/show_bug.cgi?id=1792115 on a local copy of nightly, causing the GPU process to deadlock. Because the parent process main thread sends sync ipc to the GPU process to handle things like apz input events (in my case PAPZInputBridgeChild::SendReceiveKeyboardInputEvent), the parent process main thread also completely hung indefinitely. If we had a timeout, we may be able to recover from a situation like this by detecting the GPU process has hung and restarting it, avoiding the parent process hanging.

As we now have many more toplevel protocols than we used to, and we expect to have more in the future, it's probably not the best idea to rely on each individual protocol implementation to set a timeout like this. It might make sense for us to set a default reply timeout for all sync ipc sent from the parent process main thread (or perhaps any thread in the parent process? we wouldn't want e.g. the PBackground thread to hang either).

As :jstutte mentioned on matrix, we should also add telemetry so that we know what sync IPC messages are timing out in the wild.

I looked into doing this a bit more, and came across the fact that the timeout added for PContent actually is more complex than the 500ms timeout. It appears that it ends up calling ContentParent::ShouldContinueFromReplyTimeout which will only actually time out the request if the content process appears to be hanging according to the ProcessHangMonitor (https://searchfox.org/mozilla-central/rev/a64647a2125cf3d334451051491fef6772e8eb57/dom/ipc/ContentParent.cpp#5057-5060). Every other actor actually defaults to returning true here it appears, meaning that even if we set up a timeout, it wouldn't do anything: (https://searchfox.org/mozilla-central/rev/a64647a2125cf3d334451051491fef6772e8eb57/ipc/ipdl/ipdl/lower.py#3803-3811).

The original motivation for this bug was to add some form of hang recovery for the GPU process to avoid it hanging from causing all of firefox to go down (as it does in bug 1792115). In addition, it appears as if generally speaking the bulk of sync IPC starting in the parent processes are on actors like PAPZInputBridge or other reversed actors, which mostly exist in graphics code.

IIRC we don't have something like the process hang monitor for the gpu process, but we could perhaps add timeout handling support to actors which connect to the gpu process on a per-actor basis, as ideally I imagine that things like the APZ input bridge shouldn't be hanging for an extended period of time in order to keep the browser responsive.

You need to log in before you can comment on or make changes to this bug.