Enable IPDL timeout for communicating to the GPU process

RESOLVED FIXED in Firefox 56

Status

()

P3
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: mstange, Assigned: freesamael)

Tracking

(Blocks: 2 bugs)

Trunk
mozilla56
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox56 fixed)

Details

(Whiteboard: [gfx-noted][qf:p1][bhr])

Attachments

(1 attachment, 4 obsolete attachments)

(Reporter)

Description

2 years ago
We have some synchronous IPC calls from the parent process to the GPU process. The most notable ones are FlushRendering (going away in bug 1361257) and APZ input processing.

These sync calls currently wait indefinitely. If something caused the GPU process to become completely unresponsive, that means we hang the parent process and never recover.

To combat this, we could have a watchdog in the parent process that periodically checks whether the GPU process is responsive, and terminate it if not. Terminating the GPU process will call any sync calls to complete (with an error) and the parent process will be able to recover.
We could also have a timeout for PCompositorBridge messages, when the GPU process is enabled. That's a much easier patch to write if we want something soon.

Comment 2

2 years ago
There is some BHR 8-second hang reports with signatures that may indicate hung GPU processes causing UI hangs here: https://people-mozilla.org/~mlayzell/bhr/20170429/all.html

For example, search for PCompositorBridgeChild::SendFlushRendering().
QA Whiteboard: [qf]

Updated

2 years ago
QA Whiteboard: [qf] → [qf][gfx-noted]
Priority: -- → P3

Updated

2 years ago
QA Whiteboard: [qf][gfx-noted] → [gfx-noted][qf]

Updated

2 years ago
QA Whiteboard: [gfx-noted][qf] → [qf]
Whiteboard: [gfx-noted]

Updated

2 years ago
Whiteboard: [gfx-noted] → [gfx-noted][qf]

Updated

2 years ago
QA Whiteboard: [qf]
Whiteboard: [gfx-noted][qf] → [gfx-noted][qf][bhr]
(In reply to David Anderson [:dvander] from comment #1)
> We could also have a timeout for PCompositorBridge messages, when the GPU
> process is enabled. That's a much easier patch to write if we want something
> soon.

Hi David,

Do you think we still need this after Bug 1361257 was landed?
Flags: needinfo?(dvander)
I don't think bug 1361257 is related, even if we remove a few sync messages we still want hang reports for the GPU process.

I also think we should consider comment #1 and enable the IPDL timeout for UI->GPU channels.
Flags: needinfo?(dvander)

Comment 5

2 years ago
That sounds like a good idea to me.
(In reply to David Anderson [:dvander] from comment #1)
> We could also have a timeout for PCompositorBridge messages, when the GPU
> process is enabled. That's a much easier patch to write if we want something
> soon.

qf triage liked this option. Is it possible to do that?

(qf:p1 because the "best" case when this happens is a 500 ms jank (I think that's what was said))
Flags: needinfo?(bas)
Whiteboard: [gfx-noted][qf][bhr] → [gfx-noted][qf:p1][bhr]
(In reply to Andrew Overholt [:overholt] from comment #6)
> (In reply to David Anderson [:dvander] from comment #1)
> > We could also have a timeout for PCompositorBridge messages, when the GPU
> > process is enabled. That's a much easier patch to write if we want something
> > soon.
> 
> qf triage liked this option. Is it possible to do that?
> 
> (qf:p1 because the "best" case when this happens is a 500 ms jank (I think
> that's what was said))

Hrm, I'm not sure what that means. This bug is about detecting/resolving actual hangs or deadlocks, and what I proposed isn't going to help instances of 500ms jank.
Blocks: 1364015

Comment 8

2 years ago
(In reply to David Anderson [:dvander] from comment #4)
> I don't think bug 1361257 is related, even if we remove a few sync messages
> we still want hang reports for the GPU process.
> 
> I also think we should consider comment #1 and enable the IPDL timeout for
> UI->GPU channels.

Resummarizing the bug based on this comment to correctly reflect what the bug is about.
Summary: Kill the GPU process if it's unresponsive for a long time → Enable IPDL timeout for communicating to the GPU process
If we have timeout mechanism for PCompositorBridge messages, but users still couldn't see any screen update because GPU process becomes unresponsive. Or we can consider to kill GPU process once we hit the timeout for PCommpositorBridge in parent process.
(In reply to Andrew Overholt [:overholt] from comment #6)
> (In reply to David Anderson [:dvander] from comment #1)
> > We could also have a timeout for PCompositorBridge messages, when the GPU
> > process is enabled. That's a much easier patch to write if we want something
> > soon.
> 
> Is it possible to do that?

It seems we have been using MessageChannel::SetReplyTimeoutMs in PPluginModule to detect plugin timeout for a long time. We might be able to take the same approach here. I'm willing to give it a try.
Assignee: nobody → sawang
(In reply to Samael Wang [:freesamael] from comment #10)
> (In reply to Andrew Overholt [:overholt] from comment #6)
> > (In reply to David Anderson [:dvander] from comment #1)
> > > We could also have a timeout for PCompositorBridge messages, when the GPU
> > > process is enabled. That's a much easier patch to write if we want something
> > > soon.
> > 
> > Is it possible to do that?
> 
> It seems we have been using MessageChannel::SetReplyTimeoutMs in
> PPluginModule to detect plugin timeout for a long time. We might be able to
> take the same approach here. I'm willing to give it a try.

Yup that should work, we added that to timeout CPOWs. We can test by putting a sleep on some sync message receiver.
Comment on attachment 8875652 [details] [diff] [review]
v1

Hi David,

I added a timeout at CompositorBridgeChild. Does it look correct & sufficient to you? Also, I'm not sure how long the default timeout value should be. Consider relaunching GPU process is extremely expensive I think we don't want to be aggressive.
Attachment #8875652 - Flags: review?(dvander)
Comment on attachment 8875652 [details] [diff] [review]
v1

Review of attachment 8875652 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxPrefs.h
@@ +549,5 @@
>    DECL_GFX_PREF(Once, "layers.force-shmem-tiles",              ForceShmemTiles, bool, false);
>    DECL_GFX_PREF(Once, "layers.gpu-process.enabled",            GPUProcessEnabled, bool, false);
>    DECL_GFX_PREF(Once, "layers.gpu-process.force-enabled",      GPUProcessForceEnabled, bool, false);
>    DECL_GFX_PREF(Once, "layers.gpu-process.timeout_ms",         GPUProcessTimeoutMs, int32_t, 5000);
> +  DECL_GFX_PREF(Once, "layers.gpu-process.ipc_reply_timeout_ms", GPUProcessIPCReplyTimeoutMs, int32_t, 1000);

I'm a little nervous having the value so low. Any kind of slight hiccup could end up killing the GPU process. How about somewhere between 5-10s?
Attachment #8875652 - Attachment description: Add a timeout to PCompositorBridge & kill GPU process on timeout → v1
Attachment #8875652 - Attachment is obsolete: true
Attachment #8875652 - Flags: review?(dvander)
Comment on attachment 8876058 [details] [diff] [review]
v2

(In reply to David Anderson [:dvander] from comment #14)
> I'm a little nervous having the value so low. Any kind of slight hiccup
> could end up killing the GPU process. How about somewhere between 5-10s?

I'll take 10s, so it won't kill the process unless the GPU process works really really bad.
Attachment #8876058 - Flags: review?(dvander)
(Reporter)

Comment 17

2 years ago
Should this be disabled by default in debug builds? Otherwise, if you attach your debugger to the GPU process, you're going to have a bad experience.

Do you know what happens if the computer goes into sleep mode while an IPC message is being processed? Is the duration of the sleep ignored for the purposes of measuring whether we timed out?
(In reply to Markus Stange [:mstange] from comment #17)
> Should this be disabled by default in debug builds? Otherwise, if you attach
> your debugger to the GPU process, you're going to have a bad experience.
> 
> Do you know what happens if the computer goes into sleep mode while an IPC
> message is being processed? Is the duration of the sleep ignored for the
> purposes of measuring whether we timed out?

That's a really good question.
(In reply to Markus Stange [:mstange] from comment #17)
> Do you know what happens if the computer goes into sleep mode while an IPC
> message is being processed? Is the duration of the sleep ignored for the
> purposes of measuring whether we timed out?

It seems that underlying we're using SleepConditionVariableSRW on Windows [1], so I wrote a simple app [2] with 30 secs timeout and tested it on a Windows 10 laptop. In usual cases, it should print "elapsed time=15000 success". I tried to close my laptop to put the system into sleep while the program's running, then I went for lunch and woke my laptop up when I got back. The result showed "elapsed time=1477813 success". So it looks to me that the timeout function is not affected by system sleep on Windows.

I'm looking into POSIX implementation, which might actually have problems with system sleep as it's using absolute time [3]

[1] http://searchfox.org/mozilla-central/rev/d840ebd5858a61dbc1622487c1fab74ecf235e03/mozglue/misc/ConditionVariable_windows.cpp#78
[2] https://pastebin.mozilla.org/9024337
[3] http://searchfox.org/mozilla-central/source/mozglue/misc/ConditionVariable_posix.cpp#150-151,153
Hum... I think the real problem is we determine IPC timeout by calculating the time interval:
http://searchfox.org/mozilla-central/rev/d840ebd5858a61dbc1622487c1fab74ecf235e03/ipc/glue/MessageChannel.cpp#2267-2268

Comment 21

2 years ago
FWIW if you need to handle the sleep/resume case in a custom fashion, we already have two useful notifications that we dispatch when these things happen: <http://searchfox.org/mozilla-central/rev/61054508641ee76f9c49bcf7303ef3cfb6b410d2/widget/nsIWidget.h#219>.
I found that there has been an attempt to prevent suspension issue in bug 679240. If we set a timeout of 10s, it may first be timed out in 5s, but then it would try again with another 5s. Thus if the system suspended in the first 5s, it would not cause a real timeout to the IPC call. I think this simple approach generally works well. In reality we might have a suspension during the "second half", but that would imply the IPC call hasn't responded for at least 5s.

Still, it's worth trying to understand the details. What matters to us is how PRIntervalTime and ConditionVariableImpl::wait_for are implemented on various platforms, so I tried to dig into the implementation on Windows, macOS & Linux, and made a very simple test for validation: https://pastebin.mozilla.org/9024537

On Windows, PRIntervalTime uses timeGetTime, which seems to use KeQueryInterruptTime[1] underlying, and it does count during suspension. According to MSDN article "Interrupt Time"[2] we should use QueryUnbiasedInterruptTime if we don't want to count suspension; ConditionVariableImpl::wait_for is done by SleepConditionVariableSRW on Windows which takes a relative time as the timeout value. MSDN document doesn't mention if it counts during suspension, but my experiment on Windows 10 shows not.

On macOS, we're using mach_absolute_time for PRIntervalTime, which is hardware dependent[3]. The kernel source comments[4] indicates it's using x86 Time Stamp Counter[5]; ConditionVariableImpl::wait_for uses pthread_cond_timedwait_relative_np. The source code shows it sends relative timeout to kernel, and it eventually points me to timer_call_enter_internal[6] which also seems to uses mach_absolute_time. I have no idea how TSC works but at least on my MacBook both don't include system suspension.

On Linux we're using pthread_cond_timedwait & clock_gettime respectively, and passing CLOCK_MONOTONIC in both cases. The manpage[7] implies CLOCK_MONOTONIC doesn't include suspension, only CLOCK_BOOTTIME does.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ff553025%28v=vs.85%29.aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ee662306%28v=vs.85%29.aspx
[3] https://developer.apple.com/library/content/qa/qa1398/_index.html
[4] https://opensource.apple.com/source/xnu/xnu-3789.1.32/osfmk/i386/rtclock.c.auto.html
[5] https://en.wikipedia.org/wiki/Time_Stamp_Counter
[6] https://opensource.apple.com/source/xnu/xnu-3789.1.32/osfmk/kern/timer_call.c.auto.html
[7] http://man7.org/linux/man-pages/man2/clock_gettime.2.html
Hi David,

Do you think we need additional handling for suspension, or we leave the workaround in bug 679240 as is?
Flags: needinfo?(dvander)
Thanks for the detailed writeup and investigation, Samael! Yes, the workaround sounds fine. The only remaining question is debug builds. How do you feel about putting the SetReplyTimeoutMs call in an #ifndef DEBUG block? (That won't help people attaching a debugger to a release build, but they can always set the pref to disable the timeout.)
Flags: needinfo?(dvander)
Comment on attachment 8876058 [details] [diff] [review]
v2

(In reply to David Anderson [:dvander] from comment #24)
> How do you feel about putting the SetReplyTimeoutMs call in an #ifndef DEBUG block?
Sure I can do this. Let me update the patch.
Attachment #8876058 - Flags: review?(dvander)
Attachment #8876058 - Attachment description: Add a timeout to PCompositorBridge & kill GPU process on timeout → v2
Attachment #8876058 - Attachment is obsolete: true
Attachment #8876058 - Flags: review?(dvander)
Flags: needinfo?(bas)
Keywords: checkin-needed
Needs rebasing.
Keywords: checkin-needed
Oh there's a new top-level protocol PCompositorManager. Trying to move timeout there and re-test.
Attachment #8877860 - Attachment description: Add a timeout to PCompositorBridge & kill GPU process on timeout in release builds → v3
Attachment #8877860 - Attachment is obsolete: true
Created attachment 8879054 [details] [diff] [review]
v4

Move the kill-timer to PCompositorManager now that it's the top-level protocol.
Comment on attachment 8879054 [details] [diff] [review]
v4

Hi David,

I have to move the timer to PCompositorManager as it's the top-level protocol managing PCompositorBridge now. Could you take a second look for the change just in case?
Attachment #8879054 - Flags: review?(dvander)
Attachment #8879054 - Flags: review?(dvander) → review+
Keywords: checkin-needed

Comment 31

2 years ago
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e62923d406f0
Add a timeout to PCompositorManager & kill GPU process on timeout in release builds. r=dvander
Keywords: checkin-needed
This patch should really log something to stderr indicate that the GPU process is getting killed due to timeout. We might start hitting it in automation and it would be good to know this is why things are failing.
Backed out on suspicion of causing reftest failures on Linux x64 QuantumRender opt, e.g. multicol-rule-hidden-000-ref.xht:

https://hg.mozilla.org/integration/mozilla-inbound/rev/5fc8b8976a9892b14c4187a5a133214fcc8f19d5

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=357635c84e498178cbec4d88ad814d8dab00af21&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=108559437&repo=mozilla-inbound

[task 2017-06-20T15:07:11.463623Z] 15:07:11     INFO - REFTEST TEST-UNEXPECTED-FAIL | file:///home/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/css-multicol-1/multicol-rule-hidden-000.xht == file:///home/worker/workspace/build/tests/reftest/tests/layout/reftests/w3c-css/received/css-multicol-1/multicol-rule-hidden-000-ref.xht | image comparison, max difference: 255, number of differing pixels: 800000
Flags: needinfo?(sawang)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #32)
> This patch should really log something to stderr indicate that the GPU
> process is getting killed due to timeout. We might start hitting it in
> automation and it would be good to know this is why things are failing.

Maybe Telemetry is also a good idea, some kind of crash stack/hang report if possible. I know we have this functionality for content and plugin processes, but I'm not entirely sure how it works.
I added gfxCriticalNote, and it did timeout when running reftest with Quantum Render:
https://treeherder.mozilla.org/logviewer.html#?job_id=108799281&repo=try&lineNumber=3141

I'm not exactly sure what we should do if sync IPC calls can take more than 10 seconds in our test cases...
Flags: needinfo?(sawang)
(In reply to Samael Wang [:freesamael] from comment #35)
> I added gfxCriticalNote, and it did timeout when running reftest with
> Quantum Render:
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=108799281&repo=try&lineNumber=3141
> 
> I'm not exactly sure what we should do if sync IPC calls can take more than
> 10 seconds in our test cases...

Can you disable the timeout when gfxVars::UseWebRender() returns true?
Created attachment 8880287 [details] [diff] [review]
Add a timeout to PCompositorManager & kill GPU process on timeout in release builds
Attachment #8879054 - Attachment is obsolete: true
Attachment #8879054 - Attachment description: Add a timeout to PCompositorManager & kill GPU process on timeout in release builds → v4
(In reply to David Anderson [:dvander] from comment #36)
> (In reply to Samael Wang [:freesamael] from comment #35)
> > I added gfxCriticalNote, and it did timeout when running reftest with
> > Quantum Render:
> > https://treeherder.mozilla.org/logviewer.
> > html#?job_id=108799281&repo=try&lineNumber=3141
> > 
> > I'm not exactly sure what we should do if sync IPC calls can take more than
> > 10 seconds in our test cases...
> 
> Can you disable the timeout when gfxVars::UseWebRender() returns true?

OK, and I filed a follow-up bug 1375424 for the web render issue.

Comment 40

2 years ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ffefe7f1771d
Add a timeout to PCompositorManager & kill GPU process on timeout in release builds. r=dvander
Keywords: checkin-needed

Comment 41

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ffefe7f1771d
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox56: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Comment on attachment 8880287 [details] [diff] [review]
Add a timeout to PCompositorManager & kill GPU process on timeout in release builds

Review of attachment 8880287 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorManagerChild.cpp
@@ +259,5 @@
> +CompositorManagerChild::ShouldContinueFromReplyTimeout()
> +{
> +  if (XRE_IsParentProcess()) {
> +    gfxCriticalNote << "Killing GPU process due to IPC reply timeout";
> +    GPUProcessManager::Get()->KillProcess();

We are potentially calling this when there is no GPU process.
You need to log in before you can comment on or make changes to this bug.