Closed Bug 940737 Opened 6 years ago Closed 6 years ago

Monitor IPC thread hangs using BackgroundHangMonitor

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: jchen, Assigned: jchen)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file, 1 obsolete file)

Use BackgroundHangMonitor to detect IPC thread (e.g. Compositor thread) hangs and report these through telemetry.
Do you mean the IPC thread or the compositor thread? This probably belongs in Core:Graphics
Though, if it doesn't hurt perf, why not add to both?
The compositor thread uses the IPC thread loop so it's simpler to add it to IPC threads in general.
Benjamin, can you review this or suggest someone else to review? Thanks!

This patch does several things,

* Add hang timeout options to MessageLoop
** transient_hang_timeout specifies how much time a task must take before we record a transient hang.
** permanent_hang_timeout is similar, but for permanent hangs.
** The two options are set through MessageLoop::set_hang_timeouts().
** The difference between transient hangs and permanent hangs is, for transient hangs, we eventually recover from the hang and we record how long the hang was; a transient hang may also turn into a permanent hang, which we assume is unrecoverable. We may perform more intensive analysis on permanent hangs like stack unwinding.

* Add hang timeout options to Thread, which passes these options to MessageLoop
** MessageLoop is a logical place to store these options because the hang monitoring mechanism is tied to the loop.
** However, for users of Thread, they don't get to interact with MessageLoop before the thread is started, and so Thread must be mde to pass along these options to MessageLoop.

* Add BackgroundHangMonitor (bug 909974) to MessagePumpDefault
** MessagePumpDefault contains the actual loop for IPC threads, so it's the actual user of BackgroundHangMonitor.

* Set hang timeouts for the Compositor thread
** The options are passed to Thread during Compositor thread startup.
** The two timeout values are chosen a bit arbitrarily.
*** 128ms corresponds to 8Hz and I think should be the minimally acceptable goal for Compositor responsiveness (normal goal is 60Hz).
*** 8192ms is chosen for permanent hangs because common OSes consider hangs as being unresponsive for several seconds (5 seconds on Android). And I think we want a slightly longer timeout for a background thread.
*** These values are powers-of-two because of BackgroundHangMonitor's design. Having timeouts that are powers-of-two give us better data.
** It's possible that these timeout values are not optimal, but I think we should get some data with these timeouts first before making adjustments.

* Make minor changes to BackgroundHangMonitor to accomodate the above tasks
** In particular, the default for MessageLoop is no timeouts, so BackgroundHangMonitor needs to support no timeouts.
Attachment #8336857 - Flags: review?(benjamin)
Comment on attachment 8336857 [details] [diff] [review]
Monitor IPC thread hangs using BackgroundHangMonitor (v1)

I'm going to mark r+ on the IPC bits. The actual timeouts for the compositor should be reviewed by benwa or somebody who can decide whether they are reasonable.

I'm also confused by the actual effect of the "permanent" hang timeout. What does this timeout mean? If this is primarily feeding telemetry, why wouldn't we just report the time for any hang that is longer than the threshold?
Attachment #8336857 - Flags: review?(benjamin) → review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #5)
> Comment on attachment 8336857 [details] [diff] [review]
> Monitor IPC thread hangs using BackgroundHangMonitor (v1)
> 
> I'm going to mark r+ on the IPC bits. The actual timeouts for the compositor
> should be reviewed by benwa or somebody who can decide whether they are
> reasonable.

Thanks. I'll ask for additional review.

> I'm also confused by the actual effect of the "permanent" hang timeout. What
> does this timeout mean? If this is primarily feeding telemetry, why wouldn't
> we just report the time for any hang that is longer than the threshold?

It's geared towards conditions like deadlocks which we never recover from. We can also do stack unwinding for permanent hangs; we don't want unwinding for transient hangs because it's expensive.
Comment on attachment 8336857 [details] [diff] [review]
Monitor IPC thread hangs using BackgroundHangMonitor (v1)

Review of attachment 8336857 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +148,5 @@
> +  Thread::Options options;
> +  options.transient_hang_timeout = 128; // milliseconds
> +  options.permanent_hang_timeout = 8192; // milliseconds
> +
> +  if (!sCompositorThread->StartWithOptions(options)) {

Can you review these timeout values, BenWa? See the explanations below,

* transient_hang_timeout specifies how much time a task must take before we record a transient hang.
* permanent_hang_timeout is similar, but for permanent hangs.
* The difference between transient hangs and permanent hangs is, for transient hangs, we eventually recover from the hang and we record how long the hang was; a transient hang may also turn into a permanent hang, which we assume is unrecoverable. We may perform more intensive analysis on permanent hangs like stack unwinding.

* The options are passed to Thread during Compositor thread startup.
* The two timeout values are chosen a bit arbitrarily.
** 128ms corresponds to 8Hz and I think should be the minimally acceptable goal for Compositor responsiveness (normal goal is 60Hz).
** 8192ms is chosen for permanent hangs because common OSes consider hangs as being unresponsive for several seconds (5 seconds on Android). And I think we want a slightly longer timeout for a background thread.
** These values are powers-of-two because of BackgroundHangMonitor's design. Having timeouts that are powers-of-two give us better data.
* It's possible that these timeout values are not optimal, but I think we should get some data with these timeouts first before making adjustments.
Attachment #8336857 - Flags: review?(bgirard)
Comment on attachment 8336857 [details] [diff] [review]
Monitor IPC thread hangs using BackgroundHangMonitor (v1)

Review of attachment 8336857 [details] [diff] [review]:
-----------------------------------------------------------------

r+, would like to a comment in the code.

::: gfx/layers/ipc/CompositorParent.cpp
@@ +148,5 @@
> +  Thread::Options options;
> +  options.transient_hang_timeout = 128; // milliseconds
> +  options.permanent_hang_timeout = 8192; // milliseconds
> +
> +  if (!sCompositorThread->StartWithOptions(options)) {

Block for 16.6ms on composites is expected and normal. Blocking for twice that is not desirable but common. A threshold of 128ms for the compositor is serious and hopefully shouldn't be spammy.

IMO this is a good explanation/justification and should be commented near the values in the code.

I think these values are very sane and a great starting point.
Attachment #8336857 - Flags: review?(bgirard) → review+
https://hg.mozilla.org/mozilla-central/rev/bed37a1c69a5
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.