Background thread hang monitoring

RESOLVED FIXED in mozilla28

Status

()

Core
XPCOM
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jchen, Assigned: jchen)

Tracking

unspecified
mozilla28
All
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 9 obsolete attachments)

19.96 KB, patch
jchen
: review+
Details | Diff | Splinter Review
7.81 KB, patch
jchen
: review+
Details | Diff | Splinter Review
3.05 KB, patch
jchen
: review+
Details | Diff | Splinter Review
(Assignee)

Description

5 years ago
Compositor hangs has been a problem for Fennec (frequent cause of 'App Not Responding' messages). To provide some kind of telemetry for these kind of hangs, I think there are a couple approaches.

1) Adapt xpcom/threads/HangMonitor to cover IPC threads like the compositor thread.

2) Set a timeout on the compositor sync channel and detect hangs that way (similar to how the plugin hang detector works)

I'm not sure which approach is better for the situation here.

Comment 1

5 years ago
The details of xpcom/threads/HangMonitor is tied pretty closely to main thread event-loop handling, so I wouldn't try to reuse the hang-detection logic for a worker thread.

What is your goal here? To crash the process if it hangs for a sufficiently long time? Or is this merely something to report to telemetry perhaps with a stack/profile of the affected thread? Or is it possible for us to actually recover from these sorts of hangs by disconnecting all the IPDL and creating a new compositor thread?
(Assignee)

Comment 2

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #1)
> 
> What is your goal here? To crash the process if it hangs for a sufficiently
> long time? Or is this merely something to report to telemetry perhaps with a
> stack/profile of the affected thread? Or is it possible for us to actually
> recover from these sorts of hangs by disconnecting all the IPDL and creating
> a new compositor thread?

For now, our goal is to get some simple telemetry for these hangs (for example, how frequently do these occur, on which devices/gpus are they more likely, and how long are the hangs). In the future, maybe we can get the stack and/or do something about the hangs like you suggested.
(Assignee)

Comment 3

5 years ago
Created attachment 816653 [details] [diff] [review]
WIP - Add a hang monitor for background threads (v1)

This is a work in progress patch that adds a hang monitor for background threads. It mostly works.
(Assignee)

Comment 4

5 years ago
Created attachment 816654 [details] [diff] [review]
WIP - Use hang monitor in IPC threads (v1)

This patch makes IPC threads use the hang monitor
(Assignee)

Comment 5

5 years ago
Created attachment 816851 [details] [diff] [review]
WIP - Add a hang monitor for background threads (v2)

Added some more comments and optimizations.
Attachment #816653 - Attachment is obsolete: true
(Assignee)

Comment 6

5 years ago
Comment on attachment 816851 [details] [diff] [review]
WIP - Add a hang monitor for background threads (v2)

Mind taking a look, Benjamin and Brad? The hang detecting mechanism is working and I think it's fairly complete. See the other patch for an example of how it would be used in IPC threads.

I have a separate patch for hooking up threads in jsworker.cpp, but I think that's a separate bug. I also need to figure out the reporting-through-telemetry part.
Attachment #816851 - Flags: feedback?(blassey.bugs)
Attachment #816851 - Flags: feedback?(benjamin)
Attachment #816851 - Flags: feedback?(blassey.bugs) → feedback+
(Assignee)

Comment 7

5 years ago
Created attachment 821942 [details] [diff] [review]
Add a hang monitor for background threads (v3)

I think the hang monitor part is ready for review.
Attachment #816851 - Attachment is obsolete: true
Attachment #816851 - Flags: feedback?(benjamin)
Attachment #821942 - Flags: review?(benjamin)
(Assignee)

Comment 8

5 years ago
Created attachment 824298 [details] [diff] [review]
Add a hang monitor for background threads (v4)

This patch switches to using mozilla::LinkedList instead of PRCList.

It also solves a previous bug with nested event loops by having a single BackgroundHangThread object per registered thread, and having BackgroundHangMonitor point to the thread object.
Attachment #821942 - Attachment is obsolete: true
Attachment #821942 - Flags: review?(benjamin)
Attachment #824298 - Flags: review?(benjamin)
(Assignee)

Updated

5 years ago
Summary: Compositor thread hang monitoring/telemetry → Background thread hang monitoring
(Assignee)

Updated

5 years ago
Component: Graphics, Panning and Zooming → XPCOM
Product: Firefox for Android → Core
Version: Trunk → unspecified
(Assignee)

Updated

5 years ago
Blocks: 932865

Updated

5 years ago
Attachment #824298 - Flags: review?(benjamin) → review?(nfroyd)
Comment on attachment 824298 [details] [diff] [review]
Add a hang monitor for background threads (v4)

Review of attachment 824298 [details] [diff] [review]:
-----------------------------------------------------------------

This looks more-or-less OK.  There should be some documentation in BackgroundHangMonitor.h about what's going on--and you'll need a lot less once stuff gets moved out of the .h into the .cpp.

The code as-is is going to make the thread-race-detector people unhappy.  Do you think actual atomic accesses would slow things down too much?

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +9,5 @@
> +
> +#include <algorithm>
> +
> +#ifdef ANDROID
> +#include <android/log.h>

This looks unused.

@@ +14,5 @@
> +#endif
> +
> +namespace mozilla {
> +
> +RefPtr<BackgroundHangManager> BackgroundHangManager::sInstance;

Please make this a StaticRefPtr, from mozilla/StaticPtr.h.

@@ +86,5 @@
> +       used to track PR_IntervalNow() and determine our latency. */
> +
> +    PRIntervalTime systemTime = PR_IntervalNow();
> +    PRIntervalTime waitTime = PR_INTERVAL_NO_WAIT;
> +    PRIntervalTime permaHangTimeout = PR_INTERVAL_NO_WAIT;

These values might deserve a comment like "// These are only dummy values to get us through the first iteration of the loop."

@@ +97,5 @@
> +        PRIntervalTime newTime = PR_IntervalNow();
> +        PRIntervalTime systemInterval = newTime - systemTime;
> +        systemTime = newTime;
> +
> +        /* waitTime is quarter the shortest timeout value; If our timing

Wording nit: "a quarter of the..."

@@ +241,5 @@
> +void
> +BackgroundHangThread::ReportPermaHang() const
> +{
> +    // Permanently hanged; called on the monitor thread
> +    // mMonitor->mLock IS locked

This function looks incomplete.

@@ +248,5 @@
> +void
> +BackgroundHangThread::ReportHang(PRIntervalTime aHangTime) const
> +{
> +    // Recovered from a hang; called on the hanged thread
> +    // mMonitor->mLock is NOT locked

So does this one.  Are you filling these in with this bug, or is that follow-up work?

@@ +258,5 @@
> +    if (mWaiting) {
> +        mInterval = mManager->mIntervalNow;
> +        mWaiting = false;
> +        // Use PR_Interrupt to avoid potentially taking a lock
> +        PR_Interrupt(mManager->mHangMonitorThread);

I dunno, PR_Interrupt is going to have to take a lock here anyway.  (I guess you could flip bits atomically, but that's not what nspr does at the moment.)

What is this really here for anyway?  So the lock held by the hang monitor wakes up slightly quicker?

::: xpcom/threads/BackgroundHangMonitor.h
@@ +53,5 @@
> +    BackgroundHangMonitor();
> +
> +    BackgroundHangThread* operator->()
> +    {
> +        return mThread;

I think it'd be better if you just had forwarding methods on BackgroundHangMonitor that called the appropriate methods on mThread.  These forwarding methods should be out-of-line, so that you don't have to expose BackgroundHangThread in this header.

@@ +57,5 @@
> +        return mThread;
> +    }
> +};
> +
> +class BackgroundHangManager : public AtomicRefCounted<BackgroundHangManager>

There's not much reason for this class to be defined in this header.  To expose the startup and shutdown methods, you can use a method similar to how HangMonitor defines them.

@@ +98,5 @@
> +
> +    ~BackgroundHangManager();
> +};
> +
> +class BackgroundHangThread : public RefCounted<BackgroundHangThread>

This is purely an implementation detail and should live in the source file.  Can we do that?  Then we don't have to have weird things like a private constructor and a public destructor.

@@ +125,5 @@
> +    bool mWaiting;
> +
> +    // Report a hang; aManager->mLock is NOT locked
> +    void ReportHang(PRIntervalTime aHangTime) const;
> +    // Report a permanent ahng; aManager->mLock IS locked

Spelling nit: "hang"
Attachment #824298 - Flags: review?(nfroyd) → feedback+
(Assignee)

Comment 10

5 years ago
Created attachment 825693 [details] [diff] [review]
Add a hang monitor for background threads (v5)

Thanks for the review!

(In reply to Nathan Froyd (:froydnj) from comment #9)
> Comment on attachment 824298 [details] [diff] [review]
> Add a hang monitor for background threads (v4)
> 
> Review of attachment 824298 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks more-or-less OK.  There should be some documentation in
> BackgroundHangMonitor.h about what's going on--and you'll need a lot less
> once stuff gets moved out of the .h into the .cpp.

Added usage explanations and examples in the .h

> The code as-is is going to make the thread-race-detector people unhappy.  Do
> you think actual atomic accesses would slow things down too much?

Yeah I'm a bit hesitant to add locks. Any known race should already be harmless, and I think adding locks would have a measurable impact. From what I've seen some thread loops can run up to 1000 iterations each second. That times say ten or twenty threads being monitored, and synchronization overhead can add up very fast.

For Helgrind, I see there are some annotations for it that would make it ignore these races.

> @@ +241,5 @@
> > +void
> > +BackgroundHangThread::ReportPermaHang() const
> > +{
> > +    // Permanently hanged; called on the monitor thread
> > +    // mMonitor->mLock IS locked
> 
> This function looks incomplete.
> 
> @@ +248,5 @@
> > +void
> > +BackgroundHangThread::ReportHang(PRIntervalTime aHangTime) const
> > +{
> > +    // Recovered from a hang; called on the hanged thread
> > +    // mMonitor->mLock is NOT locked
> 
> So does this one.  Are you filling these in with this bug, or is that
> follow-up work?

Yes, that work is being tracked in bug 932865. I added TODO comments for these functions.

> @@ +258,5 @@
> > +    if (mWaiting) {
> > +        mInterval = mManager->mIntervalNow;
> > +        mWaiting = false;
> > +        // Use PR_Interrupt to avoid potentially taking a lock
> > +        PR_Interrupt(mManager->mHangMonitorThread);
> 
> I dunno, PR_Interrupt is going to have to take a lock here anyway.  (I guess
> you could flip bits atomically, but that's not what nspr does at the moment.)

If I'm reading the code right, at least the pthreads implementation does not lock. http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptthread.c

> What is this really here for anyway?  So the lock held by the hang monitor
> wakes up slightly quicker?

When all threads are waiting, the monitor thread also waits indefinitely. So the interrupt wakes up the monitor thread.

> ::: xpcom/threads/BackgroundHangMonitor.h
> @@ +53,5 @@
> > +    BackgroundHangMonitor();
> > +
> > +    BackgroundHangThread* operator->()
> > +    {
> > +        return mThread;
> 
> I think it'd be better if you just had forwarding methods on
> BackgroundHangMonitor that called the appropriate methods on mThread.  These
> forwarding methods should be out-of-line, so that you don't have to expose
> BackgroundHangThread in this header.

Added forwarding methods and moved BackgroundHangThread to the .cpp

> @@ +57,5 @@
> > +        return mThread;
> > +    }
> > +};
> > +
> > +class BackgroundHangManager : public AtomicRefCounted<BackgroundHangManager>
> 
> There's not much reason for this class to be defined in this header.  To
> expose the startup and shutdown methods, you can use a method similar to how
> HangMonitor defines them.

Moved to the .cpp and made the startup and shutdown methods static in BackgroundHangMonitor

Also addressed other comments.
Attachment #824298 - Attachment is obsolete: true
Attachment #825693 - Flags: review?(nfroyd)
Comment on attachment 825693 [details] [diff] [review]
Add a hang monitor for background threads (v5)

Review of attachment 825693 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, just a few small things to fix.

(In reply to Jim Chen [:jchen :nchen] from comment #10)
> Yeah I'm a bit hesitant to add locks. Any known race should already be
> harmless, and I think adding locks would have a measurable impact. From what
> I've seen some thread loops can run up to 1000 iterations each second. That
> times say ten or twenty threads being monitored, and synchronization
> overhead can add up very fast.

OK, I think I can buy that.  Would you please file a follow-up bug dependent on this one for adding the necessary codepaths to the thread-checking blacklists, and CC :decoder?  Thanks.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +18,5 @@
> +
> +/**
> + * BackgroundHangManager is the global object that
> + * manages all instances of BackgroundHangThread
> + */

Thanks for moving all this stuff into the .cpp, this looks much better.

@@ +199,5 @@
> +
> +    /* We are in one of the following scenarios,
> +      * Permahang timeout
> +      * Thread added/removed
> +      * Thread wait ended

You could use "-" for your bullet and then you wouldn't have to indent this list peculiarly.

@@ +213,5 @@
> +      if (currentThread->mWaiting) {
> +        // Thread is waiting, not hanging
> +        continue;
> +      }
> +      PRIntervalTime hangTime = mIntervalNow - currentThread->mInterval;

This would be epsilon more efficient (and more efficient if mIntervalNow were ever turned into an atomic variable) if mIntervalNow was pulled into a local variable outside the loop.

@@ +278,5 @@
> +void
> +BackgroundHangThread::ReportHang(PRIntervalTime aHangTime) const
> +{
> +  // Recovered from a hang; called on the hanged thread
> +  // mMonitor->mLock is NOT locked

You mean mManager, right?

@@ +287,5 @@
> +void
> +BackgroundHangThread::ReportPermaHang() const
> +{
> +  // Permanently hanged; called on the monitor thread
> +  // mMonitor->mLock IS locked

Ditto here.

@@ +296,5 @@
> +MOZ_ALWAYS_INLINE void
> +BackgroundHangThread::NotifyActivity()
> +{
> +  if (mWaiting) {
> +    mInterval = mManager->mIntervalNow;

You should pull mManager->mIntervalNow out into a variable above the if and use it in both branches.  Doesn't make much difference in the waiting case, but avoids mIntervalNow potentially changing out from underneath you in the hang case.

@@ +336,5 @@
> +}
> +
> +
> +void
> +BackgroundHangMonitor::BackgroundHangMonitor::Startup()

This looks like it's not going to compile.

@@ +344,5 @@
> +  BackgroundHangManager::sInstance = new BackgroundHangManager();
> +}
> +
> +void
> +BackgroundHangMonitor::BackgroundHangMonitor::Shutdown()

Neither does this.

::: xpcom/threads/BackgroundHangMonitor.h
@@ +5,5 @@
> +
> +#ifndef mozilla_BackgroundHangMonitor_h
> +#define mozilla_BackgroundHangMonitor_h
> +
> +#include "mozilla/RefPtr.h"

Hooray for fewer includes!  Please #include <stdint.h> here to make sure uint32_t is defined.

@@ +13,5 @@
> +class BackgroundHangThread;
> +
> +/**
> + * The background hang monitor is responsible for detecting and reporting
> + * hangs in background (non-main) threads. A thread registers itself using

This is great, thank you for adding it!

@@ +28,5 @@
> + * task. To ensure responsiveness, tasks in a thread often have a target
> + * running time. This is a good starting point for determining the timeout
> + * and maximum timeout values. For example, the Compositor thread has a
> + * responsiveness goal of 60Hz or 17ms, and 10Hz or 100ms is a good
> + * starting value for its hang timeout.

"its hang timeout" is unclear about which timeout you're referring to.  Maybe say "...60Hz or 17ms, which is a good starting value for its transient hang timeout.  10Hz or 100ms might be a good starting value for its permanent hang timeout."

@@ +98,5 @@
> +
> +public:
> +  /**
> +   * Enable hang monitoring.
> +   * Must return before using BackgroundHangMonitor

Nit: periods end sentences.

@@ +104,5 @@
> +  static void Startup();
> +
> +  /**
> +   * Disable hang monitoring.
> +   * Can be called without destroying all BackgroundHangMonitors first

Here too.  And many other places.
Attachment #825693 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 12

5 years ago
Created attachment 826769 [details] [diff] [review]
Add a hang monitor for background threads (v5.1)

Thanks for the review! The new patch addresses the previous comments.
Attachment #825693 - Attachment is obsolete: true
Attachment #826769 - Flags: review+
(Assignee)

Updated

5 years ago
Blocks: 935064
(Assignee)

Comment 13

5 years ago
Created attachment 829482 [details] [diff] [review]
Let monitor thread detect hangs in addition to permahangs (v1)

In order to get the best pseudo-stack when a hang occurs, we decided it's better to detect hangs in the monitor thread rather than in the target thread. This affects how ReportHang is called, but shouldn't affect how it'll be implemented.
Attachment #829482 - Flags: review?(nfroyd)
Comment on attachment 829482 [details] [diff] [review]
Let monitor thread detect hangs in addition to permahangs (v1)

Review of attachment 829482 [details] [diff] [review]:
-----------------------------------------------------------------

Could you provide some more background on why it's better for the monitor thread to detect transient hangs?  I don't immediately see how that enables better pseudostacks.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +240,5 @@
> +          // A hang started
> +          currentThread->mHangStart = interval;
> +          currentThread->mHanging = true;
> +        }
> +      } else { // currentThread->mHanging

I don't think you need this comment.

@@ +248,5 @@
> +          currentThread->mHanging = false;
> +        }
> +      }
> +
> +      if (MOZ_LIKELY(!currentThread->mHanging)) {

It looks like this could be combined with the above.  I like the idea of:

PRIntervalTime* tp;
if (MOZ_LIKELY(!currentThread->mHanging)) {
  ... // as above
  tp = &currentThread->mMaxTimeout;
} else {
  ... // as above
  tp = &currentThread->mTimeout;
}

hangTimeout = std::min(hangTimeout, *tp - hangTime);

but I don't have a good name for |tp|.  We also need to get the logic here straightened out, see below.

@@ +249,5 @@
> +        }
> +      }
> +
> +      if (MOZ_LIKELY(!currentThread->mHanging)) {
> +        // We have a hang; revisit when a permahang becomes possible

The comment here seems to be the opposite of the condition.  Oversight or bug?

@@ +253,5 @@
> +        // We have a hang; revisit when a permahang becomes possible
> +        hangTimeout = std::min(
> +          hangTimeout, currentThread->mMaxTimeout - hangTime);
> +      } else {
> +        // We don't have a hang; revisit when a hang becomes possible

Likewise for this comment.

@@ +316,5 @@
>  
>  void
>  BackgroundHangThread::ReportHang(PRIntervalTime aHangTime) const
>  {
>    // Recovered from a hang; called on the hanged thread

This comment should be updated.
Attachment #829482 - Flags: review?(nfroyd) → feedback+
(Assignee)

Comment 15

5 years ago
(In reply to Nathan Froyd (:froydnj) from comment #14)
> Comment on attachment 829482 [details] [diff] [review]
> Let monitor thread detect hangs in addition to permahangs (v1)
> 
> Review of attachment 829482 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Could you provide some more background on why it's better for the monitor
> thread to detect transient hangs?  I don't immediately see how that enables
> better pseudostacks.

When we detect transient hangs on the target thread, we only detect the hang after the hang is over. By that point, we no longer have the pseudostack where the hang occurred. However, if we detect transient hangs on the monitor thread, we can asynchronously get a copy of the target thread's pseudostack. For example, if we have,

> PROFILER_LABEL("Task-1");
> hangMonitor->NotifyActivity();
> {
>   PROFILER_LABEL("Task-1a");
>   // Task-1a causes a transient hang
> }
> {
>   PROFILER_LABEL("Task-1b");
> }
> hangMonitor->NotifyActivity();

In the case of target thread detecting transient hangs, the second NotifyActivity() call detects the hang, but the pseusostack will be only contain "Task-1" by that point.

In the case of monitor thread detecting transient hangs, it can asynchronously retrieve the real "Task-1 > Task-1a" pseudostack at the point of the hang. Therefore, we get a more accurate pseudostack in this case. One problem with this approach is that it is possible that Task-1a finishes just before we retrieve the pseudostack, and we end up with "Task-1 > Task-1b". We get an inaccurate stack in this case, but overall I think the benefit outweighs the drawback.

I'll address the other comments.
(Assignee)

Comment 16

5 years ago
Created attachment 830443 [details] [diff] [review]
Let monitor thread detect hangs in addition to permahangs (v2)

(In reply to Nathan Froyd (:froydnj) from comment #14)
> Comment on attachment 829482 [details] [diff] [review]
> Let monitor thread detect hangs in addition to permahangs (v1)
> 
> Review of attachment 829482 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Could you provide some more background on why it's better for the monitor
> thread to detect transient hangs?  I don't immediately see how that enables
> better pseudostacks.

See comment 15.

> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +240,5 @@
> > +          // A hang started
> > +          currentThread->mHangStart = interval;
> > +          currentThread->mHanging = true;
> > +        }
> > +      } else { // currentThread->mHanging
> 
> I don't think you need this comment.

Removed.

> @@ +248,5 @@
> > +          currentThread->mHanging = false;
> > +        }
> > +      }
> > +
> > +      if (MOZ_LIKELY(!currentThread->mHanging)) {
> 
> It looks like this could be combined with the above.  I like the idea of:
> 
> PRIntervalTime* tp;
> if (MOZ_LIKELY(!currentThread->mHanging)) {
>   ... // as above
>   tp = &currentThread->mMaxTimeout;
> } else {
>   ... // as above
>   tp = &currentThread->mTimeout;
> }
> 
> hangTimeout = std::min(hangTimeout, *tp - hangTime);
> 
> but I don't have a good name for |tp|.  We also need to get the logic here
> straightened out, see below.

The problem is currentThread->mHanging can be changed within the first if block, so to adjust "tp" in it, it would be something like

> if (hanging) {
>   if (no longer hanging) {
>     hanging = false;
>     tp = &currentThread->mTimeout;
>   } else {
>     tp = &currentThread->mMaxTimeout;
>   }
> } else {
>   if (started hanging) {
>     hanging = true;
>     tp = &currentThread->mMaxTimeout;
>   } else {
>     tp = &currentThread->mTimeout;
>   }
> }

which also appears verbose...

Maybe using ?: instead of the second if block is more clear? I also renamed hangTimeout to recheckTimeout, because it's the timeout for rechecking the hang status of threads.

> @@ +249,5 @@
> > +        }
> > +      }
> > +
> > +      if (MOZ_LIKELY(!currentThread->mHanging)) {
> > +        // We have a hang; revisit when a permahang becomes possible
> 
> The comment here seems to be the opposite of the condition.  Oversight or
> bug?

Nice catch; it should be checking mHanging instead of !mHanging.

> @@ +316,5 @@
> >  
> >  void
> >  BackgroundHangThread::ReportHang(PRIntervalTime aHangTime) const
> >  {
> >    // Recovered from a hang; called on the hanged thread
> 
> This comment should be updated.

Done.
Attachment #829482 - Attachment is obsolete: true
Attachment #830443 - Flags: review?(nfroyd)
(In reply to Jim Chen [:jchen :nchen] from comment #16)
> Created attachment 830443 [details] [diff] [review]
> Let monitor thread detect hangs in addition to permahangs (v2)
> 
> (In reply to Nathan Froyd (:froydnj) from comment #14)
> > Could you provide some more background on why it's better for the monitor
> > thread to detect transient hangs?  I don't immediately see how that enables
> > better pseudostacks.
> 
> See comment 15.

Thank you for the explanation, I agree that makes things better.

> The problem is currentThread->mHanging can be changed within the first if
> block, so to adjust "tp" in it, it would be something like
> [...big block of code...]
> which also appears verbose...

Whoops, oversight on my part.  Yeah, we don't want to do something like that.

> Maybe using ?: instead of the second if block is more clear? I also renamed
> hangTimeout to recheckTimeout, because it's the timeout for rechecking the
> hang status of threads.

Eh, not sure about the ?: part, but I do like the renaming.

Will review the actual patch.
Comment on attachment 830443 [details] [diff] [review]
Let monitor thread detect hangs in addition to permahangs (v2)

Review of attachment 830443 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +250,5 @@
> +      }
> +
> +      /* If we are hanging, the next time we check for hang status is when
> +         the hang turns into a permahang. If we're not hanging, the next
> +         recheck timeout is when we may be entering a hang. */

I like this comment, but I think it would be better to use if {} else {} in this case, e.g.:

/* comment */
PRIntervalTime nextCheck;
if (currentThread->mHanging) {
  nextCheck = currentThread->mMaxTimeout;
} else {
  nextCheck = currentThread->mTimeout;
}
recheckTimeout = std::min(...);

The expression is complicated enough that the ?: just adds clutter, IMHO.
Attachment #830443 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 19

5 years ago
Created attachment 8334920 [details] [diff] [review]
Let monitor thread detect hangs in addition to permahangs (v2.1)

Addressed comments above.
Attachment #816654 - Attachment is obsolete: true
Attachment #830443 - Attachment is obsolete: true
Attachment #8334920 - Flags: review+
(Assignee)

Updated

5 years ago
Blocks: 940737
(Assignee)

Comment 20

5 years ago
Created attachment 8336843 [details] [diff] [review]
Stop the monitor thread when the BackgroundHangManager instance is destroyed (v1)

Try runs uncovered a race condition where the monitor thread could still be alive after NSPR is shut down, and Bad Things happen. This patch makes sure that the monitor thread is stopped when BackgroundHangManager is destroyed. Try run passes with this patch, https://tbpl.mozilla.org/?tree=Try&rev=6eaa198072b4

In the BackgroundHangManager destructor, I added |if(mHangMonitorThread)| after |MOZ_ASSERT(mHangMonitorThread)| to guard against shutdown crashes in release builds if PR_CreateThread does end up failing, but I'm not sure if it's really necessary.
Attachment #8336843 - Flags: review?(nfroyd)
Comment on attachment 8336843 [details] [diff] [review]
Stop the monitor thread when the BackgroundHangManager instance is destroyed (v1)

Review of attachment 8336843 [details] [diff] [review]:
-----------------------------------------------------------------

Ugh, so the race (pre-patch) is something like:

- We shutdown XPCOM;
- We shutdown the manager, which releases the strong ref we have to the manager;
- Monitor thread is notified, but delaying from actually waking up;
- NSPR shuts down;
- Monitor thread wakes up (this is problematic because NSPR has shut down?);
- Monitor thread releases strong reference to the manager from the NSPR thread function;
- Destructor runs;
- Bad things happen.

Is that correct?  And so with the patch, it looks something like:

- We shutdown XPCOM;
- We shutdown the manager, which release the strong ref we have to the manager;
- Manager's destructor runs, joins the thread in an orderly manner;
- NSPR shuts down;
- Everybody is happy.

Is that correct?  If so, r=me

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +164,5 @@
> +  MOZ_ASSERT(mHangMonitorThread,
> +    "No monitor thread");
> +
> +  // PR_CreateThread could have failed above due to resource limitation
> +  if (mHangMonitorThread) {

You need to null-check in Wakeup(), too.  I think that's the only other place, but you should make sure yourself.
Attachment #8336843 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 22

5 years ago
Created attachment 8336920 [details] [diff] [review]
Stop the monitor thread when the BackgroundHangManager instance is destroyed (v1.1)

(In reply to Nathan Froyd (:froydnj) from comment #21)
> Comment on attachment 8336843 [details] [diff] [review]
> Stop the monitor thread when the BackgroundHangManager instance is destroyed
> (v1)
> 
> Review of attachment 8336843 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> - Monitor thread wakes up (this is problematic because NSPR has shut down?);

Right. Some of the code on thread exit (e.g. unlocking the monitor) depend on NSPR.

> - Monitor thread releases strong reference to the manager from the NSPR
> thread function;
> - Destructor runs;
> - Bad things happen.
> 
> Is that correct?  And so with the patch, it looks something like:
> 
> - We shutdown XPCOM;
> - We shutdown the manager, which release the strong ref we have to the
> manager;
> - Manager's destructor runs, joins the thread in an orderly manner;
> - NSPR shuts down;
> - Everybody is happy.
> 
> Is that correct?  If so, r=me

That's correct.

> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +164,5 @@
> > +  MOZ_ASSERT(mHangMonitorThread,
> > +    "No monitor thread");
> > +
> > +  // PR_CreateThread could have failed above due to resource limitation
> > +  if (mHangMonitorThread) {
> 
> You need to null-check in Wakeup(), too.  I think that's the only other
> place, but you should make sure yourself.

You're right. I added the check.
Attachment #8336843 - Attachment is obsolete: true
Attachment #8336920 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/febf82ab0596
https://hg.mozilla.org/mozilla-central/rev/6174eab68f5a
https://hg.mozilla.org/mozilla-central/rev/1cd4e62f97c1
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
You need to log in before you can comment on or make changes to this bug.