Closed Bug 951431 Opened 10 years ago Closed 10 years ago

Monitor main thread hangs using BackgroundHangMonitor

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox28 --- fixed
firefox29 --- fixed

People

(Reporter: jchen, Assigned: jchen)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

      No description provided.
This patch sets up a BackgroundHangMonitor for the Gecko thread that spans XPCOM lifetime. For now it piggybacks on the existing HangMonitor for activity and wait notifications. Also for now, this is for all platforms, but I can make it Android only if needed (only on Android is the Gecko thread a "background thread").
Attachment #8349797 - Flags: review?(benjamin)
This patch simply makes us report permanent hangs as long transient hangs for now, until we get better analysis (e.g. stack unwinding) for permanent hangs. Not an essential patch and can be reviewed/landed whenever.
Attachment #8349798 - Flags: review?(nfroyd)
ThreadStackHelper's signal handling can interfere with the profiler's signal handling, resulting in intermittent xpcshell timeouts/failures. I think the safest thing for now is to avoid getting the stack when the profiler is active, which is acceptable because in practice, the profiler is rarely running.
Attachment #8349799 - Flags: review?(bgirard)
Comment on attachment 8349799 [details] [diff] [review]
Don't get stacks during profiler runs on Linux (v1)

Review of attachment 8349799 [details] [diff] [review]:
-----------------------------------------------------------------

This feels like a silent failure that is going to cause people to wonder why things aren't working.
Attachment #8349799 - Flags: review?(bgirard) → review+
Comment on attachment 8349798 [details] [diff] [review]
Report permanent hangs as transient hangs for now (v1)

Review of attachment 8349798 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the question below answered satisfactorily.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +264,5 @@
>        if (MOZ_UNLIKELY(hangTime >= currentThread->mMaxTimeout)) {
>          // A permahang started
>          // Skip subsequent iterations and tolerate a race on mWaiting here
>          currentThread->mWaiting = true;
> +        currentThread->mHanging = false;

What's the rationale for this change?  (I don't think it's necessarily incorrect, I'm just surprised to find it along with the other changes.)
Attachment #8349798 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd (:froydnj) (AFKish through 3 January 2014) from comment #5)
> Comment on attachment 8349798 [details] [diff] [review]
> Report permanent hangs as transient hangs for now (v1)
> 
> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +264,5 @@
> >        if (MOZ_UNLIKELY(hangTime >= currentThread->mMaxTimeout)) {
> >          // A permahang started
> >          // Skip subsequent iterations and tolerate a race on mWaiting here
> >          currentThread->mWaiting = true;
> > +        currentThread->mHanging = false;
> 
> What's the rationale for this change?  (I don't think it's necessarily
> incorrect, I'm just surprised to find it along with the other changes.)

When mHanging is true, we're looking for either the end of the transient hang or its transitioning into a permanent hang. When either case occurs, we need to reset mHanging to false so that we don't continue to look for those two conditions. And this line fixes the previous bug that mHanging was not reset when entering a permanent hang.
I don't understand what this change actually does. We already have a main-thread hang detector, and this one seems to duplicate that using the new mechanism. Why bother?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)
> I don't understand what this change actually does. We already have a
> main-thread hang detector, and this one seems to duplicate that using the
> new mechanism. Why bother?

The current hang monitor is only enabled on Windows, whereas the new mechanism is cross-platform. Especially for Android, we want main-thread hang monitoring because the Gecko "main thread" is actually a background thread and we're seeing performance issues related to that.
Flags: needinfo?(benjamin)
> The current hang monitor is only enabled on Windows

What does that mean? We certainly build it on all platforms, and appear to initialize it on all platforms: http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#610
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #9)
> > The current hang monitor is only enabled on Windows
> 
> What does that mean? We certainly build it on all platforms, and appear to
> initialize it on all platforms:
> http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#610

Currently the hang monitor is disabled through pref in bug 705748,
http://mxr.mozilla.org/mozilla-central/source/modules/libpref/src/init/all.js#1879

It's only enabled on Windows nightlies because of the Telemetry chrome hangs feature, which overrides that pref. Background hang reporting is similar to chrome hangs but is designed to support multiple platforms and multiple threads.
Flags: needinfo?(benjamin)
Comment on attachment 8349797 [details] [diff] [review]
Monitor main thread hangs using BackgroundHangMonitor (v1)

As I understand it, that pref merely controls whether we *crash* on hang. It does not control anything else about the measurement of hangs.

I'm disappointed that we seem to have two independent hang monitoring systems and we're trying to glue them together like this. If this is expediently necessary, that's fine, but we should file and assign followups to get this back into a relatively sane single set of code.
Attachment #8349797 - Flags: review?(benjamin) → review?(vdjeric)
Flags: needinfo?(benjamin)
Attachment #8349797 - Flags: review?(vdjeric) → review+
(In reply to Jim Chen [:jchen :nchen] from comment #3)
> Created attachment 8349799 [details] [diff] [review]
> Don't get stacks during profiler runs on Linux (v1)

I don't know whether this is an intended or unintended side effect,
but without this patch in place, I get segfaults every run, from
the sigaction() call that is protected by the one-and-only-hunk
in this patch.  I don't know why calling sigaction would cause
Gecko to jump off to a completely nonsense address, but it does.
I was unable to determine whether it is the calling thread or
some other thread, that jumps into outer space at the sigaction().

This is when running on arm-android, using native unwind, with
the new unwinder library from bug 938157.
Comment on attachment 8349799 [details] [diff] [review]
Don't get stacks during profiler runs on Linux (v1)

[Approval Request Comment]

Bug caused by (feature/regressing bug #): Bug 932865 / Bug 935092

User impact if declined: Random deadlocks when using the profiler tool

Testing completed (on m-c, etc.): Locally, m-c

Risk to taking this patch (and alternatives if risky): None; patch only avoids hang condition.

String or IDL/UUID changes made by this patch: None
Attachment #8349799 - Flags: approval-mozilla-aurora?
Attachment #8349799 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Backed out the first two changesets because only the third was intended for uplift.
https://hg.mozilla.org/releases/mozilla-aurora/rev/6332a66fff72
Does this have a reproducible case to verify this is working as expected?
Flags: needinfo?(nchen)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #18)
> Does this have a reproducible case to verify this is working as expected?

There's now a "Thread Hangs" section in about:telemetry. On startup, there should be a "Gecko" subsection with a histogram called "Gecko-Activity".

When a hang occurs (for example by running "for(var a=0;a<1e7;a++);" in the web console), there should be a new "Gecko-Hang-1" histogram under the "Gecko" subsection.

I'll look into possibly writing an automated test for this.
Flags: needinfo?(nchen)
Flags: in-testsuite?
Hi Jim,

I'm seeing two threads for hang monitoring in my firefox nightlies now (one for the main thread, one for non-main threads). As bsmedberg requested in comment 11, have there been followups filed to unify these threads?
Flags: needinfo?(nchen)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #20)
> Hi Jim,
> 
> I'm seeing two threads for hang monitoring in my firefox nightlies now (one
> for the main thread, one for non-main threads). As bsmedberg requested in
> comment 11, have there been followups filed to unify these threads?

Thanks for the reminder. I filed bug 951431 for a start.
Flags: needinfo?(nchen)
Ugh. Although this bug is a start, the bug I just filed was bug 982337.
Depends on: 979069
You need to log in before you can comment on or make changes to this bug.