Closed Bug 1443499 Opened 2 years ago Closed 2 years ago

Background Hang Reporter: use only one thread for unwinding and reporting

Categories

(Core :: Gecko Profiler, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr60 --- fixed
firefox61 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(1 file, 2 obsolete files)

When the BHR wants to report a tardy thread, it hands off the work of
getting a stacktrace to a helper thread.  The helper threads used are in the
Stream Transport Service threadpool, which has a default limit of 25
threads.

This has a bad effect when we are in a severely compute-resource constrained
situation.  Then, many threads will be late, and up to 25 STS worker threads
will be employed to do unwinding, a potentially expensive operation.  This
further restricts the compute resources available to progress the rest of
the system.

Another effect is that the unwinder work will compete against the "real" STS
work for the 25 workers, potentially further slowing forward progress.

This effects are clearly visible when running on Valgrind -- it is possible
to observe 20+ STS threads running for even simple-ish web pages.  This is
possibly related to the automation timeouts seen in bug 1296819 and
therefore some cause for concern.
Attached patch WIP patch (obsolete) — Splinter Review
(WIP) replaces the use of the STS thread pool with a single
nsThread dedicated to unwinding/reporting hangs.  This makes
Firefox on Valgrind markedly more reponsive.

:mystor, is something that you would consider (after cleaning
up, proper error handling, etc); or perhaps the same but using
a small dedicated nsThreadPool instead?
Flags: needinfo?(nika)
(In reply to Julian Seward [:jseward] from comment #1)
> (WIP) replaces the use of the STS thread pool with a single
> nsThread dedicated to unwinding/reporting hangs.  This makes
> Firefox on Valgrind markedly more reponsive.
> 
> :mystor, is something that you would consider (after cleaning
> up, proper error handling, etc); or perhaps the same but using
> a small dedicated nsThreadPool instead?

Yeah, I think this is a good idea. We don't really want to be processing multiple hangs in parallel anyway.
Flags: needinfo?(nika)
Summary: Background Hang Reporter: limit worst case resource consumption → Background Hang Reporter: use only one thread for unwinding and reporting
Attached patch bug1443499-2t.diff (obsolete) — Splinter Review
Attachment #8956556 - Attachment is obsolete: true
Attachment #8957062 - Attachment is obsolete: true
Attachment #8957063 - Flags: review?(nika)
Comment on attachment 8957063 [details] [diff] [review]
bug1443499-3.diff

Review of attachment 8957063 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks.

::: toolkit/components/backgroundhangmonitor/BackgroundHangMonitor.cpp
@@ +103,5 @@
> +
> +  // Unwinding and reporting of hangs is despatched to this thread.  We also
> +  // keep a bool indicating whether the thread was successfully created.
> +  nsCOMPtr<nsIThread> mHangProcessingThread;
> +  bool mHangProcessingThreadExists;

Why not just check if mHangProcessingThread is nullptr instead?

@@ +261,2 @@
>  {
> +  // Lock so we don't race against the new monitor or processing threads

We will never race against the processing thread, so leave this comment as-is.
Attachment #8957063 - Flags: review?(nika) → review+
Assignee: nobody → jseward
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/177fdd3c0a75
Background Hang Reporter: use only one thread for unwinding and reporting.  r=mystor.
https://hg.mozilla.org/mozilla-central/rev/177fdd3c0a75
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1445215
Comment on attachment 8957063 [details] [diff] [review]
bug1443499-3.diff

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:

 For automation runs on Valgrind, Valgrind slows down progress so much
 that the Background Hang Reporter kills the process unnecessarily,
 leading the test to fail.  The problem is exacerbated because the BHR
 actually uses the Service Transport Stream (STS) thread pool to report
 hangs, containing 25 threads, further reducing the progress of threads
 doing "real" work.

 The patch replaces the use of the STS thread pool with a single nsThread
 dedicated to unwinding/reporting hangs, which limits the overall resource
 damage that the BHR can inflict.

 The case for including this is only in order to reduce failures of Valgrind
 automation runs for esr60.  There is no other purpose.

User impact if declined: 

 Zero, because AFAIK end-user (non-Valgrind) runs do not suffer the
 above-described slowdown scenario w.r.t the Background Hang Reporter.

Fix Landed on Version:

 61

Risk to taking this patch (and alternatives if risky): 

 Risks that I can think of (I'm not sure if any of them are actually
 real, or likely):

 * if for whatever reason the BHR needs to make a lot of reports in a short
   timeframe.  In that case it all reports would get processed only by a
   single thread rather than potentially in parallel by a group of up to 25
   threads.

 * In early March there was some evidence that this patch had caused a leak,
   although we only have 4 crash reports for that in total, and it has not
   re-occurred since then.

 * The change to a single-thread scheme is somehow not correct in some other
   way.  I'm not aware of any such problem, though.

 We could skip it, but I fear that would significantly reduce our ability
 to avoid timeouts on Valgrind runs on automation for esr60.  There are
 two other patches involved in trying to mitigate this -- bug 1442603 and
 bug 1296819 -- and those two alone might improve the situation, but by
 how much I don't know.

String or UUID changes made by this patch: 

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8957063 - Flags: approval-mozilla-esr60?
Comment on attachment 8957063 [details] [diff] [review]
bug1443499-3.diff

The Valgrind jobs on ESR60 are currently failing frequently due to these issues. Risk seems low enough to warrant taking these.
Attachment #8957063 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
You need to log in before you can comment on or make changes to this bug.