Background thread hang reporting

RESOLVED FIXED in mozilla28

Status

()

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: jchen, Assigned: jchen)

Tracking

unspecified
mozilla28
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(relnote-firefox 28+)

Details

(Whiteboard: [qa-])

Attachments

(5 attachments, 7 obsolete attachments)

6.03 KB, patch
froydnj
: review+
froydnj
: feedback+
Details | Diff | Splinter Review
7.75 KB, patch
jchen
: review+
Details | Diff | Splinter Review
5.29 KB, patch
jchen
: review+
Details | Diff | Splinter Review
5.37 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
12.45 KB, patch
jchen
: review+
Details | Diff | Splinter Review
Bug 909974 adds monitoring for transient and permanent hangs in background threads. This bug will add telemetry for data gathered. I think this will involve adding a custom section to telemetry akin to the existing chrome hang section.
Depends on: 935089, 935092
This patch adds a ThreadHangStats class in Telemetry for recording background thread hangs
Attachment #829445 - Flags: review?(vdjeric)
This patch adds an object Telemetry can use to get hang stats for active threads.
Attachment #829510 - Flags: review?(nfroyd)
Comment on attachment 829445 [details] [diff] [review]
Add ThreadHangStats for collecting background hang telemetry (v1)

Review of attachment 829445 [details] [diff] [review]:
-----------------------------------------------------------------

Just a few drive-by comments.

::: toolkit/components/telemetry/ThreadHangStats.h
@@ +33,5 @@
> +    mozilla::PodArrayZero(*this);
> +  }
> +  void Add(PRIntervalTime time)
> +  {
> +    operator[](mozilla::FloorLog2(PR_IntervalToMilliseconds(time)))++;

I'd bet this function winds up non-inlined anyway, so there's not a lot of win in defining it here.  You could then drop mozilla/MathAlgorithms.h.

@@ +47,5 @@
> +
> +private:
> +  static uint32_t GetHash(const Stack& aStack);
> +
> +  Stack mStack;

Can we make this const, too?

@@ +74,5 @@
> + - hang histograms of individual hangs. */
> +class ThreadHangStats
> +{
> +private:
> +  std::string mName;

I think it will make other code better if this is just nsCString.
Comment on attachment 829510 [details] [diff] [review]
Add way for telemetry to iterate over active threads (v1)

Review of attachment 829510 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +433,5 @@
> +/* Because we are iterating through the BackgroundHangThread linked list,
> +   we need to take a lock. Using MonitorAutoLock as a base class makes
> +   sure all of that is taken care of for us. */
> +BackgroundHangMonitor::ThreadHangStatsIterator::ThreadHangStatsIterator()
> +  : MonitorAutoLock(BackgroundHangManager::sInstance->mLock)

This does not excite me; locks "leaking" out like this seems like a recipe for trouble.

What do you think about an API that looks something like:

// Or nsTArray, or whatever.  Some appropriate amount of stack
// allocation should be used here.
Vector<ThreadHangStats*, $SOME_AMOUNT> activeThreads;
// The lock is held over this call.
// Could also just return a Vector, and rely on the compiler/move semantics to make it efficient?
monitor->EnumerateActiveThreads(&activeThreads);
// Process the returned data.
Attachment #829510 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #4)
> Comment on attachment 829510 [details] [diff] [review]
> Add way for telemetry to iterate over active threads (v1)
> 
> Review of attachment 829510 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +433,5 @@
> > +/* Because we are iterating through the BackgroundHangThread linked list,
> > +   we need to take a lock. Using MonitorAutoLock as a base class makes
> > +   sure all of that is taken care of for us. */
> > +BackgroundHangMonitor::ThreadHangStatsIterator::ThreadHangStatsIterator()
> > +  : MonitorAutoLock(BackgroundHangManager::sInstance->mLock)
> 
> This does not excite me; locks "leaking" out like this seems like a recipe
> for trouble.

Yeah it's not the best :) But I think using RAII here will prevent most mistakes. And hopefully the only user of this will be Telemetry.

> What do you think about an API that looks something like:
> 
> // Or nsTArray, or whatever.  Some appropriate amount of stack
> // allocation should be used here.
> Vector<ThreadHangStats*, $SOME_AMOUNT> activeThreads;
> // The lock is held over this call.
> // Could also just return a Vector, and rely on the compiler/move semantics
> to make it efficient?
> monitor->EnumerateActiveThreads(&activeThreads);
> // Process the returned data.

The problem is ThreadHangStats is a member of BackgroundHangThread, which is only guaranteed to exist within the lock. As soon as we release the lock, BackgroundHangThread can get destroyed because the target thread has exited, making the returned pointer in activeThreads invalid.

I don't think exposing BackgroundHangThread's ref-counting will work either because there's a race condition right after the ref-count becomes 0 and before the destructor is called. At that instant, we can still take the lock and try to call BackgroundHangThread::AddRef, but it won't work because the object is already being destroyed.
(In reply to Nathan Froyd (:froydnj) from comment #3)
> Comment on attachment 829445 [details] [diff] [review]
> Add ThreadHangStats for collecting background hang telemetry (v1)
> 
> Review of attachment 829445 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Just a few drive-by comments.

Thanks! :)

> @@ +47,5 @@
> > +
> > +private:
> > +  static uint32_t GetHash(const Stack& aStack);
> > +
> > +  Stack mStack;
> 
> Can we make this const, too?

mStack gets cleared in HangHistogram's (implicit) move constructor, so it can't be const unfortunately. Maybe I should write out the move constructor to make this more apparent.

I addressed the other two comments in the new patch.
Attachment #829445 - Attachment is obsolete: true
Attachment #829445 - Flags: review?(vdjeric)
Attachment #830285 - Flags: review?(vdjeric)
Attachment 829510 [details] [diff] lets Telemetry iterate over active threads, but once a thread is killed, we need to move the collected stats to Telemetry so that the stats can still be reported.
Attachment #830322 - Flags: review?(vdjeric)
This patch exposes the collected thread stats to JS through nsITelemetry, and make threadHangStats part of a telemetry ping.
Attachment #830338 - Flags: review?(vdjeric)
(In reply to Jim Chen [:jchen :nchen] from comment #5)
> (In reply to Nathan Froyd (:froydnj) from comment #4)
> > This does not excite me; locks "leaking" out like this seems like a recipe
> > for trouble.
> 
> Yeah it's not the best :) But I think using RAII here will prevent most
> mistakes. And hopefully the only user of this will be Telemetry.

"Hopefully". :)  But I agree that Telemetry being the only user is a likely scenario.

> The problem is ThreadHangStats is a member of BackgroundHangThread, which is
> only guaranteed to exist within the lock. As soon as we release the lock,
> BackgroundHangThread can get destroyed because the target thread has exited,
> making the returned pointer in activeThreads invalid.

Mmm, good point.

This all feels too complicated, but maybe that's because Threaded Programming Is Hard.

I'm inclined to give r+, since you've addressed my comments, but I'd like to see the actual recording of bits in ThreadHangStats to see how this all fits together first.
Comment on attachment 830322 [details] [diff] [review]
Record inactive thread hang stats inside telemetry (v1)

Review of attachment 830322 [details] [diff] [review]:
-----------------------------------------------------------------

Drive-by comment here, trying to look at the big picture.

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +280,5 @@
>      sTlsKey.set(nullptr);
>    }
> +
> +  // Move our copy of ThreadHangStats to Telemetry storage
> +  Telemetry::RecordThreadHangStats(mStats);

This copies the mutex, too, right?  That seems kind of wasteful, as the mutex isn't necessary once we're in long-term storage.  Maybe the mutex doesn't belong in mStats?
Here's the patch that collects data into ThreadHangStats
Attachment #830995 - Flags: review?(nfroyd)
(In reply to Jim Chen [:jchen :nchen] from comment #11)
> Created attachment 830995 [details] [diff] [review]
> Collect thread hang stats in BackgroundHangMonitor (v1)
> 
> Here's the patch that collects data into ThreadHangStats

It requires ThreadStackHelper from bug 935092.
(In reply to Nathan Froyd (:froydnj) from comment #10)
> Comment on attachment 830322 [details] [diff] [review]
> Record inactive thread hang stats inside telemetry (v1)
> 
> Review of attachment 830322 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Drive-by comment here, trying to look at the big picture.
> 
> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +280,5 @@
> >      sTlsKey.set(nullptr);
> >    }
> > +
> > +  // Move our copy of ThreadHangStats to Telemetry storage
> > +  Telemetry::RecordThreadHangStats(mStats);
> 
> This copies the mutex, too, right?  That seems kind of wasteful, as the
> mutex isn't necessary once we're in long-term storage.  Maybe the mutex
> doesn't belong in mStats?

Yeah good point. We actually don't need the mutex now because we moved hang detection to the monitor thread, which holds the global lock.
Comment on attachment 830995 [details] [diff] [review]
Collect thread hang stats in BackgroundHangMonitor (v1)

Review of attachment 830995 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +15,5 @@
>  #include "js/HangMonitorSupport.h"
>  
>  #include "prinrval.h"
>  #include "prthread.h"
> +#include "ThreadStackHelper.h"

It would be really nice if this lived in xpcom, rather than in some other random place in the tree.  I'm not going to block the bug on that, because I have a feeling it would be some work to move.  Can you file a followup bug?

@@ +325,5 @@
>  {
>    // Recovered from a hang; called on the monitor thread
>    // mManager->mLock IS locked
>  
> +  Telemetry::HangHistogram newHistogram(Move(mHangStack));

HangHistogram is completely misnamed, right?  It doesn't (at least if I am looking at the right patch) have any histogram associated with it; it's just a stack with some auxiliary information for fast comparison.

@@ +335,5 @@
> +      return;
> +    }
> +  }
> +  // Add new histogram
> +  newHistogram.Add(aHangTime);

I don't see HangHistogram::Add.  Maybe you just want to keep track of the counts here?  I don't think it's worth having a completely separate histogram for every stack in addition to the histogram for the normal activity.

@@ +337,5 @@
> +  }
> +  // Add new histogram
> +  newHistogram.Add(aHangTime);
> +  // Lock so we can access the hangs vector
> +  MutexAutoLock autoLock(mStats.mHangsLock);

As discussed elsewhere, we shouldn't have this.
Attachment #830995 - Flags: review?(nfroyd) → feedback+
Comment on attachment 830322 [details] [diff] [review]
Record inactive thread hang stats inside telemetry (v1)

Review of attachment 830322 [details] [diff] [review]:
-----------------------------------------------------------------

Another drive-by comment.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +2073,5 @@
> +    return;
> +
> +  MutexAutoLock autoLock(sTelemetry->mThreadHangStatsMutex);
> +
> +  sTelemetry->mThreadHangStats.append(Move(aStats));

aStats is basically a big block of memory, so this is doing a memory copy, correct?  Maybe we should have the hang monitor thread hold a pointer to the hang stats and then just relinquish ownership here.

::: toolkit/components/telemetry/Telemetry.h
@@ +182,5 @@
> + * for active ThreadHangStats through BackgroundHangMonitor, but once a
> + * thread exits, the thread's copy of ThreadHangStats needs to be moved to
> + * inside Telemetry using this function.
> + *
> + * @param aHistogram ThreadHangStats to save; the data inside aHistogram

You mean aStats here and throughout the comment.
(In reply to Nathan Froyd (:froydnj) from comment #14)
> Comment on attachment 830995 [details] [diff] [review]
> Collect thread hang stats in BackgroundHangMonitor (v1)
> 
> Review of attachment 830995 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/threads/BackgroundHangMonitor.cpp
> @@ +15,5 @@
> >  #include "js/HangMonitorSupport.h"
> >  
> >  #include "prinrval.h"
> >  #include "prthread.h"
> > +#include "ThreadStackHelper.h"
> 
> It would be really nice if this lived in xpcom, rather than in some other
> random place in the tree.  I'm not going to block the bug on that, because I
> have a feeling it would be some work to move.  Can you file a followup bug?

ThreadStackHelper.h will live in xpcom/threads/ (bug 935092 attachment 830940 [details] [diff] [review]), although I see the header should be moved to "mozilla/ThreadStackHelper.h"

> @@ +325,5 @@
> >  {
> >    // Recovered from a hang; called on the monitor thread
> >    // mManager->mLock IS locked
> >  
> > +  Telemetry::HangHistogram newHistogram(Move(mHangStack));
> 
> HangHistogram is completely misnamed, right?  It doesn't (at least if I am
> looking at the right patch) have any histogram associated with it; it's just
> a stack with some auxiliary information for fast comparison.
>
> @@ +335,5 @@
> > +      return;
> > +    }
> > +  }
> > +  // Add new histogram
> > +  newHistogram.Add(aHangTime);
> 
> I don't see HangHistogram::Add.  Maybe you just want to keep track of the
> counts here?  I don't think it's worth having a completely separate
> histogram for every stack in addition to the histogram for the normal
> activity.

So HangHistogram inherits from TimeHistogram, including TimeHistogram::Add (attachment 830285 [details] [diff] [review]). I think it'd be nice to have a per-hang histogram. If a certain code path is slow on some devices/configurations versus others, it'd be easy to tell with per-hang histograms.

> @@ +337,5 @@
> > +  }
> > +  // Add new histogram
> > +  newHistogram.Add(aHangTime);
> > +  // Lock so we can access the hangs vector
> > +  MutexAutoLock autoLock(mStats.mHangsLock);
> 
> As discussed elsewhere, we shouldn't have this.

I'll update the patch.
Removed mHangsLock from ThreadHangStats
Attachment #830285 - Attachment is obsolete: true
Attachment #830285 - Flags: review?(vdjeric)
Attachment #832574 - Flags: review?(vdjeric)
Removed lock on mHangsLock
Attachment #830338 - Attachment is obsolete: true
Attachment #830338 - Flags: review?(vdjeric)
Attachment #832576 - Flags: review?(vdjeric)
(In reply to Jim Chen [:jchen :nchen] from comment #16)
> (In reply to Nathan Froyd (:froydnj) from comment #14)
> > Comment on attachment 830995 [details] [diff] [review]
> > Collect thread hang stats in BackgroundHangMonitor (v1)
> > 
> > Review of attachment 830995 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > >  
> > >  #include "prinrval.h"
> > >  #include "prthread.h"
> > > +#include "ThreadStackHelper.h"
> > 
> > It would be really nice if this lived in xpcom, rather than in some other
> > random place in the tree.  I'm not going to block the bug on that, because I
> > have a feeling it would be some work to move.  Can you file a followup bug?
> 
> ThreadStackHelper.h will live in xpcom/threads/ (bug 935092 attachment
> 830940 [details] [diff] [review]), although I see the header should be moved
> to "mozilla/ThreadStackHelper.h"

Hm ThreadStackHelper.h is not actually exported, and we are including it from the current source directory, so I think |#include "ThreadStackHelper.h"| here is correct.

> > @@ +337,5 @@
> > > +  }
> > > +  // Add new histogram
> > > +  newHistogram.Add(aHangTime);
> > > +  // Lock so we can access the hangs vector
> > > +  MutexAutoLock autoLock(mStats.mHangsLock);
> > 
> > As discussed elsewhere, we shouldn't have this.
> 
> I'll update the patch.

Done.
Attachment #830995 - Attachment is obsolete: true
Attachment #832589 - Flags: review?(nfroyd)
Comment on attachment 829510 [details] [diff] [review]
Add way for telemetry to iterate over active threads (v1)

Review of attachment 829510 [details] [diff] [review]:
-----------------------------------------------------------------

Asking r? again since you said you were waiting to see the patch in attachment 832589 [details] [diff] [review]. Thanks!
Attachment #829510 - Flags: review?(nfroyd)
Comment on attachment 832574 [details] [diff] [review]
Add ThreadHangStats for collecting background hang telemetry (v3)

Review of attachment 832574 [details] [diff] [review]:
-----------------------------------------------------------------

I need a bit more background:
- What's the rationale behind creating another representation for exponential histograms in Telemetry?
- Where is the code for capturing hang stacks?
Attachment #832574 - Flags: review?(vdjeric)
Comment on attachment 830322 [details] [diff] [review]
Record inactive thread hang stats inside telemetry (v1)

Review of attachment 830322 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +248,5 @@
>  #if defined(MOZ_ENABLE_PROFILER_SPS)
>    static void RecordChromeHang(uint32_t duration,
>                                 Telemetry::ProcessedStack &aStack);
>  #endif
> +  static void RecordThreadHangStats(Telemetry::ThreadHangStats& aStats);

aStats should be const?
Attachment #830322 - Flags: review?(vdjeric)
Comment on attachment 832576 [details] [diff] [review]
Expose thread hang stats through nsITelemetry and telemetry ping (v2)

Review of attachment 832576 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +1718,5 @@
> +CreateJSTimeHistogram(JSContext* cx, const Telemetry::TimeHistogram& time)
> +{
> +  // Find the last non-zero index in histogram
> +  size_t lastNonZero = ArrayLength(time) - 1;
> +  for (; lastNonZero > 0; lastNonZero--) {

what if the bucket at index 0 is the only non-empty bucket?

@@ +1746,5 @@
> +  }
> +
> +  const Telemetry::HangHistogram::Stack& hangStack = hang.GetStack();
> +  JS::RootedObject stack(cx,
> +    JS_NewArrayObject(cx, hangStack.length(), nullptr));

Sometimes you pass a 0 length to JS_NewArrayObject and sometimes you don't. Is there a reason?

@@ +1803,5 @@
> +    }
> +  }
> +  // Return object even if call fails
> +  JS_DefineProperty(cx, ret, "hangs", OBJECT_TO_JSVAL(hangs),
> +                    nullptr, nullptr, JSPROP_ENUMERATE);

What's the benefit of returning partial results vs returning nothing if something unexpected goes wrong?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +122,5 @@
> +   */
> +  [implicit_jscontext]
> +  readonly attribute jsval threadHangStats;
> +
> +  /*

Wrt threadHangStats.activity, will it be useful to have histograms of normal background threads' activities without any other descriptive info?
Attachment #832576 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #21)
> Comment on attachment 832574 [details] [diff] [review]
> Add ThreadHangStats for collecting background hang telemetry (v3)
> 
> Review of attachment 832574 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I need a bit more background:
> - What's the rationale behind creating another representation for
> exponential histograms in Telemetry?

TimeHistogram needs to have very low overhead because we can be taking thousands of measurements per second. The Chromium histogram has more flexibility but is also more expensive, and I think the additional overhead outweighs any benefits here.

> - Where is the code for capturing hang stacks?

Bug 935092 introduced ThreadStackHelper to get the pseudo-stack of a thread. It is used in attachment 832589 [details] [diff] [review] to get hang stacks.

(In reply to Vladan Djeric (:vladan) from comment #22)
> Comment on attachment 830322 [details] [diff] [review]
> Record inactive thread hang stats inside telemetry (v1)
> 
> Review of attachment 830322 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ +248,5 @@
> >  #if defined(MOZ_ENABLE_PROFILER_SPS)
> >    static void RecordChromeHang(uint32_t duration,
> >                                 Telemetry::ProcessedStack &aStack);
> >  #endif
> > +  static void RecordThreadHangStats(Telemetry::ThreadHangStats& aStats);
> 
> aStats should be const?

RecordThreadHangStats moves data from aStats to the internal vector, and aStats gets cleared in the process.

(In reply to Vladan Djeric (:vladan) from comment #23)
> Comment on attachment 832576 [details] [diff] [review]
> Expose thread hang stats through nsITelemetry and telemetry ping (v2)
> 
> Review of attachment 832576 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ +1718,5 @@
> > +CreateJSTimeHistogram(JSContext* cx, const Telemetry::TimeHistogram& time)
> > +{
> > +  // Find the last non-zero index in histogram
> > +  size_t lastNonZero = ArrayLength(time) - 1;
> > +  for (; lastNonZero > 0; lastNonZero--) {
> 
> what if the bucket at index 0 is the only non-empty bucket?

lastNonZero will be 0 because as long as lastNonZero > 0, the loop continues and lastNonZero decrements.

> @@ +1746,5 @@
> > +  }
> > +
> > +  const Telemetry::HangHistogram::Stack& hangStack = hang.GetStack();
> > +  JS::RootedObject stack(cx,
> > +    JS_NewArrayObject(cx, hangStack.length(), nullptr));
> 
> Sometimes you pass a 0 length to JS_NewArrayObject and sometimes you don't.
> Is there a reason?

I felt it would be more efficient to pre-allocate the array if we are certain of its final length, but I don't know for sure.

> @@ +1803,5 @@
> > +    }
> > +  }
> > +  // Return object even if call fails
> > +  JS_DefineProperty(cx, ret, "hangs", OBJECT_TO_JSVAL(hangs),
> > +                    nullptr, nullptr, JSPROP_ENUMERATE);
> 
> What's the benefit of returning partial results vs returning nothing if
> something unexpected goes wrong?

I think the only benefit is we can potentially get some sets of data even if we lose other sets of data. The data sets are independent so this should be okay. I can change this though if Telemetry prefers no data over partial data.

> ::: toolkit/components/telemetry/nsITelemetry.idl
> @@ +122,5 @@
> > +   */
> > +  [implicit_jscontext]
> > +  readonly attribute jsval threadHangStats;
> > +
> > +  /*
> 
> Wrt threadHangStats.activity, will it be useful to have histograms of normal
> background threads' activities without any other descriptive info?

I think it'd be useful for things like responsiveness targeting (e.g. Foo thread has a responsiveness goal of 50ms, but are we really meeting that on all platforms?) or device correlations (e.g. Foo thread does lots of I/O, but how large is the impact on Bar device with known slow I/O?)
- My preference is that we not report anything to Telemetry if we don't have complete data, it will make it easier to spot when things went wrong
- My concern with threadHangStats.activity was that the data will be hard to interpret, e.g. there might be too much variation in composition thread event times from different workloads (i.e. webpages) such that it might be impossible to make reliable conclusions about devices. But if you guys think you can make use of this data, I'm not objecting to collecting it
- What do you think of stringifying the new histograms to the same string format as the existing histograms? It'd make it a bit easier to hook it up to the dashboard + the about:telemetry page, although some of the field such as "sum_squares" would have to be set to 0
- Have you done any tests to check the impact of this system on memory use or browser performance?

r+ otherwise
(In reply to Vladan Djeric (:vladan) from comment #25)
> - My preference is that we not report anything to Telemetry if we don't have
> complete data, it will make it easier to spot when things went wrong

Yes please.  We've tried to not hand out incomplete things in Telemetry; let's continue that tradition.

> - My concern with threadHangStats.activity was that the data will be hard to
> interpret, e.g. there might be too much variation in composition thread
> event times from different workloads (i.e. webpages) such that it might be
> impossible to make reliable conclusions about devices. But if you guys think
> you can make use of this data, I'm not objecting to collecting it

Related: are you planning on having the same timeouts for things on mobile and desktop?  I suppose the first application here is graphics code, where you want to draw everything at 60 fps.  So maybe the event times (ideally) won't change too much.

> - What do you think of stringifying the new histograms to the same string
> format as the existing histograms? It'd make it a bit easier to hook it up
> to the dashboard + the about:telemetry page, although some of the field such
> as "sum_squares" would have to be set to 0

I approve of this idea.
(In reply to Vladan Djeric (:vladan) from comment #25)
> - My preference is that we not report anything to Telemetry if we don't have
> complete data, it will make it easier to spot when things went wrong

Makes sense. I made the change.

> - My concern with threadHangStats.activity was that the data will be hard to
> interpret, e.g. there might be too much variation in composition thread
> event times from different workloads (i.e. webpages) such that it might be
> impossible to make reliable conclusions about devices. But if you guys think
> you can make use of this data, I'm not objecting to collecting it

I think it's worth a try to see what kind of data we get.

> - What do you think of stringifying the new histograms to the same string
> format as the existing histograms? It'd make it a bit easier to hook it up
> to the dashboard + the about:telemetry page, although some of the field such
> as "sum_squares" would have to be set to 0

Okay, I made the change. Can you take one last look at CreateJSTimeHistogram in Telemetry.cpp?

> - Have you done any tests to check the impact of this system on memory use
> or browser performance?

We allocate about 1kB for each thread being monitored. If there are a lot of hangs, each additional hang record consumes 200B. I measured the CPU times of the hang monitor thread and the most it used was around 0.1% of total CPU time for the process. Additional overhead in each monitored thread should be negligible by design.

(In reply to Nathan Froyd (:froydnj) from comment #26)
> 
> Related: are you planning on having the same timeouts for things on mobile
> and desktop?  I suppose the first application here is graphics code, where
> you want to draw everything at 60 fps.  So maybe the event times (ideally)
> won't change too much.

I think in general we want the same timeouts for mobile and desktop; i.e. we want mobile to have similar responsiveness as desktop.
Attachment #832576 - Attachment is obsolete: true
Attachment #8334714 - Flags: review?(vdjeric)
Comment on attachment 8334714 [details] [diff] [review]
Expose thread hang stats through nsITelemetry and telemetry ping (v3)

Review of attachment 8334714 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +1734,5 @@
> +      !JS_DefineProperty(cx, ret, "histogram_type",
> +                         INT_TO_JSVAL(nsITelemetry::HISTOGRAM_EXPONENTIAL),
> +                         nullptr, nullptr, JSPROP_ENUMERATE)) {
> +    return nullptr;
> +  }

What about sum_squares_lo and sum_squares_hi

@@ +1735,5 @@
> +                         INT_TO_JSVAL(nsITelemetry::HISTOGRAM_EXPONENTIAL),
> +                         nullptr, nullptr, JSPROP_ENUMERATE)) {
> +    return nullptr;
> +  }
> +  // TODO: calculate "sum", "log_sum", and "log_sum_squares"

Talk to Mark Reid to ensure the server doesn't reject histograms with inconsistent fields

::: toolkit/components/telemetry/TelemetryPing.js
@@ +312,5 @@
>    },
>  
> +  getThreadHangStats: function getThreadHangStats(stats) {
> +    let self = this;
> +    stats.forEach(function (thread) {

You can use arrow notation here instead, it won't rebind "this" so you won't need to do "let self = this;"
Attachment #8334714 - Flags: review?(vdjeric) → review+
Comment on attachment 829510 [details] [diff] [review]
Add way for telemetry to iterate over active threads (v1)

Review of attachment 829510 [details] [diff] [review]:
-----------------------------------------------------------------

I don't see a better way forward here than what you've got.  r=me
Attachment #829510 - Flags: review?(nfroyd) → review+
Attachment #832589 - Flags: review?(nfroyd) → review+
(In reply to Vladan Djeric (:vladan) from comment #28)
> Comment on attachment 8334714 [details] [diff] [review]
> Expose thread hang stats through nsITelemetry and telemetry ping (v3)
> 
> Review of attachment 8334714 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ +1734,5 @@
> > +      !JS_DefineProperty(cx, ret, "histogram_type",
> > +                         INT_TO_JSVAL(nsITelemetry::HISTOGRAM_EXPONENTIAL),
> > +                         nullptr, nullptr, JSPROP_ENUMERATE)) {
> > +    return nullptr;
> > +  }
> 
> What about sum_squares_lo and sum_squares_hi

AFAICT, they're not recorded for exponential histograms.

> @@ +1735,5 @@
> > +                         INT_TO_JSVAL(nsITelemetry::HISTOGRAM_EXPONENTIAL),
> > +                         nullptr, nullptr, JSPROP_ENUMERATE)) {
> > +    return nullptr;
> > +  }
> > +  // TODO: calculate "sum", "log_sum", and "log_sum_squares"
> 
> Talk to Mark Reid to ensure the server doesn't reject histograms with
> inconsistent fields

Okay. I think we'd be fine since these are not under the "histograms" object in the telemetry ping.

> ::: toolkit/components/telemetry/TelemetryPing.js
> @@ +312,5 @@
> >    },
> >  
> > +  getThreadHangStats: function getThreadHangStats(stats) {
> > +    let self = this;
> > +    stats.forEach(function (thread) {
> 
> You can use arrow notation here instead, it won't rebind "this" so you won't
> need to do "let self = this;"

Done.
Attachment #8334714 - Attachment is obsolete: true
Attachment #8336913 - Flags: review+
Attachment #830322 - Flags: review+
Attachment #832574 - Flags: review+
Forgot to change self to this.
Attachment #8336913 - Attachment is obsolete: true
Attachment #8336925 - Flags: review+

Updated

5 years ago
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.