Closed Bug 1380081 Opened 7 years ago Closed 7 years ago

Update BHR ping format

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: nika, Assigned: nika)

References

Details

(Whiteboard: [measurement:client:tracking])

Attachments

(15 files, 35 obsolete files)

11.56 KB, patch
Details | Diff | Splinter Review
8.91 KB, patch
Details | Diff | Splinter Review
47.90 KB, patch
Details | Diff | Splinter Review
8.52 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
3.38 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
5.35 KB, patch
Details | Diff | Splinter Review
38.62 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
5.62 KB, patch
Details | Diff | Splinter Review
11.55 KB, patch
Details | Diff | Splinter Review
1.72 KB, patch
Details | Diff | Splinter Review
31.97 KB, patch
Details | Diff | Splinter Review
7.07 KB, patch
Details | Diff | Splinter Review
5.85 KB, patch
liuche
: review+
Details | Diff | Splinter Review
4.16 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
4.01 KB, patch
chutten
: review+
Details | Diff | Splinter Review
In bug 1367406 we're going to get new interleaved stacks, which will require a new format for our pings. We might as well get started on making that transition to a new format for the pings, to make the transition easier when we get interleaved stacks.

This bug is for tracking the design of the new pings and my implementation on the current infrastructure.
This is necessary because changes in future patches changed the way that unified
builds work in the threads folder, and these headers started being included in
files which didn't have a `using namespace mozilla` declaration.

MozReview-Commit-ID: 2ZNoLgNpAbi
This is the first part of purging the existing telemetry code for
ThreadHangStats from the tree. All of these features will be replaced with new
code for BHR telemetry in the future.

MozReview-Commit-ID: BhD5zY2LwUR
MozReview-Commit-ID: KTOSKobhNAJ
MozReview-Commit-ID: EtrktVmc3vP
MozReview-Commit-ID: 7TOGofAYM6W
These will be used to implement IPC serialization and deserialization of the
HangDetails object to send over IPC.

I plan to file a follow up to clean up HangAnnotations a bunch, because it seems
to have accumulated some unnecessary abstraction and complexity since it was
created.

MozReview-Commit-ID: 1WHNvhDrMF5
MozReview-Commit-ID: HKmoP2An4GP
MozReview-Commit-ID: CeikKabY9Vs
MozReview-Commit-ID: 2jG75AEUJfS
MozReview-Commit-ID: 3HyEKWdXaqU
Not putting these patches up for review today, as I don't have enough time before I have to run.

For anyone's future reference, this is the layout of the new pings (hopefully I'll add a patch at some point to add an rst to document this in tree).

{
  environment: ...,
  payload: {
    modules: [
      ["XUL.pdb", "BREAKPAD_ID"],
      ..
    ],
    hangs: [
      {
        duration: 10, // ms
        thread: "Gecko", // name of hanging thread
        runnable: "???", // Name of the runnable running, or ??? if unknown.
        process: "default", // default, content, or GPU
        annotations: {
          "key": "value",
          // e.g. "UserInteracting": "true",
        },
        stack: [
          [0 /* offset in modules */, "FF0F0B" /* Offset in the module as a hex string */ ],
          "Some PseudoStack or JS frame entry", // NOTE: Not interleaved yet
          ...
        ],
        // NOTE: pseudoStack will be removed once we get interleaved stacks.
        pseudoStack: [
          "Some PseudoStack entry",
        ]
      }
    ]
  },
}

dthayer, do you have any thoughts as to how I could either a) make this a better format for the hang, or b) what information we're loosing that might be useful for me to re-add?
Flags: needinfo?(dothayer)
That has all the information that I use, and it should be trivial to adjust to this format. :)
Flags: needinfo?(dothayer)
This is necessary because changes in future patches changed the way that unified
builds work in the threads folder, and these headers started being included in
files which didn't have a `using namespace mozilla` declaration.

MozReview-Commit-ID: 2ZNoLgNpAbi
This is the first part of purging the existing telemetry code for
ThreadHangStats from the tree. All of these features will be replaced with new
code for BHR telemetry in the future.

MozReview-Commit-ID: BhD5zY2LwUR
MozReview-Commit-ID: KTOSKobhNAJ
MozReview-Commit-ID: EtrktVmc3vP
MozReview-Commit-ID: 7TOGofAYM6W
These will be used to implement IPC serialization and deserialization of the
HangDetails object to send over IPC.

In a later part HangAnnotations is almost completely removed, making this patch
largely obsolete. I haven't removed it though, as rebasing everything would take
a very long time.

Let me know if you want me to try and move that patch here.

MozReview-Commit-ID: 1WHNvhDrMF5
A new test using the new APIs is introduced in a later part. This test no longer
functions as ThreadHangStats is no longer present.

MozReview-Commit-ID: HKmoP2An4GP
We're going to use HangDetails as the type containing hang information. We'll
have a JS component which reads the data out of nsIHangDetails, builds the
payload, and submits it to telemetry for us.

We'll do it in JS because telemetry has to be submitted from JS.

MozReview-Commit-ID: CeikKabY9Vs
This patch adds the BHRTelemetryService which is a JS implemented XPCOM service
that simply listens to the bhr-thread-hang observer notification, and uses the
data it collects from it to submit telemetry pings.

data-r=bsmedberg (see bug 1380081 comment 13 for the ping format).

MozReview-Commit-ID: 2hPXAFmHrm5
BHRTelemetryService only runs in the parent process (and we can only submit
pings from there), so we need to send the data which we collect in the GPU and
Content processes over IPC to the parent process.

The IPC serialization code added in this patch is cleaned up in later patches.

MozReview-Commit-ID: 8B5uZKbjNbU
These changes are going to increase the amount of data which we collect from BHR
a lot. It would be dangerous to run it on beta, especially considering how soon
the next merge is.

This should turn it off for 100% of beta users if I understand the logic
correctly.

MozReview-Commit-ID: 3HyEKWdXaqU
HangAnnotations was very complex, required a separate allocation, and used this
unfortunate virtual interface implementation which made it harder to do
interesting things with it (such as serialize it over IPC).

This new implementation is much simpler and more concrete, making
HangAnnotations simply be a nsTArray<Annotation>. This also simplifies some of
the IPC code which was added in part 10.

MozReview-Commit-ID: EzaaxdHpW1t
I intend to extend HangStack with interleaved stack capabilities when we get
them so I figured making it serialize itself over IPC was a good idea. I decided
not to make ProcessedStacks have IPC serialization directly implemented on it as
it is used by other pieces of code, and I don't intend to use it at all in BHR
after we get interleaved stacks.

MozReview-Commit-ID: I69mAZdEOWD
This patch just adds some tests for the new stuff which we added.

MozReview-Commit-ID: 2jG75AEUJfS
Attachment #8888451 - Attachment is obsolete: true
Attachment #8888452 - Attachment is obsolete: true
Attachment #8888453 - Attachment is obsolete: true
Attachment #8888454 - Attachment is obsolete: true
Attachment #8888455 - Attachment is obsolete: true
Attachment #8888456 - Attachment is obsolete: true
Attachment #8888457 - Attachment is obsolete: true
Attachment #8888458 - Attachment is obsolete: true
Attachment #8888459 - Attachment is obsolete: true
Attachment #8888460 - Attachment is obsolete: true
Attachment #8888461 - Attachment is obsolete: true
Attachment #8888462 - Attachment is obsolete: true
Attachment #8888935 - Flags: review?(chutten)
Attachment #8888936 - Flags: review?(chutten)
Attachment #8888940 - Flags: review?(chutten)
Comment on attachment 8888942 [details] [diff] [review]
Part 9: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=bsmedberg, r=froydnj

data-review?
Attachment #8888942 - Flags: review?(benjamin)
Attachment #8888934 - Flags: review?(nfroyd)
Attachment #8888934 - Flags: review?(nfroyd) → review+
Comment on attachment 8888942 [details] [diff] [review]
Part 9: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=bsmedberg, r=froydnj

I don't want to be code reviewer for this, so please make sure there is another code reviewer.

In terms of data review: the data format needs to be documented in the tree so that it shows up in gecko.readthedocs.org. See e.g. https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/docs/data/crash-ping.rst?q=file%3Acrash-ping&redirect_type=single

The data review takes place primarily against the docs, and you and the code reviewer are responsible for making sure that the code matches the docs. data-review- for now.
Attachment #8888942 - Flags: review?(benjamin) → review-
HangAnnotations was very complex, required a separate allocation, and used this
unfortunate virtual interface implementation which made it harder to do
interesting things with it (such as serialize it over IPC).

This new implementation is much simpler and more concrete, making
HangAnnotations simply be a nsTArray<Annotation>. This also simplifies some of
the IPC code which was added in part 10.

MozReview-Commit-ID: EzaaxdHpW1t
Attachment #8889493 - Flags: review?(nfroyd)
Attachment #8888945 - Attachment is obsolete: true
Attachment #8888937 - Flags: review?(nfroyd)
Attachment #8888938 - Flags: review?(nfroyd)
Attachment #8888939 - Flags: review?(nfroyd)
Attachment #8888941 - Flags: review?(nfroyd)
Comment on attachment 8888942 [details] [diff] [review]
Part 9: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=bsmedberg, r=froydnj

Nathan, can you review the code for now - I'll write up the docs bsmedberg requested and make those a separate patch.
Attachment #8888942 - Flags: review?(nfroyd)
Attachment #8888943 - Flags: review?(nfroyd)
Attachment #8888944 - Flags: review?(nfroyd)
Attachment #8888946 - Flags: review?(nfroyd)
Attachment #8888947 - Flags: review?(nfroyd)
Comment on attachment 8888935 [details] [diff] [review]
Part 2: Remove getChildThreadHangs, r=chutten

Review of attachment 8888935 [details] [diff] [review]:
-----------------------------------------------------------------

Alas poor statuser, my first addon.

Oh well, it's not like it could be made compatible for 57, so it was already mostly dead :)
Attachment #8888935 - Flags: review?(chutten) → review+
Comment on attachment 8888936 [details] [diff] [review]
Part 3: Remove ThreadHangStats from about:telemetry, r=chutten

Review of attachment 8888936 [details] [diff] [review]:
-----------------------------------------------------------------

about:telemetry is still going to have to render threadhangs, regardless of what ping they're being transported on. Are you sure that removing this is necessary/desirable?
Attachment #8888936 - Flags: review?(chutten)
Comment on attachment 8888940 [details] [diff] [review]
Part 7: Remove test_ThreadHangStats.js, r=chutten

Review of attachment 8888940 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to see the new test so I can ensure we're not losing coverage.
Attachment #8888940 - Flags: review?(chutten) → review+
(In reply to Chris H-C :chutten from comment #34)
> Alas poor statuser, my first addon.
> 
> Oh well, it's not like it could be made compatible for 57, so it was already
> mostly dead :)

I think it would be possible to rewrite the features of statuser on top of the new bhr-thread-hang observer notification which I am improving in this patch, so it isn't dead-dead, but if it isn't making the transition to 57 anyway, then that's OK.

I have been considering to make the observer notification visible through the profiler webextension API as well, as the foxfooding webextension is going to want to be able to trigger profile collection when the browser hangs.
Comment on attachment 8888947 [details] [diff] [review]
Part 14: Add a test for BHR observer notifications

chutten: This is the new test which is replacing test_ThreadHangStats. You mentioned you would be interested to look at it.
Attachment #8888947 - Flags: feedback?(chutten)
Comment on attachment 8888936 [details] [diff] [review]
Part 3: Remove ThreadHangStats from about:telemetry, r=chutten

The old renderer is no longer relevant in any way. The data format is completely different, and keeping any code wouldn't be helpful.

According to gfritzsche on #telemetry, the new custom pings should be being rendered automatically as raw JSON without me having to do any extra work. That view will be more useful than one based on the old code, as it will actually contain the new information.

We should probably add a nicer UI for BHR to about:telemetry in the future, but I'd prefer to do it in a followup.

Re-requesting review with this explanation.
Attachment #8888936 - Flags: review?(chutten)
I've added the documentation for the new ping format.

r?froydnj
data-r?bsmedberg

MozReview-Commit-ID: G4hFZcR2EGL
Attachment #8889953 - Flags: review?(nfroyd)
Attachment #8889953 - Flags: review?(benjamin)
Comment on attachment 8888936 [details] [diff] [review]
Part 3: Remove ThreadHangStats from about:telemetry, r=chutten

Fair 'nuff
Attachment #8888936 - Flags: review?(chutten) → review+
Comment on attachment 8888947 [details] [diff] [review]
Part 14: Add a test for BHR observer notifications

Review of attachment 8888947 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks
Attachment #8888947 - Flags: feedback?(chutten) → feedback+
Just realized - in the aggregation job, I normalize by total subsession length for a given build date. It would be nice to not have to go out to the main ping for this, especially since if it's on a different schedule it will be out of sync with the data from this ping. Can we include a value that contains the number of seconds of usage time between now and when the last ping was sent?
Comment on attachment 8889953 [details] [diff] [review]
Part 15: Add telemetry documentation for the new ping format, r=bsmedberg

>diff --git a/toolkit/components/telemetry/docs/data/bhr-ping.rst b/toolkit/components/telemetry/docs/data/bhr-ping.rst

>+"bhr" ping
>+==========
>+
>+This ping is captured whenever a thread monitored by the Background Hang Monitor
>+hangs. It includes non-identifying metadata about the hang.

I'm a bit confused about when this ping is sent. From reading this sentence, I assumed that there was one ping per hang even. But the ping format below has many hangs in a single ping. Can you clarify here? Are they batched up so that e.g. a ping is sent after 10-100 hang events? Is there any throttling/limit on how often or big?

>+            "process": <type>, // Type of process that hung, see below for a list of types.

For content processes in particular, I'd encourage you to also annotate the subtype/remote type: "web", "file", "extension" (see bug 1370178). (can be a followup)

>+            "annotations": { // list of present hang annotations.
>+                             // These are the same annotations used by chromeHangs.
>+              <key>: <value>,
>+              ...
>+            },

I realize that this is probably just moving existing stuff around: is there a list of possible annotations? (file followup if necessary: it would be nice to have a list in yaml format so we don't introduce privacy risk by accident).
Flags: needinfo?(michael)
Comment on attachment 8888937 [details] [diff] [review]
Part 4: Remove the ThreadHangStats object and related code, r=froydnj

Review of attachment 8888937 [details] [diff] [review]:
-----------------------------------------------------------------

I am going to assume that we can remove all of this code because you removed clients of said code in part 3 and earlier?  Any explanatory text to that effect would have been appreciated.

r=me with that assumption; if that's not the case, we need to discuss the patch further.
Attachment #8888937 - Flags: review?(nfroyd) → review+
Comment on attachment 8888938 [details] [diff] [review]
Part 5: Move BHR into its own component, r=froydnj

Review of attachment 8888938 [details] [diff] [review]:
-----------------------------------------------------------------

Um, sure, I guess, but why?  A justification would be really nice.

::: toolkit/components/moz.build
@@ +19,5 @@
>      'addoncompat',
>      'alerts',
>      'apppicker',
>      'asyncshutdown',
> +    'bhr',

I think spelling this out would be more in line with the other directories in here, and also easier for future readers to understand.

It would be great if the condition for enabling BHR was actually in this moz.build file and avoided building the `bhr` directory entirely if we didn't need to.
Attachment #8888938 - Flags: review?(nfroyd)
Comment on attachment 8888939 [details] [diff] [review]
Part 6: Add some helper methods to HangAnnotations, r=froydnj

Review of attachment 8888939 [details] [diff] [review]:
-----------------------------------------------------------------

This might be easier if you made the ParamTraits specialization a friend of HangAnnotations, and then ParamTraits could just reach into HangAnnotations and rely on ParamTraits specializations for the members?

Also, putting things like "Let me know if you want..." into the actual commit message of the patch probably isn't a good thing; those sorts of comments should be written on the bug, not in the commit message.
Attachment #8888939 - Flags: review?(nfroyd)
Comment on attachment 8888942 [details] [diff] [review]
Part 9: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=bsmedberg, r=froydnj

Review of attachment 8888942 [details] [diff] [review]:
-----------------------------------------------------------------

Canceling review until bsmedberg's r- is resolved.
Whiteboard: [measurement:client:tracking]
Comment on attachment 8888941 [details] [diff] [review]
Part 8: Add all necessary data for BHR to nsIHangDetails, r=froydnj

Review of attachment 8888941 [details] [diff] [review]:
-----------------------------------------------------------------

I have some comments, but I'm also not sure I understand the justification(s) in the explanation.

::: toolkit/components/bhr/HangDetails.cpp
@@ +200,5 @@
> +
> +  // NOTE: In the past we had problems with the main thread not being around
> +  // and this dispatch failing. I think the stream transport service may have
> +  // different properties, so it may work better.
> +  nsresult rv = SystemGroup::Dispatch("NotifyBHRHangObservers",

The comment here does not seem to match up with the code that it is commenting on.

@@ +252,5 @@
> +  // Write out the annotation information
> +  {
> +    // If we have no annotations, write out a 0-length annotations.
> +    if (!aParam.mAnnotations) {
> +      WriteParam(aMsg, 0);

Should we return after this?  I don't understand the comment if we shouldn't.

@@ +272,5 @@
> +    WriteParam(aMsg, length);
> +    for (size_t i = 0; i < length; ++i) {
> +      auto& module = aParam.mStack.GetModule(i);
> +      WriteParam(aMsg, module.mName);
> +      nsDependentCString breakpadId(module.mBreakpadId.c_str());

Do we not have ParamTraits specializations for std::string and friends?  http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_message_utils.h#322

@@ +365,5 @@
> +      nsAutoCString breakpadId;
> +      if (!ReadParam(aMsg, aIter, &breakpadId)) {
> +        return false;
> +      }
> +      module.mBreakpadId = breakpadId.get();

I guess you can make this more efficient with ParamTraits specializations as well?

::: toolkit/components/bhr/HangStack.cpp
@@ +11,5 @@
> +  // XXX: I should be able to just memcpy the other stack's mImpl and mBuffer,
> +  // and then just re-offset pointers.
> +  for (size_t i = 0; i < aOther.length(); ++i) {
> +    const char* s = aOther[i];
> +    if (aOther.IsInBuffer(s)) {

Can we provide a good justification for why we want an N^2 algorithm here?

::: toolkit/components/bhr/nsIHangDetails.idl
@@ +32,5 @@
> +
> +  /**
> +   * The name of the runnable which hung if it hung on the main thread.
> +   */
> +  readonly attribute ACString runnable;

Maybe `runnableName` instead?
Attachment #8888941 - Flags: review?(nfroyd)
Comment on attachment 8888943 [details] [diff] [review]
Part 10: Transmit BHR Hangs from the Content and GPU process to the parent process, r=froydnj

Review of attachment 8888943 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/bhr/HangDetails.cpp
@@ +201,5 @@
>        os->NotifyObservers(hangDetails, "bhr-thread-hang", nullptr);
>      }
> +
> +    // If we're not in the parent process, we might need to transmit our hang
> +    // details to another process, so that they can be reported there.

This seems a little weird.  Why are we notifying observers in this process, *and* sending the data to (presumably) the parent process so that the hang can be processed (expensively) there as well?  Shouldn't something be aggregating the information at a central location so we don't have to do it ourselves?

If we only notified in the parent process here, and only called SendBHRThreadHang for the content and GPU processes, that would make a bit more sense to me, but I would still be unclear why the normal telemetry aggregation mechanisms (if such things exist) shouldn't be used.

@@ +280,5 @@
>    }
>  
>    // Write out the annotation information
> +  // If we have no annotations, write out a 0-length annotations.
> +  // XXX: Everything about HangAnnotations is awful.

Maybe we should fix that?

This hunk deserves to be moved into part 8, I think.
Attachment #8888943 - Flags: review?(nfroyd)
Comment on attachment 8888942 [details] [diff] [review]
Part 9: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=bsmedberg, r=froydnj

Review of attachment 8888942 [details] [diff] [review]:
-----------------------------------------------------------------

Really canceling review!
Attachment #8888942 - Flags: review?(nfroyd)
Comment on attachment 8888944 [details] [diff] [review]
Part 11: Stop running BHR on beta, r=froydnj

Review of attachment 8888944 [details] [diff] [review]:
-----------------------------------------------------------------

Sure.
Attachment #8888944 - Flags: review?(nfroyd) → review+
Comment on attachment 8888947 [details] [diff] [review]
Part 14: Add a test for BHR observer notifications

Review of attachment 8888947 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is OK.  I'm a little worried about the timeouts causing intermittents here, but maybe it won't be a problem...
Attachment #8888947 - Flags: review?(nfroyd) → review+
Comment on attachment 8888946 [details] [diff] [review]
Part 13: Add IPC serialization for HangStack, r=froydnj

Review of attachment 8888946 [details] [diff] [review]:
-----------------------------------------------------------------

I feel like this patch should be folded into an earlier patch or is should be in the separate bug for extending HangStack?
Attachment #8888946 - Flags: review?(nfroyd)
Comment on attachment 8889493 [details] [diff] [review]
Part 12: Simplify the HangAnnotations abstraction

Review of attachment 8889493 [details] [diff] [review]:
-----------------------------------------------------------------

I would have preferred this rewrite to be much earlier in the series, would make many things easier. :)

::: toolkit/components/telemetry/HangReports.cpp
@@ +114,5 @@
>    for (auto iter = mAnnotationInfo.ConstIter(); !iter.Done(); iter.Next()) {
>      n += iter.Key().SizeOfExcludingThisIfUnshared(aMallocSizeOf);
> +    auto& annotations = iter.Data()->mAnnotations;
> +    n += annotations.ShallowSizeOfExcludingThis(aMallocSizeOf);
> +    n += annotations.Length() * sizeof(Annotation);

This doesn't look right.  ShallowSizeOfExcludingThis already measures the size of the allocated buffer where the annotations are stored, so you are double-counting the stored annotations.

::: xpcom/threads/HangAnnotations.cpp
@@ +99,5 @@
>      MutexAutoLock lock(mMutex);
>      for (std::set<Annotator*>::iterator i = mAnnotators.begin(),
>                                          e = mAnnotators.end();
>           i != e; ++i) {
> +      (*i)->AnnotateHang(annotations);

Maybe this function should be called "RecordAnnotations"?  This line just reads peculiarly, because the name makes it sound like you're passing `annotations` *in* to the function, I think, but `(*i)` isn't a Hang...so confusing.

::: xpcom/threads/HangAnnotations.h
@@ +38,5 @@
> +/**
> + * This class extends nsTArray<Annotation> with some methods for adding
> + * annotations being reported by a registered hang Annotator.
> + */
> +class HangAnnotations : public nsTArray<Annotation>

I dislike classes like this...but maybe it's not that bad here?

@@ +74,5 @@
>  void UnregisterAnnotator(Annotator& aAnnotator);
>  
>  /**
>   * Gathers annotations. This function should be called by ChromeHangs.
>   * @return UniquePtr to HangAnnotations object or nullptr if none.

This documentation string needs to be updated.  Possibly others as well?
Attachment #8889493 - Flags: review?(nfroyd)
Comment on attachment 8889953 [details] [diff] [review]
Part 15: Add telemetry documentation for the new ping format, r=bsmedberg

Review of attachment 8889953 [details] [diff] [review]:
-----------------------------------------------------------------

Benjamin has questions, I have comments.

::: toolkit/components/telemetry/docs/data/bhr-ping.rst
@@ +25,5 @@
> +          ...
> +        ],
> +        "hangs": [
> +          {
> +            "duration": <millis>, // duration of the hang in milliseconds.

Just to clarify: as a number?  Or as a string?

@@ +27,5 @@
> +        "hangs": [
> +          {
> +            "duration": <millis>, // duration of the hang in milliseconds.
> +            "thread": <name>, // name of the hanging thread.
> +            "runnable": <name>, // name of the runnable executing during the hang.

Also for clarification: these are strings?  (The use of <...> in this description is inconsistent; in some cases, you use it to denote the JS type, and in some cases, you denote a description of the value.  Consistency is preferred here; descriptions can go someplace else.)

@@ +31,5 @@
> +            "runnable": <name>, // name of the runnable executing during the hang.
> +                                // Runnable names are only collected for the XPCOM main thread.
> +            "process": <type>, // Type of process that hung, see below for a list of types.
> +            "annotations": { // list of present hang annotations.
> +                             // These are the same annotations used by chromeHangs.

Link to the documentation for chromeHangs, maybe?

@@ +51,5 @@
> +Process Types
> +-------------
> +
> +The ``process`` field contains the type of process that hung. Crashes are
> +currently sent only for the processes below:

Maybe:

The ``process`` field is a string denoting the kind of process that hung.  Crashes are currently only sent for the process kind below:

And then the `Type' column can more clearly show that these things are strings, perhaps?

@@ +67,5 @@
> +Stack Traces
> +------------
> +
> +Each hang object contains a ``stack`` field which has been populated with an
> +interleaved stack trace of the hung thread.

What does "interleaved" mean, here and in the description above?

@@ +70,5 @@
> +Each hang object contains a ``stack`` field which has been populated with an
> +interleaved stack trace of the hung thread.
> +
> +Note that this field only contains native stack frames, pseudostack and chrome
> +JS script frames.

We might not have this field with native stack frames, right?  What happens in cases where we can't collect a native stack trace for whatever reason?
Attachment #8889953 - Flags: review?(nfroyd)
(In reply to Doug Thayer [:dthayer] from comment #43)
> Just realized - in the aggregation job, I normalize by total subsession
> length for a given build date. It would be nice to not have to go out to the
> main ping for this, especially since if it's on a different schedule it will
> be out of sync with the data from this ping. Can we include a value that
> contains the number of seconds of usage time between now and when the last
> ping was sent?

Sure, I'll add a time-since-last-ping number to the payload. Perhaps `usage_time`?

(In reply to Benjamin Smedberg [:bsmedberg] from comment #44)
> >+This ping is captured whenever a thread monitored by the Background Hang Monitor
> >+hangs. It includes non-identifying metadata about the hang.
> 
> I'm a bit confused about when this ping is sent. From reading this sentence,
> I assumed that there was one ping per hang even. But the ping format below
> has many hangs in a single ping. Can you clarify here? Are they batched up
> so that e.g. a ping is sent after 10-100 hang events? Is there any
> throttling/limit on how often or big?

I will try to clarify this sentence.

This ping is transmitted whenever the browser exits, or the Background Hang Monitor hangs 50 times. 

Hangs are currently batched in groups of 50, and there is no throttling/limit as to how many pings a single session can produce.

> For content processes in particular, I'd encourage you to also annotate the
> subtype/remote type: "web", "file", "extension" (see bug 1370178). (can be a
> followup)

I'd prefer to do this in a follow-up, as this bug is big enough already. Filed bug 1385316.

> I realize that this is probably just moving existing stuff around: is there
> a list of possible annotations? (file followup if necessary: it would be
> nice to have a list in yaml format so we don't introduce privacy risk by
> accident).

Yes, it's just moving stuff around. This is an existing set of annotations. I looked around in the code base and they are:

Annotations
-----------

The following annotations are currently present in tree:

+-----------------+-------------------------------------------------+
| Name            | Description                                     |
+=================+=================================================+
| UserInteracting | "true" if the user was actively interacting     |
+-----------------+-------------------------------------------------+
| pluginName      | Name of the currently running plugin            |
+-----------------+-------------------------------------------------+
| pluginVersion   | Version of the currently running plugin         |
+-----------------+-------------------------------------------------+
| HangUIShown     | "true" if the hang UI was shown                 |
+-----------------+-------------------------------------------------+
| HangUIContinued | "true" if continue was selected in the hang UI  |
+-----------------+-------------------------------------------------+
| HangUIDontShow  | "true" if the hang UI was not shown             |
+-----------------+-------------------------------------------------V

I'll include this in the next version of the documentation. (NOTE: I think I have the correct interpretation of all of the annotations other than UserInteracting - they were added before my day and I just interpreted what they did from the code.

(In reply to Nathan Froyd [:froydnj] from comment #45)
> I am going to assume that we can remove all of this code because you removed
> clients of said code in part 3 and earlier?  Any explanatory text to that
> effect would have been appreciated.

Yes, this is correct.


(In reply to Nathan Froyd [:froydnj] from comment #46)
> Comment on attachment 8888938 [details] [diff] [review]
> Part 5: Move BHR into its own component, r=froydnj
> 
> Review of attachment 8888938 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Um, sure, I guess, but why?  A justification would be really nice.

The reasoning was that with the new changes to BHR, it was growing a lot in scale. It started to feel like it was a bit too large to just keep as a collection of files in xpcom/threads. 

I was working on the project originally, and started adding another 5 files to xpcom/threads, and it just started to feel like an unorganized mess, and then someone I was asking for input from from telemetry suggested that I should just create my own BHR module.

> I think spelling this out would be more in line with the other directories
> in here, and also easier for future readers to understand.

OK - I'll rename the module to backgroundhangreporter perhaps? Maybe threadhangmonitor would be a more clear name (I always found the name BHR odd, especially considering it's actually called BackgroundHangMonitor in tree - but it kinda stuck).

> It would be great if the condition for enabling BHR was actually in this
> moz.build file and avoided building the `bhr` directory entirely if we
> didn't need to.

Unfortunately, we do need to expose the BackgroundHangMonitor class from this module in builds which don't support BHR, as we don't guard calls to methods like BackgroundHangMonitor::Startup() or BackgroundHangMonitor::NotifyActivity() throughout the code base. I could potentially disable building more files in the module in the disabled case.

I would prefer to do that in a follow-up however, if that's OK.

(In reply to Nathan Froyd [:froydnj] from comment #47)
> This might be easier if you made the ParamTraits specialization a friend of
> HangAnnotations, and then ParamTraits could just reach into HangAnnotations
> and rely on ParamTraits specializations for the members?

I remove all of this in a later patch in the series. 

I would've put those patches earlier in the series, but I tried this approach first, and only got around to realizing how much nicer the code would be near the end of the patch chain, and wanted to avoid the hellish rebase to get the patches in the right order. I hope that's ok :-/

> Also, putting things like "Let me know if you want..." into the actual
> commit message of the patch probably isn't a good thing; those sorts of
> comments should be written on the bug, not in the commit message.

I'll remove that line from the commit message. Sorry about that. 

(In reply to Nathan Froyd [:froydnj] from comment #48)
> Canceling review until bsmedberg's r- is resolved.

I handled bsmedberg's r- by adding part 15 which contained the requested documentation. I'll clear the r- from him on this part and re-request review.

(In reply to Nathan Froyd [:froydnj] from comment #49)
> ::: toolkit/components/bhr/HangDetails.cpp
> @@ +200,5 @@
> > +
> > +  // NOTE: In the past we had problems with the main thread not being around
> > +  // and this dispatch failing. I think the stream transport service may have
> > +  // different properties, so it may work better.
> > +  nsresult rv = SystemGroup::Dispatch("NotifyBHRHangObservers",
> 
> The comment here does not seem to match up with the code that it is
> commenting on.

The comment refers to this awful hack:
http://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/xpcom/threads/BackgroundHangMonitor.cpp#628-636

The new code shouldn't have to go through that awful song and dance, as we're dispatching the event to the main thread from a threadpool which isn't running during that phase of shutdown, handily side-stepping the issue. I wasn't sure if that was the case at the time, but it looks from my try runs like that is the case.

> 
> @@ +252,5 @@
> > +  // Write out the annotation information
> > +  {
> > +    // If we have no annotations, write out a 0-length annotations.
> > +    if (!aParam.mAnnotations) {
> > +      WriteParam(aMsg, 0);
> 
> Should we return after this?  I don't understand the comment if we shouldn't.

Yes, that code is distinctly wrong. The final patch doesn't include it at all but I'll rebase-in a fix for it. Sorry about that.

> Do we not have ParamTraits specializations for std::string and friends? 
> http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/
> ipc_message_utils.h#322

I uhh.. grepped the codebase for ParamTraits<std::string>, and didn't realize that there was a ParamtraitsStd type.
I'll switch to using that, thanks :).

> @@ +365,5 @@
> > +      nsAutoCString breakpadId;
> > +      if (!ReadParam(aMsg, aIter, &breakpadId)) {
> > +        return false;
> > +      }
> > +      module.mBreakpadId = breakpadId.get();
> 
> I guess you can make this more efficient with ParamTraits specializations as
> well?

Yup, I can.

> ::: toolkit/components/bhr/HangStack.cpp
> @@ +11,5 @@
> > +  // XXX: I should be able to just memcpy the other stack's mImpl and mBuffer,
> > +  // and then just re-offset pointers.
> > +  for (size_t i = 0; i < aOther.length(); ++i) {
> > +    const char* s = aOther[i];
> > +    if (aOther.IsInBuffer(s)) {
> 
> Can we provide a good justification for why we want an N^2 algorithm here?

IsInBuffer is constant time I'm pretty sure (http://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/toolkit/components/telemetry/ThreadHangStats.h#138-140). It just does a pointer comparison to see if the string pointer points to some dependent storage from the other object.

Theoretically I should be able to make this copy constructor more efficient, by doing what I mention in the comment above, but it won't change the complexity, rather just making it so instead of inserting each entry one at a time, I do a memcpy and then walk over the new array performing a pointer fixup to point into the new buffer.

> > +  readonly attribute ACString runnable;
> 
> Maybe `runnableName` instead?

Sure.

(In reply to Nathan Froyd [:froydnj] from comment #50)
> This seems a little weird.  Why are we notifying observers in this process,
> *and* sending the data to (presumably) the parent process so that the hang
> can be processed (expensively) there as well?  Shouldn't something be
> aggregating the information at a central location so we don't have to do it
> ourselves?

I technically fire the observer in the child process, but nothing listens to it. The javascript module which handles that observer is only ever started in the parent process. In fact, we don't run JS at all in the GPU process.

This mechanism is how we perform that aggregation. In the content processes, we just transmit the hang information to a central location (the parent process), where the notification is observed and handled.

I can, and perhaps should, stop notifying observers in the content processes. I believe I didn't in the first pass because I had some tests written which listened to the observer notification in the child process.

> If we only notified in the parent process here, and only called
> SendBHRThreadHang for the content and GPU processes, that would make a bit
> more sense to me, but I would still be unclear why the normal telemetry
> aggregation mechanisms (if such things exist) shouldn't be used.

I don't think that for custom pings in the GPU process there is a standard telemetry aggregation mechanism. As an example, in the GPUChild.cpp file, here are a few other basic telemetry accumulators which have been manually implemented:
http://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/gfx/ipc/GPUChild.cpp#181-193

> 
> @@ +280,5 @@
> >    }
> >  
> >    // Write out the annotation information
> > +  // If we have no annotations, write out a 0-length annotations.
> > +  // XXX: Everything about HangAnnotations is awful.
> 
> Maybe we should fix that?
> 
> This hunk deserves to be moved into part 8, I think.

I do fix this, in a later part. I already explained above why I didn't do the rebase to have it fixed earlier, and I'm really sorry about that mess.

The fact that that hunk is in this patch instead of part 8 is a mistake which I made when rebasing if I had to guess. Sorry about that.

(In reply to Nathan Froyd [:froydnj] from comment #54)
> I feel like this patch should be folded into an earlier patch or is should
> be in the separate bug for extending HangStack?

I can fold this into the patch for part 8 if you would like.

(In reply to Nathan Froyd [:froydnj] from comment #55)

> I would have preferred this rewrite to be much earlier in the series, would
> make many things easier. :)

I'm so sorry about that :-/. When I was writing the earlier parts I was thinking that I would do this in a completely separate bug later, but things just kept getting worse and I just decided to do it.

> This doesn't look right.  ShallowSizeOfExcludingThis already measures the
> size of the allocated buffer where the annotations are stored, so you are
> double-counting the stored annotations.

I misinterpreted the comment on ShallowSizeOfExcludingThis when I originally read it. I'll fix it.

> 
> ::: xpcom/threads/HangAnnotations.cpp
> @@ +99,5 @@
> >      MutexAutoLock lock(mMutex);
> >      for (std::set<Annotator*>::iterator i = mAnnotators.begin(),
> >                                          e = mAnnotators.end();
> >           i != e; ++i) {
> > +      (*i)->AnnotateHang(annotations);
> 
> Maybe this function should be called "RecordAnnotations"?  This line just
> reads peculiarly, because the name makes it sound like you're passing
> `annotations` *in* to the function, I think, but `(*i)` isn't a Hang...so
> confusing.

This is an existing method. I could change it if you'd like, but I was avoiding changing more of this file than I needed to, as I was worried about the size of the patches which I was posting already. :-)


> > +class HangAnnotations : public nsTArray<Annotation>
> 
> I dislike classes like this...but maybe it's not that bad here?

I also dislike classes like this. Looking at it again 3 days later I think I will change the nsTArray<Annotation> to be a member.

> 
> @@ +74,5 @@
> >  void UnregisterAnnotator(Annotator& aAnnotator);
> >  
> >  /**
> >   * Gathers annotations. This function should be called by ChromeHangs.
> >   * @return UniquePtr to HangAnnotations object or nullptr if none.
> 
> This documentation string needs to be updated.  Possibly others as well?

Will update.

(In reply to Nathan Froyd [:froydnj] from comment #56)
> ::: toolkit/components/telemetry/docs/data/bhr-ping.rst
> > +            "duration": <millis>, // duration of the hang in milliseconds.
> 
> Just to clarify: as a number?  Or as a string?

As a number, will clarify.

> > +            "runnable": <name>, // name of the runnable executing during the hang.
> 
> Also for clarification: these are strings?  (The use of <...> in this
> description is inconsistent; in some cases, you use it to denote the JS
> type, and in some cases, you denote a description of the value.  Consistency
> is preferred here; descriptions can go someplace else.)

It's a string - I'll change the <...> to be types, and clarify meaning in the // comment

> Link to the documentation for chromeHangs, maybe?

No idea how to do that, but I'll read some rst docs ^.^

> @@ +51,5 @@
> > +Process Types
> > +-------------
> > +
> > +The ``process`` field contains the type of process that hung. Crashes are
> > +currently sent only for the processes below:
> 
> Maybe:
> 
> The ``process`` field is a string denoting the kind of process that hung. 
> Crashes are currently only sent for the process kind below:
> 
> And then the `Type' column can more clearly show that these things are
> strings, perhaps?

Sure.

> @@ +67,5 @@
> > +Stack Traces
> > +------------
> > +
> > +Each hang object contains a ``stack`` field which has been populated with an
> > +interleaved stack trace of the hung thread.
> 
> What does "interleaved" mean, here and in the description above?

I'll add a description. An interleaved stack is something which is not present yet in this bug, but I am trying to make the documentation forward-compatible with (as it will be present in the next follow-up to this bug, which is bug 1367406). 

Basically an interleaved stack is like a normal C++ native stack, except that JS stack frame and pseudostack entries have also been collected and placed in the correct places in the stack. If you've used the gecko profiler before, this is what the stacks which are used by it are like.

> @@ +70,5 @@
> > +Each hang object contains a ``stack`` field which has been populated with an
> > +interleaved stack trace of the hung thread.
> > +
> > +Note that this field only contains native stack frames, pseudostack and chrome
> > +JS script frames.
> 
> We might not have this field with native stack frames, right?  What happens
> in cases where we can't collect a native stack trace for whatever reason?

This is changing soon once we have the interleaved stacks which I mention in the previous note. Basically once we get the new interleaved stack frame work, what will happen on platforms which don't have native stack tracing support is we will only collect pseudostack and chrome JS stack frame information in the stack, and no native entries or modules will be present.
Flags: needinfo?(michael)
(In reply to Michael Layzell [:mystor] (PTO until July 28) from comment #57)
> Sure, I'll add a time-since-last-ping number to the payload. Perhaps
> `usage_time`?

Sounds good. Ideally the value would only count time that we could be hanging (computer not asleep, etc.) But if that's a bit tricky, then just time since last sent should be fine, as that's what we've been working with.
(In reply to Doug Thayer [:dthayer] from comment #58)
> Sounds good. Ideally the value would only count time that we could be
> hanging (computer not asleep, etc.) But if that's a bit tricky, then just
> time since last sent should be fine, as that's what we've been working with.

BHR internally uses a different sense of time than the rest of firefox - namely it doesn't consider time spent sleeping or waiting for things like OS events.

I could provide a uptime duration measured in this "fake" time if that would be useful? It would have a much higher ratio of hangs to usage hours most likely (as it doesn't include time when the user isn't using firefox).
Flags: needinfo?(dothayer)
(In reply to Michael Layzell [:mystor] from comment #59)
> BHR internally uses a different sense of time than the rest of firefox -
> namely it doesn't consider time spent sleeping or waiting for things like OS
> events.
> 
> I could provide a uptime duration measured in this "fake" time if that would
> be useful? It would have a much higher ratio of hangs to usage hours most
> likely (as it doesn't include time when the user isn't using firefox).

I think that would be good. Maybe including both would be best, but I think the fake time would be more helpful than real time for the purpose of normalization.
Flags: needinfo?(dothayer)
(In reply to Doug Thayer [:dthayer] from comment #60)
> I think that would be good. Maybe including both would be best, but I think
> the fake time would be more helpful than real time for the purpose of
> normalization.

If it's OK with you I would like to do this in a follow-up. I'm looking into how easy this will be to expose to the BHRTelemetryService so I can send it, and it won't be hard but it's enough of a chance I'd like to do it in a separate bug. I filed Bug 1385366 for it.
Attachment #8889953 - Flags: review?(benjamin)
I think in the interest of landing this patch sooner rather than later I'm not going to change HangAnnotations to not inherit from nsTArray, and I'm not going to stop dispatching bhr-thread-hang observer notifications in the content process until a follow-up.
This is the first part of purging the existing telemetry code for
ThreadHangStats from the tree. All of these features will be replaced with new
code for BHR telemetry in the future.

MozReview-Commit-ID: BhD5zY2LwUR
Attachment #8888934 - Attachment is obsolete: true
Attachment #8888935 - Attachment is obsolete: true
Attachment #8888936 - Attachment is obsolete: true
Attachment #8888937 - Attachment is obsolete: true
Attachment #8888938 - Attachment is obsolete: true
Attachment #8888939 - Attachment is obsolete: true
Attachment #8888940 - Attachment is obsolete: true
Attachment #8888941 - Attachment is obsolete: true
Attachment #8888942 - Attachment is obsolete: true
Attachment #8888943 - Attachment is obsolete: true
Attachment #8888944 - Attachment is obsolete: true
Attachment #8888946 - Attachment is obsolete: true
Attachment #8888947 - Attachment is obsolete: true
Attachment #8889493 - Attachment is obsolete: true
Attachment #8889953 - Attachment is obsolete: true
MozReview-Commit-ID: KTOSKobhNAJ
MozReview-Commit-ID: 7TOGofAYM6W
Attachment #8892187 - Flags: review?(nfroyd)
These will be used to implement IPC serialization and deserialization of the
HangDetails object to send over IPC. This is a temporary measure as
HangAnnotations is rewritten in part 11.

MozReview-Commit-ID: 1WHNvhDrMF5
Attachment #8892188 - Flags: review?(nfroyd)
A new test using the new APIs is introduced in a later part. This test no longer
functions as ThreadHangStats is no longer present.

MozReview-Commit-ID: HKmoP2An4GP
We're going to use HangDetails as the type containing hang information. We'll
have a JS component which reads the data out of nsIHangDetails, builds the
payload, and submits it to telemetry for us.

We'll do it in JS because telemetry has to be submitted from JS.

This patch also adds IPC serization for the relevant types so that we can send
HangDetails objects over IPDL.

MozReview-Commit-ID: CeikKabY9Vs
Attachment #8892191 - Flags: review?(nfroyd)
This patch adds the BHRTelemetryService which is a JS implemented XPCOM service
that simply listens to the bhr-thread-hang observer notification, and uses the
data it collects from it to submit telemetry pings.

MozReview-Commit-ID: 2hPXAFmHrm5
Attachment #8892192 - Flags: review?(nfroyd)
BHRTelemetryService only runs in the parent process (and we can only submit
pings from there), so we need to send the data which we collect in the GPU and
Content processes over IPC to the parent process.

MozReview-Commit-ID: 8B5uZKbjNbU
Attachment #8892193 - Flags: review?(nfroyd)
These changes are going to increase the amount of data which we collect from BHR
a lot. It would be dangerous to run it on beta, especially considering how soon
the next merge is.

This should turn it off for 100% of beta users if I understand the logic
correctly.

MozReview-Commit-ID: 3HyEKWdXaqU
HangAnnotations was very complex, required a separate allocation, and used this
unfortunate virtual interface implementation which made it harder to do
interesting things with it (such as serialize it over IPC).

This new implementation is much simpler and more concrete, making
HangAnnotations simply be a nsTArray<Annotation>. This also simplifies some of
the IPC code which was added in part 7.

MozReview-Commit-ID: EzaaxdHpW1t
Attachment #8892195 - Flags: review?(nfroyd)
This patch just adds some tests for the new stuff which we added.

MozReview-Commit-ID: 2jG75AEUJfS
bsmedberg is on PTO, so I'll have to find someone else to do data-review. I'll deal with that tomorrow.

MozReview-Commit-ID: G4hFZcR2EGL
Attachment #8892197 - Flags: review?(nfroyd)
Attachment #8892187 - Flags: review?(nfroyd) → review+
Attachment #8892188 - Flags: review?(nfroyd) → review+
Comment on attachment 8892191 [details] [diff] [review]
Part 7: Add all necessary data for BHR to nsIHangDetails

Review of attachment 8892191 [details] [diff] [review]:
-----------------------------------------------------------------

It looks like you have two HangDetails.cpp files in this patch, which are different from one another?

Canceling review until the patch is straightened out.
Attachment #8892191 - Flags: review?(nfroyd)
Comment on attachment 8892192 [details] [diff] [review]
Part 8: Report bhr-thread-hang hangs to telemetry in a custom bhr ping

Review of attachment 8892192 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the changes below.

::: toolkit/components/backgroundhangmonitor/HangDetails.cpp
@@ +27,5 @@
> +    nsAutoString key;
> +    nsAutoString value;
> +    while (enumerator->Next(key, value)) {
> +      mAnnotations->AddAnnotation(key, value);
> +    }

Please fold these changes into a different patch.

@@ +155,5 @@
> +      jsValue.setString(JS_NewUCStringCopyN(aCx, value.get(), value.Length()));
> +      if (!JS_DefineUCProperty(aCx, jsAnnotation, key.get(), key.Length(),
> +                               jsValue, JSPROP_ENUMERATE)) {
> +        return NS_ERROR_OUT_OF_MEMORY;
> +      }

These too.

::: toolkit/components/backgroundhangmonitor/HangDetails.h
@@ +25,5 @@
>   *
>   * This type is separate, as it can be sent over IPC while nsHangDetails is an
>   * XPCOM interface which is harder to serialize over IPC.
>   */
> +class HangDetails

...and all of these, too.
Attachment #8892192 - Flags: review?(nfroyd) → review+
Comment on attachment 8892192 [details] [diff] [review]
Part 8: Report bhr-thread-hang hangs to telemetry in a custom bhr ping

Review of attachment 8892192 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, actually, are we running this in the parent process only, or are we running it everywhere?  We should change the manifest to reflect that appropriately.
Comment on attachment 8892193 [details] [diff] [review]
Part 9: Transmit BHR Hangs from the Content and GPU process to the parent process

Review of attachment 8892193 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/backgroundhangmonitor/HangDetails.cpp
@@ +198,5 @@
>    nsCOMPtr<nsIRunnable> notifyObservers = NS_NewRunnableFunction("NotifyBHRHangObservers", [hangDetails] {
>      nsCOMPtr<nsIObserverService> os = mozilla::services::GetObserverService();
>      if (os) {
>        os->NotifyObservers(hangDetails, "bhr-thread-hang", nullptr);
>      }

I still think you should move this to GeckoProcessType_Default, unless you still have those tests you mentioned.
Attachment #8892193 - Flags: review?(nfroyd) → review+
Comment on attachment 8892195 [details] [diff] [review]
Part 11: Simplify the HangAnnotations abstraction

Review of attachment 8892195 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with additional error checking.

::: toolkit/components/backgroundhangmonitor/HangDetails.cpp
@@ +126,5 @@
>    }
>  
> +  for (auto& annot : mDetails.mAnnotations) {
> +    JS::RootedValue jsValue(aCx);
> +    jsValue.setString(JS_NewUCStringCopyN(aCx, annot.mValue.get(), annot.mValue.Length()));

I think this needs some more error checking, at least around the return value of JS_NewUCStringCopyN.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +750,2 @@
>          JS::RootedValue jsValue(cx);
> +        jsValue.setString(JS_NewUCStringCopyN(cx, annot.mValue.get(), annot.mValue.Length()));

Likewise with the error checking here.
Attachment #8892195 - Flags: review?(nfroyd) → review+
Comment on attachment 8892197 [details] [diff] [review]
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg

Review of attachment 8892197 [details] [diff] [review]:
-----------------------------------------------------------------

This works for me.  Data review is still necessary, though.

::: toolkit/components/telemetry/docs/data/backgroundhangmonitor-ping.rst
@@ +52,5 @@
> +The ``process`` field is a string denoting the kind of process that hung. Hangs
> +are currently sent only for the processes below:
> +
> ++---------------+---------------------------------------------------+
> +| Type          | Description                                       |

Maybe this first column should be "Kind", for consistency with the description?  (I try to avoid "type" when I'm not talking about programming language types, but sometimes "kind" or "variant" doesn't sound nearly as nice to my ear...)

@@ +66,5 @@
> +------------
> +
> +Each hang object contains a ``stack`` field which has been populated with an
> +interleaved stack trace of the hung thread. An interleaved stack consists of a
> +native backtrace with additional frames interleaved, representing Chrome JS and

Nit: "chrome JS" (it's not capitalized in the below paragraphs and it's not a proper name).
Attachment #8892197 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #78)
> Oh, actually, are we running this in the parent process only, or are we
> running it everywhere?  We should change the manifest to reflect that
> appropriately.

Only in the parent process. I am pretty sure that the profile-after-change category of services is only started in the parent process.

Is there a better way to make that fact clear in the manifest?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #79)
> I still think you should move this to GeckoProcessType_Default, unless you
> still have those tests you mentioned.

The test added in this patch listens to the notification in the child to make sure we don't shut down the child process before we finish submitting the hang information to the parent process. I can fix this with some message passing probably, but that feels like a follow-up to me.

I also want to move the observer notification to GeckoProcessType_Default.
We're going to use HangDetails as the type containing hang information. We'll
have a JS component which reads the data out of nsIHangDetails, builds the
payload, and submits it to telemetry for us.

We'll do it in JS because telemetry has to be submitted from JS.

This patch also adds IPC serization for the relevant types so that we can send
HangDetails objects over IPDL.

MozReview-Commit-ID: CeikKabY9Vs
Attachment #8892559 - Flags: review?(nfroyd)
Attachment #8892191 - Attachment is obsolete: true
Attachment #8892192 - Attachment is obsolete: true
Attachment #8892193 - Attachment is obsolete: true
Attachment #8892194 - Attachment is obsolete: true
Attachment #8892195 - Attachment is obsolete: true
Attachment #8892196 - Attachment is obsolete: true
Attachment #8892197 - Attachment is obsolete: true
This patch adds the BHRTelemetryService which is a JS implemented XPCOM service
that simply listens to the bhr-thread-hang observer notification, and uses the
data it collects from it to submit telemetry pings.

MozReview-Commit-ID: 2hPXAFmHrm5
BHRTelemetryService only runs in the parent process (and we can only submit
pings from there), so we need to send the data which we collect in the GPU and
Content processes over IPC to the parent process.

MozReview-Commit-ID: 8B5uZKbjNbU
These changes are going to increase the amount of data which we collect from BHR
a lot. It would be dangerous to run it on beta, especially considering how soon
the next merge is.

This should turn it off for 100% of beta users if I understand the logic
correctly.

MozReview-Commit-ID: 3HyEKWdXaqU
HangAnnotations was very complex, required a separate allocation, and used this
unfortunate virtual interface implementation which made it harder to do
interesting things with it (such as serialize it over IPC).

This new implementation is much simpler and more concrete, making
HangAnnotations simply be a nsTArray<Annotation>. This also simplifies some of
the IPC code which was added in part 7.

MozReview-Commit-ID: EzaaxdHpW1t
This patch just adds some tests for the new stuff which we added.

MozReview-Commit-ID: 2jG75AEUJfS
Comment on attachment 8892565 [details] [diff] [review]
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg

data-r?liuche
Attachment #8892565 - Flags: review?(liuche)
(In reply to Michael Layzell [:mystor] from comment #82)
> (In reply to Nathan Froyd [:froydnj] from comment #78)
> > Oh, actually, are we running this in the parent process only, or are we
> > running it everywhere?  We should change the manifest to reflect that
> > appropriately.
> 
> Only in the parent process. I am pretty sure that the profile-after-change
> category of services is only started in the parent process.

Hm, on further review, I think you are correct.  Assuming I actually understand our startup code.

> Is there a better way to make that fact clear in the manifest?

Adding the process=main flag to the appropriate places (the category entry?  maybe others?) would help:

https://dxr.mozilla.org/mozilla-central/search?q=process%3Dmain
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #92)
> Adding the process=main flag to the appropriate places (the category entry? 
> maybe others?) would help:
> 
> https://dxr.mozilla.org/mozilla-central/search?q=process%3Dmain

Awesome, I'll add that process=main flag. (Didn't know that existed :S)
This is important because without it we don't get the service and xpt files bundled in the final firefox build. (oops).

MozReview-Commit-ID: IH56INaSOoK
Attachment #8893423 - Flags: review?(nfroyd)
Comment on attachment 8893423 [details] [diff] [review]
Part 14: Include BHRTelemetryService and xpt files in the package

Review of attachment 8893423 [details] [diff] [review]:
-----------------------------------------------------------------

Whoops?
Attachment #8893423 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #95)
> Comment on attachment 8893423 [details] [diff] [review]
> Part 14: Include BHRTelemetryService and xpt files in the package
> 
> Review of attachment 8893423 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Whoops?

This should've been done when I first added the relevant files - it was whoops because I forgot to add it.
Comment on attachment 8892559 [details] [diff] [review]
Part 7: Add all necessary data for BHR to nsIHangDetails

Review of attachment 8892559 [details] [diff] [review]:
-----------------------------------------------------------------

Can confirm that this patch only has a single copy of all added files.
Attachment #8892559 - Flags: review?(nfroyd) → review+
Comment on attachment 8892565 [details] [diff] [review]
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg

Review of attachment 8892565 [details] [diff] [review]:
-----------------------------------------------------------------

Is this opt-out collection? This is all Type 1 data (technical data), except for one piece of Type 2 data ("UserInteracting"), and this is Nightly only. data r+ from me.
Attachment #8892565 - Flags: review?(liuche) → review+
Comment on attachment 8892565 [details] [diff] [review]
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg

Review of attachment 8892565 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/docs/data/backgroundhangmonitor-ping.rst
@@ +4,5 @@
> +
> +Whenever a thread monitored by the Background Hang Monitor hangs, a stack and
> +some non-identifying information about the hang is captured. When 50 of these
> +hangs are collected across all processes, or the browser exits, this ping is
> +transmitted with the collected hang information.

On e.g. Mac, users may not exit their browser for a long time.
This will affect the data latency for people who see few (<50) hangs daily.
Maybe make sure that the ping is sent at least daily if there were any hangs?
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fe3893891ade
Part 1: Remove getChildThreadHangs, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/b1fdf3c9f074
Part 2: Remove ThreadHangStats from about:telemetry, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd63a8fecf00
Part 3: Remove the ThreadHangStats object and related code, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/68bec6464338
Part 4: Move BHR into its own component, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/2811e4973556
Part 5: Add some helper methods to HangAnnotations, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/d77a13ef7a49
Part 6: Remove test_ThreadHangStats.js, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/6bc453505dc3
Part 7: Add all necessary data for BHR to nsIHangDetails, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d0b5e469b69
Part 8: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b6c4c95323e
Part 9: Transmit BHR Hangs from the Content and GPU process to the parent process, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/d34e59a86079
Part 10: Stop running BHR on beta, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/9158ca4292bb
Part 11: Simplify the HangAnnotations abstraction, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc4253078c54
Part 12: Add a test for BHR observer notifications, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/314ddbc9d70f
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/6bca45190947
Part 14: Include BHRTelemetryService and xpt files in the package, r=froydnj
(In reply to Georg Fritzsche [:gfritzsche] from comment #99)
> Comment on attachment 8892565 [details] [diff] [review]
> Part 13: Add telemetry documentation for the new ping format, r=bsmedberg
> 
> Review of attachment 8892565 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/docs/data/backgroundhangmonitor-ping.rst
> @@ +4,5 @@
> > +
> > +Whenever a thread monitored by the Background Hang Monitor hangs, a stack and
> > +some non-identifying information about the hang is captured. When 50 of these
> > +hangs are collected across all processes, or the browser exits, this ping is
> > +transmitted with the collected hang information.
> 
> On e.g. Mac, users may not exit their browser for a long time.
> This will affect the data latency for people who see few (<50) hangs daily.
> Maybe make sure that the ping is sent at least daily if there were any hangs?

I'll do this in a follow up: bug 1389235
dthayer - if this sticks on inbound (fingers crossed) then we'll be sending the new ping in tomorrow's nightly.
Flags: needinfo?(dothayer)
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b69867d87c50
Part 15: Fix typo in android package-manifest.in, a=bustage
Backed out bug 1365309, bug 1380081 and bug 1386369 for almost perma-timeout in many mochitest chunks on Windows 7 opt and pgo:

Bug 1365309
https://hg.mozilla.org/integration/mozilla-inbound/rev/7fefe0f46c4648f92d75e02a5c5bf8b20a6f799d
https://hg.mozilla.org/integration/mozilla-inbound/rev/d5bdc2c7ca34dcb6a51365aee8aef45cd76854c7
https://hg.mozilla.org/integration/mozilla-inbound/rev/d7a7569e436a330711c0cd8b622c399b14fb3190

Bug 1380081
https://hg.mozilla.org/integration/mozilla-inbound/rev/2145dba678d8f69a35149a120908445e55f02587
https://hg.mozilla.org/integration/mozilla-inbound/rev/e6d57bb2b33e6b37585b44e0bfd3f76d45f808d2
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6b47124336fba85e8ba753ae607c2347b5665e6
https://hg.mozilla.org/integration/mozilla-inbound/rev/36c4ddeb51a14c5e4046376dd22501fa22d9889f
https://hg.mozilla.org/integration/mozilla-inbound/rev/718448f617c45649e07e44c9d5d6cd95cf2d72f3
https://hg.mozilla.org/integration/mozilla-inbound/rev/ef7efc0cd92a05231760aff15c546e97cfd0925f
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb0774dc70c54f12cc63cbcc4197d922b5a0b832
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ac0bd6aeda33ab083cca86d36508c8b3d1231b7
https://hg.mozilla.org/integration/mozilla-inbound/rev/f00f5f041f65f3001b6b0ebf23fd7f963baa6fdd
https://hg.mozilla.org/integration/mozilla-inbound/rev/750dc31d7ee77a328e1fbde3b31d262a8ea33a93
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd34bd64de6a02be6860e14c23f90c8613e72c56
https://hg.mozilla.org/integration/mozilla-inbound/rev/fcbcb7b65bf2fb6532d0f1bfc7b421f00239155a
https://hg.mozilla.org/integration/mozilla-inbound/rev/514f6411123722c30a773043ce2bef22486b4982
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c07f2793aa9eb16d6d51db1b0e5a6cb00d54856

Bug 1386369
https://hg.mozilla.org/integration/mozilla-inbound/rev/5cd587ec53be835443367007eeb79bdac0ef4c54
https://hg.mozilla.org/integration/mozilla-inbound/rev/c621494dc3f5f5cfd6e9a8e62f19396dbc9b8990
https://hg.mozilla.org/integration/mozilla-inbound/rev/43cb8075d066fba493779e35922ce83c77bdd70d

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=272d381012677aedc86462ebd8e45b0b30884655&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Flags: needinfo?(michael)
Comment on attachment 8892186 [details] [diff] [review]
Part 3: Remove the ThreadHangStats object and related code

Review of attachment 8892186 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/docs/data/main-ping.rst
@@ -62,5 @@
>        // The following properties may all be null if we fail to collect them.
>        histograms: {...},
>        keyedHistograms: {...},
>        chromeHangs: {...},
> -      threadHangStats: [...],

We should not remove this, instead calling out that it is only present until Firefox X.

@@ -248,5 @@
>  
>  As of Firefox 48, this section does not contain empty keyed histograms anymore.
>  
> -threadHangStats
> ----------------

We should not remove this, this still documents the data in pings before Firefox X.
Instead clearly call out that this was removed in Firefox X.
I think this is the same / a similar problem to bug 1364068 comment 36. I'm writing a patch which should disable calling into telemetry to submit the BHR pings when running tests to avoid that issue.

If it works, I'll re-land this patch.

:gfritzsche, I'll restore the main-ping.rst text and just mention that the fields are not present in firefox 57.
Flags: needinfo?(michael)
Comment on attachment 8897525 [details] [diff] [review]
Part 15: Don't try to submit bhr ping to telemetry while running tests

Review of attachment 8897525 [details] [diff] [review]:
-----------------------------------------------------------------

Seems to be following in the footsteps of the new profile ping. LGTM
Attachment #8897525 - Flags: review?(chutten) → review+
Pushed by michael@thelayzells.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a806fdd3092f
Part 1: Remove getChildThreadHangs, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/ef5984e08bc2
Part 2: Remove ThreadHangStats from about:telemetry, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/3ae61f0ba108
Part 3: Remove the ThreadHangStats object and related code, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/a2d2a1bedc3c
Part 4: Move BHR into its own component, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/3afb52d2db56
Part 5: Add some helper methods to HangAnnotations, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6159f804417
Part 6: Remove test_ThreadHangStats.js, r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/f35ba2fdb518
Part 7: Add all necessary data for BHR to nsIHangDetails, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/c93121387471
Part 8: Report bhr-thread-hang hangs to telemetry in a custom bhr ping, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/07452b30b4dc
Part 9: Transmit BHR Hangs from the Content and GPU process to the parent process, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/d3910f43605c
Part 10: Stop running BHR on beta, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/5eb62a558dc1
Part 11: Simplify the HangAnnotations abstraction, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/f86e7c4d0ad9
Part 12: Add a test for BHR observer notifications, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/aa9e74a07733
Part 13: Add telemetry documentation for the new ping format, r=bsmedberg, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/3572c91f1ece
Part 14: Include BHRTelemetryService and xpt files in the package, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/2ff0b6602d76
Part 15: Don't try to submit bhr ping to telemetry while running tests, r=chutten
One of the patches from bug 1389165 did this:

>--- a/toolkit/content/aboutTelemetry.js
>+++ b/toolkit/content/aboutTelemetry.js
>@@ -1162,16 +1162,17 @@ var ThreadHangStats = {
>    * Creates and fills data corresponding to a thread
>    */
>   renderThread(aThread) {
>     let div = document.createElement("div");
> 
>     let title = document.createElement("h2");
>     title.textContent = aThread.name;
>     div.appendChild(title);
>+    div.id = title;
> 
>     // Don't localize the histogram name, because the
>     // name is also used as the div element's ID
>     Histogram.render(div, aThread.name + "-Activity",
>                      aThread.activity, {exponential: true}, true);
>     aThread.hangs.forEach((hang, index) => {
>       let hangName = aThread.name + "-Hang-" + (index + 1);
>       let hangDiv = Histogram.render(

Part 2 from this bug removed that whole block of code. When I merged 1389165 around, I left things with that block of code deleted, but I never added that `div.id = title` anywhere else. Does that need to go anywhere else?
Flags: needinfo?(michael)
Flags: needinfo?(flyinggrub)
No, that page in about:telemetry has been completely removed in that patch and that line of code should no longer be necessary.
Flags: needinfo?(michael)
if you removed completely this section no need to add it somewhere else :).
Flags: needinfo?(flyinggrub)
The aggregate of this is visible now on https://squarewave.github.io
Flags: needinfo?(dothayer)
You need to log in before you can comment on or make changes to this bug.