Closed Bug 1225851 Opened 4 years ago Closed 3 years ago

Implementation of capturing keyed stacks in Telemetry

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox45 --- affected
firefox53 --- fixed

People

(Reporter: yarik.sheptykin, Assigned: yarik.sheptykin, Mentored)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [measurement:client] [lang=c++])

Attachments

(1 file, 4 obsolete files)

Bugs 1229129 and 1209131 discuss provide a user story and a concept for capturing stacks with Telemetry. The concept defines that Telemetry should provide a method for capturing a call stack on demand and store captured stacks together with a user defined key. This bug tacks the implementation for these aspects. Implementation for the display of the captured stacks, server-side symbolication, and integration with telemetry pings will be tracked in separate bugs.

API for capturing stacks has the following form:
Telemetry::captureStack("string-key")
Where "string-key" is a user-defined (dynamic) string under which stacks will be agregated.

Data format is a follows:
"capturedStacks": {
    "memoryMap": [
    ["wgdi32.pdb", "08A541B5942242BDB4AEABD8C87E4CFF2"],
    ["igd10iumd32.pdb", "D36DEBF2E78149B5BE1856B772F1C3991"],
    ... other entries in the format ["module name", "breakpad identifier"] ...
    ],
    "stacks": [
    [
          [
            0, // the module index or -1 for invalid module indices
            190649 // the offset of this program counter in its module or an absolute pc
          ],
          [1, 2540075],
          ... other frames ...
    ],
         ... other stacks ...
    ],
    "captures": [["string-key", stack-index, count], ... ]
}

Relevant requirements:
    The time-cost of capturing stacks should be made clear to API users.
    Minimize jank.
    A stack is captured once per session per key. Consequent requests to capture a stack for the key increment counter without re-capturing the stack.
    Ensure thread-safety while unwinding stacks.

The concept for the implementation is also summarized [here](https://public.etherpad-mozilla.org/p/callstack-capture)
From talking about this, it's probably best to go for a rough prototype here first to get it working.
Once we have that, we can still worry about a clean implementation and style issues.
Status: NEW → ASSIGNED
Hi Georg! I have started with the implementation. A quick architectural question: I noticed that Telemetry.cpp is a pretty complex piece of code. I am afraid that adding more functionality will increase its complexity even more. Do you think I should anyways add the code into Telemetry.cpp? or should I better place it into a new file, say "TelemetryStackCapturer.cpp"? Is there any strong benefit from keeping everything in one file?
Flags: needinfo?(gfritzsche)
For a prototype it could be fine to add this to Telemetry.cpp
For the final solution we should have more complex code in a separate module and only simpler functions (interfacing etc.) in Telemetry.cpp
Flags: needinfo?(gfritzsche)
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche
Attachment #8690070 - Flags: review?(gfritzsche)
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25757/diff/1-2/
This is a raw WIP yet, but it is capable of capturing call stacks and should show in what direction I am moving.
Great, that was quick!
I won't have time today, but i will have a look here on monday.
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25757/diff/2-3/
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

https://reviewboard.mozilla.org/r/25757/#review23369

Thanks, this looks like a good start!
I've only given it a rough high-level look, but it looks sane.
We will probably want to factor things out of Telemetry.cpp, but maybe we don't need to do this right now and can move it to a follow-up bug - it doesn't look like the added code will be very complex or long.

::: toolkit/components/telemetry/Telemetry.h:275
(Diff revision 3)
> + * Record the main thread's call stack on demand.

We're limiting this to only record the first stack per key, right?
We should document that here and point out that counts per key are recorded.

::: toolkit/components/telemetry/Telemetry.cpp:74
(Diff revision 3)
> +#include "mozilla/StackWalk.h"

Should this be limited to MOZ_ENABLE_PROFILER_SPS?

::: toolkit/components/telemetry/Telemetry.cpp:423
(Diff revision 3)
> +      mCount = aCount;
> +      mIndex = aIndex;

In C++ you'd use initializer lists:
```
StackFrequencyInfo(uint32_t aCount, uint32_t aIndex)
  : mCount(aCount)
  , mIndex(aIndex)
{
  ...
```

::: toolkit/components/telemetry/Telemetry.cpp:435
(Diff revision 3)
> +  static void
> +  RecordStackWalker(uint32_t aFrameNumber, void* aPC, void* aSP, void* aClosure);

This doesn't need to be public.

::: toolkit/components/telemetry/Telemetry.cpp:812
(Diff revision 3)
> +  static bool CaptureStack(const nsCString& aKey);

Do we need this inside MOZ_ENABLE_PROFILER_SPS?

::: toolkit/components/telemetry/Telemetry.cpp:2805
(Diff revision 3)
> +    const char* key;
> +    NS_CStringGetData(iter.Key(), &key);
> +    JS::Rooted<JSString*> str(cx, JS_NewStringCopyZ(cx, key));

Just `str(cx, JS_NewStringCopyZ(iter.Key().c_str()))` or similar?

::: toolkit/components/telemetry/tests/gtest/TestStackCapture.cpp:12
(Diff revision 3)
> +TEST(TestStackCapture, CaptureStack) {

I think we don't need this, it will be hard to set it up for proper coverage.
The xpcshell will already give us the coverage we need.

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:8
(Diff revision 3)
> +  dump(JSON.stringify(stacks));

If we add a `captureStack(key)` function to nsITelemetry, then we can do proper testing here:
* capture with different known keys
* capture with multiple counts
* test that the captured data shows up properly in pings
* ...
Attachment #8690070 - Flags: review?(gfritzsche)
https://reviewboard.mozilla.org/r/25757/#review23369

> Just `str(cx, JS_NewStringCopyZ(iter.Key().c_str()))` or similar?

I couild not find c_str equivalent in `nsACString` class.
```
error: ‘const class nsACString_internal’ has no member named ‘c_str’
```
The method I used is from https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/String_functions/NS_CStringGetData
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25757/diff/3-4/
Attachment #8690070 - Flags: review?(gfritzsche)
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

https://reviewboard.mozilla.org/r/25757/#review24937

Thanks for the patience, this generally looks pretty good!
Did you have a chance to test whether the symbolication works for this?

::: toolkit/components/telemetry/Telemetry.cpp:424
(Diff revision 4)
> +    int32_t mCount;
> +    // Index of the stack inside stacks array.
> +    int32_t mIndex;

Presumably those should be `size_t`?

::: toolkit/components/telemetry/Telemetry.cpp:439
(Diff revision 4)
> +  const nsClassHashtable<nsCStringHashKey, StackFrequencyInfo>& getKeyedInfo() const;

C++ coding style is StartingWithUpperCase.
Lets add a typedef for the `nsClassHashTable` instead of repeating that long typedef in multiple places.

::: toolkit/components/telemetry/Telemetry.cpp:456
(Diff revision 4)
> +KeyedStackCapturer::RecordStackWalker(

We could make this a lambda below, i think that makes it easier to read (code is closer to actual usage) and keeps it local to that function.

::: toolkit/components/telemetry/Telemetry.cpp:478
(Diff revision 4)
> +  MozStackWalk(KeyedStackCapturer::RecordStackWalker, /* skipFrames */ 0,

Suggesting something like:
```
auto callback = [](uint32_t, void* aPC, void*, void*) {
  ...
});
MozStackWalk(callback, ...
```

::: toolkit/components/telemetry/Telemetry.cpp:667
(Diff revision 4)
> - 
> +

Can we put the various cleanups in a separate patch?
That would reduce the noise a bit here.

If that is painful to do now, its no problem to leave them... But would be nice to keep it separate in future bigger patches.

::: toolkit/components/telemetry/Telemetry.cpp:2776
(Diff revision 4)
> +#if defined(MOZ_ENABLE_PROFILER_SPS)

This would be easier to follow as:
```
#if defined(MOZ_ENABLE_PROFILER_SPS)
  return mStackCapturer.ReflectCapturedStacks();
#else
  return NS_OK;
#endif

::: toolkit/components/telemetry/Telemetry.cpp:2812
(Diff revision 4)
> +    JS::Rooted<JSString*> str(cx, JS_NewStringCopyZ(cx, key));

You should be able to just use:
```JS::RootedString str(cx, JS_NewStringCopyZ(cx, iter.Key().get()));```

::: toolkit/components/telemetry/Telemetry.cpp:3790
(Diff revision 4)
> +TelemetryImpl::CaptureStackStaticHelper(const nsACString& aKey) {
> +  if (!sTelemetry || !sTelemetry->mCanRecordExtended)
> +    return;
> +  sTelemetry->CaptureStack(aKey);

What do we need this function for?
Can't this code just live in `CaptureStack()`?

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:30
(Diff revision 4)
> +const checkObjectStructure = function (obj) {

Why not just `function checkObjectStructure()`?

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:35
(Diff revision 4)
> +    properties.reduce((prev, curr) => prev && curr in obj, true) &&
> +    // are all properties arrays?
> +    properties.reduce((prev, curr) => prev && Array.isArray(obj[curr]), true);

Just use Array.prototype.every() or even explicit loops here (as preferred).

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:49
(Diff revision 4)
> +const captureStacks = function (key, checkStructure) {

Ditto, `function captureStacks...`.

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:50
(Diff revision 4)
> +  checkStructure = typeof checkStructure === "undefined" ? true : checkStructure;

This looks unused? Just drop the `checkStructure` argument.

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:70
(Diff revision 4)
> +  Assert.deepEqual(["stacks", "captures"].map(p => stacks[p].length), [1, 1]);

Lets just do an old fashioned `Assert.deepEqual(stacks.stacks.length, 1)` etc.
That way a test failure actually points us to what exactly failed.
Ditto below.

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:72
(Diff revision 4)
> +  Assert.deepEqual(stacks.captures[0], [TEST_STACK_KEYS[0], 0, 1]);
> +  Assert.ok(stacks.memoryMap.length, 0);

We can't rely on this being at index 0, there could be platform code triggering stack captures before this test-code runs.
Ditto for memoryMap and below.
Attachment #8690070 - Flags: review?(gfritzsche)
Priority: -- → P3
Whiteboard: [measurement:client]
https://reviewboard.mozilla.org/r/25757/#review24937

> Presumably those should be `size_t`?

Making these size_t seems like a good idea, but when using size_t I get:

```
error: call of overloaded ‘JS_DefineElement(JSContext*&, JS::Rooted<JSObject*>&, int, const size_t&, int)’ is ambiguous
```

when trying ```JS_DefineElement(cx, infoArray, 1, info->mIndex, JSPROP_ENUMERATE)``` in line 2875.

These are the candidates:

```
JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, int32_t value,
JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, uint32_t value,
JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, double value,
```

Should I do explicit type casting here?

> This would be easier to follow as:
> ```
> #if defined(MOZ_ENABLE_PROFILER_SPS)
>   return mStackCapturer.ReflectCapturedStacks();
> #else
>   return NS_OK;
> #endif

Should I move the CaptureStack() implemetation inside the StackCapturer?

> What do we need this function for?
> Can't this code just live in `CaptureStack()`?

I have difficulties moving this code into CaptureStack because sTelemetry seems to be a [privat static variable](https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#736,794). I cannot access it from outside the scope of TelemetryImpl. Maybe you know a convenient workaround?
Hi Georg! Sorry for taking this long for an update. I have been working on your feedback and got most of it addressed. I have some difficulties though. Please take a look at comment 13.
Flags: needinfo?(gfritzsche)
https://reviewboard.mozilla.org/r/25757/#review24937

> Making these size_t seems like a good idea, but when using size_t I get:
> 
> ```
> error: call of overloaded ‘JS_DefineElement(JSContext*&, JS::Rooted<JSObject*>&, int, const size_t&, int)’ is ambiguous
> ```
> 
> when trying ```JS_DefineElement(cx, infoArray, 1, info->mIndex, JSPROP_ENUMERATE)``` in line 2875.
> 
> These are the candidates:
> 
> ```
> JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, int32_t value,
> JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, uint32_t value,
> JS_DefineElement(JSContext* cx, JS::HandleObject obj, uint32_t index, double value,
> ```
> 
> Should I do explicit type casting here?

`uint32_t` seems fine here.

> Should I move the CaptureStack() implemetation inside the StackCapturer?

I'm not sure what you mean here - the reflection code (serializing to JS) would live better inside the StackCapturer, yes.
We should not need to leak the details used there outside of the StackCapturer.

> I have difficulties moving this code into CaptureStack because sTelemetry seems to be a [privat static variable](https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#736,794). I cannot access it from outside the scope of TelemetryImpl. Maybe you know a convenient workaround?

Alright, but we don't need two functions here.
Let's just call this function `TelemetryImpl::CaptureStack()` and make it call the `Capture()` function directly (similar to `TelemetryImpl::GetHistogramEnumId()`).
Flags: needinfo?(gfritzsche)
https://reviewboard.mozilla.org/r/25757/#review24937

> Alright, but we don't need two functions here.
> Let's just call this function `TelemetryImpl::CaptureStack()` and make it call the `Capture()` function directly (similar to `TelemetryImpl::GetHistogramEnumId()`).

Hi Georg! I tried renaming CaptureStackStaticHelper to CaptureStack. I get a name conflict with member function CaptureStack which I implement from nsITelemetry. Name overloading does not work in this case. How about using DoCaptureStack instead of CaptureStackStaticHelper? Alternative would be renaming captureStack in nsITelemetry.
See comment 16
Flags: needinfo?(gfritzsche)
https://reviewboard.mozilla.org/r/25757/#review24937

> Hi Georg! I tried renaming CaptureStackStaticHelper to CaptureStack. I get a name conflict with member function CaptureStack which I implement from nsITelemetry. Name overloading does not work in this case. How about using DoCaptureStack instead of CaptureStackStaticHelper? Alternative would be renaming captureStack in nsITelemetry.

Ok, looking more at the context, how about:
* we call this e.g. `TelemetryImpl::DoCaptureStack()` (adding `Static` seems redundant)
* let it keep calling `sTelemetry->CaptureStack()`
* move the `mCanRecordExtended` check to `TelemetryImpl::CaptureStack()` ... currently that function doesn't check `mCanRecordExtended` at all
Flags: needinfo?(gfritzsche)
Attachment #8690070 - Flags: review?(gfritzsche)
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/25757/diff/4-5/
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

https://reviewboard.mozilla.org/r/25757/#review40311

We are coming back to this after a longer while, so i went over this fully again.
High-level this looks solid to me, so i went for a detailed review. The comments below are mostly about smaller issues and style, nothing major.

More general notes:

* We should definitely check that this builds and passes tests both with and without `MOZ_ENABLE_PROFILER_SPS` defined.
* We should also update the documentation in `toolkit/components/docs/main-ping.rst`, but lets take that to a separate patch to limit scope of this one

To help with scope, we can also move the `about:telemetry` change to a separate patch on this bug (so we can wrap up the main implementation here).
To test the symbolication in `about:telemetry`, we could e.g. take some chrome hangs data from a Nightly build and put that into `payload.capturedStacks`, just to see that the rendering code works.

::: toolkit/components/telemetry/Telemetry.h:285
(Diff revision 5)
>                        int32_t aFirefoxUptime,
>                        mozilla::UniquePtr<mozilla::HangMonitor::HangAnnotations>
>                                aAnnotations);
> +
> +/**
> + * Record the main thread's call stack on demand. Note that, the stack is

We record the current, not the main, thread.

::: toolkit/components/telemetry/Telemetry.cpp:414
(Diff revision 5)
> + * Allows taking a snapshot of a call stack on demand. Captured stacks are
> + * indexed by a string key in a hash table. The stack is only captured Once
> + * for each key. Consequent captures with the same key result in incrementing
> + * capture counter without re-capturing the stack.

Nits:
* "only captured once" (casing)
* "Subsequent captures with ..."
* "incrementing the capture counter"

::: toolkit/components/telemetry/Telemetry.cpp:420
(Diff revision 5)
> + * indexed by a string key in a hash table. The stack is only captured Once
> + * for each key. Consequent captures with the same key result in incrementing
> + * capture counter without re-capturing the stack.
> + */
> +class KeyedStackCapturer {
> +public:

I think the only parts that need to be public are the constructor, `Capture()` and `ReflectCapturedStacks()`. The other functions and types should be private.

::: toolkit/components/telemetry/Telemetry.cpp:436
(Diff revision 5)
> +      : mCount(aCount)
> +      , mIndex(aIndex)
> +    {}
> +  };
> +
> +  typedef nsClassHashtable<nsCStringHashKey, StackFrequencyInfo> FrequencyInfoHash;

Let use `FrequencyInfoMapType`

::: toolkit/components/telemetry/Telemetry.cpp:441
(Diff revision 5)
> +  const CombinedStacks& GetStacks() const;
> +  const FrequencyInfoHash& GetKeyedInfo() const;
> +  Mutex& GetMutex();

These seem to not be used anymore, lets drop them.

::: toolkit/components/telemetry/Telemetry.cpp:450
(Diff revision 5)
> +  static void
> +  RecordStackWalker(uint32_t aFrameNumber, void* aPC, void* aSP, void* aClosure);

This doesn't seem to be used anywhere, lets drop it.

::: toolkit/components/telemetry/Telemetry.cpp:453
(Diff revision 5)
> +   * This is a helper function called by MozStackWalk for every stack frame.
> +   */
> +  static void
> +  RecordStackWalker(uint32_t aFrameNumber, void* aPC, void* aSP, void* aClosure);
> +
> +  FrequencyInfoHash mKeyedInfo;

Nit: This could be named a bit more descriptive - `mStackInfos`?

::: toolkit/components/telemetry/Telemetry.cpp:458
(Diff revision 5)
> +KeyedStackCapturer::KeyedStackCapturer():
> +  mStackCapturerMutex("Telemetry::StackCapturerMutex")

Nit: Lets keep to our common style here:
```
KeyedStackCapturer::KeyedStackCapturer()
  : mStackCapturerMutex(...)
```

::: toolkit/components/telemetry/Telemetry.cpp:466
(Diff revision 5)
> +
> +void KeyedStackCapturer::Capture(const nsACString& aKey) {
> +  // Trying to find and update the stack information.
> +  StackFrequencyInfo* info = mKeyedInfo.Get(aKey);
> +  if (info) {
> +    // Only increment the counter if the key exists.

`// We already recorded this stack before, only increase the count.`

::: toolkit/components/telemetry/Telemetry.cpp:471
(Diff revision 5)
> +    // Only increment the counter if the key exists.
> +    info->mCount++;
> +    return;
> +  }
> +
> +  std::vector<uintptr_t> rawStack;

A comment here would be helpful, e.g.:
// We haven't captured a stack for this key before, do it now.
// Note that this does a stackwalk and is an expensive operation.

::: toolkit/components/telemetry/Telemetry.cpp:478
(Diff revision 5)
> +    std::vector<uintptr_t>* stack =
> +      static_cast<std::vector<uintptr_t>*>(aClosure);
> +    stack->push_back(reinterpret_cast<uintptr_t>(aPC));
> +  };
> +  MozStackWalk(callback, /* skipFrames */ 0,
> +              /* maxFrames */ 0, reinterpret_cast<void*>(&rawStack), 0, nullptr);

The second-to-last parameter is a `uintptr_t`, so we should pass `nullptr` instead of `0` (which means collecting the stack for the current thread).

::: toolkit/components/telemetry/Telemetry.cpp:480
(Diff revision 5)
> +    stack->push_back(reinterpret_cast<uintptr_t>(aPC));
> +  };
> +  MozStackWalk(callback, /* skipFrames */ 0,
> +              /* maxFrames */ 0, reinterpret_cast<void*>(&rawStack), 0, nullptr);
> +  Telemetry::ProcessedStack stack = Telemetry::GetStackAndModules(rawStack);
> +  MutexAutoLock captureStackMutex(mStackCapturerMutex);

This should go clearly with the next section, i.e. just move it to after the comment.

::: toolkit/components/telemetry/Telemetry.cpp:482
(Diff revision 5)
> +  MozStackWalk(callback, /* skipFrames */ 0,
> +              /* maxFrames */ 0, reinterpret_cast<void*>(&rawStack), 0, nullptr);
> +  Telemetry::ProcessedStack stack = Telemetry::GetStackAndModules(rawStack);
> +  MutexAutoLock captureStackMutex(mStackCapturerMutex);
> +
> +  // Append the new stack to the stack's circular queue.

I'm not quite sure what this comment is saying.
Maybe just `// Store the new stack info.` ?

::: toolkit/components/telemetry/Telemetry.cpp:504
(Diff revision 5)
> +Mutex&
> +KeyedStackCapturer::GetMutex()
> +{
> +  return mStackCapturerMutex;
> +}
> +#endif

This needs to move down to after the following method.

::: toolkit/components/telemetry/Telemetry.cpp:511
(Diff revision 5)
> +NS_IMETHODIMP
> +KeyedStackCapturer::ReflectCapturedStacks(JSContext *cx, JS::MutableHandle<JS::Value> ret)
> +{
> +  MutexAutoLock capturedStackMutex(mStackCapturerMutex);
> +
> +  JS::Rooted<JSObject*> fullReportObj(cx, CreateJSStackObject(cx, mStacks));

Let's add a comment here that this adds the `memoryMap` and `stacks` properties.

Nit: Lets use `JS::RootedObject` and `JS::RootedString` in this function instead.

::: toolkit/components/telemetry/Telemetry.cpp:516
(Diff revision 5)
> +  JS::Rooted<JSObject*> fullReportObj(cx, CreateJSStackObject(cx, mStacks));
> +  if (!fullReportObj) {
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  ret.setObject(*fullReportObj);

Let's only do this at the end of this function so we don't return incomplete objects.

::: toolkit/components/telemetry/Telemetry.cpp:531
(Diff revision 5)
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  size_t keyIndex = 0;
> +  for (auto iter = mKeyedInfo.ConstIter(); !iter.Done(); iter.Next()) {
> +    const KeyedStackCapturer::StackFrequencyInfo* info = iter.Data();

This is a member function, you can just use `StackFrequencyInfo`.

::: toolkit/components/telemetry/Telemetry.cpp:542
(Diff revision 5)
> +    if (!JS_DefineElement(cx, infoArray, 0, str, JSPROP_ENUMERATE)) {
> +      return NS_ERROR_FAILURE;
> +    }
> +    if (!JS_DefineElement(cx, infoArray, 1, info->mIndex, JSPROP_ENUMERATE)) {
> +      return NS_ERROR_FAILURE;
> +    }
> +    if (!JS_DefineElement(cx, infoArray, 2, info->mCount, JSPROP_ENUMERATE)) {
> +      return NS_ERROR_FAILURE;
> +    }
> +
> +    if (!JS_DefineElement(cx, keysArray, keyIndex, infoArray, JSPROP_ENUMERATE)) {
> +      return NS_ERROR_FAILURE;
> +    }

Nit: We could pull them together:
```
if (!JS_DefineElement(...) ||
    !JS_DefineElement(...) ||
    ...) {
  return NS_ERROR_FAILURE;
}

::: toolkit/components/telemetry/Telemetry.cpp:556
(Diff revision 5)
> +
> +    if (!JS_DefineElement(cx, keysArray, keyIndex, infoArray, JSPROP_ENUMERATE)) {
> +      return NS_ERROR_FAILURE;
> +    }
> +
> +    keyIndex++;

Lets handle that in the loop header (where it's more visible):
`for (...; iter.Next(), ++keyIndex) { ...`

::: toolkit/components/telemetry/TelemetrySession.jsm:1200
(Diff revision 5)
>      };
>  
>      // Add extended set measurements common to chrome & content processes
>      if (Telemetry.canRecordExtended) {
>        payloadObj.chromeHangs = protect(() => Telemetry.chromeHangs);
> +      payloadObj.capturedStacks = protect(() => Telemetry.capturedStacks);

I think we should only add this field if we have any data.
We could do something like:
```
let stacks = protect(...);
if (stacks && ("captures" in stacks) &&
    (stacks.captures.length > 0)) {
  // ... add it

::: toolkit/components/telemetry/nsITelemetry.idl:15
(Diff revision 5)
>  interface nsIFetchTelemetryDataCallback : nsISupports
>  {
>    void complete();
>  };
>  
> -[scriptable, uuid(273d2dd0-6c63-475a-b864-cb65160a1909)]
> +[scriptable, uuid(de4ea30a-2dac-46c3-8060-df715dd400ac)]

We don't need to change the interface UUIDs anymore from Firefox 47 on, so we can drop this :)

::: toolkit/components/telemetry/nsITelemetry.idl:132
(Diff revision 5)
>     */
>    [implicit_jscontext]
>    readonly attribute jsval chromeHangs;
>  
>    /*
> +   * Record the main thread's call stack on demand. Note that, the stack is

We record the current, not the main, thread.

::: toolkit/components/telemetry/nsITelemetry.idl:134
(Diff revision 5)
>    readonly attribute jsval chromeHangs;
>  
>    /*
> +   * Record the main thread's call stack on demand. Note that, the stack is
> +   * only captured at the first call. All subsequent calls result in incrementing
> +   * the capture counter without during actual stack unwinding.

Nit: "without doing ..."

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:76
(Diff revision 5)
> +add_task(function* test_TelemetryCapturesStacks(){
> +  let stacks = Telemetry.capturedStacks;
> +  let initialLength = stacks.length || 0;
> +
> +  // Perform stack capture and ensure more stacks are available.
> +  stacks = captureStacks(TEST_STACK_KEYS[0]);

This test will not actually collect anything when `MOZ_ENABLE_PROFILER_SPS` is not set.
We could just check that we don't have `capturedStacks` in that case.
To see whether it is enabled, we can add `ENABLE_PROFILER_SPS` to `AppConstants.jsm`.

::: toolkit/content/aboutTelemetry.js:1067
(Diff revision 5)
> +  symbolRequest: null,
> +
> +  render: function CapturedStacks_render(aPing) {
> +    let capturedStacks = aPing.payload.capturedStacks;
> +    let hasData = capturedStacks && capturedStacks.stacks &&
> +                  capturedStacks.stacks.length !== 0;

Nit: `.length > 0`

::: toolkit/content/aboutTelemetry.js:1068
(Diff revision 5)
> +
> +  render: function CapturedStacks_render(aPing) {
> +    let capturedStacks = aPing.payload.capturedStacks;
> +    let hasData = capturedStacks && capturedStacks.stacks &&
> +                  capturedStacks.stacks.length !== 0;
> +    console.log("Captured Stacks: " + JSON.stringify(capturedStacks));

Debugging left-over?

::: toolkit/content/aboutTelemetry.js:1589
(Diff revision 5)
> +      if (!gPingData) {
> +        return;
> +      }
> +      let capturedStacks = gPingData.payload.capturedStacks;
> +      let req = new SymbolicationRequest("captured-stacks",
> +                                         () => {},

We will have to trigger rendering here.

::: toolkit/content/aboutTelemetry.js:1791
(Diff revision 5)
> +  // Show chrome hang stacks
> +  CapturedStacks.render(ping);

We should move this down to the `ThreadHangStats` rendering after the payload selection - that way you can view the captured stack data from child processes too.

::: toolkit/content/aboutTelemetry.js:1795
(Diff revision 5)
>  
> +  // Show chrome hang stacks
> +  CapturedStacks.render(ping);
> +
> +  // Show thread hang stats
> +  ThreadHangStats.render(ping);

`ThreadHangStats` is already rendered below, lets remove this (merge artifact?).
Attachment #8690070 - Flags: review?(gfritzsche)
Talking with the people working on client side stack walking for the crash pings, this does not really overlap with our client side work.
However, they are spinning up a process for symbolification of stacks in the pipeline, which we can leverage for the data here.

Open questions that came up here in London:
* Do we reset this data between subsessions? (we should)
* Do we match the stack data format of the crash pings? (we should)
(In reply to Georg Fritzsche [:gfritzsche] from comment #21)
> Open questions that came up here in London:
> ...
> * Do we match the stack data format of the crash pings?

Gabriele, do you have docs or examples somewhere of the stack format you will use with crash pings?
We probably want to match that to make re-using the symbolification job easier.
Flags: needinfo?(gsvelto)
Whiteboard: [measurement:client] → [measurement:client] [lang=c++]
(In reply to Georg Fritzsche [:gfritzsche] from comment #22)
> Gabriele, do you have docs or examples somewhere of the stack format you
> will use with crash pings?

No, I was thinking of re-using the format used by the profiler by I cannot find any documentation for it.

> We probably want to match that to make re-using the symbolification job
> easier.

Yes, it would be best to have a single format for all the code that produces stacks (this, the profiler, the crash reporter) and document it too.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #23)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #22)
> > Gabriele, do you have docs or examples somewhere of the stack format you
> > will use with crash pings?
> 
> No, I was thinking of re-using the format used by the profiler by I cannot
> find any documentation for it.

We documented what we found here:
https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/main-ping.html#chromehangs

Is that suitable as the shared format?
Ideally we could use shared code for serialization.
Flags: needinfo?(gsvelto)
(In reply to Georg Fritzsche [:gfritzsche] [away Jun 24 - Jul 3] from comment #24)
> We documented what we found here:
> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/
> telemetry/main-ping.html#chromehangs
> 
> Is that suitable as the shared format?
> Ideally we could use shared code for serialization.

Yes, it's fine! In fact it's basically the same thing I get from breakpad (a memory map with the list of modules then for every thread a list of module index / offset pairs).
Flags: needinfo?(gsvelto)
Attachment #8690070 - Attachment is obsolete: true
Comment on attachment 8690070 [details]
MozReview Request: Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r?gfritzsche

https://reviewboard.mozilla.org/r/25757/#review40311

> The second-to-last parameter is a `uintptr_t`, so we should pass `nullptr` instead of `0` (which means collecting the stack for the current thread).

passing `nullptr` gives me a compilation error about type mismatch.
Comment on attachment 8784100 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/73660/#review73532

Sorry for the delay Iaroslav.
I mostly ran through this higher-level now.
This is getting close, but there is still a bit of work left to do.
We can limit the scope here though and move non-essential work to follow-up bugs.

::: toolkit/components/telemetry/Telemetry.h:332
(Diff revision 1)
> + *
> + * @param aKey - A user defined key associated with the captured stack.
> + *
> + * NOTE: Unwinding call stacks is an expensive operation performance-wise.
> + */
> +void CaptureStack(const nsCString& aKey);

We should file a follow-up bug to document this and the JS API in `toolkit/components/telemetry/docs/collection/`.

::: toolkit/components/telemetry/Telemetry.cpp:456
(Diff revision 1)
> +  size_t stackIndex = mStacks.AddStack(stack);
> +  mStackInfos.Put(aKey, new StackFrequencyInfo(1, stackIndex));

We need to limit the amount of collected data to protect from flooding and oversized ping payloads.

`mStacks` is already protected as `CombinedStacks` is a circular buffer of size `kMaxChromeStacksKept`.

However, we still need to protect `mStackInfos` like here:
https://dxr.mozilla.org/mozilla-central/rev/506facea63169a29e04eb140663da1730052db64/toolkit/components/telemetry/Telemetry.cpp#282

::: toolkit/components/telemetry/Telemetry.cpp:1501
(Diff revision 1)
> +#endif
> +  return NS_OK;

```
...
#else
  return NS_OK;
#endif
```

::: toolkit/components/telemetry/Telemetry.cpp:2420
(Diff revision 1)
>                                     Move(annotations));
>  }
> +
> +void
> +TelemetryImpl::DoStackCapture(const nsACString& aKey) {
> +  if (Telemetry::CanRecordExtended()) {

We haven't really thought about handling data from child processes here.
I think we should get this landed without much more scope bloat, so i suggest we simply only support parent processes for now by checking `XRE_IsParentProcess()`.

Also, i think we need to enforce sanity limits on `aKey`.
I think it should:
* only contain alphanumeric character plus '-'
* not exceed a length of 50 characters

We should log an error otherwise.

::: toolkit/components/telemetry/TelemetrySession.jsm:1270
(Diff revision 1)
>        payloadObj.chromeHangs = protect(() => Telemetry.chromeHangs);
>        payloadObj.threadHangStats = protect(() => this.getThreadHangStats(Telemetry.threadHangStats));
>        payloadObj.log = protect(() => TelemetryLog.entries());
>        payloadObj.webrtc = protect(() => Telemetry.webrtcStats);
> +      // Adding captured stacks to the payload only if any exist.
> +      let stacks = protect(() => Telemetry.capturedStacks);

For this to properly work with main pings, we need to clear the data after collecting & sending it.
Otherwise each ping in a session also includes the stacks collected in previous subsessions.

I suggest we:
* change the `capturedStacks` property to a `snapshotCapturedStacks()` function
* give it an optional parameter `clear`, defaulting to false
* from this code here call `snapshotCapturedStacks(clearSubsession)`

::: toolkit/components/telemetry/TelemetrySession.jsm:1272
(Diff revision 1)
>        payloadObj.log = protect(() => TelemetryLog.entries());
>        payloadObj.webrtc = protect(() => Telemetry.webrtcStats);
> +      // Adding captured stacks to the payload only if any exist.
> +      let stacks = protect(() => Telemetry.capturedStacks);
> +      if (stacks && ("captures" in stacks) && (stacks.captures.length > 0)) {
> +        payloadObj.capturedStacks = stacks;

I think we should get this version landed & tested to see that it works.
I would request that we put this data now into `payloadObj.processes.parent.capturedStacks` further down below *"// Add extended set measurements for chrome process."*.

To make this properly work with main pings, subsessions and content processes, we'll have to do a little work in a follow-up bug, probably with:
* stop storing this data in the child processes, instead sending it up to the parent process like bug 1218576
* change the stack snapshot function to return the snapshots organized by process (`parent`, `content`)
* put the snapshot data here into `payloadObj.processes.parent`, `payloadObj.processes.content`, etc.

::: toolkit/components/telemetry/nsITelemetry.idl:9
(Diff revision 1)
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  
>  #include "nsISupports.idl"
>  #include "nsIFile.idl"
>  
> -[scriptable,function, uuid(3d3b9075-5549-4244-9c08-b64fefa1dd60)]
> +[scriptable, uuid(273d2dd0-6c63-475a-b864-cb65160a1909)]

There is no need to change UUIDs for interfaces anymore.

::: toolkit/components/telemetry/nsITelemetry.idl:15
(Diff revision 1)
>  interface nsIFetchTelemetryDataCallback : nsISupports
>  {
>    void complete();
>  };
>  
> -[scriptable, uuid(273d2dd0-6c63-475a-b864-cb65160a1909)]
> +[scriptable, uuid(de4ea30a-2dac-46c3-8060-df715dd400ac)]

Ditto.

::: toolkit/components/telemetry/nsITelemetry.idl:156
(Diff revision 1)
> +   */
> +  void captureStack(in ACString name);
> +
> +  /*
> +   * Returns information about stacks captured explicitly via Telemetry API.
> +   * The data has the following structure:

We also need to update the documentation in `main-ping.rst` with this information.

`threadHangStats` can serve as an example here:
* https://dxr.mozilla.org/mozilla-central/rev/506facea63169a29e04eb140663da1730052db64/toolkit/components/telemetry/docs/data/main-ping.rst#64
* https://dxr.mozilla.org/mozilla-central/rev/506facea63169a29e04eb140663da1730052db64/toolkit/components/telemetry/docs/data/main-ping.rst#214

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:9
(Diff revision 1)
> +Cu.import("resource://gre/modules/TelemetryController.jsm", this);
> +Cu.import("resource://gre/modules/AppConstants.jsm", this);
> +
> +/**
> + * Ensures that the sctucture of the javascript object used for capturing stacks
> + * is a intended. The structure is expected to be as in this example:

"is as intended"

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:41
(Diff revision 1)
> +    return false;
> +  }
> +
> +  // Ensuring all properties exist inside the object and are arrays.
> +  for (let property of ["memoryMap", "stacks", "captures"]) {
> +    if (!property in obj || !Array.isArray(obj[property]))

You need parantheses around `property in obj`.
FWIW, `mach eslint toolkit/components/telemetry` will warn you about this.

::: toolkit/components/telemetry/tests/unit/test_TelemetryCaptureStack.js:68
(Diff revision 1)
> +
> +/**
> + * Ensures that captured stacks appear in pings, if any were captured.
> + */
> +add_task({
> +  skip_if: () => AppConstants.MOZ_ENABLE_PROFILER_SPS

Shouldn't this be `skip_if: () => !...`?

::: toolkit/content/aboutTelemetry.js:1941
(Diff revision 1)
>    }
>  
>    // Show thread hang stats
>    ThreadHangStats.render(payload);
>  
> +  // Show chrome hang stacks

Some change in here or in `aboutTelemetry.xhtml` seems to have broken `about:telemetry`.

The first time i load it, it doesn't render any ping data.

::: toolkit/content/aboutTelemetry.js:1942
(Diff revision 1)
>  
>    // Show thread hang stats
>    ThreadHangStats.render(payload);
>  
> +  // Show chrome hang stacks
> +  CapturedStacks.render(ping);

Lets read it from `payload` here.

::: toolkit/content/aboutTelemetry.xhtml:263
(Diff revision 1)
>          <div id="raw-payload-data" class="data">
>            <pre id="raw-payload-data-pre"></pre>
>          </div>
>        </section>
> +
> +      <section id="captured-stacks-section" class="data-section">

I think this section should go before `raw-payload-section`.
Attachment #8784100 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> ::: toolkit/components/telemetry/TelemetrySession.jsm:1272
> (Diff revision 1)
> >        payloadObj.log = protect(() => TelemetryLog.entries());
> >        payloadObj.webrtc = protect(() => Telemetry.webrtcStats);
> > +      // Adding captured stacks to the payload only if any exist.
> > +      let stacks = protect(() => Telemetry.capturedStacks);
> > +      if (stacks && ("captures" in stacks) && (stacks.captures.length > 0)) {
> > +        payloadObj.capturedStacks = stacks;
> 
> I think we should get this version landed & tested to see that it works.
> I would request that we put this data now into
> `payloadObj.processes.parent.capturedStacks` further down below *"// Add
> extended set measurements for chrome process."*.
> 
> To make this properly work with main pings, subsessions and content
> processes, we'll have to do a little work in a follow-up bug, probably with:
> * stop storing this data in the child processes, instead sending it up to
> the parent process like bug 1218576
> * change the stack snapshot function to return the snapshots organized by
> process (`parent`, `content`)
> * put the snapshot data here into `payloadObj.processes.parent`,
> `payloadObj.processes.content`, etc.

Chris, does this sound sane to you?
I think for all new additions that are process-specific we should move it to the new `payload.processes.X` scheme right away.
Flags: needinfo?(chutten)
Definitely we should stick new process-specific stuff in the payload.processes.<process-type> location to avoid churn.

Hopefully once bug 1218576 lands there will be a nice pattern to follow to aggregate child process stuff in the parent. ...captured stacks are aggregatable, right?
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #30)
> Hopefully once bug 1218576 lands there will be a nice pattern to follow to
> aggregate child process stuff in the parent. ...captured stacks are
> aggregatable, right?

Yes - for each "key", we record a captured stack only once and then only increase the count for future occurrences.
See Also: → 1303077
Hey Iaroslav,

Are you still working on this? Bug 1303077 is a bug that we're hoping to investigate using stacks stored in Telemetry, so this is blocking that work.

Thanks!
Blocks: 1303077
Flags: needinfo?(yarik.sheptykin)
See Also: 1303077
Hi Mike! Yes I am still working on this but I am very slow. Georg helped me to make quite a progress. We implemented stackcapture API, wrote tests and added captured stacks to to the payload. However there are still several minor issues open in the patch. If you need it urgent (earlier than next 2 weeks) please feel free to pick it up. Otherwise I will be slowly moving on.
Flags: needinfo?(yarik.sheptykin)
Comment on attachment 8784100 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/25755/#review81746

It looks like re-basing for the patch went wrong. The diff here includes changes which landed in the mean-time, like the ones from bug 1218576.
I'm not 100% sure which changes belong to the patch here - can you re-submit this with a clean diff?
Some initial comments below though.

We should file a follow-up to de-race this data collection (presumably protecting it with a mutex).

::: toolkit/components/telemetry/Telemetry.h:36
(Diff revisions 6 - 7)
> +struct Accumulation;
> +struct KeyedAccumulation;
> +

This looks like the patch was not properly rebased?
It includes changes from bug 1218576 that already landed.

::: toolkit/components/telemetry/Telemetry.cpp:403
(Diff revisions 6 - 7)
> +/**
> + * Helps validating keys used around Telemetry code.
> + * The key are allowed to only contain alphanumeric character plus '-'
> + * and not exceed a length of 50 characters.
> + */
> +class KeyValidator {

I think this doesn't need to be a class.
Classes in C++ are for grouping instance data together with logic that operates for them.
For grouping code together with common names we have namespaces.

In this case though, i think we can just put the two functions, definitions and constanst in the existing (anonymous) namespace here.

::: toolkit/components/telemetry/Telemetry.cpp:440
(Diff revisions 6 - 7)
> +
> +  /** Length limit for keys. */
> +  static const uint8_t kMaxKeyLength;
> +
> +  /**
> +   * Checks if a single character of the key strubg is valid.

"key string"?

::: toolkit/components/telemetry/Telemetry.cpp:475
(Diff revisions 6 - 7)
> +
> +  const char* cur = aKey.BeginReading();
> +  const char* end = aKey.EndReading();
> +
> +  for (; cur < end; ++cur) {
> +      if (!KeyValidator::IsCharValid(*cur)) {

I think this is a bit more generic than we need it here.
It seems clearer to have a single function that loops over the string and checks for the chars being allowed.

If needed, helper functions (like one for "check char is in range "A" to "Z") would be helpful.

::: toolkit/components/telemetry/Telemetry.cpp:1590
(Diff revisions 6 - 7)
>  
>  NS_IMETHODIMP
> -TelemetryImpl::GetCapturedStacks(JSContext *cx, JS::MutableHandle<JS::Value> ret)
> +TelemetryImpl::SnapshotCapturedStacks(bool clear, JSContext *cx, JS::MutableHandle<JS::Value> ret)
>  {
>  #if defined(MOZ_ENABLE_PROFILER_SPS)
> -  return mStackCapturer.ReflectCapturedStacks(cx, ret);
> +  NS_IMETHODIMP capturedStacks = mStackCapturer.ReflectCapturedStacks(cx, ret);

```
nsresult rv = ...
...
return rv;
```

::: toolkit/components/telemetry/Telemetry.cpp:1593
(Diff revisions 6 - 7)
>  {
>  #if defined(MOZ_ENABLE_PROFILER_SPS)
> -  return mStackCapturer.ReflectCapturedStacks(cx, ret);
> -#endif
> +  NS_IMETHODIMP capturedStacks = mStackCapturer.ReflectCapturedStacks(cx, ret);
> +  if (clear) {
> +    mStackCapturer.Clear();
> +  } 

Nit: Trailing whitespace.

::: toolkit/components/telemetry/docs/data/main-ping.rst:268
(Diff revision 7)
>        ... other threads ...
>       ]
>  
> +capturedStacks
> +--------------
> +Contains information about stacks captured on demand via Telemetry API. Similarly to `chromeHangs` exclusively stack captured on the main thread of the parent process are reported. Precise C++ stacks are reported. This is only available on Nightly Release on Windows, when building using "--enable-profiling" switch.

Wording nits:
"This is similar to `chromeHangs`, but only stacks captured on ... It reports precise C++ stacks and is only available ..."

::: toolkit/components/telemetry/docs/data/main-ping.rst:270
(Diff revision 7)
>  
> +capturedStacks
> +--------------
> +Contains information about stacks captured on demand via Telemetry API. Similarly to `chromeHangs` exclusively stack captured on the main thread of the parent process are reported. Precise C++ stacks are reported. This is only available on Nightly Release on Windows, when building using "--enable-profiling" switch.
> +
> +Limits for captured stacks are the same a for chromeHangs (see below). Furthermore:

Nit: "... as for chromeHangs"

::: toolkit/components/telemetry/docs/data/main-ping.rst:273
(Diff revision 7)
> +Contains information about stacks captured on demand via Telemetry API. Similarly to `chromeHangs` exclusively stack captured on the main thread of the parent process are reported. Precise C++ stacks are reported. This is only available on Nightly Release on Windows, when building using "--enable-profiling" switch.
> +
> +Limits for captured stacks are the same a for chromeHangs (see below). Furthermore:
> +
> +* the key length is limited to 50 characters,
> +* keys are allowed to contain alpfa-numeric characters and `-`.

Nit: "keys are restricted to ..."
Nit: "alpha"

::: toolkit/components/telemetry/docs/data/main-ping.rst:283
(Diff revision 7)
> +
> +    "capturedStacks" : {
> +      "memoryMap": [
> +        ["wgdi32.pdb", "08A541B5942242BDB4AEABD8C87E4CFF2"],
> +        ["igd10iumd32.pdb", "D36DEBF2E78149B5BE1856B772F1C3991"],
> +        ... other entries in the format ["module name", "breakpad identifier"] ...

Nit: Lets make this a comment using `//`.
Ditto below.
Attachment #8784100 - Attachment is obsolete: true
Attachment #8784100 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #35)
> We should file a follow-up to de-race this data collection (presumably
> protecting it with a mutex).

I filed bug 1307746 for this.
No longer blocks: 1303077
Priority: P1 → P3
Comment on attachment 8797984 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/25755/#review90070

This looks solid to me now, given the follow-up bugs we already filed.
After looking at this patch for a rather long time i want someone else to have a look over the changes as well.

::: toolkit/components/telemetry/Telemetry.cpp:489
(Diff revisions 7 - 8)
>  {}
>  
>  void KeyedStackCapturer::Capture(const nsACString& aKey) {
>    // Check if the key is ok.
> -  if (!KeyValidator::IsValid(aKey)) {
> +  if (!IsKeyValid(aKey)) {
>      // TODO: Write this into the log.

`NS_WARN_IF()` should be enough here.
Attachment #8797984 - Flags: review?(chutten)
Comment on attachment 8797984 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/83590/#review90076
Attachment #8797984 - Flags: review?(gfritzsche) → review+
@Chris:
Could you take a look at this as well?
The interesting changes are in Telemetry.h/.cpp and good context should be in the documentation .rst too.
I should have the rest of the changes covered, but would like a second look over the C++ code here.

Note that de-racing this is handled in a follow-up bug, as is supporting content processes.
Comment on attachment 8797984 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/83590/#review90140

A few small things, nothing terrible.

::: toolkit/components/telemetry/Telemetry.cpp:489
(Diff revision 1)
> +{}
> +
> +void KeyedStackCapturer::Capture(const nsACString& aKey) {
> +  // Check if the key is ok.
> +  if (!IsKeyValid(aKey)) {
> +    // TODO: Write this into the log.

checking in a TODO is suboptimal. NS_WARNING ought to do the trick.

::: toolkit/components/telemetry/Telemetry.cpp:501
(Diff revision 1)
> +    // We already recorded this stack before, only increase the count.
> +    info->mCount++;
> +    return;
> +  }
> +
> +  // Check if we have room for new captures.

File a bug for recording telemetry when we run out of room.

::: toolkit/components/telemetry/docs/data/main-ping.rst:268
(Diff revision 1)
>        ... other threads ...
>       ]
>  
> +capturedStacks
> +--------------
> +Contains information about stacks captured on demand via Telemetry API. This is similar to `chromeHangs`, but only stacks captured on the main thread of the parent process are reported. It reports precise C++ stacks are reported and is only available on Nightly Release on Windows, when building using "--enable-profiling" switch.

"Nightly Release on Windows" is an odd turn of phrase. Do you mean "Firefox Nightly on Windows"? And you might be missing an "or" before "when building using..."

::: toolkit/components/telemetry/tests/unit/xpcshell.ini:64
(Diff revision 1)
>  [test_TelemetryReportingPolicy.js]
>  tags = addons
>  [test_TelemetryScalars.js]
>  [test_TelemetryTimestamps.js]
>  skip-if = toolkit == 'android'
> +[test_TelemetryCaptureStack.js]

Shouldn't we skip the tests if they are run on a build that doesn't support the profiler or the stack capture mechanism?

::: toolkit/locales/en-US/chrome/global/aboutTelemetry.dtd:104
(Diff revision 1)
>  <!ENTITY aboutTelemetry.threadHangStatsSection "
>    Thread Hangs
>  ">
>  
> +<!ENTITY aboutTelemetry.capturedStacksSection "
> +  Captured Callstacks

Everywhere else they are just called "Stacks"
Attachment #8797984 - Flags: review?(chutten) → review-
Blocks: 1316088
Blocks: 1316793
Attachment #8797984 - Attachment is obsolete: true
Comment on attachment 8810863 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/93162/#review93136

::: toolkit/components/telemetry/Telemetry.cpp:422
(Diff revision 1)
> + *
> + * @param aKey is the key string.
> + * @return True, if the key is valid, False - otherwise.
> + */
> +bool
> +IsKeyValid(const nsACString& aKey)

It's possible that the static analysis on inbound/central will fail the build because this isn't used in builds with the profiler turned off (-Wunused -Werror)

To be safe, maybe include these under the same #ifdef as their uses.
Looks like there's a problem with the debug statements in the Linux debug build:

/home/worker/workspace/build/src/toolkit/components/telemetry/Telemetry.cpp:491:117: error: cannot pass objects of non-trivially-copyable type 'const class nsACString_internal' through '...'
/home/worker/workspace/build/src/obj-firefox/dist/include/nsTSubstring.h:935:3: error: 'nsACString_internal::nsACString_internal(const self_type&)' is protected
/home/worker/workspace/build/src/toolkit/components/telemetry/Telemetry.cpp:491:117: error: within this context
/home/worker/workspace/build/src/toolkit/components/telemetry/Telemetry.cpp:491:5: error: cannot convert 'nsPrintfCString' to 'const char*' for argument '2' to 'void NS_DebugBreak(uint32_t, const char*, const char*, const char*, int32_t)'
So, after an illuminating conversation with :bz on irc.mozilla.org#developers, I've learned that you can only pass plain old data through variadic arguments.

In other words, throw a `.get()` on `aKey` to get the char* out of it, and we should be off to the races.
Er, rather, apparently the standard operating procedure is to also wrap `aKey` in `PromiseFlatCString` first to ensure that it has the correct conversion.
Hi Chris! Thanks for the hints! I was a little surprised to see build failing on tryserver because local build worked fine. Error message was also a bit cryptic, so I wasn't sure what I was missing. I fixed the code and re sent it to the try server.
Looks like you need a get() from the printfstring as well
(In reply to Chris H-C :chutten from comment #54)
> Looks like you need a get() from the printfstring as well

You are right!

(In reply to Iaroslav Sheptykin from comment #55)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ba89b1443fc

Now I see that I forgot to go with lint over my code. Fixing it.
Comment on attachment 8810863 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry.

https://reviewboard.mozilla.org/r/93162/#review93840

LGTM
Attachment #8810863 - Flags: review?(chutten) → review+
Attachment #8810863 - Attachment is obsolete: true
Georg! I had to merge manually the patch after a rebase so I thought another look over the patch from you would be good. I also re sent the patch to the tryserver to make sure everything is ok. Is my choice of platforms and tests ok?
Flags: needinfo?(gfritzsche)
I'm missing a Win32 build (that is where our main population still is).
In general, one Windows/Mac/Linux/Android/lint build should be enough, no need to run this kind of change on e.g. different Mac & Windows platforms.
The test choices look fine to me.

There is an eslint failure that still needs fixing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=708f1f215003&selectedJob=31351934

You can check eslint locally using: mach eslint
Flags: needinfo?(gfritzsche)
Comment on attachment 8811881 [details]
Bug 1225851: Capturing keyed call stacks on demand inside Telemetry. r+chutten,

https://reviewboard.mozilla.org/r/93790/#review94928
Attachment #8811881 - Flags: review?(gfritzsche) → review+
Hm, I haven't seen a green build like that for a while time. I must have forgotten something, what do you think Georg?
Flags: needinfo?(gfritzsche)
That looks good, lets land it!

You might have just gotten lucky with not hitting any intermittents, but xpcshell-tests are affected less by those anyway.
Flags: needinfo?(gfritzsche)
Keywords: checkin-needed
I'll land this.
Keywords: checkin-needed
Pushed by georg.fritzsche@googlemail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/620b85825ca5
Capturing keyed call stacks on demand inside Telemetry. r=chutten,r=gfritzsche
https://hg.mozilla.org/mozilla-central/rev/620b85825ca5
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Depends on: 1320312
See Also: → 1320661
You need to log in before you can comment on or make changes to this bug.