Closed Bug 1312883 Opened 3 years ago Closed 3 years ago

BHR should collect native stacks instead of pseudostacks on builds with frame pointers

Categories

(Toolkit :: Telemetry, defect, P4)

49 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- wontfix
firefox55 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

BHR emits pseudostacks for the smaller hangs, but goes for native stacks for what it considers to be permahangs.

The Profiler can get native stacks quite quickly on builds where frame pointers haven't been stripped. We should try to get BHR to use the same mechanism where possible.
Priority: -- → P3
Naveed: Hang stacks from the field would be helpful in identifying bad addons, scripts and pages. This seems like something the Flow team could consider taking on to improve data collection.
Flags: needinfo?(nihsanullah)
Priority: P3 → P4
Assignee: nobody → mconley
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review114594

::: toolkit/components/telemetry/Telemetry.cpp:1895
(Diff revision 1)
> +    JS::RootedObject fullReportObj(cx, CreateJSStackObject(cx, singleStack));
> +    if (!fullReportObj) {
> +      return nullptr;
> +    }
> +
> +    if (!JS_DefineProperty(cx, ret, "nativeStack", fullReportObj, JSPROP_ENUMERATE)) {

Please document the format changes in the corresponding section in toolkit/components/telemetry/docs/data/main-ping.rst
Attachment #8838139 - Flags: review?(gfritzsche)
Comment on attachment 8838137 [details]
Bug 1312883 - MOZ_THREADSTACKHELPER_NATIVE should only be true on Windows.

https://reviewboard.mozilla.org/r/113108/#review114786

Why remove the ARM code?
Comment on attachment 8838138 [details]
Bug 1312883 - Remove all thread context processing from ThreadStackHelper.

https://reviewboard.mozilla.org/r/113110/#review114788
Attachment #8838138 - Flags: review?(nchen) → review+
Comment on attachment 8838137 [details]
Bug 1312883 - MOZ_THREADSTACKHELPER_NATIVE should only be true on Windows.

https://reviewboard.mozilla.org/r/113108/#review114790

Ah I guess the ARM stuff should be part of the next patch.
Attachment #8838137 - Flags: review?(nchen) → review+
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review114792

::: toolkit/components/telemetry/ThreadHangStats.h:15
(Diff revision 3)
>  #include "mozilla/Assertions.h"
>  #include "mozilla/HangAnnotations.h"
>  #include "mozilla/Move.h"
>  #include "mozilla/Mutex.h"
>  #include "mozilla/PodOperations.h"
> +#include "mozilla/Telemetry.h"

Don't need this?

::: toolkit/components/telemetry/ThreadHangStats.h:63
(Diff revision 3)
>    Impl mImpl;
>  
>    // Stack entries can either be a static const char*
>    // or a pointer to within this buffer.
>    mozilla::Vector<char, 0> mBuffer;
> +  std::vector<uintptr_t> mNativeFrames;

Add comment about what the `uintptr_t` is supposed to represent.

::: xpcom/threads/ThreadStackHelper.cpp:78
(Diff revision 3)
>  #endif // MOZ_THREADSTACKHELPER_NATIVE
>  
>  namespace mozilla {
>  
> +// The maximum depth for the native stack frames that we might collect.
> +const size_t kThreadStackHelperNativeMaxDepth = 25;

Nit: move to inside ThreadStackHelper

::: xpcom/threads/ThreadStackHelper.cpp:169
(Diff revision 3)
>  } // namespace
>  
>  void
>  ThreadStackHelper::GetStack(Stack& aStack)
>  {
> +  GetStackInternal(aStack, false);

Nit: `GetStackInternal(aStack, /* aAppendNativeStack */ false);`

::: xpcom/threads/ThreadStackHelper.cpp:231
(Diff revision 3)
> +#endif
> +
> +  if (aAppendNativeStack) {
> +    auto callback = [](uint32_t, void* aPC, void*, void* aClosure) {
> +      Stack* stack = static_cast<Stack*>(aClosure);
> +      stack->AppendNativeFrame(reinterpret_cast<uintptr_t>(aPC));

Should we be preallocating this buffer? Possible allocation while another thread is suspended in an arbitrary location seems dangerous.

::: xpcom/threads/ThreadStackHelper.cpp:267
(Diff revision 3)
>  
>  void
>  ThreadStackHelper::GetNativeStack(Stack& aStack)
>  {
>  #ifdef MOZ_THREADSTACKHELPER_NATIVE
> -  // Get pseudostack first and fill the thread context.
> +  GetStackInternal(aStack, true);

Nit: `GetStackInternal(aStack, /* aAppendNativeStack */ true);`
Attachment #8838139 - Flags: review?(nchen)
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review114792

> Should we be preallocating this buffer? Possible allocation while another thread is suspended in an arbitrary location seems dangerous.

So the reason I'm using a std::vector is because that's what's expected by `Telemetry::GetStackAndModules`. I'm not sure what the right move is here. In this latest patch, I've altered the constructors of HangStack to reserve the number of native frames that we're willing to capture. Is that okay, or would you rather I statically declare a fixed sized array and convert it to a std::vector, or alter ThreadStackHelper::GetStackInternal to take some kind of generic iterator to a collection of uintptr_t's, or something else entirely?
ni? for comment 20.
Flags: needinfo?(nchen)
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review115158

::: toolkit/components/telemetry/ThreadHangStats.h:73
(Diff revision 4)
> +  std::vector<uintptr_t> mNativeFrames;
>  
>  public:
> -  HangStack() { }
> +  HangStack()
> +  {
> +    mNativeFrames.reserve(sMaxNativeFrames);

Calling `reserve` is fine, but I'd call it in `GetStackInternal`, before `SuspendThread`, if `aAppendNativeFrames` is true.

::: toolkit/components/telemetry/ThreadHangStats.h:153
(Diff revision 4)
>  
>    const char* InfallibleAppendViaBuffer(const char* aText, size_t aLength);
>    const char* AppendViaBuffer(const char* aText, size_t aLength);
> +
> +  void AppendNativeFrame(uintptr_t aPc) {
> +    mNativeFrames.push_back(aPc);

Also assert here that you're not going over the capacity.
Attachment #8838139 - Flags: review?(nchen) → review+
Flags: needinfo?(nchen)
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review115390

This looks good to me for the Telemetry parts.
I'll comment on the format change on the documentation commit.

::: toolkit/components/telemetry/ThreadHangStats.h:153
(Diff revision 5)
> +  void EnsureNativeFrameCapacity(size_t aCapacity) {
> +    mNativeFrames.reserve(aCapacity);
> +  }
> +
> +  void AppendNativeFrame(uintptr_t aPc) {
> +    MOZ_ASSERT(mNativeFrames.size() <= sMaxNativeFrames);

Does this assert right before the push logic add value?
Attachment #8838139 - Flags: review?(gfritzsche) → review+
Comment on attachment 8838170 [details]
Bug 1312883 - Document the nativeStack property on the threadHangStats object.

https://reviewboard.mozilla.org/r/113142/#review115388

This looks good with the questions below cleared.

::: toolkit/components/telemetry/docs/data/main-ping.rst:251
(Diff revision 4)
>  
>  As of Firefox 48, this section does not contain empty keyed histograms anymore.
>  
>  threadHangStats
>  ---------------
> -Contains the statistics about the hangs in main and background threads. Note that hangs in this section capture the `C++ pseudostack <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Native_stack_vs._Pseudo_stack>`_ and an incomplete JS stack, which is not 100% precise.
> +Contains the statistics about the hangs in main and background threads. Note that hangs in this section capture the `C++ pseudostack <https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Native_stack_vs._Pseudo_stack>`_ and an incomplete JS stack, which is not 100% precise. For particularly egregious hangs, an unsymbolicated native stack is also captured.

What are the criteria to decide that it is that kind of hang?
This is important information for analysts.

::: toolkit/components/telemetry/docs/data/main-ping.rst:276
(Diff revision 4)
>                "Timer::Fire",
>                "(content script)",
>                "IPDL::PPluginScriptableObject::SendGetChildProperty",
>                ... up to 11 frames ...
>              ],
> -            "nativeStack": [...], // optionally available
> +            "nativeStack": { // only captured for egregious hangs

What is the value when it's not captured?

This changes the existing format - are there any existing jobs that could break from changing this?

Note that you may still have submissions coming in with the old format and new buildids (from dev builds with "official" build flags).
You'll have to account for this either in the analysis jobs or rename this field.
Attachment #8838170 - Flags: review?(gfritzsche) → review+
Comment on attachment 8838170 [details]
Bug 1312883 - Document the nativeStack property on the threadHangStats object.

https://reviewboard.mozilla.org/r/113142/#review115388

> What are the criteria to decide that it is that kind of hang?
> This is important information for analysts.

This is mostly up to the implementors of the BackgroundHangMonitor for a particular thread. For main threads, that's defined here:

http://searchfox.org/mozilla-central/rev/12cf11303392edac9f1da0c02e3d9ad2ecc8f4d3/xpcom/build/XPCOMInit.cpp#767-770

Where after the "permanent hang timeout", we capture the native stack. For the main threads in the parent and content processes, that's currently just over 8 seconds: http://searchfox.org/mozilla-central/rev/12cf11303392edac9f1da0c02e3d9ad2ecc8f4d3/xpcom/build/XPCOMInit.cpp#532-535

For the compositor thread (and I assume the main thread in the compositor process), that's just over 2 seconds: http://searchfox.org/mozilla-central/rev/12cf11303392edac9f1da0c02e3d9ad2ecc8f4d3/gfx/layers/ipc/CompositorThread.cpp#94

Because I care so much about getting native stacks for tab switch spinners, I've made it so that the force paint thread monitor only waits 1s before capturing a native stack.

Anyhow, I'll put something to that effect in the documentation.

> What is the value when it's not captured?
> 
> This changes the existing format - are there any existing jobs that could break from changing this?
> 
> Note that you may still have submissions coming in with the old format and new buildids (from dev builds with "official" build flags).
> You'll have to account for this either in the analysis jobs or rename this field.

I don't believe there are any dashboards monitoring native stacks (or the pseudostacks, beyond the ones I myself am gathering via my Spark cluster stuff). The dashboards for BHR were shut down a few months ago because they were broken (didn't survive the Unified Telemetry transition).

So I think I'm okay here. I'll ask around in #telemetry to be sure though.
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review115390

> Does this assert right before the push logic add value?

Yeah - I added it for jchen from his feedback in https://reviewboard.mozilla.org/r/113112/#review115158
Hey mreid,

I was told I should probably clear this with you, wrt ping size and data budget. In the rare cases where we experience a permahang on any threads that BHR monitors (the thing that populates threadHangStats), I'll be adding an object to the nativeStack property on that thread hang that will probably be similar in size to the sorts of things we collect with chromeHangs.

I've got a stack depth max of 25, so it'd bounded there, and it's also bounded by the number of modules that are contained within the Firefox process.

Does that all sound above board?

Also ni'ing bsmedberg to make sure I've got data peer approval on gathering this stuff. I know we already gather similar stuff for chromeHangs, but that's just for the main thread, and this is going to be on all BHR monitored threads now. There's not much in there that I'd consider privacy sensitive - it's memory addresses within the Firefox process, along with the module identifiers that are used for symbolication. Do let me know if there's a problem, but otherwise, are you able to give me data-review+?

Assuming I get green lit there, I'll send out mail to fhr-dev next to let them know what's going on before landing this thing.
Flags: needinfo?(mreid)
Flags: needinfo?(benjamin)
Clearing naveed's ni, since I've got this more or less done now.
Flags: needinfo?(nihsanullah)
bsmedberg is bogged down with post-vacation stuff, redirecting data-review? to liuche (see comment 30).
Flags: needinfo?(benjamin) → needinfo?(liuche)
(In reply to Mike Conley (:mconley) from comment #30)
> Hey mreid,
> 
> I was told I should probably clear this with you, wrt ping size and data
> budget. In the rare cases where we experience a permahang on any threads
> that BHR monitors (the thing that populates threadHangStats), I'll be adding
> an object to the nativeStack property on that thread hang that will probably
> be similar in size to the sorts of things we collect with chromeHangs.
> 
> I've got a stack depth max of 25, so it'd bounded there, and it's also
> bounded by the number of modules that are contained within the Firefox
> process.
> 
> Does that all sound above board?

Per IRC conversation, this is pre-release only, and a given native stack should be on the order of a few KBs. Sounds fine to me.
Flags: needinfo?(mreid)
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review116564

Reading this and the documentation patch, recording memory addresses looks fine to me from a data collection perspective.

data-review+ from me
Comment on attachment 8838139 [details]
Bug 1312883 - Use MozStackWalk to gather native stacks on Windows.

https://reviewboard.mozilla.org/r/113112/#review116566
Attachment #8838139 - Flags: review+
Flags: needinfo?(liuche)
Comment on attachment 8838170 [details]
Bug 1312883 - Document the nativeStack property on the threadHangStats object.

https://reviewboard.mozilla.org/r/113142/#review115388

> I don't believe there are any dashboards monitoring native stacks (or the pseudostacks, beyond the ones I myself am gathering via my Spark cluster stuff). The dashboards for BHR were shut down a few months ago because they were broken (didn't survive the Unified Telemetry transition).
> 
> So I think I'm okay here. I'll ask around in #telemetry to be sure though.

And to answer "what is the value when it's not captured" - the property is not defined. It's been that way for a while.
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/42f0a217197d
MOZ_THREADSTACKHELPER_NATIVE should only be true on Windows. r=jchen
https://hg.mozilla.org/integration/autoland/rev/d6b3328c9573
Remove all thread context processing from ThreadStackHelper. r=jchen
https://hg.mozilla.org/integration/autoland/rev/fddfc0739043
Use MozStackWalk to gather native stacks on Windows. r=gfritzsche,jchen,liuche
https://hg.mozilla.org/integration/autoland/rev/5c3ba01ac170
Document the nativeStack property on the threadHangStats object. r=gfritzsche
Backed out for build bustage Windows due to unknown identifiers:

https://hg.mozilla.org/integration/autoland/rev/25b45d519a90064beef663da8aa9b886c7e94e64
https://hg.mozilla.org/integration/autoland/rev/453711cd7f0ef8610a92e920f9768064925bdcde
https://hg.mozilla.org/integration/autoland/rev/ad03bd423ccf0254d8c9505fcf4153305a97df94
https://hg.mozilla.org/integration/autoland/rev/44261dd4c8927ed56ef68437b2d468fa838c4152

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=5c3ba01ac170f487658e79934016e1ad3cde8b32&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

13:16:34     INFO -  z:/build/build/src/sccache2/sccache.exe z:/build/build/src/vs2015u3/VC/bin/amd64/cl.exe -FoUnified_cpp_dom_canvas5.obj -c -Iz:/build/build/src/obj-firefox/dist/stl_wrappers  -DNDEBUG=1 -DTRIMMED=1 -DWIN32_LEAN_AND_MEAN -D_WIN32 -DWIN32 -D_CRT_RAND_S -DCERT_CHAIN_PARA_HAS_EXTRA_FIELDS -DOS_WIN=1 -D_UNICODE -DCHROMIUM_BUILD -DU_STATIC_IMPLEMENTATION -DUNICODE -D_WINDOWS -D_SECURE_ATL -DCOMPILER_MSVC -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -Iz:/build/build/src/dom/canvas -Iz:/build/build/src/obj-firefox/dom/canvas -Iz:/build/build/src/js/xpconnect/wrappers -Iz:/build/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -Iz:/build/build/src/ipc/chromium/src -Iz:/build/build/src/ipc/glue -Iz:/build/build/src/dom/workers -Iz:/build/build/src/dom/base -Iz:/build/build/src/dom/html -Iz:/build/build/src/dom/svg -Iz:/build/build/src/dom/workers -Iz:/build/build/src/dom/xul -Iz:/build/build/src/gfx/gl -Iz:/build/build/src/image -Iz:/build/build/src/js/xpconnect/src -Iz:/build/build/src/layout/generic -Iz:/build/build/src/layout/style -Iz:/build/build/src/layout/xul -Iz:/build/build/src/media/libyuv/include -Iz:/build/build/src/gfx/skia -Iz:/build/build/src/gfx/skia/skia/include/config -Iz:/build/build/src/gfx/skia/skia/include/core -Iz:/build/build/src/gfx/skia/skia/include/gpu -Iz:/build/build/src/gfx/skia/skia/include/utils -Iz:/build/build/src/obj-firefox/dist/include  -Iz:/build/build/src/obj-firefox/dist/include/nspr -Iz:/build/build/src/obj-firefox/dist/include/nss        -MD -FI z:/build/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT -deps.deps/Unified_cpp_dom_canvas5.obj.pp  -TP -nologo -wd5026 -wd5027 -Zc:sizedDealloc- -wd4091 -wd4577 -D_HAS_EXCEPTIONS=0 -W3 -Gy -Zc:inline -utf-8 -Gw -wd4251 -wd4244 -wd4267 -wd4800 -wd4595 -we4553 -GR-  -Z7 -O1 -Oi -Oy- -WX -Iz:/build/build/src/obj-firefox/dist/include/cairo -wd4312   z:/build/build/src/obj-firefox/dom/canvas/Unified_cpp_dom_canvas5.cpp
13:16:35     INFO -  Unified_cpp_xpcom_threads0.cpp
13:16:35     INFO -  z:/build/build/src/xpcom/threads/ThreadStackHelper.cpp(205): error C3861: 'AcquireStackWalkWorkaroundLock': identifier not found
13:16:35     INFO -  z:/build/build/src/xpcom/threads/ThreadStackHelper.cpp(226): error C3861: 'ReleaseStackWalkWorkaroundLock': identifier not found
13:16:35     INFO -  z:/build/build/src/config/rules.mk:999: recipe for target 'Unified_cpp_xpcom_threads0.obj' failed
Flags: needinfo?(mconley)
This also often fails toolkit/components/telemetry/tests/unit/test_ThreadHangStats.js | check_results - [check_results : 89] false == true

https://treeherder.mozilla.org/logviewer.html#?job_id=79964686&repo=autoland
Comment on attachment 8840927 [details]
Bug 1312883 - Update test_ThreadHangStats.js for the new format for nativeStacks.

https://reviewboard.mozilla.org/r/115322/#review116756
Attachment #8840927 - Flags: review?(snorp) → review+
Pushed by mconley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/de5f5cec4080
MOZ_THREADSTACKHELPER_NATIVE should only be true on Windows. r=jchen
https://hg.mozilla.org/integration/autoland/rev/31f5518f9794
Remove all thread context processing from ThreadStackHelper. r=jchen
https://hg.mozilla.org/integration/autoland/rev/141d1bcc6dde
Use MozStackWalk to gather native stacks on Windows. r=gfritzsche,jchen,liuche
https://hg.mozilla.org/integration/autoland/rev/3b85c9efa7dd
Document the nativeStack property on the threadHangStats object. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/392ec4e96ddb
Update test_ThreadHangStats.js for the new format for nativeStacks. r=snorp
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/30fd632d64eb
Update test_ThreadHangStats.js for the new format for nativeStacks: disable test on non-Windows systems. r=xpcshell-fix
Flags: needinfo?(mconley)
See Also: → 1346415
You need to log in before you can comment on or make changes to this bug.