Closed Bug 872496 Opened 11 years ago Closed 11 years ago

Allow early registration of stack tops, to improve native unwind quality

Categories

(Core :: Gecko Profiler, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(1 file, 2 obsolete files)

Currently, native unwind often produces incomplete unwinds or fails
entirely.  This happens because the stack chunk copied from the
sampled thread is not big enough, ending at too low an address.  That
in turn happens because the "maximum safe address" is determined
essentially by noting the stack pointer value at the point each thread
registers itself for profiling.  In the case of the main thread, this
is done in a deeply nested call stack and so records a SP value which
is far too low.

The ideal solution would be for all threads to register/deregister
themselves with the profiler in their root functions.  That is
unfortunately too intrusive -- would require hacking around NSPR.
This proposal is for a simpler mechanism, in which threads can if they
want "pre-register" a high (good) SP value without requesting
profiling at that point.
Attached patch Patch (obsolete) — Splinter Review
Patch which does exactly that:

* adds a new function profiler_pre_register_thread_stack_top

* calls it from XREMain::XRE_main

* one other tweak: in UnwinderThread2.cpp
    end = (end & ~(uintptr_t)4095) + (uintptr_t)4095;
  rounds up the copied stack area to the end of the registered
  page, which can't be dangerous and might get us a couple of
  extra frames in some situations.

Hence this improves the situation on the main thread, where the
problem appears worst, but not on any other thread (currently).
Does not make any changes to NSPR or anywhere else outside the
profiler, except XREMain.
Attachment #749835 - Flags: feedback?(bgirard)
Comment on attachment 749835 [details] [diff] [review]
Patch

Review of attachment 749835 [details] [diff] [review]:
-----------------------------------------------------------------

I think the API could be cleaner. Let me know what you think.

This is a good time to split up thread registration and activating a thread for profiling.

::: toolkit/xre/nsAppRunner.cpp
@@ +3886,5 @@
>  XREMain::XRE_main(int argc, char* argv[], const nsXREAppData* aAppData)
>  {
>    GeckoProfilerInitRAII profilerGuard;
>    PROFILER_LABEL("Startup", "XRE_Main");
> +  profiler_pre_register_thread_stack_top(&argc);

This thread will already be fully registered by the profiler init just above it. I'm not sure how this doesn't hit your asserts.

::: tools/profiler/UnwinderThread2.cpp
@@ +176,5 @@
>  }
>  
>  void uwt__register_thread_for_profiling(void* stackTop)
>  {
> +  thread_register_for_profiling(stackTop, true/*isFullReg*/);

NIT: We prefer to use enum instead of bool parameters to make the call site more readable.

@@ +366,5 @@
> +
> +typedef
> +  struct {
> +    pthread_t thrId;
> +    bool      fullyRegd;  // false=>PreReg, true=>Full

IMO it would be better to register threads that my be selected for profiling and then exposing a way to 'select' threads for profiling. Rather then using pre-registration/full-registration.

This means that we would also split the registration function in UnwinderThread2 into two functions. Right now it's really doing two different things.

@@ +541,5 @@
> +    /* Assert that the pre-registration supplied a stack top value
> +       which is better than the one we have now -- since that's the
> +       whole purpose of this preregistration mechanism. */
> +    MOZ_ASSERT(g_knownThreads[i].stackTop != NULL);
> +    MOZ_ASSERT(stackTop <= g_knownThreads[i].stackTop);

I don't like this assert. See 'uwt__register_thread_for_profiling'.

@@ +758,5 @@
> +      // Round |end| up to the end of the containing page.  We might
> +      // as well do this -- there's no danger of a fault, and we might
> +      // get a few more base-of-the-stack frames as a result.  This
> +      // assumes that no target has a page size smaller than 4096.
> +      end = (end & ~(uintptr_t)4095) + (uintptr_t)4095;

Why not do this once in the registration?

::: tools/profiler/UnwinderThread2.h
@@ +39,5 @@
>  
>  // Registers a sampler thread for profiling.  Threads must be registered
>  // before they are allowed to call utb__acquire_empty_buffer or
>  // utb__release_full_buffer.
>  void uwt__register_thread_for_profiling(void* stackTop);

I don't think we should require the user to give us another stackTop. Currently you ask them to provide a stackTop that is as accurate or worse but no better. If we let the user pass in any bad stacktop then it's not useful results. No point in checking if they lied in pre-registration.

This isn't exposed externally but we should still fix it in platform.cpp 'mozilla_sampler_start'.
Attachment #749835 - Flags: feedback?(bgirard) → feedback-
Attached patch Patch v2 (obsolete) — Splinter Review
Early registration of stack tops, second attempt.  This is a new
implementation that is unrelated to the previous patch.

This one does not add any functions to the the profiler's API (the
profiler_* functions).  Instead it uses the the existing
profiler_register_thread function, but adds a second argument to pass
in the stack base.

Because initialisation of the profiler also registers the calling
thread, profiler_init() also has to pass in a stack top argument, as
does the GeckoProfilerInitRAII constructor.

Registration and un-registration events are passed through the
profiler to UnwinderThread2.cpp, as before.  The routines in there
(thread_register_for_profiling, thread_unregister_for_profiling) have
been made robust against duplicate registration and against
deregistration of non-registered threads.

mozilla_sampler_start no longer calls uwt__register_thread_for_profiling.

UnwinderThread2.cpp can track up to 20 registered threads.  I hope
that is enough.  So far on Linux I have seen a max of 10 and on
Android only 4.

One minor thing I wasn't clear about is that
mozilla_sampler_print_location1, mozilla_sampler_start and
mozilla_sampler_stop appear to call profiler_init even though (afaics)
the profiler should already have been initialised.  Hence they have no
way to pass a valid stack value for the initial thread.  So I pass
NULL there and have thread_register_for_profiling() ignore such calls.
It seems to work OK.
Attachment #749835 - Attachment is obsolete: true
Attachment #750418 - Flags: feedback?(bgirard)
Comment on attachment 750418 [details] [diff] [review]
Patch v2

Review of attachment 750418 [details] [diff] [review]:
-----------------------------------------------------------------

Took a quick look waiting for my connection. This looks good :)
Attachment #750418 - Flags: feedback?(bgirard) → feedback+
(In reply to Julian Seward from comment #3)
> One minor thing I wasn't clear about is that
> mozilla_sampler_print_location1, mozilla_sampler_start and
> mozilla_sampler_stop appear to call profiler_init even though (afaics)
> the profiler should already have been initialised.  Hence they have no
> way to pass a valid stack value for the initial thread.  So I pass
> NULL there and have thread_register_for_profiling() ignore such calls.
> It seems to work OK.

Before it was just easier to initialize the profiler then to make sure these calls we're always made after initializing the profiler. But this is a good reason to change this.
Attached patch Patch, v3Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=98ceca502355

Changes compared to previous version:

* dynamic allocation of UnwinderThread2.cpp's array of threads, so
  arbitrary numbers of threads can register/unregister at any time

* build fixes for Windows
Attachment #750418 - Attachment is obsolete: true
Attachment #751849 - Flags: review?(bgirard)
Comment on attachment 751849 [details] [diff] [review]
Patch, v3

Review of attachment 751849 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with some optional suggested improvements.

::: toolkit/xre/nsAppRunner.cpp
@@ +3885,5 @@
>  int
>  XREMain::XRE_main(int argc, char* argv[], const nsXREAppData* aAppData)
>  {
> +  char aLocal;
> +  GeckoProfilerInitRAII profilerGuard(&aLocal);

We could of used 'this' in GeckoProfilerInitRAII and get a similar stack address but this is fine.

::: tools/profiler/UnwinderThread2.cpp
@@ +582,5 @@
> +
> +    if (old_size != g_stackLimitsSize) {
> +      /* We've been outraced.  Instead of trying to deal in-line with
> +         this extremely rare case, just start all over again by
> +         tail-calling this routine. */

Just putting the above code in a while loop shouldn't of been really hard but this is fine.

@@ +636,5 @@
> +  bool found = false;
> +  pthread_t me = pthread_self();
> +  for (i = 0; i < g_stackLimitsUsed; i++) {
> +    if (g_stackLimits[i].thrId == me)
> +      break;

IMO it would be cleaner if the code that used I was in here. You wouldn't need to check i < g_StackLimitUsed either.
Attachment #751849 - Flags: review?(bgirard) → review+
(In reply to Benoit Girard (:BenWa) from comment #7)
> IMO it would be cleaner if the code that used I was in here. You wouldn't
> need to check i < g_StackLimitUsed either.

I looked at this, but afaics I is used in the entire rest of the function
-- that loop basically finds the registration entry for the sampled thread.
https://hg.mozilla.org/mozilla-central/rev/383bed640c7b
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
For reference, there's a more direct way to find a thread's stack bounds on at least glibc and Android systems: pthread_attr_getstack and pthread_getattr_np (the latter is a GNU extension).
(In reply to Jed Davis [:jld] from comment #11)
> For reference, there's a more direct way to find a thread's stack bounds on
> at least glibc and Android systems: pthread_attr_getstack and
> pthread_getattr_np (the latter is a GNU extension).

The former is useless without the latter, so this effectively means it doesn't work for Android.
None of that will work on Windows, and anyway we can avoid system dependencies
by registering stacks close to startup.  Ideally we'd instrument NSPR to do this,
but that requires buy-in from the NSPR people, so that's a longer-term thing.
On Windows, the thread information block contains all the necessary information http://en.wikipedia.org/wiki/Win32_Thread_Information_Block

That leaves OSX and Android.
OSX has pthread_get_stackaddr_np and pthread_get_stacksize_np.

That leaves Android.
And Android has void *__get_stack_base(int  *p_stack_size), which returns both the address and size.
Doesn't js/src/jsnativestack.cpp do all this already?
(In reply to Mike Hommey [:glandium] from comment #12)
> (In reply to Jed Davis [:jld] from comment #11)
> > For reference, there's a more direct way to find a thread's stack bounds on
> > at least glibc and Android systems: pthread_attr_getstack and
> > pthread_getattr_np (the latter is a GNU extension).
> 
> The former is useless without the latter, so this effectively means it
> doesn't work for Android.

I seem to have been unclear: it's originally a GNU extension, but Android also implements it.  (This is what I was trying to express with "at least glibc and Android".)  I know this because I've used it, and also read the source of the bionic version that B2G uses to see if it was doing anything that would be bad in the context of NS_StackWalk.

__get_stack_base also works, but it's a private interface (the header declaring it isn't installed into the sysroot, presumably to discourage outside use), so I wanted to avoid it for bug 810526 if possible.

Also, as long as I'm disclaiming things: glibc documents pthread_getattr_np as possibly reading and parsing /proc/self/maps if called from the main thread, though I haven't tried to confirm or deny this; bionic just reads fields from TLS.
I think what we're doing is a bit unclear but I don't think it's really unsafe. It's also very portable. It's bad if we care about unwind past XRE_main and seeing stuff like pthread_create but we don't so IMO I rather keep our approach.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: