Closed Bug 769241 Opened 12 years ago Closed 12 years ago

implement fast & functional stackwalk on ARM

Categories

(Core :: Gecko Profiler, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: vlad, Unassigned)

References

Details

Attachments

(2 files)

The gecko profiler is really useful for Fennec and B2G, given that the general profiling situation there is poor.  However, it would be really helpful to have full stack unwinding support there so that we can get a more detailed view into what's going on.

Currently there's some code that uses libunwind; unfortunately, libunwind seems to crash many times during a single stack walk, making it way too slow to be useful.  Julian mentioned that V has very fast arm stack walking, but it's license-incompatible (GPL), and may be hard to extract from V anyway.  Other useful bits of info is that we already have dwarf parsing code in breakpad, and potentially CFI parsing code in various places.  We should figure out how to combine all of our knowledge into working fast and functional stackwalk code on ARM.

A first suggestion was to investigate why libunwind is crashing, and see if we can make it not do that; then evaluate its performance, potentially adding caching as needed.
I'll try it out vanilla arm-linux, first of all.  Are there any more
details on how/why libunwind crashes?
I don't remember the specifics, but libunwind used to return invalid memory addresses in a stack walk step, and on the next step it would crash on them (and it also would crash some times when interpreting the CFI unwinding information).  I had to do this <https://github.com/ehsan/libunwind/commit/c55e24f44eee416ec2883b704bc570a20a1e2af5> in order to recover from those crashes (warning: horrible horrible hack).  Looking at other libunwind architectures, it seems like some of them have similar protections in place as well.
And my hunch is that these signals are why stackwalking using libunwind is currently very slow.
I got a libunwind-based build x86_64-linux, after some unpleasant
hand-to-hand combat with the build system and a couple of nasty
kludges.

In void TableTicker::doBacktrace(ThreadProfile &aProfile, TickSample*
aSample) I had to make and use and x86_64 equivalents for REPLACE_REG.

Some observations:

* With the new REPLACE_REG in place, the stack traces that come out
  of libunwind are complete garbage.  I don't know why; I assume I 
  somehow screwed up the initial unwind state.

* I commented out the REPLACE_REG assignments, so as to start from
  the sighandler provided context.  And that seems to work fine.  
  Got plausible profiles, no segfaulting, not too much complaining
  from Valgrind.

* It appears that libunwind at least on x86_64 has its own scheme for
  checking whether a page is safe to access -- msync_validate in
  tools/profiler/libunwind/src/src/x86_64/Ginit.c.  So at least
  in principle it shouldn't fault (if we don't think too hard about
  multithread badness, I guess).  Not sure how that interacts with
  Ehsan's signal handler hack.

* Having a max unwind depth of 1000 strikes me as exacerbating the
  performance problems.  If the unwinder gets hosed, stuck in a loop,
  goes into outer space, or whatever, it might well generate 1000
  bogus entries, and hence segfaults, for no purpose.  I saw that
  happen several times.  So I limited it to 100.  Can we live with that?
Just to clarify, I started with x86_64-linux as that's easy(ish) to
build and I wanted to check that CFI based unwinding in libunwind isn't
broken in some very obvious way, which it doesn't seem to be.  Will now
transition to ARM/Android as that's the real target.
Note that your x86_64 code is still useful -- we should get it checked in.
(In reply to comment #4)
> I got a libunwind-based build x86_64-linux, after some unpleasant
> hand-to-hand combat with the build system and a couple of nasty
> kludges.
> 
> In void TableTicker::doBacktrace(ThreadProfile &aProfile, TickSample*
> aSample) I had to make and use and x86_64 equivalents for REPLACE_REG.

Makes sense.

> Some observations:
> 
> * With the new REPLACE_REG in place, the stack traces that come out
>   of libunwind are complete garbage.  I don't know why; I assume I 
>   somehow screwed up the initial unwind state.
> 
> * I commented out the REPLACE_REG assignments, so as to start from
>   the sighandler provided context.  And that seems to work fine.  
>   Got plausible profiles, no segfaulting, not too much complaining
>   from Valgrind.

That almost certainly means that the initial state is bad somehow.  libunwind includes some code which is supposed to deal with signal handler frames, and in my experience that code didn't work very well for ARM Android signal handlers, which is why I bothered to set up the initial state.

> * It appears that libunwind at least on x86_64 has its own scheme for
>   checking whether a page is safe to access -- msync_validate in
>   tools/profiler/libunwind/src/src/x86_64/Ginit.c.  So at least
>   in principle it shouldn't fault (if we don't think too hard about
>   multithread badness, I guess).  Not sure how that interacts with
>   Ehsan's signal handler hack.

Yeah, libunwind has that protection in many of its backends.  I didn't find an API that I could use on Android to give me information about whether an address is mapped or not, which is why I resorted to the signal handler hack.  If you know of a way to get that information on Android, that'd be super!

> * Having a max unwind depth of 1000 strikes me as exacerbating the
>   performance problems.  If the unwinder gets hosed, stuck in a loop,
>   goes into outer space, or whatever, it might well generate 1000
>   bogus entries, and hence segfaults, for no purpose.  I saw that
>   happen several times.  So I limited it to 100.  Can we live with that?

Yeah, we can experiment with that and if it's too low in practice, we can increase it gently (i.e., not ten times!)
(In reply to comment #5)
> Just to clarify, I started with x86_64-linux as that's easy(ish) to
> build and I wanted to check that CFI based unwinding in libunwind isn't
> broken in some very obvious way, which it doesn't seem to be.  Will now
> transition to ARM/Android as that's the real target.

Note that on ARM, we fallback to the exidx based unwinding if we fail to do CFI unwinding.  You can turn that off initially if you only want to use the shared CFI unwinding code.
(In reply to Ehsan Akhgari [:ehsan] from comment #7)
> > * It appears that libunwind at least on x86_64 has its own scheme for
> >   checking whether a page is safe to access -- msync_validate in
> >   tools/profiler/libunwind/src/src/x86_64/Ginit.c.  So at least
> >   in principle it shouldn't fault (if we don't think too hard about
> >   multithread badness, I guess).  Not sure how that interacts with
> >   Ehsan's signal handler hack.
> 
> Yeah, libunwind has that protection in many of its backends.  I didn't find
> an API that I could use on Android to give me information about whether an
> address is mapped or not, which is why I resorted to the signal handler
> hack.  If you know of a way to get that information on Android, that'd be
> super!

The x86_64-linux scheme uses sys_msync to test if a page is mapped or
not.  Since that's a direct call into the kernel (AFAICS) I think the
same game should work on Android.  One wild guess about why libunwind
doesn't already do this for ARM is that it is set up to deal with low
end embedded ARM implementations, which don't necessarily have an MMU
and hence for which this trick won't work.  But all the targets we
care about have MMUs.
(In reply to comment #9)
> (In reply to Ehsan Akhgari [:ehsan] from comment #7)
> > > * It appears that libunwind at least on x86_64 has its own scheme for
> > >   checking whether a page is safe to access -- msync_validate in
> > >   tools/profiler/libunwind/src/src/x86_64/Ginit.c.  So at least
> > >   in principle it shouldn't fault (if we don't think too hard about
> > >   multithread badness, I guess).  Not sure how that interacts with
> > >   Ehsan's signal handler hack.
> > 
> > Yeah, libunwind has that protection in many of its backends.  I didn't find
> > an API that I could use on Android to give me information about whether an
> > address is mapped or not, which is why I resorted to the signal handler
> > hack.  If you know of a way to get that information on Android, that'd be
> > super!
> 
> The x86_64-linux scheme uses sys_msync to test if a page is mapped or
> not.  Since that's a direct call into the kernel (AFAICS) I think the
> same game should work on Android.  One wild guess about why libunwind
> doesn't already do this for ARM is that it is set up to deal with low
> end embedded ARM implementations, which don't necessarily have an MMU
> and hence for which this trick won't work.  But all the targets we
> care about have MMUs.

My impression from working on libunwind's ARM backend is that it's very minimally tested, and probably only on a single platform (my best guess is some flavor of desktop linux on ARM?).  It was never tested on Android, I'm sure, as it was relying on things like dl_iterate_phdr which doesn't exist in Android's libc.

That being said, I have not tested whether the Android's kernel supports msync, but I have no reason to believe that it doesn't.
(In reply to Ehsan Akhgari [:ehsan] from comment #10)
> (In reply to comment #9)
> > (In reply to Ehsan Akhgari [:ehsan] from comment #7)
> > > > * It appears that libunwind at least on x86_64 has its own scheme for
> > > >   checking whether a page is safe to access -- msync_validate in
> > > >   tools/profiler/libunwind/src/src/x86_64/Ginit.c.  So at least
> > > >   in principle it shouldn't fault (if we don't think too hard about
> > > >   multithread badness, I guess).  Not sure how that interacts with
> > > >   Ehsan's signal handler hack.
> > > 
> > > Yeah, libunwind has that protection in many of its backends.  I didn't find
> > > an API that I could use on Android to give me information about whether an
> > > address is mapped or not, which is why I resorted to the signal handler
> > > hack.  If you know of a way to get that information on Android, that'd be
> > > super!
> > 
> > The x86_64-linux scheme uses sys_msync to test if a page is mapped or
> > not.  Since that's a direct call into the kernel (AFAICS) I think the
> > same game should work on Android.  One wild guess about why libunwind
> > doesn't already do this for ARM is that it is set up to deal with low
> > end embedded ARM implementations, which don't necessarily have an MMU
> > and hence for which this trick won't work.  But all the targets we
> > care about have MMUs.
> 
> My impression from working on libunwind's ARM backend is that it's very
> minimally tested, and probably only on a single platform (my best guess is
> some flavor of desktop linux on ARM?).  It was never tested on Android, I'm
> sure, as it was relying on things like dl_iterate_phdr which doesn't exist
> in Android's libc.

We have it on android fennec, and it would be easy to add to b2g (either in the system linker or in gecko).

> That being said, I have not tested whether the Android's kernel supports
> msync, but I have no reason to believe that it doesn't.

The kernel obviously supports it. The libc might not expose a function for that, though, but after a quick check, it does.
Unwinding on arm-android does seem to work to some extent, especially
when I reduced the sampling frequency to 100Hz and the max stack depth
to 100 entries.

A much more serious problem [than segfaulting and resulting performance
loss] is that fennec very often hangs after a short while.  I wonder if
this is the same problem as bug 757186.

I also noticed w.r.t. sigsegv_handler() in
tools/profiler/libunwind/src/src/arm/Ginit.c that this isn't really right

    switch (PROCESSOR_MODE(uc->uc_mcontext.arm_cpsr)) {
      ...
      case 1: // Thumb
        pc += 2; // each instruction is 16 bits
        break;

Thumb instructions can be 2 or 4 bytes long.  I think the reason it happens
to work as-is is that the faults can only come from one of two instructions
in access_mem() a bit further down the file, and those both happen to be 
compiled into 2-byte instructions.  This is hardly a good thing to rely on,
though.
Depends on: 757186
I've been working with libunwind/SPS on Android for a while now.  Here
are some observations.  In general I'm not convinced that libunwind is
either as broken or as slow as is claimed above.

* The 1000 frame limit in TableTicker.cpp is way too high.  Since the
  unwinder can sometimes get stuck in a loop, asking it to produce
  1000 frames has the effect of wasting a lot of time filling most of
  them with the same (possibly bogus) value.  I set it to 100, and
  even then the vast majority if not all traces are shorter.

* The 1000Hz sampling rate is also too high.  I set it to 100Hz,
  although I am not sure to what extent it is adhered to.

* With those settings, performance is perfectly adequate for profiling
  on a Motorola Xoom (1 GHz dual core A9).  It idles at around 26% of
  one core with the above settings.

* The segfault handler is hit, but not a lot.  For a profile run
  involving 2400 samples (loading news.bbc.co.uk and then moving to a
  different page), it was hit 98 times, that is, 4% of samples.
  Watching the log, it is hit very roughly once per second.

* I ran it on a hacked version of Valgrind, so that it can ask
  Valgrind's unwinder for a stack whenever it asks libunwind for one.
  The stacks are then compared.  This makes it possible to assess the
  accuracy of libunwind's output.  I'm not claiming V's unwinder is
  infallible (it certainly has problems) but this gives a way to see
  if libunwind is "obviously broken".

  For 2400 samples, all of which have stack traces of 5 or more
  frames, almost all (way more than 99%) are identical between
  libunwind and Valgrind.  Conclusion is that if libunwind is broken,
  it's not broken in a big way.


I see the following issues for successful integration of libunwind
into SPS:

* I spent a lot of time investigating deadlocks.  It's clear that
  libunwind is not signal safe, despite claims to that effect in its
  documentation.  In particular it allocates memory from the heap and
  deadlocks as soon as it takes a sample when the thread being sampled
  is inside malloc (libc or jemalloc) and is holding the malloc lock.

  Avoiding these deadlocks is going to require properly isolating
  libunwind from the outside world, something that we really ought to
  be able to enforce at compile time.

  There are also hangs related to Awesomebar and other activities
  (bug 773308) which I was not able to make sense of.  I think they
  are different from the malloc hangs, though.

* It appears (AIUI, at least) that Cleopatra (the GUI) has an
  algorithm to merge multiple stack traces into a tree, and it assumes
  that all stack traces end up back at some common root (main,
  basically).  This works OK for pseudostack unwinding, but produces a
  bunch of smaller trees with libunwind, because many stack traces are
  incomplete.  This is probably unavoidable, so maybe the tree
  construction algorithm needs to be augmented to deal
  (probabilistically) with partial traces.

* The build system integration of libunwind is problematic.
  In particular things like

    make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library

  fail to rebuild properly after changes to TableTicker.cpp or
  the libunwind bits.  The only way I found to get a reliable rebuild
  was "make -f client.mk build" at the top level, which is very slow.
That's extremely encouraging; thanks for diving into this!  It sounds like fixing the heap memory issue (and potentially the naive 2-byte instruction assumption issue in comment #12) are the main things that we need before we can integrate it?

Do you have any patches that you applied to either libunwind or the profiler that would be helpful in reproducing your results?  I'd like to play with it at some point soon.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #14)
> It sounds like fixing the heap memory issue (and potentially the naive
> 2-byte instruction assumption issue in comment #12) are the main
> things that we need before we can integrate it?

Yes.  The 2-byte insn thing is trivial and not critical -- I'll put up
a patch for it in the next day or so.

The heap memory stuff is more difficult, at least to do properly.
What I'd like to do ideally is build libunwind with only a minimal set
of headers (no stdlib, stdio, etc) so it simply can't allocate heap,
either directly or indirectly.  Then add an object that carries round
a bundle of replacement functions that we control (malloc, free,
fread, etc), so as to fix up the resulting link errors.  That way,
provided nobody re-adds #include <stdlib.h> etc, we are guaranteed
safe at compile time.  Unfortunately it's probably quite an intrusive
thing to do.

> Do you have any patches that you applied to either libunwind or the profiler
> that would be helpful in reproducing your results?  I'd like to play with it
> at some point soon.

I'll attach the hacky patches to libunwind and the profiler in the
next comment.  I can also attach the patches for Valgrind if you want
to do the backtrace-diffing game, although you'll need a lot of patience :)
Provides a quick n dirty private allocator for
tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c
so as to stop it deadlocking every ten seconds.
(In reply to comment #13)
> I've been working with libunwind/SPS on Android for a while now.  Here
> are some observations.  In general I'm not convinced that libunwind is
> either as broken or as slow as is claimed above.
> 
> * The 1000 frame limit in TableTicker.cpp is way too high.  Since the
>   unwinder can sometimes get stuck in a loop, asking it to produce
>   1000 frames has the effect of wasting a lot of time filling most of
>   them with the same (possibly bogus) value.  I set it to 100, and
>   even then the vast majority if not all traces are shorter.

I think this is an acceptable adjustment.

> * The 1000Hz sampling rate is also too high.  I set it to 100Hz,
>   although I am not sure to what extent it is adhered to.

Hmm, 100Hz would be a bit too low for measuring a lot of things.  That would sample at 10ms intervals, which can make profiling things which can cause us to skip a frame for example very hard. :(

> * With those settings, performance is perfectly adequate for profiling
>   on a Motorola Xoom (1 GHz dual core A9).  It idles at around 26% of
>   one core with the above settings.
> 
> * The segfault handler is hit, but not a lot.  For a profile run
>   involving 2400 samples (loading news.bbc.co.uk and then moving to a
>   different page), it was hit 98 times, that is, 4% of samples.
>   Watching the log, it is hit very roughly once per second.

Hmm, that's great, and a lot lower than what I used to see.  I don't know why the difference is, but I'm not going to question a good thing.

> * I ran it on a hacked version of Valgrind, so that it can ask
>   Valgrind's unwinder for a stack whenever it asks libunwind for one.
>   The stacks are then compared.  This makes it possible to assess the
>   accuracy of libunwind's output.  I'm not claiming V's unwinder is
>   infallible (it certainly has problems) but this gives a way to see
>   if libunwind is "obviously broken".
> 
>   For 2400 samples, all of which have stack traces of 5 or more
>   frames, almost all (way more than 99%) are identical between
>   libunwind and Valgrind.  Conclusion is that if libunwind is broken,
>   it's not broken in a big way.

That's great news!

> I see the following issues for successful integration of libunwind
> into SPS:
> 
> * I spent a lot of time investigating deadlocks.  It's clear that
>   libunwind is not signal safe, despite claims to that effect in its
>   documentation.  In particular it allocates memory from the heap and
>   deadlocks as soon as it takes a sample when the thread being sampled
>   is inside malloc (libc or jemalloc) and is holding the malloc lock.

Yeah this will be tricky.

> * It appears (AIUI, at least) that Cleopatra (the GUI) has an
>   algorithm to merge multiple stack traces into a tree, and it assumes
>   that all stack traces end up back at some common root (main,
>   basically).  This works OK for pseudostack unwinding, but produces a
>   bunch of smaller trees with libunwind, because many stack traces are
>   incomplete.  This is probably unavoidable, so maybe the tree
>   construction algorithm needs to be augmented to deal
>   (probabilistically) with partial traces.

You should just see multiple roots in Cleopatra right?  AFAIK we should be able to handle those cases.

> * The build system integration of libunwind is problematic.
>   In particular things like
> 
>     make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library
> 
>   fail to rebuild properly after changes to TableTicker.cpp or
>   the libunwind bits.  The only way I found to get a reliable rebuild
>   was "make -f client.mk build" at the top level, which is very slow.

Ouch.  Unfortunately I don't know very much about our build system, but perhaps you can ask one of the build system peers to see if they can spot any obvious mistakes?
(In reply to comment #15)
> I'll attach the hacky patches to libunwind and the profiler in the
> next comment.  I can also attach the patches for Valgrind if you want
> to do the backtrace-diffing game, although you'll need a lot of patience :)

I'd appreciate if you can submit your libunwind changes to my github repo as a pull request and then update the in-tree version from the github repo, as the in-tree version is supposed to be a copy of the github repo.  (This will make it easier to upstream our libunwind changes at some point in the future.)
This is what I've been using with TableTicker to do the backtrace
diffing.  You'll need to edit out the relevant bits before building
(should be obvious), else you'll have something that's unbuildable.

Also:

* decreases number of grabbed frames from 1000 to 100

* adds a bit more logcat printing, in particular so we can see
  when the segfault handler is invoked.
(In reply to Ehsan Akhgari [:ehsan] from comment #17)

> > * The 1000Hz sampling rate is also too high.  I set it to 100Hz,
> >   although I am not sure to what extent it is adhered to.
> 
> Hmm, 100Hz would be a bit too low for measuring a lot of things.  That would
> sample at 10ms intervals, which can make profiling things which can cause us
> to skip a frame for example very hard. :(

We can maybe come back to this once the stability/deadlock issues have
been fixed up.  Besides, looking over my diffs just now, I think the
100Hz change got left behind in an old tree, so maybe it is not
necessary.  Not sure what's going on here.


> > * I spent a lot of time investigating deadlocks.  It's clear that
> >   libunwind is not signal safe, despite claims to that effect in its
> >   documentation.  In particular it allocates memory from the heap and
> >   deadlocks as soon as it takes a sample when the thread being sampled
> >   is inside malloc (libc or jemalloc) and is holding the malloc lock.
> 
> Yeah this will be tricky.

I see it as being tedious and intrusive -- plumbing, basically, so as
to get our bundle of replacement functions everywhere we need.  But
not particularly tricky.  Is there some specific difficulty you
envisage?  I have not looked much at the libunwind code, true.


> You should just see multiple roots in Cleopatra right?  AFAIK we should be
> able to handle those cases.

Yes, it shows multiple roots.  Not so pretty, but not critical either.


> Ouch.  Unfortunately I don't know very much about our build system, but
> perhaps you can ask one of the build system peers to see if they can spot
> any obvious mistakes?

I'll try to get TedM on the case :)
(In reply to Ehsan Akhgari [:ehsan] from comment #18)
> I'd appreciate if you can submit your libunwind changes to my [...]

Will do, as soon as I have something clean to submit.  As you can see
from the two patches I just put up, it's still at the investigation
stage.
(In reply to Ehsan Akhgari [:ehsan] from comment #17)
> > * The segfault handler is hit, but not a lot.  For a profile run
> >   involving 2400 samples (loading news.bbc.co.uk and then moving to a
> >   different page), it was hit 98 times, that is, 4% of samples.
> >   Watching the log, it is hit very roughly once per second.
> 
> Hmm, that's great, and a lot lower than what I used to see.  I don't know
> why the difference is, but I'm not going to question a good thing.

The difference between our observations makes me kinda nervous
.. maybe I did something wrong.  Not sure what though.  So it's good
if Vlad can try to repro what I did.
(In reply to Julian Seward from comment #13)
> I've been working with libunwind/SPS on Android for a while now.  Here
> are some observations.  In general I'm not convinced that libunwind is
> either as broken or as slow as is claimed above.
> 
> * The 1000 frame limit in TableTicker.cpp is way too high.  Since the
>   unwinder can sometimes get stuck in a loop, asking it to produce
>   1000 frames has the effect of wasting a lot of time filling most of
>   them with the same (possibly bogus) value.  I set it to 100, and
>   even then the vast majority if not all traces are shorter.
> 
> * The 1000Hz sampling rate is also too high.  I set it to 100Hz,
>   although I am not sure to what extent it is adhered to.

We use 1000Hz because in Pseudostack mode that is fast enough and the most useful. I think running at 100Hz with full stacks is very useful. When the fixes land I also want to experiment with increasing sampling to 1000Hz or so and only unwind the top 20 frames. We will be able to reconstruct the base of the stack by merging the pseudostack which we already do by default. With some additional work we can get it to aggregate properly.

> 
> * With those settings, performance is perfectly adequate for profiling
>   on a Motorola Xoom (1 GHz dual core A9).  It idles at around 26% of
>   one core with the above settings.
> 
> * The segfault handler is hit, but not a lot.  For a profile run
>   involving 2400 samples (loading news.bbc.co.uk and then moving to a
>   different page), it was hit 98 times, that is, 4% of samples.
>   Watching the log, it is hit very roughly once per second.
> 
> * I ran it on a hacked version of Valgrind, so that it can ask
>   Valgrind's unwinder for a stack whenever it asks libunwind for one.
>   The stacks are then compared.  This makes it possible to assess the
>   accuracy of libunwind's output.  I'm not claiming V's unwinder is
>   infallible (it certainly has problems) but this gives a way to see
>   if libunwind is "obviously broken".
> 
>   For 2400 samples, all of which have stack traces of 5 or more
>   frames, almost all (way more than 99%) are identical between
>   libunwind and Valgrind.  Conclusion is that if libunwind is broken,
>   it's not broken in a big way.
> 

That's excellent news. Some error is acceptable in profiling and being at 99% accuracy is more then enough. Plus since we merge in pseudostack we should be able to sanity check cheaply.

> 
> I see the following issues for successful integration of libunwind
> into SPS:
> 
> * I spent a lot of time investigating deadlocks.  It's clear that
>   libunwind is not signal safe, despite claims to that effect in its
>   documentation.  In particular it allocates memory from the heap and
>   deadlocks as soon as it takes a sample when the thread being sampled
>   is inside malloc (libc or jemalloc) and is holding the malloc lock.
> 
>   Avoiding these deadlocks is going to require properly isolating
>   libunwind from the outside world, something that we really ought to
>   be able to enforce at compile time.

What do you think the best way to do that is? Writing a custom allocator?

> 
>   There are also hangs related to Awesomebar and other activities
>   (bug 773308) which I was not able to make sense of.  I think they
>   are different from the malloc hangs, though.

Yes, don't worry about these.

> 
> * It appears (AIUI, at least) that Cleopatra (the GUI) has an
>   algorithm to merge multiple stack traces into a tree, and it assumes
>   that all stack traces end up back at some common root (main,
>   basically).  This works OK for pseudostack unwinding, but produces a
>   bunch of smaller trees with libunwind, because many stack traces are
>   incomplete.  This is probably unavoidable, so maybe the tree
>   construction algorithm needs to be augmented to deal
>   (probabilistically) with partial traces.

Yes it has bugs there. A lot of the code assumes that the non inverted view only has one parent node. If it's not easy to enforce I normally attach an anchor (Root) node to all the samples but I'm going to be fixing this up in Cleopatra soon.

> 
> * The build system integration of libunwind is problematic.
>   In particular things like
> 
>     make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library
> 
>   fail to rebuild properly after changes to TableTicker.cpp or
>   the libunwind bits.  The only way I found to get a reliable rebuild
>   was "make -f client.mk build" at the top level, which is very slow.

I've never experienced that :(, that's how I already rebuild.
(In reply to Benoit Girard (:BenWa) from comment #23)
> We use 1000Hz because in Pseudostack mode that is fast enough and the most
> useful. I think running at 100Hz with full stacks is very useful. When the
> fixes land I also want to experiment with increasing sampling to 1000Hz or
> so and only unwind the top 20 frames. We will be able to reconstruct the
> base of the stack by merging the pseudostack which we already do by default.
> With some additional work we can get it to aggregate properly.

Honestly I'd love it if we could somehow give the stack address of the topmost pseudostack entry and have unwinding just stop when it hits that SP; generally as soon as we hit a pseudostack marker we know/don't care about anything above it.
(In reply to comment #20)
> (In reply to Ehsan Akhgari [:ehsan] from comment #17)
> 
> > > * The 1000Hz sampling rate is also too high.  I set it to 100Hz,
> > >   although I am not sure to what extent it is adhered to.
> > 
> > Hmm, 100Hz would be a bit too low for measuring a lot of things.  That would
> > sample at 10ms intervals, which can make profiling things which can cause us
> > to skip a frame for example very hard. :(
> 
> We can maybe come back to this once the stability/deadlock issues have
> been fixed up.  Besides, looking over my diffs just now, I think the
> 100Hz change got left behind in an old tree, so maybe it is not
> necessary.  Not sure what's going on here.

OK, /me crosses fingers for higher frequencies.

> > > * I spent a lot of time investigating deadlocks.  It's clear that
> > >   libunwind is not signal safe, despite claims to that effect in its
> > >   documentation.  In particular it allocates memory from the heap and
> > >   deadlocks as soon as it takes a sample when the thread being sampled
> > >   is inside malloc (libc or jemalloc) and is holding the malloc lock.
> > 
> > Yeah this will be tricky.
> 
> I see it as being tedious and intrusive -- plumbing, basically, so as
> to get our bundle of replacement functions everywhere we need.  But
> not particularly tricky.  Is there some specific difficulty you
> envisage?  I have not looked much at the libunwind code, true.

Nothing tricky in that sense.  It's just a lot of code to look at, and patch.  I wasn't implying that there would be any technical problems per se.

> > You should just see multiple roots in Cleopatra right?  AFAIK we should be
> > able to handle those cases.
> 
> Yes, it shows multiple roots.  Not so pretty, but not critical either.

Well, that's as good as things can get if we can't walk up the entire stack.  Out of curiosity, have you investigated why this happens?
(In reply to comment #21)
> (In reply to Ehsan Akhgari [:ehsan] from comment #18)
> > I'd appreciate if you can submit your libunwind changes to my [...]
> 
> Will do, as soon as I have something clean to submit.  As you can see
> from the two patches I just put up, it's still at the investigation
> stage.

Sounds good!
(In reply to comment #24)
> (In reply to Benoit Girard (:BenWa) from comment #23)
> > We use 1000Hz because in Pseudostack mode that is fast enough and the most
> > useful. I think running at 100Hz with full stacks is very useful. When the
> > fixes land I also want to experiment with increasing sampling to 1000Hz or
> > so and only unwind the top 20 frames. We will be able to reconstruct the
> > base of the stack by merging the pseudostack which we already do by default.
> > With some additional work we can get it to aggregate properly.
> 
> Honestly I'd love it if we could somehow give the stack address of the topmost
> pseudostack entry and have unwinding just stop when it hits that SP; generally
> as soon as we hit a pseudostack marker we know/don't care about anything above
> it.

Why do you think that's a good idea?  I mean the existing pseudo-stacks that we have are very much targeted at specific profiling scenarios, and I'd prefer us focus on getting a general purpose profiler which doesn't make assumptions about how well the code is instrumented.  Using the pseudo-stacks as common merge points seems fine to me, but stopping at the first pseudo-stack entry seems to put too much trust in their comprehensiveness.
I say saw we just experiment on trades between how much info we get and performance.  No need to guess they are both easy to code.
(In reply to comment #28)
> I say saw we just experiment on trades between how much info we get and
> performance.  No need to guess they are both easy to code.

Fair enough!
Closing this in favour of the approach in bug 779121, which is to try
to achieve the same by using the breakpad unwinder instead.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
(In reply to Julian Seward from comment #30)
> Closing this in favour of the approach in bug 779121, which is to try
                                            ^^^^^^^^^^
Wrong bug -- I meant bug 779291.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: