Closed Bug 769241 Opened 12 years ago Closed 12 years ago

implement fast & functional stackwalk on ARM

Categories

(Core :: Gecko Profiler, defect)

Product:

Component:

Platform:

ARM

Android

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: vlad, Unassigned)

References

Details

Attachments

(2 files)

Sick-hack private allocator for parts of libunwind; not for production use! 12 years ago Julian Seward [:jseward] 6.20 KB, patch		Details \| Diff \| Splinter Review
Misc grab-bag TableTicker hacks; read before applying. 12 years ago Julian Seward [:jseward] 3.52 KB, patch		Details \| Diff \| Splinter Review

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Description

•

12 years ago

The gecko profiler is really useful for Fennec and B2G, given that the general profiling situation there is poor. However, it would be really helpful to have full stack unwinding support there so that we can get a more detailed view into what's going on. Currently there's some code that uses libunwind; unfortunately, libunwind seems to crash many times during a single stack walk, making it way too slow to be useful. Julian mentioned that V has very fast arm stack walking, but it's license-incompatible (GPL), and may be hard to extract from V anyway. Other useful bits of info is that we already have dwarf parsing code in breakpad, and potentially CFI parsing code in various places. We should figure out how to combine all of our knowledge into working fast and functional stackwalk code on ARM. A first suggestion was to investigate why libunwind is crashing, and see if we can make it not do that; then evaluate its performance, potentially adding caching as needed.

Julian Seward [:jseward]

Comment 1

•

12 years ago

I'll try it out vanilla arm-linux, first of all. Are there any more details on how/why libunwind crashes?

(no longer active)

Comment 2

•

12 years ago

I don't remember the specifics, but libunwind used to return invalid memory addresses in a stack walk step, and on the next step it would crash on them (and it also would crash some times when interpreting the CFI unwinding information). I had to do this <https://github.com/ehsan/libunwind/commit/c55e24f44eee416ec2883b704bc570a20a1e2af5> in order to recover from those crashes (warning: horrible horrible hack). Looking at other libunwind architectures, it seems like some of them have similar protections in place as well.

(no longer active)

Comment 3

•

12 years ago

And my hunch is that these signals are why stackwalking using libunwind is currently very slow.

Julian Seward [:jseward]

Comment 4

•

12 years ago

I got a libunwind-based build x86_64-linux, after some unpleasant hand-to-hand combat with the build system and a couple of nasty kludges. In void TableTicker::doBacktrace(ThreadProfile &aProfile, TickSample* aSample) I had to make and use and x86_64 equivalents for REPLACE_REG. Some observations: * With the new REPLACE_REG in place, the stack traces that come out of libunwind are complete garbage. I don't know why; I assume I somehow screwed up the initial unwind state. * I commented out the REPLACE_REG assignments, so as to start from the sighandler provided context. And that seems to work fine. Got plausible profiles, no segfaulting, not too much complaining from Valgrind. * It appears that libunwind at least on x86_64 has its own scheme for checking whether a page is safe to access -- msync_validate in tools/profiler/libunwind/src/src/x86_64/Ginit.c. So at least in principle it shouldn't fault (if we don't think too hard about multithread badness, I guess). Not sure how that interacts with Ehsan's signal handler hack. * Having a max unwind depth of 1000 strikes me as exacerbating the performance problems. If the unwinder gets hosed, stuck in a loop, goes into outer space, or whatever, it might well generate 1000 bogus entries, and hence segfaults, for no purpose. I saw that happen several times. So I limited it to 100. Can we live with that?

Julian Seward [:jseward]

Comment 5

•

12 years ago

Just to clarify, I started with x86_64-linux as that's easy(ish) to build and I wanted to check that CFI based unwinding in libunwind isn't broken in some very obvious way, which it doesn't seem to be. Will now transition to ARM/Android as that's the real target.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 6

•

12 years ago

Note that your x86_64 code is still useful -- we should get it checked in.

(no longer active)

Comment 7

•

12 years ago

(In reply to comment #4) > I got a libunwind-based build x86_64-linux, after some unpleasant > hand-to-hand combat with the build system and a couple of nasty > kludges. > > In void TableTicker::doBacktrace(ThreadProfile &aProfile, TickSample* > aSample) I had to make and use and x86_64 equivalents for REPLACE_REG. Makes sense. > Some observations: > > * With the new REPLACE_REG in place, the stack traces that come out > of libunwind are complete garbage. I don't know why; I assume I > somehow screwed up the initial unwind state. > > * I commented out the REPLACE_REG assignments, so as to start from > the sighandler provided context. And that seems to work fine. > Got plausible profiles, no segfaulting, not too much complaining > from Valgrind. That almost certainly means that the initial state is bad somehow. libunwind includes some code which is supposed to deal with signal handler frames, and in my experience that code didn't work very well for ARM Android signal handlers, which is why I bothered to set up the initial state. > * It appears that libunwind at least on x86_64 has its own scheme for > checking whether a page is safe to access -- msync_validate in > tools/profiler/libunwind/src/src/x86_64/Ginit.c. So at least > in principle it shouldn't fault (if we don't think too hard about > multithread badness, I guess). Not sure how that interacts with > Ehsan's signal handler hack. Yeah, libunwind has that protection in many of its backends. I didn't find an API that I could use on Android to give me information about whether an address is mapped or not, which is why I resorted to the signal handler hack. If you know of a way to get that information on Android, that'd be super! > * Having a max unwind depth of 1000 strikes me as exacerbating the > performance problems. If the unwinder gets hosed, stuck in a loop, > goes into outer space, or whatever, it might well generate 1000 > bogus entries, and hence segfaults, for no purpose. I saw that > happen several times. So I limited it to 100. Can we live with that? Yeah, we can experiment with that and if it's too low in practice, we can increase it gently (i.e., not ten times!)

(no longer active)

Comment 8

•

12 years ago

(In reply to comment #5) > Just to clarify, I started with x86_64-linux as that's easy(ish) to > build and I wanted to check that CFI based unwinding in libunwind isn't > broken in some very obvious way, which it doesn't seem to be. Will now > transition to ARM/Android as that's the real target. Note that on ARM, we fallback to the exidx based unwinding if we fail to do CFI unwinding. You can turn that off initially if you only want to use the shared CFI unwinding code.

Julian Seward [:jseward]

Comment 9

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #7) > > * It appears that libunwind at least on x86_64 has its own scheme for > > checking whether a page is safe to access -- msync_validate in > > tools/profiler/libunwind/src/src/x86_64/Ginit.c. So at least > > in principle it shouldn't fault (if we don't think too hard about > > multithread badness, I guess). Not sure how that interacts with > > Ehsan's signal handler hack. > > Yeah, libunwind has that protection in many of its backends. I didn't find > an API that I could use on Android to give me information about whether an > address is mapped or not, which is why I resorted to the signal handler > hack. If you know of a way to get that information on Android, that'd be > super! The x86_64-linux scheme uses sys_msync to test if a page is mapped or not. Since that's a direct call into the kernel (AFAICS) I think the same game should work on Android. One wild guess about why libunwind doesn't already do this for ARM is that it is set up to deal with low end embedded ARM implementations, which don't necessarily have an MMU and hence for which this trick won't work. But all the targets we care about have MMUs.

(no longer active)

Comment 10

•

12 years ago

(In reply to comment #9) > (In reply to Ehsan Akhgari [:ehsan] from comment #7) > > > * It appears that libunwind at least on x86_64 has its own scheme for > > > checking whether a page is safe to access -- msync_validate in > > > tools/profiler/libunwind/src/src/x86_64/Ginit.c. So at least > > > in principle it shouldn't fault (if we don't think too hard about > > > multithread badness, I guess). Not sure how that interacts with > > > Ehsan's signal handler hack. > > > > Yeah, libunwind has that protection in many of its backends. I didn't find > > an API that I could use on Android to give me information about whether an > > address is mapped or not, which is why I resorted to the signal handler > > hack. If you know of a way to get that information on Android, that'd be > > super! > > The x86_64-linux scheme uses sys_msync to test if a page is mapped or > not. Since that's a direct call into the kernel (AFAICS) I think the > same game should work on Android. One wild guess about why libunwind > doesn't already do this for ARM is that it is set up to deal with low > end embedded ARM implementations, which don't necessarily have an MMU > and hence for which this trick won't work. But all the targets we > care about have MMUs. My impression from working on libunwind's ARM backend is that it's very minimally tested, and probably only on a single platform (my best guess is some flavor of desktop linux on ARM?). It was never tested on Android, I'm sure, as it was relying on things like dl_iterate_phdr which doesn't exist in Android's libc. That being said, I have not tested whether the Android's kernel supports msync, but I have no reason to believe that it doesn't.

Mike Hommey [:glandium]

Comment 11

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #10) > (In reply to comment #9) > > (In reply to Ehsan Akhgari [:ehsan] from comment #7) > > > > * It appears that libunwind at least on x86_64 has its own scheme for > > > > checking whether a page is safe to access -- msync_validate in > > > > tools/profiler/libunwind/src/src/x86_64/Ginit.c. So at least > > > > in principle it shouldn't fault (if we don't think too hard about > > > > multithread badness, I guess). Not sure how that interacts with > > > > Ehsan's signal handler hack. > > > > > > Yeah, libunwind has that protection in many of its backends. I didn't find > > > an API that I could use on Android to give me information about whether an > > > address is mapped or not, which is why I resorted to the signal handler > > > hack. If you know of a way to get that information on Android, that'd be > > > super! > > > > The x86_64-linux scheme uses sys_msync to test if a page is mapped or > > not. Since that's a direct call into the kernel (AFAICS) I think the > > same game should work on Android. One wild guess about why libunwind > > doesn't already do this for ARM is that it is set up to deal with low > > end embedded ARM implementations, which don't necessarily have an MMU > > and hence for which this trick won't work. But all the targets we > > care about have MMUs. > > My impression from working on libunwind's ARM backend is that it's very > minimally tested, and probably only on a single platform (my best guess is > some flavor of desktop linux on ARM?). It was never tested on Android, I'm > sure, as it was relying on things like dl_iterate_phdr which doesn't exist > in Android's libc. We have it on android fennec, and it would be easy to add to b2g (either in the system linker or in gecko). > That being said, I have not tested whether the Android's kernel supports > msync, but I have no reason to believe that it doesn't. The kernel obviously supports it. The libc might not expose a function for that, though, but after a quick check, it does.

Julian Seward [:jseward]

Comment 12

•

12 years ago

Unwinding on arm-android does seem to work to some extent, especially when I reduced the sampling frequency to 100Hz and the max stack depth to 100 entries. A much more serious problem [than segfaulting and resulting performance loss] is that fennec very often hangs after a short while. I wonder if this is the same problem as bug 757186. I also noticed w.r.t. sigsegv_handler() in tools/profiler/libunwind/src/src/arm/Ginit.c that this isn't really right switch (PROCESSOR_MODE(uc->uc_mcontext.arm_cpsr)) { ... case 1: // Thumb pc += 2; // each instruction is 16 bits break; Thumb instructions can be 2 or 4 bytes long. I think the reason it happens to work as-is is that the faults can only come from one of two instructions in access_mem() a bit further down the file, and those both happen to be compiled into 2-byte instructions. This is hardly a good thing to rely on, though.

Julian Seward [:jseward]

Updated

•

12 years ago

Depends on: 757186

Julian Seward [:jseward]

Comment 13

•

12 years ago

I've been working with libunwind/SPS on Android for a while now. Here are some observations. In general I'm not convinced that libunwind is either as broken or as slow as is claimed above. * The 1000 frame limit in TableTicker.cpp is way too high. Since the unwinder can sometimes get stuck in a loop, asking it to produce 1000 frames has the effect of wasting a lot of time filling most of them with the same (possibly bogus) value. I set it to 100, and even then the vast majority if not all traces are shorter. * The 1000Hz sampling rate is also too high. I set it to 100Hz, although I am not sure to what extent it is adhered to. * With those settings, performance is perfectly adequate for profiling on a Motorola Xoom (1 GHz dual core A9). It idles at around 26% of one core with the above settings. * The segfault handler is hit, but not a lot. For a profile run involving 2400 samples (loading news.bbc.co.uk and then moving to a different page), it was hit 98 times, that is, 4% of samples. Watching the log, it is hit very roughly once per second. * I ran it on a hacked version of Valgrind, so that it can ask Valgrind's unwinder for a stack whenever it asks libunwind for one. The stacks are then compared. This makes it possible to assess the accuracy of libunwind's output. I'm not claiming V's unwinder is infallible (it certainly has problems) but this gives a way to see if libunwind is "obviously broken". For 2400 samples, all of which have stack traces of 5 or more frames, almost all (way more than 99%) are identical between libunwind and Valgrind. Conclusion is that if libunwind is broken, it's not broken in a big way. I see the following issues for successful integration of libunwind into SPS: * I spent a lot of time investigating deadlocks. It's clear that libunwind is not signal safe, despite claims to that effect in its documentation. In particular it allocates memory from the heap and deadlocks as soon as it takes a sample when the thread being sampled is inside malloc (libc or jemalloc) and is holding the malloc lock. Avoiding these deadlocks is going to require properly isolating libunwind from the outside world, something that we really ought to be able to enforce at compile time. There are also hangs related to Awesomebar and other activities (bug 773308) which I was not able to make sense of. I think they are different from the malloc hangs, though. * It appears (AIUI, at least) that Cleopatra (the GUI) has an algorithm to merge multiple stack traces into a tree, and it assumes that all stack traces end up back at some common root (main, basically). This works OK for pseudostack unwinding, but produces a bunch of smaller trees with libunwind, because many stack traces are incomplete. This is probably unavoidable, so maybe the tree construction algorithm needs to be augmented to deal (probabilistically) with partial traces. * The build system integration of libunwind is problematic. In particular things like make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library fail to rebuild properly after changes to TableTicker.cpp or the libunwind bits. The only way I found to get a reliable rebuild was "make -f client.mk build" at the top level, which is very slow.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 14

•

12 years ago

That's extremely encouraging; thanks for diving into this! It sounds like fixing the heap memory issue (and potentially the naive 2-byte instruction assumption issue in comment #12) are the main things that we need before we can integrate it? Do you have any patches that you applied to either libunwind or the profiler that would be helpful in reproducing your results? I'd like to play with it at some point soon.

Julian Seward [:jseward]

Comment 15

•

12 years ago

(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #14) > It sounds like fixing the heap memory issue (and potentially the naive > 2-byte instruction assumption issue in comment #12) are the main > things that we need before we can integrate it? Yes. The 2-byte insn thing is trivial and not critical -- I'll put up a patch for it in the next day or so. The heap memory stuff is more difficult, at least to do properly. What I'd like to do ideally is build libunwind with only a minimal set of headers (no stdlib, stdio, etc) so it simply can't allocate heap, either directly or indirectly. Then add an object that carries round a bundle of replacement functions that we control (malloc, free, fread, etc), so as to fix up the resulting link errors. That way, provided nobody re-adds #include <stdlib.h> etc, we are guaranteed safe at compile time. Unfortunately it's probably quite an intrusive thing to do. > Do you have any patches that you applied to either libunwind or the profiler > that would be helpful in reproducing your results? I'd like to play with it > at some point soon. I'll attach the hacky patches to libunwind and the profiler in the next comment. I can also attach the patches for Valgrind if you want to do the backtrace-diffing game, although you'll need a lot of patience :)

Julian Seward [:jseward]

Comment 16

•

12 years ago

Attached patch Sick-hack private allocator for parts of libunwind; not for production use! — Details — Splinter Review

Provides a quick n dirty private allocator for tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c so as to stop it deadlocking every ten seconds.

(no longer active)

Comment 17

•

12 years ago

(In reply to comment #13) > I've been working with libunwind/SPS on Android for a while now. Here > are some observations. In general I'm not convinced that libunwind is > either as broken or as slow as is claimed above. > > * The 1000 frame limit in TableTicker.cpp is way too high. Since the > unwinder can sometimes get stuck in a loop, asking it to produce > 1000 frames has the effect of wasting a lot of time filling most of > them with the same (possibly bogus) value. I set it to 100, and > even then the vast majority if not all traces are shorter. I think this is an acceptable adjustment. > * The 1000Hz sampling rate is also too high. I set it to 100Hz, > although I am not sure to what extent it is adhered to. Hmm, 100Hz would be a bit too low for measuring a lot of things. That would sample at 10ms intervals, which can make profiling things which can cause us to skip a frame for example very hard. :( > * With those settings, performance is perfectly adequate for profiling > on a Motorola Xoom (1 GHz dual core A9). It idles at around 26% of > one core with the above settings. > > * The segfault handler is hit, but not a lot. For a profile run > involving 2400 samples (loading news.bbc.co.uk and then moving to a > different page), it was hit 98 times, that is, 4% of samples. > Watching the log, it is hit very roughly once per second. Hmm, that's great, and a lot lower than what I used to see. I don't know why the difference is, but I'm not going to question a good thing. > * I ran it on a hacked version of Valgrind, so that it can ask > Valgrind's unwinder for a stack whenever it asks libunwind for one. > The stacks are then compared. This makes it possible to assess the > accuracy of libunwind's output. I'm not claiming V's unwinder is > infallible (it certainly has problems) but this gives a way to see > if libunwind is "obviously broken". > > For 2400 samples, all of which have stack traces of 5 or more > frames, almost all (way more than 99%) are identical between > libunwind and Valgrind. Conclusion is that if libunwind is broken, > it's not broken in a big way. That's great news! > I see the following issues for successful integration of libunwind > into SPS: > > * I spent a lot of time investigating deadlocks. It's clear that > libunwind is not signal safe, despite claims to that effect in its > documentation. In particular it allocates memory from the heap and > deadlocks as soon as it takes a sample when the thread being sampled > is inside malloc (libc or jemalloc) and is holding the malloc lock. Yeah this will be tricky. > * It appears (AIUI, at least) that Cleopatra (the GUI) has an > algorithm to merge multiple stack traces into a tree, and it assumes > that all stack traces end up back at some common root (main, > basically). This works OK for pseudostack unwinding, but produces a > bunch of smaller trees with libunwind, because many stack traces are > incomplete. This is probably unavoidable, so maybe the tree > construction algorithm needs to be augmented to deal > (probabilistically) with partial traces. You should just see multiple roots in Cleopatra right? AFAIK we should be able to handle those cases. > * The build system integration of libunwind is problematic. > In particular things like > > make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library > > fail to rebuild properly after changes to TableTicker.cpp or > the libunwind bits. The only way I found to get a reliable rebuild > was "make -f client.mk build" at the top level, which is very slow. Ouch. Unfortunately I don't know very much about our build system, but perhaps you can ask one of the build system peers to see if they can spot any obvious mistakes?

(no longer active)

Comment 18

•

12 years ago

(In reply to comment #15) > I'll attach the hacky patches to libunwind and the profiler in the > next comment. I can also attach the patches for Valgrind if you want > to do the backtrace-diffing game, although you'll need a lot of patience :) I'd appreciate if you can submit your libunwind changes to my github repo as a pull request and then update the in-tree version from the github repo, as the in-tree version is supposed to be a copy of the github repo. (This will make it easier to upstream our libunwind changes at some point in the future.)

Julian Seward [:jseward]

Comment 19

•

12 years ago

Attached patch Misc grab-bag TableTicker hacks; read before applying. — Details — Splinter Review

This is what I've been using with TableTicker to do the backtrace diffing. You'll need to edit out the relevant bits before building (should be obvious), else you'll have something that's unbuildable. Also: * decreases number of grabbed frames from 1000 to 100 * adds a bit more logcat printing, in particular so we can see when the segfault handler is invoked.

Julian Seward [:jseward]

Comment 20

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #17) > > * The 1000Hz sampling rate is also too high. I set it to 100Hz, > > although I am not sure to what extent it is adhered to. > > Hmm, 100Hz would be a bit too low for measuring a lot of things. That would > sample at 10ms intervals, which can make profiling things which can cause us > to skip a frame for example very hard. :( We can maybe come back to this once the stability/deadlock issues have been fixed up. Besides, looking over my diffs just now, I think the 100Hz change got left behind in an old tree, so maybe it is not necessary. Not sure what's going on here. > > * I spent a lot of time investigating deadlocks. It's clear that > > libunwind is not signal safe, despite claims to that effect in its > > documentation. In particular it allocates memory from the heap and > > deadlocks as soon as it takes a sample when the thread being sampled > > is inside malloc (libc or jemalloc) and is holding the malloc lock. > > Yeah this will be tricky. I see it as being tedious and intrusive -- plumbing, basically, so as to get our bundle of replacement functions everywhere we need. But not particularly tricky. Is there some specific difficulty you envisage? I have not looked much at the libunwind code, true. > You should just see multiple roots in Cleopatra right? AFAIK we should be > able to handle those cases. Yes, it shows multiple roots. Not so pretty, but not critical either. > Ouch. Unfortunately I don't know very much about our build system, but > perhaps you can ask one of the build system peers to see if they can spot > any obvious mistakes? I'll try to get TedM on the case :)

Julian Seward [:jseward]

Comment 21

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #18) > I'd appreciate if you can submit your libunwind changes to my [...] Will do, as soon as I have something clean to submit. As you can see from the two patches I just put up, it's still at the investigation stage.

Julian Seward [:jseward]

Comment 22

•

12 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #17) > > * The segfault handler is hit, but not a lot. For a profile run > > involving 2400 samples (loading news.bbc.co.uk and then moving to a > > different page), it was hit 98 times, that is, 4% of samples. > > Watching the log, it is hit very roughly once per second. > > Hmm, that's great, and a lot lower than what I used to see. I don't know > why the difference is, but I'm not going to question a good thing. The difference between our observations makes me kinda nervous .. maybe I did something wrong. Not sure what though. So it's good if Vlad can try to repro what I did.

Benoit Girard (:BenWa)

Comment 23

•

12 years ago

(In reply to Julian Seward from comment #13) > I've been working with libunwind/SPS on Android for a while now. Here > are some observations. In general I'm not convinced that libunwind is > either as broken or as slow as is claimed above. > > * The 1000 frame limit in TableTicker.cpp is way too high. Since the > unwinder can sometimes get stuck in a loop, asking it to produce > 1000 frames has the effect of wasting a lot of time filling most of > them with the same (possibly bogus) value. I set it to 100, and > even then the vast majority if not all traces are shorter. > > * The 1000Hz sampling rate is also too high. I set it to 100Hz, > although I am not sure to what extent it is adhered to. We use 1000Hz because in Pseudostack mode that is fast enough and the most useful. I think running at 100Hz with full stacks is very useful. When the fixes land I also want to experiment with increasing sampling to 1000Hz or so and only unwind the top 20 frames. We will be able to reconstruct the base of the stack by merging the pseudostack which we already do by default. With some additional work we can get it to aggregate properly. > > * With those settings, performance is perfectly adequate for profiling > on a Motorola Xoom (1 GHz dual core A9). It idles at around 26% of > one core with the above settings. > > * The segfault handler is hit, but not a lot. For a profile run > involving 2400 samples (loading news.bbc.co.uk and then moving to a > different page), it was hit 98 times, that is, 4% of samples. > Watching the log, it is hit very roughly once per second. > > * I ran it on a hacked version of Valgrind, so that it can ask > Valgrind's unwinder for a stack whenever it asks libunwind for one. > The stacks are then compared. This makes it possible to assess the > accuracy of libunwind's output. I'm not claiming V's unwinder is > infallible (it certainly has problems) but this gives a way to see > if libunwind is "obviously broken". > > For 2400 samples, all of which have stack traces of 5 or more > frames, almost all (way more than 99%) are identical between > libunwind and Valgrind. Conclusion is that if libunwind is broken, > it's not broken in a big way. > That's excellent news. Some error is acceptable in profiling and being at 99% accuracy is more then enough. Plus since we merge in pseudostack we should be able to sanity check cheaply. > > I see the following issues for successful integration of libunwind > into SPS: > > * I spent a lot of time investigating deadlocks. It's clear that > libunwind is not signal safe, despite claims to that effect in its > documentation. In particular it allocates memory from the heap and > deadlocks as soon as it takes a sample when the thread being sampled > is inside malloc (libc or jemalloc) and is holding the malloc lock. > > Avoiding these deadlocks is going to require properly isolating > libunwind from the outside world, something that we really ought to > be able to enforce at compile time. What do you think the best way to do that is? Writing a custom allocator? > > There are also hangs related to Awesomebar and other activities > (bug 773308) which I was not able to make sense of. I think they > are different from the malloc hangs, though. Yes, don't worry about these. > > * It appears (AIUI, at least) that Cleopatra (the GUI) has an > algorithm to merge multiple stack traces into a tree, and it assumes > that all stack traces end up back at some common root (main, > basically). This works OK for pseudostack unwinding, but produces a > bunch of smaller trees with libunwind, because many stack traces are > incomplete. This is probably unavoidable, so maybe the tree > construction algorithm needs to be augmented to deal > (probabilistically) with partial traces. Yes it has bugs there. A lot of the code assumes that the non inverted view only has one parent node. If it's not easy to enforce I normally attach an anchor (Root) node to all the samples but I'm going to be fixing this up in Cleopatra soon. > > * The build system integration of libunwind is problematic. > In particular things like > > make -C ff-opt/tools/profiler && make -C ff-opt/toolkit/library > > fail to rebuild properly after changes to TableTicker.cpp or > the libunwind bits. The only way I found to get a reliable rebuild > was "make -f client.mk build" at the top level, which is very slow. I've never experienced that :(, that's how I already rebuild.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 24

•

12 years ago

(In reply to Benoit Girard (:BenWa) from comment #23) > We use 1000Hz because in Pseudostack mode that is fast enough and the most > useful. I think running at 100Hz with full stacks is very useful. When the > fixes land I also want to experiment with increasing sampling to 1000Hz or > so and only unwind the top 20 frames. We will be able to reconstruct the > base of the stack by merging the pseudostack which we already do by default. > With some additional work we can get it to aggregate properly. Honestly I'd love it if we could somehow give the stack address of the topmost pseudostack entry and have unwinding just stop when it hits that SP; generally as soon as we hit a pseudostack marker we know/don't care about anything above it.

(no longer active)

Comment 25

•

12 years ago

(In reply to comment #20) > (In reply to Ehsan Akhgari [:ehsan] from comment #17) > > > > * The 1000Hz sampling rate is also too high. I set it to 100Hz, > > > although I am not sure to what extent it is adhered to. > > > > Hmm, 100Hz would be a bit too low for measuring a lot of things. That would > > sample at 10ms intervals, which can make profiling things which can cause us > > to skip a frame for example very hard. :( > > We can maybe come back to this once the stability/deadlock issues have > been fixed up. Besides, looking over my diffs just now, I think the > 100Hz change got left behind in an old tree, so maybe it is not > necessary. Not sure what's going on here. OK, /me crosses fingers for higher frequencies. > > > * I spent a lot of time investigating deadlocks. It's clear that > > > libunwind is not signal safe, despite claims to that effect in its > > > documentation. In particular it allocates memory from the heap and > > > deadlocks as soon as it takes a sample when the thread being sampled > > > is inside malloc (libc or jemalloc) and is holding the malloc lock. > > > > Yeah this will be tricky. > > I see it as being tedious and intrusive -- plumbing, basically, so as > to get our bundle of replacement functions everywhere we need. But > not particularly tricky. Is there some specific difficulty you > envisage? I have not looked much at the libunwind code, true. Nothing tricky in that sense. It's just a lot of code to look at, and patch. I wasn't implying that there would be any technical problems per se. > > You should just see multiple roots in Cleopatra right? AFAIK we should be > > able to handle those cases. > > Yes, it shows multiple roots. Not so pretty, but not critical either. Well, that's as good as things can get if we can't walk up the entire stack. Out of curiosity, have you investigated why this happens?

(no longer active)

Comment 26

•

12 years ago

(In reply to comment #21) > (In reply to Ehsan Akhgari [:ehsan] from comment #18) > > I'd appreciate if you can submit your libunwind changes to my [...] > > Will do, as soon as I have something clean to submit. As you can see > from the two patches I just put up, it's still at the investigation > stage. Sounds good!

(no longer active)

Comment 27

•

12 years ago

(In reply to comment #24) > (In reply to Benoit Girard (:BenWa) from comment #23) > > We use 1000Hz because in Pseudostack mode that is fast enough and the most > > useful. I think running at 100Hz with full stacks is very useful. When the > > fixes land I also want to experiment with increasing sampling to 1000Hz or > > so and only unwind the top 20 frames. We will be able to reconstruct the > > base of the stack by merging the pseudostack which we already do by default. > > With some additional work we can get it to aggregate properly. > > Honestly I'd love it if we could somehow give the stack address of the topmost > pseudostack entry and have unwinding just stop when it hits that SP; generally > as soon as we hit a pseudostack marker we know/don't care about anything above > it. Why do you think that's a good idea? I mean the existing pseudo-stacks that we have are very much targeted at specific profiling scenarios, and I'd prefer us focus on getting a general purpose profiler which doesn't make assumptions about how well the code is instrumented. Using the pseudo-stacks as common merge points seems fine to me, but stopping at the first pseudo-stack entry seems to put too much trust in their comprehensiveness.

Benoit Girard (:BenWa)

Comment 28

•

12 years ago

I say saw we just experiment on trades between how much info we get and performance. No need to guess they are both easy to code.

(no longer active)

Comment 29

•

12 years ago

(In reply to comment #28) > I say saw we just experiment on trades between how much info we get and > performance. No need to guess they are both easy to code. Fair enough!

Julian Seward [:jseward]

Comment 30

•

12 years ago

Closing this in favour of the approach in bug 779121, which is to try to achieve the same by using the breakpad unwinder instead.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

Julian Seward [:jseward]

Comment 31

•

12 years ago

(In reply to Julian Seward from comment #30) > Closing this in favour of the approach in bug 779121, which is to try ^^^^^^^^^^ Wrong bug -- I meant bug 779291.

You need to log in before you can comment on or make changes to this bug.