Last Comment Bug 757186 - Gecko Profiler SIGPROF causes hangs on Linux
: Gecko Profiler SIGPROF causes hangs on Linux
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Gecko Profiler (show other bugs)
: unspecified
: x86_64 Linux
: -- normal with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on:
Blocks: 769241
  Show dependency treegraph
 
Reported: 2012-05-21 13:29 PDT by Steve Fink [:sfink] [:s:]
Modified: 2015-09-06 19:36 PDT (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Stack trace (36.91 KB, text/plain)
2012-05-21 13:31 PDT, Steve Fink [:sfink] [:s:]
no flags Details
strace log (1.38 KB, application/x-gzip)
2012-05-21 13:33 PDT, Steve Fink [:sfink] [:s:]
no flags Details
lsof output (7.08 KB, application/x-gzip)
2012-05-21 13:34 PDT, Steve Fink [:sfink] [:s:]
no flags Details
/proc/(pid)/maps output (21.21 KB, application/x-gzip)
2012-05-21 13:34 PDT, Steve Fink [:sfink] [:s:]
no flags Details
Cleopatra screenshot (160.46 KB, image/png)
2012-05-31 13:41 PDT, Steve Fink [:sfink] [:s:]
no flags Details
Stacks when hanging on Android (38.93 KB, text/plain)
2012-07-09 01:24 PDT, Julian Seward [:jseward]
no flags Details
Stacks when hanging on Android, take 2 (72.80 KB, text/plain)
2012-07-09 03:36 PDT, Julian Seward [:jseward]
no flags Details
stacktrace (4.88 KB, text/plain)
2013-03-18 09:29 PDT, Tom Schuster [:evilpie]
no flags Details

Description Steve Fink [:sfink] [:s:] 2012-05-21 13:29:19 PDT
Was working on another workspace, switched back to find Firefox hung.

Stacks attached. It was sitting in a poll() -- well, except I have the geckoprofiler installed, so the poll was getting interrupted constantly with SIGPROF signals.

I'm also attaching a 6-7 second strace log. Oddly, although the poll() appears to be set to a 5-second timeout, it doesn't seem like it ever timed out. It seems unlikely that restarting poll() would reset its timeout, but I can't explain why that poll() never seemed to return otherwise.
Comment 1 Steve Fink [:sfink] [:s:] 2012-05-21 13:31:41 PDT
Created attachment 625740 [details]
Stack trace

Note that this is on the profiling nightly from http://hg.mozilla.org/projects/profiling/rev/edb3df9e27a6.
Comment 2 Steve Fink [:sfink] [:s:] 2012-05-21 13:33:01 PDT
Created attachment 625741 [details]
strace log
Comment 3 Steve Fink [:sfink] [:s:] 2012-05-21 13:34:13 PDT
Created attachment 625743 [details]
lsof output
Comment 4 Steve Fink [:sfink] [:s:] 2012-05-21 13:34:45 PDT
Created attachment 625744 [details]
/proc/(pid)/maps output
Comment 5 Justin Dolske [:Dolske] 2012-05-22 18:11:26 PDT
Seems like nsHttpConnectionMgr was the last notable thing in the stack, so let's try Core::Networking.
Comment 6 Patrick McManus [:mcmanus] 2012-05-22 18:31:10 PDT
I don't know exactly what was happening, but I bet it has something to do with sigprof interrupting it constantly.. pr_setpollable_event event pretty much writes to a fifo iirc and maybe that got filled up.

interesting to look into, but I doubt its hiding something dangerous.
Comment 7 Honza Bambas (:mayhemer) 2012-05-31 04:58:13 PDT
It's known (cannot find the bug # though) the pollable event buffer gets filled up.
Comment 8 Steve Fink [:sfink] [:s:] 2012-05-31 10:13:28 PDT
Not sure it's meaningful, but when this happened again I attached with gdb, stopped passing SIGPROF through, and continued. It remained hung. (Then I reattached and attempted to call sigaction(SIGPROF, SIG_IGN), but botched it and crashed the process.)

I think this may impact the stability of the Gecko Profiler; I get one of these hangs every week or so.

I also see bug 737328, though it probably isn't related to this bug.
Comment 9 Benoit Girard (:BenWa) 2012-05-31 10:23:42 PDT
Are you running a profiling build (using unwinding) or not (using pseudo stack)? Not sure if it matters but just in case.
Comment 10 Steve Fink [:sfink] [:s:] 2012-05-31 11:49:59 PDT
(In reply to Benoit Girard (:BenWa) from comment #9)
> Are you running a profiling build (using unwinding) or not (using pseudo
> stack)? Not sure if it matters but just in case.

Ermmm... I think I'm using unwinding. Cleopatra shows hex addresses in its stack. I'm on the nightly profiling branch.
Comment 11 Benoit Girard (:BenWa) 2012-05-31 12:24:06 PDT
We will show address only on the leaf nodes in pseudo stack mode. Is the symbolication also failing for you? Looks like sps linux support really needs some love.
Comment 12 Steve Fink [:sfink] [:s:] 2012-05-31 13:41:15 PDT
Created attachment 628885 [details]
Cleopatra screenshot

Numbers all the way down
Comment 13 Julian Seward [:jseward] 2012-07-09 00:41:15 PDT
SPS with libunwind-based-unwinding on Android, at 100HZ sampling,
will usually hang within a couple of minutes.  Not sure if it is
the same problem or not.
Comment 14 Julian Seward [:jseward] 2012-07-09 01:24:29 PDT
Created attachment 640154 [details]
Stacks when hanging on Android

Stacks at hang on Android.  The user interface no longer responds
to input, but I can see from debug printing, that signals are still
being delivered to the profiler and that it taking stack snapshots
accordingly.
Comment 15 Julian Seward [:jseward] 2012-07-09 03:36:03 PDT
Created attachment 640176 [details]
Stacks when hanging on Android, take 2

Previous version was completely bogus (debuginfo out of date).
Comment 16 Julian Seward [:jseward] 2012-07-09 04:37:05 PDT
Some of these threads are waiting on jemalloc's allocation lock.  In
this particular snapshot, the signal handler's thread (thread 12) is
too.  So it's clear that the signal handler, calling into libunwind,
eventually calls (je)malloc.

I don't think that's the cause of the hang though, because the signal
handler thread continues to take samples (as per debugging printing)
even though the rest of the app appears to be deadlocked.

What could cause a deadlock is if some thread T calls into jemalloc,
gets the jemalloc lock, and then the handler runs on thread T, and
that also calls jemalloc (as per thread 12).
Comment 17 Julian Seward [:jseward] 2012-07-09 06:10:50 PDT
I checked the libunwind documentation, at
http://www.nongnu.org/libunwind/man/libunwind%283%29.html#section_5

This states that the functions we use (unw_getcontext, unw_init_local,
unw_step, unw_get_reg) are "signal safe", by which I think is meant,
only calls functions which are listed as being "async-signal-safe" by
POSIX.  Whether that is any use to use I don't know, considering 1st
and 3rd paras in comment #16.
Comment 18 Julian Seward [:jseward] 2012-07-10 06:40:37 PDT
Following Ehsan/BenWa comments on irc, I made a simple mmap based
allocator in tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c
and used it to handle all the malloc, calloc, realloc and free calls there.

This appears to remove deadlocks in which threads are waiting for the
jemalloc lock.  It still deadlocks though.  Some of them are due to
contention for a lock in libc (bionic) malloc, which is reached via
calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
mmap allocator.

At this point I am wondering if libunwind is actually suitable for
this task.  It needs to be much more self contained if we are to run
it safely from a signal handler.  Also I am wondering what the
libunwind docs mean when they claim the functions we use are
signal-safe -- they clearly are not, at least given the way we are
using them.

All of that said, I believe there are still deadlocks that have
nothing to do with bad interactions with {libc/je}malloc, as per the
original reason for filing this bug.
Comment 19 Julian Seward [:jseward] 2012-07-10 06:48:23 PDT
(In reply to Patrick McManus [:mcmanus] from comment #6)
> I don't know exactly what was happening, but I bet it has something to do
> with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> much writes to a fifo iirc and maybe that got filled up.

Do you have any more specifics about what the problem might be and how
to go about checking your hypothesis?  Having looked at several such
deadlock instances in Android, I am inclined to believe at least some
of them are caused by this (kind of) problem.
Comment 20 Patrick McManus [:mcmanus] 2012-07-10 09:11:05 PDT
(In reply to Julian Seward from comment #19)
> (In reply to Patrick McManus [:mcmanus] from comment #6)
> > I don't know exactly what was happening, but I bet it has something to do
> > with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> > much writes to a fifo iirc and maybe that got filled up.
> 
> Do you have any more specifics about what the problem might be and how
> to go about checking your hypothesis?  Having looked at several such
> deadlock instances in Android, I am inclined to believe at least some
> of them are caused by this (kind of) problem.

just a hunch, but probly related to either the pollable-event fifo filling up
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/io/prpolevt.c#115
Comment 21 :Ehsan Akhgari (out sick) 2012-07-10 10:46:20 PDT
(In reply to comment #18)
> This appears to remove deadlocks in which threads are waiting for the
> jemalloc lock.  It still deadlocks though.  Some of them are due to
> contention for a lock in libc (bionic) malloc, which is reached via
> calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
> mmap allocator.

Bah, libunwind shouldn't really touch anything from libc if it wants to even claim to be signal safe. :(
Comment 22 Tom Schuster [:evilpie] 2013-03-18 09:29:17 PDT
Created attachment 726203 [details]
stacktrace

I get this hang, when I click on a downloaded file in the new Download Manager (ie. trying to display it), while I have the Gecko Profiler running.

This is on Linux Mint 14.
Comment 23 Mike Hommey [:glandium] 2013-03-18 12:50:26 PDT
(In reply to Tom Schuster [:evilpie] from comment #22)
> Created attachment 726203 [details]
> stacktrace
> 
> I get this hang, when I click on a downloaded file in the new Download
> Manager (ie. trying to display it), while I have the Gecko Profiler running.
> 
> This is on Linux Mint 14.

That sounds like bug 837390
Comment 24 Benoit Girard (:BenWa) 2013-07-10 12:15:56 PDT
Can you still reproduce a hang on poll()?
Comment 25 Steve Fink [:sfink] [:s:] 2013-07-11 17:40:03 PDT
(In reply to Benoit Girard (:BenWa) from comment #24)
> Can you still reproduce a hang on poll()?

I can't use the profiler because of bug 879547, but I just re-enabled it and strace shows the usual SIGPROF storm, so I'll run like this for a while and let you know. I'll needinfo? myself so I remember.
Comment 26 Steve Fink [:sfink] [:s:] 2013-07-22 10:20:33 PDT
I haven't seen this at all in the week and a half. Seems to be fixed.

From the comments, it seems like other people might be seeing slightly different bugs, so I'm resolving this one.  Unless you're getting a hang when spawning an external process, open another bug.

Ooh! Even better, while updating this bug I clicked on the profiler icon, and it actually did something again! Perhaps that's fixed too!
Comment 27 Nickolay_Ponomarev 2014-11-09 15:09:49 PST
This bug is referenced from https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native stack" on Linux nightly.

Could someone update MDN - I'm not sure if native stack is available by default or if you need special env variables as the section below seems to say.

Note You need to log in before you can comment on or make changes to this bug.