Closed Bug 757186 Opened 8 years ago Closed 7 years ago

Gecko Profiler SIGPROF causes hangs on Linux

Categories

(Core :: Gecko Profiler, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: sfink, Unassigned)

References

Details

Attachments

(7 files, 1 obsolete file)

Was working on another workspace, switched back to find Firefox hung.

Stacks attached. It was sitting in a poll() -- well, except I have the geckoprofiler installed, so the poll was getting interrupted constantly with SIGPROF signals.

I'm also attaching a 6-7 second strace log. Oddly, although the poll() appears to be set to a 5-second timeout, it doesn't seem like it ever timed out. It seems unlikely that restarting poll() would reset its timeout, but I can't explain why that poll() never seemed to return otherwise.
Attached file Stack trace
Note that this is on the profiling nightly from http://hg.mozilla.org/projects/profiling/rev/edb3df9e27a6.
Attached file strace log
Attached file lsof output
Attachment #625740 - Attachment is patch: false
Seems like nsHttpConnectionMgr was the last notable thing in the stack, so let's try Core::Networking.
Component: General → Networking: HTTP
Product: Firefox → Core
QA Contact: general → networking.http
I don't know exactly what was happening, but I bet it has something to do with sigprof interrupting it constantly.. pr_setpollable_event event pretty much writes to a fifo iirc and maybe that got filled up.

interesting to look into, but I doubt its hiding something dangerous.
It's known (cannot find the bug # though) the pollable event buffer gets filled up.
Not sure it's meaningful, but when this happened again I attached with gdb, stopped passing SIGPROF through, and continued. It remained hung. (Then I reattached and attempted to call sigaction(SIGPROF, SIG_IGN), but botched it and crashed the process.)

I think this may impact the stability of the Gecko Profiler; I get one of these hangs every week or so.

I also see bug 737328, though it probably isn't related to this bug.
Component: Networking: HTTP → Gecko Profiler
QA Contact: networking.http → gecko-profiler
Summary: Hang while doing nothing → Gecko Profiler SIGPROF causes hangs on Linux
Are you running a profiling build (using unwinding) or not (using pseudo stack)? Not sure if it matters but just in case.
(In reply to Benoit Girard (:BenWa) from comment #9)
> Are you running a profiling build (using unwinding) or not (using pseudo
> stack)? Not sure if it matters but just in case.

Ermmm... I think I'm using unwinding. Cleopatra shows hex addresses in its stack. I'm on the nightly profiling branch.
We will show address only on the leaf nodes in pseudo stack mode. Is the symbolication also failing for you? Looks like sps linux support really needs some love.
Attached image Cleopatra screenshot
Numbers all the way down
SPS with libunwind-based-unwinding on Android, at 100HZ sampling,
will usually hang within a couple of minutes.  Not sure if it is
the same problem or not.
Attached file Stacks when hanging on Android (obsolete) —
Stacks at hang on Android.  The user interface no longer responds
to input, but I can see from debug printing, that signals are still
being delivered to the profiler and that it taking stack snapshots
accordingly.
Previous version was completely bogus (debuginfo out of date).
Attachment #640154 - Attachment is obsolete: true
Some of these threads are waiting on jemalloc's allocation lock.  In
this particular snapshot, the signal handler's thread (thread 12) is
too.  So it's clear that the signal handler, calling into libunwind,
eventually calls (je)malloc.

I don't think that's the cause of the hang though, because the signal
handler thread continues to take samples (as per debugging printing)
even though the rest of the app appears to be deadlocked.

What could cause a deadlock is if some thread T calls into jemalloc,
gets the jemalloc lock, and then the handler runs on thread T, and
that also calls jemalloc (as per thread 12).
Blocks: 769241
I checked the libunwind documentation, at
http://www.nongnu.org/libunwind/man/libunwind%283%29.html#section_5

This states that the functions we use (unw_getcontext, unw_init_local,
unw_step, unw_get_reg) are "signal safe", by which I think is meant,
only calls functions which are listed as being "async-signal-safe" by
POSIX.  Whether that is any use to use I don't know, considering 1st
and 3rd paras in comment #16.
Following Ehsan/BenWa comments on irc, I made a simple mmap based
allocator in tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c
and used it to handle all the malloc, calloc, realloc and free calls there.

This appears to remove deadlocks in which threads are waiting for the
jemalloc lock.  It still deadlocks though.  Some of them are due to
contention for a lock in libc (bionic) malloc, which is reached via
calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
mmap allocator.

At this point I am wondering if libunwind is actually suitable for
this task.  It needs to be much more self contained if we are to run
it safely from a signal handler.  Also I am wondering what the
libunwind docs mean when they claim the functions we use are
signal-safe -- they clearly are not, at least given the way we are
using them.

All of that said, I believe there are still deadlocks that have
nothing to do with bad interactions with {libc/je}malloc, as per the
original reason for filing this bug.
(In reply to Patrick McManus [:mcmanus] from comment #6)
> I don't know exactly what was happening, but I bet it has something to do
> with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> much writes to a fifo iirc and maybe that got filled up.

Do you have any more specifics about what the problem might be and how
to go about checking your hypothesis?  Having looked at several such
deadlock instances in Android, I am inclined to believe at least some
of them are caused by this (kind of) problem.
(In reply to Julian Seward from comment #19)
> (In reply to Patrick McManus [:mcmanus] from comment #6)
> > I don't know exactly what was happening, but I bet it has something to do
> > with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> > much writes to a fifo iirc and maybe that got filled up.
> 
> Do you have any more specifics about what the problem might be and how
> to go about checking your hypothesis?  Having looked at several such
> deadlock instances in Android, I am inclined to believe at least some
> of them are caused by this (kind of) problem.

just a hunch, but probly related to either the pollable-event fifo filling up
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/io/prpolevt.c#115
(In reply to comment #18)
> This appears to remove deadlocks in which threads are waiting for the
> jemalloc lock.  It still deadlocks though.  Some of them are due to
> contention for a lock in libc (bionic) malloc, which is reached via
> calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
> mmap allocator.

Bah, libunwind shouldn't really touch anything from libc if it wants to even claim to be signal safe. :(
Attached file stacktrace
I get this hang, when I click on a downloaded file in the new Download Manager (ie. trying to display it), while I have the Gecko Profiler running.

This is on Linux Mint 14.
(In reply to Tom Schuster [:evilpie] from comment #22)
> Created attachment 726203 [details]
> stacktrace
> 
> I get this hang, when I click on a downloaded file in the new Download
> Manager (ie. trying to display it), while I have the Gecko Profiler running.
> 
> This is on Linux Mint 14.

That sounds like bug 837390
Can you still reproduce a hang on poll()?
(In reply to Benoit Girard (:BenWa) from comment #24)
> Can you still reproduce a hang on poll()?

I can't use the profiler because of bug 879547, but I just re-enabled it and strace shows the usual SIGPROF storm, so I'll run like this for a while and let you know. I'll needinfo? myself so I remember.
Flags: needinfo?(sphink)
I haven't seen this at all in the week and a half. Seems to be fixed.

From the comments, it seems like other people might be seeing slightly different bugs, so I'm resolving this one.  Unless you're getting a hang when spawning an external process, open another bug.

Ooh! Even better, while updating this bug I clicked on the profiler icon, and it actually did something again! Perhaps that's fixed too!
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(sphink)
Resolution: --- → FIXED
This bug is referenced from https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native stack" on Linux nightly.

Could someone update MDN - I'm not sure if native stack is available by default or if you need special env variables as the section below seems to say.
(In reply to Nickolay_Ponomarev from comment #27)
> This bug is referenced from
> https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native
> stack" on Linux nightly.
> 
> Could someone update MDN - I'm not sure if native stack is available by
> default or if you need special env variables as the section below seems to
> say.

Yeah, I just got back to this bug through that link. I'm running a Nightly on Linux, and I see to be getting symbols now. (Well, libxul.so symbols. I still see some missing for eg libdbus. But maybe I need to install that debuginfo.)
You need to log in before you can comment on or make changes to this bug.