Closed
Bug 757186
Opened 13 years ago
Closed 11 years ago
Gecko Profiler SIGPROF causes hangs on Linux
Categories
(Core :: Gecko Profiler, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: sfink, Unassigned)
References
Details
Attachments
(7 files, 1 obsolete file)
Was working on another workspace, switched back to find Firefox hung.
Stacks attached. It was sitting in a poll() -- well, except I have the geckoprofiler installed, so the poll was getting interrupted constantly with SIGPROF signals.
I'm also attaching a 6-7 second strace log. Oddly, although the poll() appears to be set to a 5-second timeout, it doesn't seem like it ever timed out. It seems unlikely that restarting poll() would reset its timeout, but I can't explain why that poll() never seemed to return otherwise.
Reporter | ||
Comment 1•13 years ago
|
||
Note that this is on the profiling nightly from http://hg.mozilla.org/projects/profiling/rev/edb3df9e27a6.
Reporter | ||
Comment 2•13 years ago
|
||
Reporter | ||
Comment 3•13 years ago
|
||
Reporter | ||
Comment 4•13 years ago
|
||
Reporter | ||
Updated•13 years ago
|
Attachment #625740 -
Attachment is patch: false
Comment 5•13 years ago
|
||
Seems like nsHttpConnectionMgr was the last notable thing in the stack, so let's try Core::Networking.
Component: General → Networking: HTTP
Product: Firefox → Core
QA Contact: general → networking.http
Comment 6•13 years ago
|
||
I don't know exactly what was happening, but I bet it has something to do with sigprof interrupting it constantly.. pr_setpollable_event event pretty much writes to a fifo iirc and maybe that got filled up.
interesting to look into, but I doubt its hiding something dangerous.
Comment 7•12 years ago
|
||
It's known (cannot find the bug # though) the pollable event buffer gets filled up.
Reporter | ||
Comment 8•12 years ago
|
||
Not sure it's meaningful, but when this happened again I attached with gdb, stopped passing SIGPROF through, and continued. It remained hung. (Then I reattached and attempted to call sigaction(SIGPROF, SIG_IGN), but botched it and crashed the process.)
I think this may impact the stability of the Gecko Profiler; I get one of these hangs every week or so.
I also see bug 737328, though it probably isn't related to this bug.
Updated•12 years ago
|
Component: Networking: HTTP → Gecko Profiler
QA Contact: networking.http → gecko-profiler
Summary: Hang while doing nothing → Gecko Profiler SIGPROF causes hangs on Linux
Comment 9•12 years ago
|
||
Are you running a profiling build (using unwinding) or not (using pseudo stack)? Not sure if it matters but just in case.
Reporter | ||
Comment 10•12 years ago
|
||
(In reply to Benoit Girard (:BenWa) from comment #9)
> Are you running a profiling build (using unwinding) or not (using pseudo
> stack)? Not sure if it matters but just in case.
Ermmm... I think I'm using unwinding. Cleopatra shows hex addresses in its stack. I'm on the nightly profiling branch.
Comment 11•12 years ago
|
||
We will show address only on the leaf nodes in pseudo stack mode. Is the symbolication also failing for you? Looks like sps linux support really needs some love.
Reporter | ||
Comment 12•12 years ago
|
||
Numbers all the way down
Comment 13•12 years ago
|
||
SPS with libunwind-based-unwinding on Android, at 100HZ sampling,
will usually hang within a couple of minutes. Not sure if it is
the same problem or not.
Comment 14•12 years ago
|
||
Stacks at hang on Android. The user interface no longer responds
to input, but I can see from debug printing, that signals are still
being delivered to the profiler and that it taking stack snapshots
accordingly.
Comment 15•12 years ago
|
||
Previous version was completely bogus (debuginfo out of date).
Attachment #640154 -
Attachment is obsolete: true
Comment 16•12 years ago
|
||
Some of these threads are waiting on jemalloc's allocation lock. In
this particular snapshot, the signal handler's thread (thread 12) is
too. So it's clear that the signal handler, calling into libunwind,
eventually calls (je)malloc.
I don't think that's the cause of the hang though, because the signal
handler thread continues to take samples (as per debugging printing)
even though the rest of the app appears to be deadlocked.
What could cause a deadlock is if some thread T calls into jemalloc,
gets the jemalloc lock, and then the handler runs on thread T, and
that also calls jemalloc (as per thread 12).
Comment 17•12 years ago
|
||
I checked the libunwind documentation, at
http://www.nongnu.org/libunwind/man/libunwind%283%29.html#section_5
This states that the functions we use (unw_getcontext, unw_init_local,
unw_step, unw_get_reg) are "signal safe", by which I think is meant,
only calls functions which are listed as being "async-signal-safe" by
POSIX. Whether that is any use to use I don't know, considering 1st
and 3rd paras in comment #16.
Comment 18•12 years ago
|
||
Following Ehsan/BenWa comments on irc, I made a simple mmap based
allocator in tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c
and used it to handle all the malloc, calloc, realloc and free calls there.
This appears to remove deadlocks in which threads are waiting for the
jemalloc lock. It still deadlocks though. Some of them are due to
contention for a lock in libc (bionic) malloc, which is reached via
calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
mmap allocator.
At this point I am wondering if libunwind is actually suitable for
this task. It needs to be much more self contained if we are to run
it safely from a signal handler. Also I am wondering what the
libunwind docs mean when they claim the functions we use are
signal-safe -- they clearly are not, at least given the way we are
using them.
All of that said, I believe there are still deadlocks that have
nothing to do with bad interactions with {libc/je}malloc, as per the
original reason for filing this bug.
Comment 19•12 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #6)
> I don't know exactly what was happening, but I bet it has something to do
> with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> much writes to a fifo iirc and maybe that got filled up.
Do you have any more specifics about what the problem might be and how
to go about checking your hypothesis? Having looked at several such
deadlock instances in Android, I am inclined to believe at least some
of them are caused by this (kind of) problem.
Comment 20•12 years ago
|
||
(In reply to Julian Seward from comment #19)
> (In reply to Patrick McManus [:mcmanus] from comment #6)
> > I don't know exactly what was happening, but I bet it has something to do
> > with sigprof interrupting it constantly.. pr_setpollable_event event pretty
> > much writes to a fifo iirc and maybe that got filled up.
>
> Do you have any more specifics about what the problem might be and how
> to go about checking your hypothesis? Having looked at several such
> deadlock instances in Android, I am inclined to believe at least some
> of them are caused by this (kind of) problem.
just a hunch, but probly related to either the pollable-event fifo filling up
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/io/prpolevt.c#115
Comment 21•12 years ago
|
||
(In reply to comment #18)
> This appears to remove deadlocks in which threads are waiting for the
> jemalloc lock. It still deadlocks though. Some of them are due to
> contention for a lock in libc (bionic) malloc, which is reached via
> calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my
> mmap allocator.
Bah, libunwind shouldn't really touch anything from libc if it wants to even claim to be signal safe. :(
Comment 22•12 years ago
|
||
I get this hang, when I click on a downloaded file in the new Download Manager (ie. trying to display it), while I have the Gecko Profiler running.
This is on Linux Mint 14.
Comment 23•12 years ago
|
||
(In reply to Tom Schuster [:evilpie] from comment #22)
> Created attachment 726203 [details]
> stacktrace
>
> I get this hang, when I click on a downloaded file in the new Download
> Manager (ie. trying to display it), while I have the Gecko Profiler running.
>
> This is on Linux Mint 14.
That sounds like bug 837390
Comment 24•11 years ago
|
||
Can you still reproduce a hang on poll()?
Reporter | ||
Comment 25•11 years ago
|
||
(In reply to Benoit Girard (:BenWa) from comment #24)
> Can you still reproduce a hang on poll()?
I can't use the profiler because of bug 879547, but I just re-enabled it and strace shows the usual SIGPROF storm, so I'll run like this for a while and let you know. I'll needinfo? myself so I remember.
Flags: needinfo?(sphink)
Reporter | ||
Comment 26•11 years ago
|
||
I haven't seen this at all in the week and a half. Seems to be fixed.
From the comments, it seems like other people might be seeing slightly different bugs, so I'm resolving this one. Unless you're getting a hang when spawning an external process, open another bug.
Ooh! Even better, while updating this bug I clicked on the profiler icon, and it actually did something again! Perhaps that's fixed too!
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(sphink)
Resolution: --- → FIXED
Comment 27•10 years ago
|
||
This bug is referenced from https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native stack" on Linux nightly.
Could someone update MDN - I'm not sure if native stack is available by default or if you need special env variables as the section below seems to say.
Reporter | ||
Comment 28•8 years ago
|
||
(In reply to Nickolay_Ponomarev from comment #27)
> This bug is referenced from
> https://developer.mozilla.org/en-US/docs/Mozilla/Performance/
> Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native
> stack" on Linux nightly.
>
> Could someone update MDN - I'm not sure if native stack is available by
> default or if you need special env variables as the section below seems to
> say.
Yeah, I just got back to this bug through that link. I'm running a Nightly on Linux, and I see to be getting symbols now. (Well, libxul.so symbols. I still see some missing for eg libdbus. But maybe I need to install that debuginfo.)
You need to log in
before you can comment on or make changes to this bug.
Description
•