Closed Bug 757186 Opened 13 years ago Closed 11 years ago

Gecko Profiler SIGPROF causes hangs on Linux

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: sfink, Unassigned)

References

Details

Attachments

(7 files, 1 obsolete file)

Stack trace 13 years ago Steve Fink [:sfink] [:s:] 36.91 KB, text/plain		Details
strace log 13 years ago Steve Fink [:sfink] [:s:] 1.38 KB, application/x-gzip		Details
lsof output 13 years ago Steve Fink [:sfink] [:s:] 7.08 KB, application/x-gzip		Details
/proc/(pid)/maps output 13 years ago Steve Fink [:sfink] [:s:] 21.21 KB, application/x-gzip		Details
Cleopatra screenshot 12 years ago Steve Fink [:sfink] [:s:] 160.46 KB, image/png		Details
Stacks when hanging on Android 12 years ago Julian Seward [:jseward] 38.93 KB, text/plain		Details
Stacks when hanging on Android, take 2 12 years ago Julian Seward [:jseward] 72.80 KB, text/plain		Details
stacktrace 12 years ago Tom S [:evilpie] 4.88 KB, text/plain		Details

Steve Fink [:sfink] [:s:]

Reporter

Description

•

13 years ago

Was working on another workspace, switched back to find Firefox hung. Stacks attached. It was sitting in a poll() -- well, except I have the geckoprofiler installed, so the poll was getting interrupted constantly with SIGPROF signals. I'm also attaching a 6-7 second strace log. Oddly, although the poll() appears to be set to a 5-second timeout, it doesn't seem like it ever timed out. It seems unlikely that restarting poll() would reset its timeout, but I can't explain why that poll() never seemed to return otherwise.

Steve Fink [:sfink] [:s:]

Reporter

Comment 1

•

13 years ago

Attached file Stack trace — Details

Note that this is on the profiling nightly from http://hg.mozilla.org/projects/profiling/rev/edb3df9e27a6.

Steve Fink [:sfink] [:s:]

Reporter

Comment 2

•

13 years ago

Attached file strace log — Details

Steve Fink [:sfink] [:s:]

Reporter

Comment 3

•

13 years ago

Attached file lsof output — Details

Steve Fink [:sfink] [:s:]

Reporter

Comment 4

•

13 years ago

Attached file /proc/(pid)/maps output — Details

Steve Fink [:sfink] [:s:]

Reporter

Updated

•

13 years ago

Attachment #625740 - Attachment is patch: false

Justin Dolske [:Dolske]

Comment 5

•

13 years ago

Seems like nsHttpConnectionMgr was the last notable thing in the stack, so let's try Core::Networking.

Component: General → Networking: HTTP

Product: Firefox → Core

QA Contact: general → networking.http

Patrick McManus [:mcmanus]

Comment 6

•

13 years ago

I don't know exactly what was happening, but I bet it has something to do with sigprof interrupting it constantly.. pr_setpollable_event event pretty much writes to a fifo iirc and maybe that got filled up. interesting to look into, but I doubt its hiding something dangerous.

Honza Bambas (:mayhemer)

Comment 7

•

12 years ago

It's known (cannot find the bug # though) the pollable event buffer gets filled up.

Steve Fink [:sfink] [:s:]

Reporter

Comment 8

•

12 years ago

Not sure it's meaningful, but when this happened again I attached with gdb, stopped passing SIGPROF through, and continued. It remained hung. (Then I reattached and attempted to call sigaction(SIGPROF, SIG_IGN), but botched it and crashed the process.) I think this may impact the stability of the Gecko Profiler; I get one of these hangs every week or so. I also see bug 737328, though it probably isn't related to this bug.

Benoit Girard (:BenWa)

Updated

•

12 years ago

Component: Networking: HTTP → Gecko Profiler

QA Contact: networking.http → gecko-profiler

Summary: Hang while doing nothing → Gecko Profiler SIGPROF causes hangs on Linux

Benoit Girard (:BenWa)

Comment 9

•

12 years ago

Are you running a profiling build (using unwinding) or not (using pseudo stack)? Not sure if it matters but just in case.

Steve Fink [:sfink] [:s:]

Reporter

Comment 10

•

12 years ago

(In reply to Benoit Girard (:BenWa) from comment #9) > Are you running a profiling build (using unwinding) or not (using pseudo > stack)? Not sure if it matters but just in case. Ermmm... I think I'm using unwinding. Cleopatra shows hex addresses in its stack. I'm on the nightly profiling branch.

Benoit Girard (:BenWa)

Comment 11

•

12 years ago

We will show address only on the leaf nodes in pseudo stack mode. Is the symbolication also failing for you? Looks like sps linux support really needs some love.

Steve Fink [:sfink] [:s:]

Reporter

Comment 12

•

12 years ago

Attached image Cleopatra screenshot — Details

Numbers all the way down

Julian Seward [:jseward]

Comment 13

•

12 years ago

SPS with libunwind-based-unwinding on Android, at 100HZ sampling, will usually hang within a couple of minutes. Not sure if it is the same problem or not.

Julian Seward [:jseward]

Comment 14

•

12 years ago

Attached file Stacks when hanging on Android (obsolete) — Details

Stacks at hang on Android. The user interface no longer responds to input, but I can see from debug printing, that signals are still being delivered to the profiler and that it taking stack snapshots accordingly.

Julian Seward [:jseward]

Comment 15

•

12 years ago

Attached file Stacks when hanging on Android, take 2 — Details

Previous version was completely bogus (debuginfo out of date).

Attachment #640154 - Attachment is obsolete: true

Julian Seward [:jseward]

Comment 16

•

12 years ago

Some of these threads are waiting on jemalloc's allocation lock. In this particular snapshot, the signal handler's thread (thread 12) is too. So it's clear that the signal handler, calling into libunwind, eventually calls (je)malloc. I don't think that's the cause of the hang though, because the signal handler thread continues to take samples (as per debugging printing) even though the rest of the app appears to be deadlocked. What could cause a deadlock is if some thread T calls into jemalloc, gets the jemalloc lock, and then the handler runs on thread T, and that also calls jemalloc (as per thread 12).

Julian Seward [:jseward]

Updated

•

12 years ago

Blocks: 769241

Julian Seward [:jseward]

Comment 17

•

12 years ago

I checked the libunwind documentation, at http://www.nongnu.org/libunwind/man/libunwind%283%29.html#section_5 This states that the functions we use (unw_getcontext, unw_init_local, unw_step, unw_get_reg) are "signal safe", by which I think is meant, only calls functions which are listed as being "async-signal-safe" by POSIX. Whether that is any use to use I don't know, considering 1st and 3rd paras in comment #16.

Julian Seward [:jseward]

Comment 18

•

12 years ago

Following Ehsan/BenWa comments on irc, I made a simple mmap based allocator in tools/profiler/libunwind/src/src/dwarf/Gfind_proc_info-lsb.c and used it to handle all the malloc, calloc, realloc and free calls there. This appears to remove deadlocks in which threads are waiting for the jemalloc lock. It still deadlocks though. Some of them are due to contention for a lock in libc (bionic) malloc, which is reached via calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my mmap allocator. At this point I am wondering if libunwind is actually suitable for this task. It needs to be much more self contained if we are to run it safely from a signal handler. Also I am wondering what the libunwind docs mean when they claim the functions we use are signal-safe -- they clearly are not, at least given the way we are using them. All of that said, I believe there are still deadlocks that have nothing to do with bad interactions with {libc/je}malloc, as per the original reason for filing this bug.

Julian Seward [:jseward]

Comment 19

•

12 years ago

(In reply to Patrick McManus [:mcmanus] from comment #6) > I don't know exactly what was happening, but I bet it has something to do > with sigprof interrupting it constantly.. pr_setpollable_event event pretty > much writes to a fifo iirc and maybe that got filled up. Do you have any more specifics about what the problem might be and how to go about checking your hypothesis? Having looked at several such deadlock instances in Android, I am inclined to believe at least some of them are caused by this (kind of) problem.

Patrick McManus [:mcmanus]

Comment 20

•

12 years ago

(In reply to Julian Seward from comment #19) > (In reply to Patrick McManus [:mcmanus] from comment #6) > > I don't know exactly what was happening, but I bet it has something to do > > with sigprof interrupting it constantly.. pr_setpollable_event event pretty > > much writes to a fifo iirc and maybe that got filled up. > > Do you have any more specifics about what the problem might be and how > to go about checking your hypothesis? Having looked at several such > deadlock instances in Android, I am inclined to believe at least some > of them are caused by this (kind of) problem. just a hunch, but probly related to either the pollable-event fifo filling up http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/io/prpolevt.c#115

(no longer active)

Comment 21

•

12 years ago

(In reply to comment #18) > This appears to remove deadlocks in which threads are waiting for the > jemalloc lock. It still deadlocks though. Some of them are due to > contention for a lock in libc (bionic) malloc, which is reached via > calls to fread in Gfind_proc_info-lsb.c, so they are not caught by my > mmap allocator. Bah, libunwind shouldn't really touch anything from libc if it wants to even claim to be signal safe. :(

Tom S [:evilpie]

Comment 22

•

12 years ago

Attached file stacktrace — Details

I get this hang, when I click on a downloaded file in the new Download Manager (ie. trying to display it), while I have the Gecko Profiler running. This is on Linux Mint 14.

Mike Hommey [:glandium]

Comment 23

•

12 years ago

(In reply to Tom Schuster [:evilpie] from comment #22) > Created attachment 726203 [details] > stacktrace > > I get this hang, when I click on a downloaded file in the new Download > Manager (ie. trying to display it), while I have the Gecko Profiler running. > > This is on Linux Mint 14. That sounds like bug 837390

Benoit Girard (:BenWa)

Comment 24

•

11 years ago

Can you still reproduce a hang on poll()?

Steve Fink [:sfink] [:s:]

Reporter

Comment 25

•

11 years ago

(In reply to Benoit Girard (:BenWa) from comment #24) > Can you still reproduce a hang on poll()? I can't use the profiler because of bug 879547, but I just re-enabled it and strace shows the usual SIGPROF storm, so I'll run like this for a while and let you know. I'll needinfo? myself so I remember.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Reporter

Comment 26

•

11 years ago

I haven't seen this at all in the week and a half. Seems to be fixed. From the comments, it seems like other people might be seeing slightly different bugs, so I'm resolving this one. Unless you're getting a hang when spawning an external process, open another bug. Ooh! Even better, while updating this bug I clicked on the profiler icon, and it actually did something again! Perhaps that's fixed too!

Status: NEW → RESOLVED

Closed: 11 years ago

Flags: needinfo?(sphink)

Resolution: --- → FIXED

Nickolay_Ponomarev

Comment 27

•

10 years ago

This bug is referenced from https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native stack" on Linux nightly. Could someone update MDN - I'm not sure if native stack is available by default or if you need special env variables as the section below seems to say.

Steve Fink [:sfink] [:s:]

Reporter

Comment 28

•

8 years ago

(In reply to Nickolay_Ponomarev from comment #27) > This bug is referenced from > https://developer.mozilla.org/en-US/docs/Mozilla/Performance/ > Profiling_with_the_Built-in_Profiler#Availability as a "bug for Native > stack" on Linux nightly. > > Could someone update MDN - I'm not sure if native stack is available by > default or if you need special env variables as the section below seems to > say. Yeah, I just got back to this bug through that link. I'm running a Nightly on Linux, and I see to be getting symbols now. (Well, libxul.so symbols. I still see some missing for eg libdbus. But maybe I need to install that debuginfo.)

You need to log in before you can comment on or make changes to this bug.