Closed Bug 664453 Opened 13 years ago Closed 13 years ago

build with --enable-jprof often crashes at startup if JP_START is set

Categories

(Core :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jesup, Assigned: jesup)

References

Details

Attachments

(1 file, 3 obsolete files)

Frequently (50-80%) firefox will crash at startup if JP_START is set in the JPROF_FLAGS; may be worse with high interrupt rates.  Probably a timing issue with setting the signal handler.
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Defers jprof startup until we have a profile (etc); also pause jprof when we're about to launch another app of any type.  Works much more reliably - I've been unable to break it now.
Attachment #540051 - Flags: review?(benjamin)
Attachment #540051 - Flags: review?(benjamin) → review?(dbaron)
Could you explain why this makes sense?  It seems like starting at startup ought to work.
Starting it earlier causes random (frequent) crashes in library opens, things like segfaults, failures to catch the signal, and errors like this:

Inconsistency detected by ld.so: dl-open.c: 260: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

Also, profiling the profile manager really is separate from profiling normal app startup. I tried some lesser changes (pausing it before exec-ing the new app, though I left that in) and they didn't solve the problem.
As backup, from the setitimer() man page:

  Notes

  A child created via fork(2) does not inherit its parent's interval timers. 
  Interval timers are preserved across an execve(2). 

I assume they're preserved across execv*().  Signal handlers (of course) are not, so any combination of setitimer() and execv* is DANGEROUS unless preceded by fork() (and maybe spawn()).

Also, the atexit() to pause profiling and dump the address map won't get called it appears:

From the atexit manpage:

       When  a child process is created via fork(2), it inherits copies of its parent’s registrations.
       Upon a successful call to one of the exec(3) functions, all registrations are removed.

See also https://code.google.com/p/chromium/issues/detail?id=84911 and https://bugzilla.redhat.com/show_bug.cgi?id=645528 for examples of some of the asynchronous dangers of setitimer() and fork/exec. 

It's simply unsafe to leave the setitimer running when you might call fork() or execv*() (and quite possibly spawn()).

I'm revising the patch to also cover spawning plugin-container processes as well.
Comment on attachment 540051 [details] [diff] [review]
Patch that defers initing jprof until we launch the real app

Pulling review request for now
Attachment #540051 - Flags: review?(dbaron)
The fork/exec issue may explain why launching flash (plugin-container) was sometimes failing with complaints about broken pipes in chromium ipc.

I'll note that there are some fork()/execv*() uses I can't easily modify: one in NSS (in safe_popen()), the other in ForkAndExec() in NSPR.
Ok, updated patch given what I found about spawning subprocesses via fork and execv*.  

This does not hit every use of those calls.  In particular, it doesn't hit nspr or nss (we can file a follow-on bug/patch on them), the updater app isn't touched (but normally isn't jprofed'd anyways), and js.cpp is modified by this patch but the mod is disabled because there's a build-order issue to resolve (jprof.h hasn't been published in dist/include when js is built).  That can also be a follow-on.  These missed changes don't seem to cause much if any problem in general profiling, though there may be cases where they would fail.

In practice, normally the only things that had problems were initial system startup (crashing around profilemanager/etc) and spawning plugin-containers (i-looping when trying to start flash - this likely was the same problem Chromium had that I mentioned above).  Using profiling/setitimer() in a complex program is tricky at best, especially if you don't control directly every use of fork & exec in every library you use.  This is actually ok, so long as this is solely a debugging tool and it's "ok" if it fails once in a while.  Failing 8 of 10 startups though was obviously unacceptable.
Attachment #540051 - Attachment is obsolete: true
Attachment #540362 - Flags: review?(dbaron)
Attached patch Patch - one more update (obsolete) — Splinter Review
One last (I hope) update to cover another very indirect way we call fork() (the filepicker, which ~25 frames deep calls fork() from dbus).

There will be more I haven't found.  That's ok - jprof is a tool; and most uses of it won't hit any of these forks, OR the user will be either using a longer period (the short period greatly increases the odds of the i-loop trying to clone() the memory map), OR the period will be short but the process won't be as huge as my testcase (bigger the VM map, the bigger the chance of this biting you).  Even now, the reason I even noticed there was an issue is that it was biting me when profiling the startup code; normally you kill -PROF it right before your test, and -USR1 right after.
Attachment #540362 - Attachment is obsolete: true
Attachment #540362 - Flags: review?(dbaron)
Attachment #540377 - Flags: review?(dbaron)
Bug 552864 changed how firefox gets started and shut down, and for example broke the use of atexit() to dump the address map.  This update to the patch gets things working again.
Attachment #540377 - Attachment is obsolete: true
Attachment #540377 - Flags: review?(dbaron)
Attachment #540977 - Flags: review?(dbaron)
Depends on: 666501
I've found one small hole jprofing startup where it if we got a SIGPROF while inside a malloc pthread lock, and backtrace() needs to load a library and it calls malloc() you can get deadlocked.  I've only hit it once, and there's not too much one can do about it (probably force it to load or statically load the library in question before enabling the setitimers).  I don't want to hold this patch (which makes profiling startup usable) for that edge case.

dbaron: any chance of a review before lockdown?
Comment on attachment 540977 [details] [diff] [review]
Fix bitrot due to bug 552864

I should have decided to do this ages ago, but given my review backlog, and the failures of my attempts to find somebody else interested in reviewing jprof code, I'm just going to rubber stamp this r=dbaron.  Sorry for taking so long to decide to do that.
Attachment #540977 - Flags: review?(dbaron) → review+
checked in as http://hg.mozilla.org/mozilla-central/rev/71c422d27ed4
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: