Last Comment Bug 664453 - build with --enable-jprof often crashes at startup if JP_START is set
: build with --enable-jprof often crashes at startup if JP_START is set
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: x86 Linux
: -- normal (vote)
: ---
Assigned To: [:jesup] on pto until 2016/7/5 Randell Jesup
:
Mentors:
Depends on: 666501
Blocks:
  Show dependency treegraph
 
Reported: 2011-06-15 08:58 PDT by [:jesup] on pto until 2016/7/5 Randell Jesup
Modified: 2011-08-12 10:03 PDT (History)
2 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch that defers initing jprof until we launch the real app (2.88 KB, patch)
2011-06-17 07:01 PDT, [:jesup] on pto until 2016/7/5 Randell Jesup
no flags Details | Diff | Review
Patch to defer start of profiling, and pause it around dangerous operations (fork/execv*) (12.72 KB, patch)
2011-06-19 16:51 PDT, [:jesup] on pto until 2016/7/5 Randell Jesup
no flags Details | Diff | Review
Patch - one more update (15.46 KB, patch)
2011-06-19 19:36 PDT, [:jesup] on pto until 2016/7/5 Randell Jesup
no flags Details | Diff | Review
Fix bitrot due to bug 552864 (17.83 KB, patch)
2011-06-21 23:40 PDT, [:jesup] on pto until 2016/7/5 Randell Jesup
dbaron: review+
Details | Diff | Review

Description [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-15 08:58:43 PDT
Frequently (50-80%) firefox will crash at startup if JP_START is set in the JPROF_FLAGS; may be worse with high interrupt rates.  Probably a timing issue with setting the signal handler.
Comment 1 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-17 07:01:04 PDT
Created attachment 540051 [details] [diff] [review]
Patch that defers initing jprof until we launch the real app

Defers jprof startup until we have a profile (etc); also pause jprof when we're about to launch another app of any type.  Works much more reliably - I've been unable to break it now.
Comment 2 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2011-06-17 14:14:53 PDT
Could you explain why this makes sense?  It seems like starting at startup ought to work.
Comment 3 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-17 14:29:22 PDT
Starting it earlier causes random (frequent) crashes in library opens, things like segfaults, failures to catch the signal, and errors like this:

Inconsistency detected by ld.so: dl-open.c: 260: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

Also, profiling the profile manager really is separate from profiling normal app startup. I tried some lesser changes (pausing it before exec-ing the new app, though I left that in) and they didn't solve the problem.
Comment 4 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-19 00:00:14 PDT
As backup, from the setitimer() man page:

  Notes

  A child created via fork(2) does not inherit its parent's interval timers. 
  Interval timers are preserved across an execve(2). 

I assume they're preserved across execv*().  Signal handlers (of course) are not, so any combination of setitimer() and execv* is DANGEROUS unless preceded by fork() (and maybe spawn()).

Also, the atexit() to pause profiling and dump the address map won't get called it appears:

From the atexit manpage:

       When  a child process is created via fork(2), it inherits copies of its parent’s registrations.
       Upon a successful call to one of the exec(3) functions, all registrations are removed.

See also https://code.google.com/p/chromium/issues/detail?id=84911 and https://bugzilla.redhat.com/show_bug.cgi?id=645528 for examples of some of the asynchronous dangers of setitimer() and fork/exec. 

It's simply unsafe to leave the setitimer running when you might call fork() or execv*() (and quite possibly spawn()).

I'm revising the patch to also cover spawning plugin-container processes as well.
Comment 5 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-19 00:00:57 PDT
Comment on attachment 540051 [details] [diff] [review]
Patch that defers initing jprof until we launch the real app

Pulling review request for now
Comment 6 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-19 00:19:14 PDT
The fork/exec issue may explain why launching flash (plugin-container) was sometimes failing with complaints about broken pipes in chromium ipc.

I'll note that there are some fork()/execv*() uses I can't easily modify: one in NSS (in safe_popen()), the other in ForkAndExec() in NSPR.
Comment 7 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-19 16:51:18 PDT
Created attachment 540362 [details] [diff] [review]
Patch to defer start of profiling, and pause it around dangerous operations (fork/execv*)

Ok, updated patch given what I found about spawning subprocesses via fork and execv*.  

This does not hit every use of those calls.  In particular, it doesn't hit nspr or nss (we can file a follow-on bug/patch on them), the updater app isn't touched (but normally isn't jprofed'd anyways), and js.cpp is modified by this patch but the mod is disabled because there's a build-order issue to resolve (jprof.h hasn't been published in dist/include when js is built).  That can also be a follow-on.  These missed changes don't seem to cause much if any problem in general profiling, though there may be cases where they would fail.

In practice, normally the only things that had problems were initial system startup (crashing around profilemanager/etc) and spawning plugin-containers (i-looping when trying to start flash - this likely was the same problem Chromium had that I mentioned above).  Using profiling/setitimer() in a complex program is tricky at best, especially if you don't control directly every use of fork & exec in every library you use.  This is actually ok, so long as this is solely a debugging tool and it's "ok" if it fails once in a while.  Failing 8 of 10 startups though was obviously unacceptable.
Comment 8 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-19 19:36:03 PDT
Created attachment 540377 [details] [diff] [review]
Patch - one more update

One last (I hope) update to cover another very indirect way we call fork() (the filepicker, which ~25 frames deep calls fork() from dbus).

There will be more I haven't found.  That's ok - jprof is a tool; and most uses of it won't hit any of these forks, OR the user will be either using a longer period (the short period greatly increases the odds of the i-loop trying to clone() the memory map), OR the period will be short but the process won't be as huge as my testcase (bigger the VM map, the bigger the chance of this biting you).  Even now, the reason I even noticed there was an issue is that it was biting me when profiling the startup code; normally you kill -PROF it right before your test, and -USR1 right after.
Comment 9 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-06-21 23:40:51 PDT
Created attachment 540977 [details] [diff] [review]
Fix bitrot due to bug 552864

Bug 552864 changed how firefox gets started and shut down, and for example broke the use of atexit() to dump the address map.  This update to the patch gets things working again.
Comment 10 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-07-02 19:46:27 PDT
I've found one small hole jprofing startup where it if we got a SIGPROF while inside a malloc pthread lock, and backtrace() needs to load a library and it calls malloc() you can get deadlocked.  I've only hit it once, and there's not too much one can do about it (probably force it to load or statically load the library in question before enabling the setitimers).  I don't want to hold this patch (which makes profiling startup usable) for that edge case.

dbaron: any chance of a review before lockdown?
Comment 11 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2011-08-11 12:16:28 PDT
Comment on attachment 540977 [details] [diff] [review]
Fix bitrot due to bug 552864

I should have decided to do this ages ago, but given my review backlog, and the failures of my attempts to find somebody else interested in reviewing jprof code, I'm just going to rubber stamp this r=dbaron.  Sorry for taking so long to decide to do that.
Comment 12 [:jesup] on pto until 2016/7/5 Randell Jesup 2011-08-12 10:03:47 PDT
checked in as http://hg.mozilla.org/mozilla-central/rev/71c422d27ed4

Note You need to log in before you can comment on or make changes to this bug.