Last Comment Bug 664453 - build with --enable-jprof often crashes at startup if JP_START is set
: build with --enable-jprof often crashes at startup if JP_START is set
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: x86 Linux
-- normal (vote)
: ---
Assigned To: Randell Jesup [:jesup]
Depends on: 666501
  Show dependency treegraph
Reported: 2011-06-15 08:58 PDT by Randell Jesup [:jesup]
Modified: 2011-08-12 10:03 PDT (History)
2 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

Patch that defers initing jprof until we launch the real app (2.88 KB, patch)
2011-06-17 07:01 PDT, Randell Jesup [:jesup]
no flags Details | Diff | Splinter Review
Patch to defer start of profiling, and pause it around dangerous operations (fork/execv*) (12.72 KB, patch)
2011-06-19 16:51 PDT, Randell Jesup [:jesup]
no flags Details | Diff | Splinter Review
Patch - one more update (15.46 KB, patch)
2011-06-19 19:36 PDT, Randell Jesup [:jesup]
no flags Details | Diff | Splinter Review
Fix bitrot due to bug 552864 (17.83 KB, patch)
2011-06-21 23:40 PDT, Randell Jesup [:jesup]
dbaron: review+
Details | Diff | Splinter Review

Description User image Randell Jesup [:jesup] 2011-06-15 08:58:43 PDT
Frequently (50-80%) firefox will crash at startup if JP_START is set in the JPROF_FLAGS; may be worse with high interrupt rates.  Probably a timing issue with setting the signal handler.
Comment 1 User image Randell Jesup [:jesup] 2011-06-17 07:01:04 PDT
Created attachment 540051 [details] [diff] [review]
Patch that defers initing jprof until we launch the real app

Defers jprof startup until we have a profile (etc); also pause jprof when we're about to launch another app of any type.  Works much more reliably - I've been unable to break it now.
Comment 2 User image David Baron :dbaron: ⌚️UTC-8 2011-06-17 14:14:53 PDT
Could you explain why this makes sense?  It seems like starting at startup ought to work.
Comment 3 User image Randell Jesup [:jesup] 2011-06-17 14:29:22 PDT
Starting it earlier causes random (frequent) crashes in library opens, things like segfaults, failures to catch the signal, and errors like this:

Inconsistency detected by dl-open.c: 260: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

Also, profiling the profile manager really is separate from profiling normal app startup. I tried some lesser changes (pausing it before exec-ing the new app, though I left that in) and they didn't solve the problem.
Comment 4 User image Randell Jesup [:jesup] 2011-06-19 00:00:14 PDT
As backup, from the setitimer() man page:


  A child created via fork(2) does not inherit its parent's interval timers. 
  Interval timers are preserved across an execve(2). 

I assume they're preserved across execv*().  Signal handlers (of course) are not, so any combination of setitimer() and execv* is DANGEROUS unless preceded by fork() (and maybe spawn()).

Also, the atexit() to pause profiling and dump the address map won't get called it appears:

From the atexit manpage:

       When  a child process is created via fork(2), it inherits copies of its parent’s registrations.
       Upon a successful call to one of the exec(3) functions, all registrations are removed.

See also and for examples of some of the asynchronous dangers of setitimer() and fork/exec. 

It's simply unsafe to leave the setitimer running when you might call fork() or execv*() (and quite possibly spawn()).

I'm revising the patch to also cover spawning plugin-container processes as well.
Comment 5 User image Randell Jesup [:jesup] 2011-06-19 00:00:57 PDT
Comment on attachment 540051 [details] [diff] [review]
Patch that defers initing jprof until we launch the real app

Pulling review request for now
Comment 6 User image Randell Jesup [:jesup] 2011-06-19 00:19:14 PDT
The fork/exec issue may explain why launching flash (plugin-container) was sometimes failing with complaints about broken pipes in chromium ipc.

I'll note that there are some fork()/execv*() uses I can't easily modify: one in NSS (in safe_popen()), the other in ForkAndExec() in NSPR.
Comment 7 User image Randell Jesup [:jesup] 2011-06-19 16:51:18 PDT
Created attachment 540362 [details] [diff] [review]
Patch to defer start of profiling, and pause it around dangerous operations (fork/execv*)

Ok, updated patch given what I found about spawning subprocesses via fork and execv*.  

This does not hit every use of those calls.  In particular, it doesn't hit nspr or nss (we can file a follow-on bug/patch on them), the updater app isn't touched (but normally isn't jprofed'd anyways), and js.cpp is modified by this patch but the mod is disabled because there's a build-order issue to resolve (jprof.h hasn't been published in dist/include when js is built).  That can also be a follow-on.  These missed changes don't seem to cause much if any problem in general profiling, though there may be cases where they would fail.

In practice, normally the only things that had problems were initial system startup (crashing around profilemanager/etc) and spawning plugin-containers (i-looping when trying to start flash - this likely was the same problem Chromium had that I mentioned above).  Using profiling/setitimer() in a complex program is tricky at best, especially if you don't control directly every use of fork & exec in every library you use.  This is actually ok, so long as this is solely a debugging tool and it's "ok" if it fails once in a while.  Failing 8 of 10 startups though was obviously unacceptable.
Comment 8 User image Randell Jesup [:jesup] 2011-06-19 19:36:03 PDT
Created attachment 540377 [details] [diff] [review]
Patch - one more update

One last (I hope) update to cover another very indirect way we call fork() (the filepicker, which ~25 frames deep calls fork() from dbus).

There will be more I haven't found.  That's ok - jprof is a tool; and most uses of it won't hit any of these forks, OR the user will be either using a longer period (the short period greatly increases the odds of the i-loop trying to clone() the memory map), OR the period will be short but the process won't be as huge as my testcase (bigger the VM map, the bigger the chance of this biting you).  Even now, the reason I even noticed there was an issue is that it was biting me when profiling the startup code; normally you kill -PROF it right before your test, and -USR1 right after.
Comment 9 User image Randell Jesup [:jesup] 2011-06-21 23:40:51 PDT
Created attachment 540977 [details] [diff] [review]
Fix bitrot due to bug 552864

Bug 552864 changed how firefox gets started and shut down, and for example broke the use of atexit() to dump the address map.  This update to the patch gets things working again.
Comment 10 User image Randell Jesup [:jesup] 2011-07-02 19:46:27 PDT
I've found one small hole jprofing startup where it if we got a SIGPROF while inside a malloc pthread lock, and backtrace() needs to load a library and it calls malloc() you can get deadlocked.  I've only hit it once, and there's not too much one can do about it (probably force it to load or statically load the library in question before enabling the setitimers).  I don't want to hold this patch (which makes profiling startup usable) for that edge case.

dbaron: any chance of a review before lockdown?
Comment 11 User image David Baron :dbaron: ⌚️UTC-8 2011-08-11 12:16:28 PDT
Comment on attachment 540977 [details] [diff] [review]
Fix bitrot due to bug 552864

I should have decided to do this ages ago, but given my review backlog, and the failures of my attempts to find somebody else interested in reviewing jprof code, I'm just going to rubber stamp this r=dbaron.  Sorry for taking so long to decide to do that.
Comment 12 User image Randell Jesup [:jesup] 2011-08-12 10:03:47 PDT
checked in as

Note You need to log in before you can comment on or make changes to this bug.