Last Comment Bug 680190 - jemalloc can deadlock during IPC fork on Android
: jemalloc can deadlock during IPC fork on Android
Status: RESOLVED FIXED
[inbound]
:
Product: Core
Classification: Components
Component: IPC (show other bugs)
: Trunk
: ARM Android
: -- normal (vote)
: mozilla9
Assigned To: Jim Chen [:jchen] [:darchons]
:
:
Mentors:
Depends on:
Blocks: 678381 683799
  Show dependency treegraph
 
Reported: 2011-08-18 11:35 PDT by Jim Chen [:jchen] [:darchons]
Modified: 2011-09-01 21:32 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Possible fix (3.95 KB, patch)
2011-08-18 11:42 PDT, Jim Chen [:jchen] [:darchons]
cjones.bugs: review+
jasone: feedback+
Details | Diff | Splinter Review
Temporary fix (4.42 KB, patch)
2011-08-30 11:02 PDT, Jim Chen [:jchen] [:darchons]
nchen: review+
Details | Diff | Splinter Review

Description Jim Chen [:jchen] [:darchons] 2011-08-18 11:35:15 PDT
The problem is Android didn't have pthread_atfork until 2.3, so we specifically don't call it. (http://mxr.mozilla.org/mozilla-central/source/memory/jemalloc/jemalloc.c#5577)

If another thread locks jemalloc during the IPC fork, the new child process will have jemalloc permanently locked. This is fine if we call exec immediately afterwards. However, before exec, we do stuff with file descriptors which involves stl containers that can potentially call jemalloc (http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#56), thus causing a deadlock in the child process if all the conditions are met.

The symptom is a blank page when loading. The parent is probably still responsive, but the child is not. I've only seen it when running talos on a tegra. It's possible that one of the talos bug is caused by this (also I think up until now this bug has been overshadowed by bug 662936)

On a side note, mxr says we can call pthread_atlock in NSS too (http://mxr.mozilla.org/mozilla-central/source/security/nss/lib/softoken/pkcs11.c). But I don't know the effects of this.
Comment 1 Jim Chen [:jchen] [:darchons] 2011-08-18 11:42:12 PDT
Created attachment 554161 [details] [diff] [review]
Possible fix

This is one possible fix. It's pretty ugly though because it exposes two private functions from jemalloc.
Comment 2 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-18 11:52:39 PDT
bionic strikes again!  Nice analysis, and boo on folks blindly #ifdef'ing stuff out.

I'm not sure even this hacky fix this right; it appears to me that the jemalloc mutexes aren't initialized to be reentrant on linux, so if fork() itself malloc's, then we'll deadlock there.  fork() is specified to be async-signal-safe, so theoretically we can assume fork() doesn't malloc, but this being bionic who knows.  Worth checking bionic, or implementing our own fork() ... not out of the question.
Comment 3 Jim Chen [:jchen] [:darchons] 2011-08-24 11:32:52 PDT
(In reply to Chris Jones [:cjones] [:warhammer] from comment #2)
> bionic strikes again!  Nice analysis, and boo on folks blindly #ifdef'ing
> stuff out.
> 
> I'm not sure even this hacky fix this right; it appears to me that the
> jemalloc mutexes aren't initialized to be reentrant on linux, so if fork()
> itself malloc's, then we'll deadlock there.  fork() is specified to be
> async-signal-safe, so theoretically we can assume fork() doesn't malloc, but
> this being bionic who knows.  Worth checking bionic, or implementing our own
> fork() ... not out of the question.

I think we will be okay? If fork mallocs it will be using the bionic malloc, not jemalloc. It is kind of scary though. fork() is at http://android.git.kernel.org/?p=platform/bionic.git;a=blob_plain;f=libc/bionic/fork.c;hb=HEAD (the __bionic_atfork_ lines are added in 2.3 and don't do us any good)

jmaher is seeing this on talos now, so we should get this fixed soon.
Comment 4 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-24 13:57:42 PDT
(In reply to Jim Chen [:jchen] (mobile intern :) from comment #3)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #2)
> > bionic strikes again!  Nice analysis, and boo on folks blindly #ifdef'ing
> > stuff out.
> > 
> > I'm not sure even this hacky fix this right; it appears to me that the
> > jemalloc mutexes aren't initialized to be reentrant on linux, so if fork()
> > itself malloc's, then we'll deadlock there.  fork() is specified to be
> > async-signal-safe, so theoretically we can assume fork() doesn't malloc, but
> > this being bionic who knows.  Worth checking bionic, or implementing our own
> > fork() ... not out of the question.
> 
> I think we will be okay? If fork mallocs it will be using the bionic malloc,
> not jemalloc.

As long as it only calls malloc() (or things that call malloc()) directly.

> It is kind of scary though. fork() is at
> http://android.git.kernel.org/?p=platform/bionic.git;a=blob_plain;f=libc/
> bionic/fork.c;hb=HEAD (the __bionic_atfork_ lines are added in 2.3 and don't
> do us any good)

Android's cpuacct_add() calls fopen, fclose, and fprintf(), none of which are async-signal-safe.  (Totally unnecessarily; could have equivalently used open/close/write.)  That means we really do have to guarantee that malloc() in bionic always resolves to bionic malloc().  If it doesn't, and resolves to jemalloc malloc(), then we *will* deadlock.  On the other hand, that's nice too, because we'll immediately know if we mess up.

> jmaher is seeing this on talos now, so we should get this fixed soon.

Want to keep driving this patch in?
Comment 5 Jim Chen [:jchen] [:darchons] 2011-08-24 14:49:49 PDT
Comment on attachment 554161 [details] [diff] [review]
Possible fix

(In reply to Chris Jones [:cjones] [:warhammer] from comment #4)
> (In reply to Jim Chen [:jchen] (mobile intern :) from comment #3)
> 
> > It is kind of scary though. fork() is at
> > http://android.git.kernel.org/?p=platform/bionic.git;a=blob_plain;f=libc/
> > bionic/fork.c;hb=HEAD (the __bionic_atfork_ lines are added in 2.3 and don't
> > do us any good)
> 
> Android's cpuacct_add() calls fopen, fclose, and fprintf(), none of which
> are async-signal-safe.  (Totally unnecessarily; could have equivalently used
> open/close/write.)  That means we really do have to guarantee that malloc()
> in bionic always resolves to bionic malloc().  If it doesn't, and resolves
> to jemalloc malloc(), then we *will* deadlock.  On the other hand, that's
> nice too, because we'll immediately know if we mess up.
> 
Good find. So far I've only seen bionic functions using bionic malloc. Hope it stays this way.

> > jmaher is seeing this on talos now, so we should get this fixed soon.
> 
> Want to keep driving this patch in?

Yeah, although a custom pthread_atfork() and fork() wrapper shouldn't be much work either. Asking for review, and feedback from Jason on exposing the jemalloc functions.
Comment 6 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-08-24 15:56:26 PDT
Comment on attachment 554161 [details] [diff] [review]
Possible fix

This is pretty horrible, but probably less work than trying to maintain our own fork() that integrates properly with bionic.  Three things

+#ifdef ANDROID
+  ::_malloc_prefork();
+#endif

(1) Add a comment explaining this hack here.

(2) These calls need to be conditional on MOZ_MEMORY.

(3) File a followup bug for removing this, after we only care about 2.3 and above.  (And 3.[N] and above, if the bug exists in earlier 3.x.)
Comment 7 Jason Evans 2011-08-24 21:19:40 PDT
Comment on attachment 554161 [details] [diff] [review]
Possible fix

Looks fine to me.
Comment 8 Mike Hommey [:glandium] 2011-08-25 00:57:58 PDT
Please leave the _malloc_* functions static on !android.
Comment 9 Jim Chen [:jchen] [:darchons] 2011-08-30 11:02:36 PDT
Created attachment 556921 [details] [diff] [review]
Temporary fix

Addressed comments from cjones and glandium. Carried over cjones' r+.
Comment 11 Marco Bonardo [::mak] 2011-08-31 02:14:20 PDT
http://hg.mozilla.org/mozilla-central/rev/157511556e27

Note You need to log in before you can comment on or make changes to this bug.