Closed Bug 799805 Opened 12 years ago Closed 12 years ago

Failed fork (possibly during startup)

Categories

(Firefox OS Graveyard :: General, defect, P2)

x86
macOS
defect

Tracking

(blocking-basecamp:+, firefox18 fixed)

RESOLVED FIXED
blocking-basecamp +
Tracking Status
firefox18 --- fixed

People

(Reporter: gwagner, Assigned: mwu)

References

Details

(Keywords: dogfood)

Attachments

(4 files, 3 obsolete files)

There is a zombie process on my otoro:
Application        PID      Vss      Rss      Pss      Uss  cmdline
b2g                106   89700K   75480K   63714K   52144K  /system/b2g/b2g
Gecko_IOThread     422   64304K   50088K   40570K   31192K  /system/b2g/b2g
Browser            710   19564K   19560K   17141K   14932K  /system/b2g/plugin-container

It might have happened during phone restart but I am not 100% sure.

bt: http://pastebin.mozilla.org/1862535

#4  0x4010bd98 in fork () at bionic/libc/bionic/fork.c:58
58	        __bionic_atfork_run_child();
(gdb) list
53	         * Call cpuacct_add passing in our uid, which will take
54	         * the current task id and add it to the uid group passed
55	         * as a parameter.
56	         */
57	        cpuacct_add(getuid());
58	        __bionic_atfork_run_child();
59	    }
60	    return ret;
61	}
One thing to note here is that we're wrapping pthread_atfork(), but this version of bionic supports it.  May or may not be related.

__bionic_atfork_run_child() ends up notifying its callbacks, and then does

    pthread_mutex_unlock(&handler_mutex);

before returning.  This appears to be where the forked task is getting stuck.

There's a corresponding __bionic_atfork_run_prepare() that's run from the bionic fork() impl and does

    pthread_mutex_lock(&handler_mutex);

Nothing jumps out at me :/.

This is an extremely bad bug because it ties up memory in such a way that it can only be reclaimed by rebooting.  This won't be obvious to dogfooders.

This may also be the cause of some of the issues QA sees during smoketests.  Would be great to get STR, but this looks like some race condition.
Severity: normal → critical
blocking-basecamp: --- → +
Keywords: dogfood
(In reply to Chris Jones [:cjones] [:warhammer] from comment #1)
> This is an extremely bad bug because it ties up memory in such a way that it
> can only be reclaimed by rebooting. 

(Through on-device UI, I mean.)
It would also be really good to see the output of b2g-ps (or ps) when this happens. This would also help to establish if it really is a zombie process or not.

The fact that we're seeing Gecko_IOThread as the process name could just be a simple race condition on when prctl happens and not actually mean anything significant.

In any event, we should be reaping these if it is, in fact, a zombie.
http://pastebin.mozilla.org/1862535 is the backtrace from that process, it's most definitely hung in fork().  (Failed prctl() was my first hope too ;).)
This feels like a race where we're forking while another thread is holding the __recursive_lock (bionic/libc/bionic/pthread.c:1015)

Then when it tries to acquire the lock in the forked child, it can't.

So this isn't a zombie process in the normal parlance of what linux calls zombie processes. This is a process which has deadlocked itself.

I think that the only way to make this work is to do one of these:
1 - acquire the _recursive_lock before the fork and release it after in both the parent and the child
2 - in the child, reinitialize the recursive_lock to make it appear to be unlocked, since there can't possibly be any other thread holding it. This seems a little dicey to me since that implies that some other data structures might be in an inconsistent state.
(In reply to Dave Hylands [:dhylands] from comment #5)
> This feels like a race where we're forking while another thread is holding
> the __recursive_lock (bionic/libc/bionic/pthread.c:1015)
> 
> Then when it tries to acquire the lock in the forked child, it can't.
> 

Hah!  That sure looks right.

A threading bug in bionic, unpossible!

> I think that the only way to make this work is to do one of these:
> 1 - acquire the _recursive_lock before the fork and release it after in both
> the parent and the child

This is the right solution if it's done inside bionic itself.

> 2 - in the child, reinitialize the recursive_lock to make it appear to be
> unlocked, since there can't possibly be any other thread holding it. This
> seems a little dicey to me since that implies that some other data
> structures might be in an inconsistent state.

This isn't much different than (1).  We can't touch thread-shared data after fork() anyway, unless we explicitly acquire all the protecting mutexes in atfork.

Unfortunately, we don't have the capability to do either from within gecko :(.  We can patch this at build-time if needed, I guess.  Will think on an evil solution for insurance.

Would be interesting to see if this is fixed in JB/master bionic.
Dave, assigning to you since it seems like you're working on this.  Feel free to re-assign to nobody if that's incorrect.
Assignee: nobody → dhylands
We need some advice from our friends on this.
If dhyland's guess is correct (which I'm pretty sure it is), then we can make this bug much more frequent by having a background thread continually spin acquiring/releasing an errorcheck or recursive mutex.
I can reproduce this very easily with a small C testcase.  Happens with about 1% frequency.
Attached file Testcase
STR
 (1) Download this file
 (2) cd $b2g
 (3) unzip forking-hell.zip
 (4) ./build.sh forking-hell
 (5) adb remount
 (6) adb push out/target/product/otoro/system/bin/forking-hell /system/bin/
 (7) adb shell forking-hell 1000 &
 (8) (Wait for message)
 (9) adb shell ps | grep forking-hell

If you see any forking-hell processes left over in step (9), then you've hit /a/ bionic bug.

I'm pretty sure the test code is correct because it works 100% with a real libc^W^W glibc.

I hope we're hitting the bug here.  gdb having some issues with the child processes.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #12)
> Created attachment 670219 [details]
> Testcase
> 
> If you see any forking-hell processes left over in step (9), then you've hit
> /a/ bionic bug.
> 

Correction: if you see *more than one* forking-hell process.
Attached patch Sketch of a fixSplinter Review
This is a little gross but does the trick.  Need to find out how to get this upstreamed.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #11)
> I can reproduce this very easily with a small C testcase.  Happens with
> about 1% frequency.

To clarify, running

$ adb shell forking-hell 1000 &

reproduces hung subprocesses 100%.

The number of hung subprocesses is quite reliably about 1% of the number forked by the test.
It looks like bionic has proper recursive mutex support in jellybean:
https://github.com/android/platform_bionic/blob/master/libc/bionic/pthread.c#L1132

I built forking-hell and tried it on the jellybean emulator, and it is working as expected. It looks like we might only need this patch temporarily until we can upgrade to jellybean's bionic.
I was going to try that, but I see you beat me to it. Good work.
What's the plan for upgrading to JB's bionic?  If we didn't fix this, would the result be that we use more memory?
Flags: needinfo?(dhylands)
Priority: -- → P2
I'm wondering if we can hit this on android...
I don't see why not. It's really just a case of forking at an inopportune time.
Flags: needinfo?(dhylands)
Failing to fix this could cause random fork hangs, which translates roughly into random failures of launching apps.

From the B2G perspective, we could either use the workaround or upgrade to JB's bionic.

For fennec though, it has to use the bionic on the phone. In that case I think it would either need to provide its own version of bionic, or at least its own version of pthreads, or come up with some other creative workaround.
Upgrading to JB bionic isn't on the table.  We need to apply this patch during our builds.

Fennec doesn't fork() anymore so it's unaffected.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #22)
> Fennec doesn't fork() anymore so it's unaffected.

Are plugins in-process on Fennec? Also, I keep hearing that eventually, fennec will be multi-process again; surely that means it will hit this problem.
You mean flash? ;)  It's not possible to move OOP on android.  I haven't heard plans about fennec doing OOP content again, so let's cross that bridge when we come to it.
Attached patch Avoid bionic fork() (obsolete) — Splinter Review
Attachment #675363 - Flags: review?(mh+mozilla)
Attachment #675363 - Flags: feedback?(dhylands)
Comment on attachment 675363 [details] [diff] [review]
Avoid bionic fork()

Can you prove that there are no bionic consumers of atfork?  Otherwise we'll get same-but-different problems.
Attachment #675363 - Flags: review?(mh+mozilla) → review-
Comment on attachment 675363 [details] [diff] [review]
Avoid bionic fork()

mwu@delta 0% find . -iname "*.c" | xargs grep fork                 ~/B2G/bionic
./libc/bionic/fork.c:extern int  __fork(void);
./libc/bionic/fork.c:int  fork(void)
./libc/bionic/fork.c:    /* Posix mandates that the timers of a fork child process be
./libc/bionic/fork.c:    __bionic_atfork_run_prepare();
./libc/bionic/fork.c:    ret = __fork();
./libc/bionic/fork.c:        __bionic_atfork_run_parent();
./libc/bionic/fork.c:        __bionic_atfork_run_child();
./libc/bionic/malloc_debug_common.c: * In a VM process, this is set to 1 after fork()ing out of zygote.
./libc/bionic/malloc_debug_qemu.c:     * is doing allocation has been forked from the process that initialized
./libc/bionic/pthread-atfork.c:struct atfork_t
./libc/bionic/pthread-atfork.c:    CIRCLEQ_ENTRY(atfork_t) entries;
./libc/bionic/pthread-atfork.c:static CIRCLEQ_HEAD(atfork_head_t, atfork_t) atfork_head = \
./libc/bionic/pthread-atfork.c:    CIRCLEQ_HEAD_INITIALIZER(atfork_head);
./libc/bionic/pthread-atfork.c:void __bionic_atfork_run_prepare()
./libc/bionic/pthread-atfork.c:    struct atfork_t *cursor;
./libc/bionic/pthread-atfork.c:    /* Call pthread_atfork() prepare handlers.  Posix states that the prepare
./libc/bionic/pthread-atfork.c:    for (cursor = atfork_head.cqh_last;
./libc/bionic/pthread-atfork.c:         cursor != (void*)&atfork_head;
./libc/bionic/pthread-atfork.c:void __bionic_atfork_run_child()
./libc/bionic/pthread-atfork.c:    struct atfork_t *cursor;
./libc/bionic/pthread-atfork.c:    /* Call pthread_atfork() child handlers */
./libc/bionic/pthread-atfork.c:    for (cursor = atfork_head.cqh_first;
./libc/bionic/pthread-atfork.c:         cursor != (void*)&atfork_head;
./libc/bionic/pthread-atfork.c:void __bionic_atfork_run_parent()
./libc/bionic/pthread-atfork.c:    struct atfork_t *cursor;
./libc/bionic/pthread-atfork.c:    /* Call pthread_atfork() parent handlers */
./libc/bionic/pthread-atfork.c:    for (cursor = atfork_head.cqh_first;
./libc/bionic/pthread-atfork.c:         cursor != (void*)&atfork_head;
./libc/bionic/pthread-atfork.c:int pthread_atfork(void (*prepare)(void), void (*parent)(void), void(*child)(void))
./libc/bionic/pthread-atfork.c:    struct atfork_t *entry = malloc(sizeof(struct atfork_t));
./libc/bionic/pthread-atfork.c:    CIRCLEQ_INSERT_TAIL(&atfork_head, entry, entries);
./libc/bionic/pthread-timers.c: * Note also an important thing: Posix mandates that in the case of fork(),
./libc/bionic/pthread-timers.c: * this is implemented by providing a fork() wrapper (see bionic/fork.c) which
./libc/bionic/pthread-timers.c: * stops all timers before the fork, and only re-start them in case of error
./libc/bionic/pthread-timers.c: ** this should be called from the 'fork()' wrapper to stop/start
./libc/bionic/pthread-timers.c: ** requirements: the timers of fork child processes must be
./libc/bionic/pthread.c: *       or calls fork()
./libc/unistd/daemon.c:   pid = fork();
./libc/unistd/popen.c:	switch (pid = fork()) {
./libc/unistd/popen.c:		 * We fork()'d, we got our own copy of the list, no
./libc/unistd/system.c:	switch (pid = vfork()) {

The only parts of bionic that calls fork are daemon and popen. daemon isn't called in gecko and popen is only called on maemo. vfork is direct system call.
Attachment #675363 - Flags: review- → review?(jones.chris.g)
Comment on attachment 675363 [details] [diff] [review]
Avoid bionic fork()

One more issue: this doesn't do the __timer_table_start_stop preparation that bionic does.

Why can we skip that?
Attachment #675363 - Flags: review?(jones.chris.g)
Comment on attachment 675363 [details] [diff] [review]
Avoid bionic fork()

Review of attachment 675363 [details] [diff] [review]:
-----------------------------------------------------------------

I like your solution. I think that with the addition of the calls to __timer_table_start_stop() (as cjones mentioned), then this should work for fennec as well as b2g.
Attachment #675363 - Flags: feedback?(dhylands)
Incidently, I ran into this particular hang today by just launching apps and returning to the home screen.
(In reply to Dave Hylands [:dhylands] from comment #29)
> Comment on attachment 675363 [details] [diff] [review]
> Avoid bionic fork()
> 
> I like your solution. I think that with the addition of the calls to
> __timer_table_start_stop() (as cjones mentioned), then this should work for
> fennec as well as b2g.

I agree, I prefer working around things in gecko.

However, that symbol is __attribute__((hidden)), unfortunately.  The argument has to be why we can do without them.  (I think that's going to be hard to make :/.)
(In reply to Chris Jones [:cjones] [:warhammer] from comment #28)
> Comment on attachment 675363 [details] [diff] [review]
> Avoid bionic fork()
> 
> One more issue: this doesn't do the __timer_table_start_stop preparation
> that bionic does.
> 
> Why can we skip that?

That stops timers that are part of the timer_create/timer_settime/etc. API. We do not use that API - we use set/getitimer instead. That api does not require timers to be stopped when forking.

(In reply to Dave Hylands [:dhylands] from comment #29)
> Comment on attachment 675363 [details] [diff] [review]
> Avoid bionic fork()
> 
> Review of attachment 675363 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I like your solution. I think that with the addition of the calls to
> __timer_table_start_stop() (as cjones mentioned), then this should work for
> fennec as well as b2g.

__timer_table_start_stop also calls pthread_mutex_lock, which makes it unsafe.
(In reply to Michael Wu [:mwu] from comment #32)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #28)
> > Comment on attachment 675363 [details] [diff] [review]
> > Avoid bionic fork()
> > 
> > One more issue: this doesn't do the __timer_table_start_stop preparation
> > that bionic does.
> > 
> > Why can we skip that?
> 
> That stops timers that are part of the timer_create/timer_settime/etc. API.
> We do not use that API - we use set/getitimer instead. That api does not
> require timers to be stopped when forking.

We load an unknown set of vendor .so's at runtime.  Do all of those not use that API?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #33)
> We load an unknown set of vendor .so's at runtime.  Do all of those not use
> that API?

No library that b2g currently loads uses timer_create. I checked on an unagi.

The only one that we might cause problems is libnfc, but that isn't v1.
Since we're preloading libmozglue.so we could provide an implementation of timer_create et al which ASSERTs if called.

That will catch any vendor .so's etc which might call those APIs
(In reply to Michael Wu [:mwu] from comment #34)
> (In reply to Chris Jones [:cjones] [:warhammer] from comment #33)
> > We load an unknown set of vendor .so's at runtime.  Do all of those not use
> > that API?
> 
> No library that b2g currently loads uses timer_create. I checked on an unagi.
> 
> The only one that we might cause problems is libnfc, but that isn't v1.

I don't believe you're seeing the full set of libraries that might be loaded for a v1 product.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #36)
> (In reply to Michael Wu [:mwu] from comment #34)
> > (In reply to Chris Jones [:cjones] [:warhammer] from comment #33)
> > > We load an unknown set of vendor .so's at runtime.  Do all of those not use
> > > that API?
> > 
> > No library that b2g currently loads uses timer_create. I checked on an unagi.
> > 
> > The only one that we might cause problems is libnfc, but that isn't v1.
> 
> I don't believe you're seeing the full set of libraries that might be loaded
> for a v1 product.

I grep'd all of /system/lib from an android unagi image. What else do you suggest checking?
Unfortunately, any vendor lib we might encounter during the v1 cycle :(.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #38)
> Unfortunately, any vendor lib we might encounter during the v1 cycle :(.

The odds of this happening are fairly low and detectable if it does happen.

1. setitimer/getitimer is more commonly used than timer_create. 
2. Current vendor dsos that are loaded tend towards HAL duties, which tends to not run timers. However, nearly every current use of timer_create is for daemons running separate from B2G. The one exception is for libnfc.
3. Vendor dsos that are loaded as xpcom components and such should be using gecko timer APIs and blocking the use of timer_create now can only help.

And if it really matters, we can implement a safe version of timer_create and friends, assuming that posix compliance matters before we exec and kill everything anyway.
Comment on attachment 675363 [details] [diff] [review]
Avoid bionic fork()

I'm afraid that implementing a separate timer_create and then friends is going to be a tar baby of "dark matter" compatibility.  Trying to share with fennec across N bionic versions will just make things worse, I fear.

Perhaps our friends have an opinion, if they feel at liberty to speak on an A2-only issue.
Attachment #675363 - Flags: review-
Attachment #675363 - Flags: feedback?(mvines)
Assignee: dhylands → mwu
Attached patch Avoid bionic fork(), v2 (obsolete) — Splinter Review
This contains a copy of pthread-timers.c from bionic so we support everything exactly the same as bionic. The only change was to explicitly export the required symbols. The wrapping on some symbols was also removed so LD_PRELOAD can force every library to use the mozglue versions. This doesn't change anything on Android builds so there is no risk there.
Attachment #675363 - Attachment is obsolete: true
Attachment #675363 - Flags: feedback?(mvines)
Attachment #675699 - Flags: review?(jones.chris.g)
Comment on attachment 675699 [details] [diff] [review]
Avoid bionic fork(), v2

After calling a consult, there appear to be no issues on the horizon with timer_create.

However, since no one is using it, the idea was mooted that to avoid potential subtle compat issues entirely, just have the impls fatally assert.  I'm in favor of that plan.  We can just add the fatal asserts to the public entry points and leave the remainder of the support intact in case we need to unfreeze it in the future.
Attachment #675699 - Flags: review?(jones.chris.g)
Attached patch Avoid bionic fork(), v3 (obsolete) — Splinter Review
This version abort()s if someone calls timer_create.
Attachment #675699 - Attachment is obsolete: true
Attachment #675727 - Flags: review?(jones.chris.g)
Comment on attachment 675727 [details] [diff] [review]
Avoid bionic fork(), v3

>diff --git a/configure.in b/configure.in

>+    if test -z "$gonkdir"; then
>+        WRAP_LDFLAGS="${WRAP_LDFLAGS} -Wl,--wrap=fork,--wrap=pthread_atfork,--wrap=raise"
>+    fi

You're removing raise() from the android wrap list, which I don't think is intentional.

Do we have to make this gonk-only because android can't find __fork()?
If there's no reason like that to make it gonk-only, I'd prefer we
didn't.

>diff --git a/mozglue/build/BionicGlue.cpp b/mozglue/build/BionicGlue.cpp

>+#else
>+#define __fork() fork()
>+#define cpuacct_add(x)
>+#define WRAP(x) __wrap_##x
>+#endif

Append |// ifdef MOZ_WIDGET_GONK| to this line.

r=me with those addressed.
Attachment #675727 - Flags: review?(jones.chris.g) → review+
(In reply to Chris Jones [:cjones] [:warhammer] from comment #6)
> Unfortunately, we don't have the capability to do either from within gecko
> :(.  We can patch this at build-time if needed, I guess.  Will think on an
> evil solution for insurance.
> 
> Would be interesting to see if this is fixed in JB/master bionic.

Hi Chris and Michael,
1. As later comment#16, android JB seems already fixed this in this commit.
https://github.com/android/platform_bionic/commit/faca92f2f17cea192c5fbde4d869aa7620315196

2. As I know from some partners using ICS bionic, their workaround to fix this bug is to rollback below commit's change in libc/bionic/fork.c -- don't use pthread_atfork() 
https://github.com/android/platform_bionic/commit/4f086aeb4aa06e13079b7fec71a8178ceeacf318

3. Even though we don't use bionic fork, what if partners use their library in hal, which uses bionic fork? It might also hit this bug. Considering (2) and this, could we just patch this at build time?

4. By the way, bionic dlmalloc would also hit this bug in child process as some partners met before. Maybe we have to apply this patch during our builds, too.
https://github.com/android/platform_bionic/commit/3e2d2936b0447ed2f0b0aab3625494b2533cd422
(In reply to Chris Jones [:cjones] [:warhammer] from comment #44)
> Comment on attachment 675727 [details] [diff] [review]
> Avoid bionic fork(), v3
> 
> >diff --git a/configure.in b/configure.in
> 
> >+    if test -z "$gonkdir"; then
> >+        WRAP_LDFLAGS="${WRAP_LDFLAGS} -Wl,--wrap=fork,--wrap=pthread_atfork,--wrap=raise"
> >+    fi
> 
> You're removing raise() from the android wrap list, which I don't think is
> intentional.
> 

We're only wrapping these three functions when on android - test -z "$gonkdir". Gonk overrides these functions directly via LD_PRELOAD.

> Do we have to make this gonk-only because android can't find __fork()?
> If there's no reason like that to make it gonk-only, I'd prefer we
> didn't.
> 

Ok, sure.
This version always uses __fork
Attachment #675727 - Attachment is obsolete: true
https://tbpl.mozilla.org/?tree=Try&rev=bd0a62e2825a

Will land when things look good.
Priority: P2 → --
Priority: -- → P2
https://hg.mozilla.org/mozilla-central/rev/9e99e11d6ef6
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Blocks: 884239
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: