Closed Bug 772734 Opened 12 years ago Closed 12 years ago

LaunchApp may hang

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhylands, Assigned: dhylands)

References

Details

Attachments

(2 files, 4 obsolete files)

The ipc/chromium/src/base/process_util_linux.cc version of LaunchApp can hang.

This can happen since it calls setenv after the fork and before the exec, and setenv may allocate memory. If setenv happens to want to allocate memory and the heap was locked at the time of the fork, then setenv will hang.

The code needs to be changed to use execvpe or execve. Android doesn't seem to have an execvpe, so that means that we'd need to use execve and do the path search ourselves. execve is an actual syscall, so there shouldn't be an runtime library allocations.
Assignee: nobody → dhylands
Actually, I'd be inclined to make the code call execvpe and implement execvpe for android by copying/modifying the appropriate code from the bionic runtime.
That would break a handful of little debugging tools that rely on env vars :/.  Thinking.
I don't follow. We'd still be passing in the exact same environment. It's just that with the e versions of the exec functions you pass in the array already formatted.

We can still export the entire environment, it's just that you pass it in explicitly rather than implicity.

We can do any required allocation before the fork.
This backtrace proves my hypothesis of the problem.
Yes, but technically we're not supposed to touch |environ|.  With things like binary extensions and plugins loaded, that could lead to security bugs.

I can't think of any other (correct) way to do this that won't subtly break one of our platform-specific hacks, so maybe that's the best we can do :/.  Will think some more.
Yeah - but if we only touch environ from within the implementation of execvpe, I don't see any harm done. The execvpe function that we provide would only be provided with bionic platforms and will only use stuff that could be used from bionic.

For glibc based builds, we'd just use the execvpe that's provided.
Never mind - I see the problem. The other option is to do the setenv's before the fork. I'll poke around some more too.
That has the same kind of problem, unfortunately :(.  If some other thread dlopen()s while we have LD_LIBRARY_PATH temporarily set, e.g., that could cause a security problem too.
Actually, the problem is bigger. If *anybody* calls setenv or getenv in a multi-threaded  environment, then there are issues. Any call to setenv will potentially invalidate any pointers retrieved from getenv.

The bionic implementations of getenv and setenv are not thread safe.
Gecko code uses PR_SetEnv()/PR_GetEnv() for just that reason.
Doing an MXR search on getenv and setenv says otherwise.

If it is in fact true, that PR_GetEnv and PR_SetEnv are used everywhere, then there's should be no reason why we can't add a PR_DuplicateEnvironment function which returns something like an a copy of envron, and a version of PR_SetEnv which can set a variable into an environment instance. This could then be passed to execvpe/execve and you'd have no security holes.
Another way to workaround this would be to acquire the heap mutex prior to the fork and release it in both the child and the parent. Then the forked heap will be in a guaranteed consistent state.
bug 771765 would lift the need to setenv at all...
(In reply to Dave Hylands [:dhylands] from comment #11)
> Doing an MXR search on getenv and setenv says otherwise.
> 

Those are bugs.

> If it is in fact true, that PR_GetEnv and PR_SetEnv are used everywhere,
> then there's should be no reason why we can't add a PR_DuplicateEnvironment
> function which returns something like an a copy of envron

Yeah, that sounds pretty good.
(In reply to Dave Hylands [:dhylands] from comment #12)
> Another way to workaround this would be to acquire the heap mutex prior to
> the fork and release it in both the child and the parent. Then the forked
> heap will be in a guaranteed consistent state.

Aha!  This is indeed the issue I remembered.  jemalloc does this when pthread_atfork works; bionic doesn't implement pthread_atfork though.  We ended up shimming our impl.

(I don't think we need to release the locks in the child since we're about to exec().)

So we have two problems here
 - not using jemalloc, which lets us get away with this code being broken
 - this code being broken

I think I'd rather fix the brokenness.
PR_SetEnv uses putenv, which may well malloc just as much as setenv does.
(In reply to Mike Hommey [:glandium] from comment #16)
> PR_SetEnv uses putenv, which may well malloc just as much as setenv does.

In fact, in bionic, it calls strdup and setenv.
The "real bug" here is that we're setenv()'ing after fork().  Pre-forking while there's only one thread might work around that, and jemalloc is kind enough to work around that for us, but it's still wrong.

I like the proposed solution of comment 11.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #18)
> I like the proposed solution of comment 11.

I don't like the idea of adding an API to NSPR for just that. Especially when the "real bug" is that bionic is just broken in so many ways, and that the "workaround" from comment 11 is just really needed for bionic.

Why not just fix bionic instead of adding workarounds over workarounds over workarounds for it?

Or, before setenv'ing, actually checking what's in LD_LIBRARY_PATH, because, as it stands, LD_LIBRARY_PATH *already* contains /system/b2g due to /init.b2g.rc.
For the little story, to avoid having to set LD_LIBRARY_PATH, one could use rpath. Except the bionic linker doesn't support rpath.
Gecko is violating POSIX programming rules.  This bug could manifest in any configuration where we're not using jemalloc.  Adding assumptions on top of assumptions on top of assumptions for minor inconveniences seems worse than just fixing what gecko is doing wrong.

The NSPR patch required to fix this would be small, simple, and generally useful.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #21)
> This bug could manifest in any configuration where we're not using jemalloc.

I don't see how, except for bionic brokenness, this could happen.
Another option is to make plugin-container just work without LD_LIBRARY_PATH being required to be properly set beforehand, like the firefox and b2g binaries do.
(In reply to Mike Hommey [:glandium] from comment #23)
> Another option is to make plugin-container just work without LD_LIBRARY_PATH
> being required to be properly set beforehand, like the firefox and b2g
> binaries do.

It turns out that wouldn't be enough, because there's more than LD_LIBRARY_PATH we set. On Android, we also set ANDROID_PROPERTY_WORKSPACE, and on mac, we set DYLD_LIBRARY_PATH and DYLD_INSERT_LIBRARIES. So yeah, getting a copy of the environment and augmenting it before forking (doing it in the forked process would still hit the malloc problem) is about the only solution, however much it sucks to add an API to NSPR for that.

That being said, it would still be worthwhile to make plugin-container work without LD_LIBRARY_PATH/DYLD_LIBRARY_PATH
Depends on: 773414
Removed execvpe function and modified LauncApp to use execve instead.
cjones says he thinks that we always use absolute pathnames.
Attachment #641991 - Attachment is obsolete: true
I'm 100% sure GeckoChildProcessHost always uses absolute paths, because gecko binaries are never in the PATH.  I'm 99.9% sure we don't have other LaunchApp() consumers.
Try build of this patch and the PR_DuplicateEnvironment patch
https://tbpl.mozilla.org/?tree=Try&rev=cd68233ec698
Attachment #642034 - Flags: review?(jones.chris.g)
Comment on attachment 642034 [details] [diff] [review]
Bug 772734 - Fixes LaunchApp hang when launching plugin-container - v2

>diff --git a/ipc/chromium/src/base/process_util_linux.cc b/ipc/chromium/src/base/process_util_linux.cc

>+class EnvironmentEnvp
>+{
>+public:
>+  EnvironmentEnvp()
>+    : mEnvp(PR_DuplicateEnvironment()) {}

Note: this allocation can fail.

>+
>+  EnvironmentEnvp(const environment_map &em)
>+  {
>+    mEnvp = (char **)PR_Malloc(sizeof(char *) * (em.size() + 1));

This allocation can fail.  If it does, this impl will end up crashing
the parent process.  Let's just |return;| here if it fails.  AsEnvp()
will return null and most likely the content process won't start up.

>+  ~EnvironmentEnvp()
>+  {
>+    for (char **e = mEnvp; *e; ++e) {

|mEnvp| can be null here, as a result of either ctor.

r=me with those changes.
Attachment #642034 - Flags: review?(jones.chris.g) → review+
So then the question becomes, what do you want to happen when mEnvp becomes NULL? I took a look at the kernel sources, and it seems to accept a NULL envp in the execve call.

So we have 2 options:
1 - Pass the NULL through to execve. This seems bad because it might just happen that the execve will work even though the envp allocation failed (unlikely, but possible). Then the child will run successfully, but with the wrong environment.
2 - Detect that envp fails, log an error and return false to the caller of LaunchApp.

I'm going to say we should go with option 2.
Attached patch Now works for android and linux (obsolete) — Splinter Review
Added #ifdef so that the new code is only included for android (which is the only env that adds PR_DuplicateEnvironment in the mozglue patch)

Did a try run and got clean builds (at least most of them had completed by the time I wrote this).
https://tbpl.mozilla.org/?tree=Try&rev=a96f24ef0eb3
Attachment #642034 - Attachment is obsolete: true
Attachment #643255 - Flags: review?(jones.chris.g)
Comment on attachment 643255 [details] [diff] [review]
Now works for android and linux

>diff --git a/ipc/chromium/src/base/process_util_linux.cc b/ipc/chromium/src/base/process_util_linux.cc

>+/* Tempoaray until we have PR_DuplicateEnvironment in prenv.h */

Temporary

>+};
>+#endif /* HAVE_PR_DUPLICATE_ENVIRONMENT */

 // <-- use a line comment

Let's get this landed and some more apps out of process! :)
Attachment #643255 - Flags: review?(jones.chris.g) → review+
Cleaned up nits
Attachment #643255 - Attachment is obsolete: true
Attachment #643257 - Flags: review+
Push backed out for XUL Android bustage:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=41f4bdd4b6bd

https://hg.mozilla.org/integration/mozilla-inbound/rev/7dd75469589b

[And yeah I wish we could just ditch those builds as much as you... :-) ]
(In reply to Ed Morley [:edmorley] from comment #35)
> Push backed out for XUL Android bustage:
> https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=41f4bdd4b6bd
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/7dd75469589b
> 
> [And yeah I wish we could just ditch those builds as much as you... :-) ]

Actually, that reveals a problem that would happen on native android builds as well when they start using child processes.
Had to move PR_DuplicateEnvironment into process_util_linux.cc so that the heap function alloc/free functions get wrapped/matched properly on android.
Attachment #643257 - Attachment is obsolete: true
Attachment #643485 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/b54897b245b5
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: