[mozprocess] Allow tracking of processes which fork themselves and leave zombie processes behind

NEW
Unassigned

Status

enhancement
P3
normal
2 years ago
5 months ago

People

(Reporter: whimboo, Unassigned)

Tracking

(Depends on 2 bugs, Blocks 4 bugs)

Version 3
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Lately I was digging through bugs like bug 1421289 and bug 1433905 but so far didn't find a solution in how to get a better support for processes which fork themselves into a new process/session. Right now mozprocess completely looses track of those processes, which is most annoying for Firefox because we use mozprocess in all of our test harnesses to handle Firefox.

Firefox forks itself into a new process when it comes to restarts (update, start in safe mode, ...), which then results in a new session and a parent pid of 1, which is init.

That behavior breaks mozprocess because it looses control over the newly forked process. Instead it keeps track of a zombie process, which exists as long as wait() hasn't been called on it. Once done, we are not able to kill Firefox anytime longer.

In case for Marionette, which supports restart tests, I implemented a workaround a while ago (bug 1176758). Hereby the Marionette component as running inside of Firefox tells the client (which uses mozprocess) about the new pid. It then will be updated in mozprocess, so that we keep tracking the correct process.

Sadly this doesn't work for other external commands like `adb` as noticed on bug 1433905 comment 7.

I have spent a bit of time in trying to find a way to handle that but was not successful so far. Maybe someone could assist and help to figure that out? A testcase showing this problem is here: attachment 8945047 [details].

Karl, I know that you have deep knowledge at least about internals on Linux systems. Maybe you have a pointer? I would kinda appreciate it. Thanks.
Flags: needinfo?(karlt)
(In reply to Henrik Skupin (:whimboo) from comment #0)
> Firefox forks itself into a new process when it comes to restarts (update,
> start in safe mode, ...), which then results in a new session and a parent
> pid of 1, which is init.

I'm surprised Firefox would start a new session.

Is there a way to force this?

I tried "firefox --safe-mode -P new -no-remote" and firefox still has the
shell as PPID.

If you can force the restart, then you can determine whether the shell has a
way to track the process by whether it stays in the foreground when run from
the shell.  If it goes into background, check ps output to see whether ps
thinks the process is related to the shell.

> Sadly this doesn't work for other external commands like `adb` as noticed on
> bug 1433905 comment 7.

I assume applications would call setsid because they no longer want to be
tracked by the parent.  I wonder whether the command has an option to disable
this.

If it leaves stdin or stdout or stderr open then perhaps it could be found
through /proc/<pid>/fd matching a special fd used to start the first process.

ptrace is probably not an ideal solution, because that would prevent attaching
a debugger.

Passing ni to jld, in case he has some suggestions.
Flags: needinfo?(karlt) → needinfo?(jld)
According to searchfox we're not using setsid or setpgid/setpgrp anywhere, so I don't think we're creating a new session in the Unix sense when restarting.  However, adb does do this when daemonizing, which is what setsid is for:

$ ps -jp 13386
  PID  PGID   SID TTY          TIME CMD
13386 13386 13386 ?        00:00:00 adb

cgroups allow tracking all descendents of a process, but that's complicated and I think needs things to be set up as root.  If systemd is being used there might be some way to make use of it for this, but I don't know the details offhand.

pid namespaces are also in this area: a process becomes init(8) for its descendents in essentially all respects, and they're all killed when that process exits, but they have a different mapping from pids to processes, and that will break too many things to be useful in this context.  (Also, using this as a normal user means setting up an unprivileged user namespace, and the last time I tried running Firefox in one of those it crashed, apparently because GTK expected some files to be owned by root but root didn't exist.)  This is probably not useful here.

prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) sets the calling process as a subreaper which adopts any orphaned descendents, similar to init(8); it could then find any new children by traversing procfs (there's probably already a library for this) and wait(2) on them normally.  Note that if the subreaper exits, its children are picked up by init (or the next enclosing subreaper, if any) and not killed.  This seems like the most promising approach.

And I should add that these are all Linux-specific; I don't think there's any portable Unix way to do anything like this.


But also, as a workaround for the specific case of adb: `adb kill-server`.
Flags: needinfo?(jld)
Depends on: 921632
Depends on: 1435820
(In reply to Karl Tomlinson (:karlt) from comment #1)
> (In reply to Henrik Skupin (:whimboo) from comment #0)
> > Firefox forks itself into a new process when it comes to restarts (update,
> > start in safe mode, ...), which then results in a new session and a parent
> > pid of 1, which is init.
> 
> I'm surprised Firefox would start a new session.
> 
> Is there a way to force this?

Maybe this is not happening on Linux, but on OS X I see a parent pid of 1 each time a restart is done by Firefox (eg. install updates, or restart in safe mode). You can also see this when starting Firefox with -ProfileManager. After selecting the profile Firefox also gets a parent id of 1.

> I tried "firefox --safe-mode -P new -no-remote" and firefox still has the
> shell as PPID.

Yes, because in this case Firefox directly starts in safe mode by not using the profile manager.

> If you can force the restart, then you can determine whether the shell has a
> way to track the process by whether it stays in the foreground when run from
> the shell.  If it goes into background, check ps output to see whether ps
> thinks the process is related to the shell.

It's not that easy because other processes of Firefox can run on the same host, which should not be obeyed in such a case. 

> > Sadly this doesn't work for other external commands like `adb` as noticed on
> > bug 1433905 comment 7.
> 
> I assume applications would call setsid because they no longer want to be
> tracked by the parent.  I wonder whether the command has an option to disable
> this.

Being able to stop adb from doing that would be great. But so far I haven't found something.

> If it leaves stdin or stdout or stderr open then perhaps it could be found
> through /proc/<pid>/fd matching a special fd used to start the first process.

That's actually a good idea. As it looks like on MacOS `lsof -p $pid` could be used. To make it platform agnostic maybe we could add the dependency to psutils.

(In reply to Jed Davis [:jld] (⏰UTC-7) from comment #2)
> According to searchfox we're not using setsid or setpgid/setpgrp anywhere,
> so I don't think we're creating a new session in the Unix sense when
> restarting.

We first saw this problem with bug 1276220 about 3 years ago. The reason was a change for the updater in bug 394984 and specifically attachment 8747283 [details] [diff] [review]. Maybe in case of Firefox it's just a different process group.

> However, adb does do this when daemonizing, which is what setsid is for:
> 
> $ ps -jp 13386
>   PID  PGID   SID TTY          TIME CMD
> 13386 13386 13386 ?        00:00:00 adb

Maybe `adb` in this case is special for now. As mentioned above it would be great to know if someone could force adb not to run as daemon and to stay in foreground.

> pid namespaces are also in this area: a process becomes init(8) for its
> descendents in essentially all respects, and they're all killed when that
> process exits, but they have a different mapping from pids to processes, and
> that will break too many things to be useful in this context.  (Also, using
> this as a normal user means setting up an unprivileged user namespace, and
> the last time I tried running Firefox in one of those it crashed, apparently
> because GTK expected some files to be owned by root but root didn't exist.) 
> This is probably not useful here.

Maybe that was something similar as what I have noticed with Janitor a while ago on bug 1430756? Do you know on which platforms namespaces are supported? Is that only Linux?

> prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) sets the calling process as a
> subreaper which adopts any orphaned descendents, similar to init(8); it
> could then find any new children by traversing procfs (there's probably
> already a library for this) and wait(2) on them normally.  Note that if the
> subreaper exits, its children are picked up by init (or the next enclosing
> subreaper, if any) and not killed.  This seems like the most promising
> approach.
> And I should add that these are all Linux-specific; I don't think there's
> any portable Unix way to do anything like this.

So this is all Linux only then, but we clearly would also need something for MacOS, and as I noticed recently also Windows.

> But also, as a workaround for the specific case of adb: `adb kill-server`.

You mean that we would have to call this command before we are going to send the real adb command to start the Fennec application on the device? And then adb will run in the foreground?

Thank you both already for helping!
Flags: needinfo?(jld)
(In reply to Henrik Skupin (:whimboo) from comment #3)
> > But also, as a workaround for the specific case of adb: `adb kill-server`.
> 
> You mean that we would have to call this command before we are going to send
> the real adb command to start the Fennec application on the device? And then
> adb will run in the foreground?

Maybe we can just ignore `adb` in all the discussion and concentrate on Firefox. I just figured out that this might be a bug in mozprocess, where `poll` and `wait` actually check the host process (adb) but not the remove process (firefox). For details see bug 1433905.
(In reply to Henrik Skupin (:whimboo) from comment #4)
> Firefox. I just figured out that this might be a bug in mozprocess, where

Just to correct myself.... s/mozprocess/mozrunner.
Depends on: 892902
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Maybe this is not happening on Linux, but on OS X I see a parent pid of 1
> each time a restart is done by Firefox (eg. install updates, or restart in
> safe mode). You can also see this when starting Firefox with
> -ProfileManager. After selecting the profile Firefox also gets a parent id
> of 1.

Sounds like something different is happening for the restart on OS X.

I tried firefox -ProfileManager -no-remote on Linux.  After selecting the
profile, Firefox still has the same PID and PPID.
We seem to have a few types of restart that behave differently.  The parent process crash reporter appears to do a fork/exec[1] while the original process does crash submission in the background and then exits.  There's also a fork() in update-related code[2], and I don't quite understand what's going on there just from quick look at it, especially because this code seems to contradict the comment that was added in the same commit — it looks like a fork-and-exit, but the comment suggests it should be exec'ing directly.  Contrast [3], which does exec on Linux but appears to spawn a new process[4] on OS X — I think that's the updater starting the new browser and the previous one is the browser launching the updater?  And then there's [5], which I think is what the profile manager uses[6] (note that there are several different `LaunchChild`s in the tree).

[1] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/crashreporter/client/crashreporter_gtk_common.cpp#79
[2] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/xre/nsUpdateDriver.cpp#649
[3] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/mozapps/update/updater/updater.cpp#2105
[4] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/mozapps/update/updater/launchchild_osx.mm#31
[5] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/xre/nsAppRunner.cpp#1968
[6] https://searchfox.org/mozilla-central/rev/84cea84b1214/toolkit/xre/nsAppRunner.cpp#2288
And this needinfo might be obsolete by now, but:

* Yes, namespaces are Linux-specific.  Also, they tend to have surprising side-effects — e.g., using the PulseAudio client in a different pid namespace from the server will break audio systemwide, because it sees shared memory files containing pids of processes that don't exist in the client's frame of reference, so it deletes them all.  (A container that has a fully isolated environment with its own set of system services won't have those problems.)

* I didn't find a documented flag to run the adb daemon in the foreground; I was thinking that, as a special case, it could be manually killed at the end of a test.
Flags: needinfo?(jld)
Depends on: 1438009
Thanks Jed for all the additional feedback! I will have a look at this bug again once all the dependencies have been fixed in mozprocess.

As said earlier we don't have to care about adb here, because the is_running implementation in mozrunner is broken, and needs to be fixed.
No longer blocks: 1433905
Depends on: 1433905
Depends on: 1438830
Priority: -- → P3
Depends on: 1493796
Blocks: 1464989
You need to log in before you can comment on or make changes to this bug.