Allow mozprocess to detect and kill detached child processes

RESOLVED FIXED

Status

defect
--
major
RESOLVED FIXED
4 years ago
Last year

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

(Blocks 2 bugs)

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 fixed, firefox50 fixed)

Details

Attachments

(2 attachments)

+++ This bug was initially created as a clone of Bug #1166033 +++

Originally I have seen it for non-restart tests that Marionette waits until it hits the global timeout to abort the test. Now that this has been fixed on bug 1166033 all is working fine for non-restart tests. But restart tests are still affected, whereby only those which restart the application with the in_app restart flag. 

Exactly this behavior is happening for our Firefox update tests, which are also run by RelEng for release and beta. Having an abnormal application termination would cause a huge delay in processing all the builds. That's why I will bump this priority up to major.

To reproduce please run the following:

1. Clone https://github.com/mozilla/firefox-ui-tests
2. Create a new virtual environment
3. Run `python setup.py develop`
4. Get an older build of Firefox
5. Run `firefox-ui-update --installer %path_to_build%`
6. When the application has been restarted and downloads the patch, close the browser window and the about window.

After step 6 it will wait a long long time...
This is currently by design.  When we restart using in_app=True, we no longer know the pid of the process, and so can no longer track it, and we can't tell the difference between "no data has come in on the socket because the process is dead" and "no data has come in on the socket because the process is busy".  I'm not sure of a good way around this.
So after the restart we will connect via the same port to Firefox and that connection is successful given that we can run tests. Maybe the server should give us the information which process id it currently has? We could then patch the instance to point to the updated pid.
Possibly, although doing this would probably involve quite a bit of hackery, or a lot of changes to mozrunner and mozprocess.  In practice, how often does this cause problems?
Well, for desktop the restart feature is the one we primarily promote. So far we had limited crashes, some related to restart tests especially the update tests. At least so far I haven't seen hangs due to the mentioned problem. But maybe it's some other side-effect beside the latest fixed bug 1141519. I could re-check once the next version of marionette-client and driver has been released. In any case it's not urgent right now.
We have this problem on mozilla-central now and only for OS X that our update tests are perma failing at the end because Firefox cannot be closed via mozprocess due to an known process id. This is currently tracked on bug 1276220.

Something in that changeset changed the behavior:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5

It's most likely due to the elevated updater changes landed during that time. So far I was not able to figure that out correctly due to issues in down tracking.

Continuing the discussion from earlier comments on this bug I wonder if we could instruct Marionette server to send the current pid to the client. Therefore the nsIXULRuntime interface could be used:

https://dxr.mozilla.org/mozilla-central/source/xpcom/system/nsIXULRuntime.idl#83

David, what do you think?
Flags: needinfo?(dburns)
Summary: Marionette hangs if application quits unexpectedly after it has been restarted via the in_app flag → Sometimes Marionette is not able to shutdown/kill the process when the application restarted itself
Ok, after some investigation on bug 1276220 it's now clear what's happening. We cannot find the process id because a restart of Firefox started a new process group. This is becoming more prominent those days to kill any kind of ghost process lingering around after a restart. 

So this bug is blocking our capability of running update tests on OS X.
Blocks: 1276220
Summary: Sometimes Marionette is not able to shutdown/kill the process when the application restarted itself → Marionette is not able to kill the process if a restart escapes the process group
Just to add, I do not really see how we could retrieve the new process group id via mozprocess alone. Ted, would you have any idea beside letting Marionette server send it to the client?
Flags: needinfo?(ted)
(In reply to Henrik Skupin (:whimboo) from comment #5)
> We have this problem on mozilla-central now and only for OS X that our
> update tests are perma failing at the end because Firefox cannot be closed
> via mozprocess due to an known process id. This is currently tracked on bug
> 1276220.
> 
> Something in that changeset changed the behavior:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?changeset=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5
> 
> It's most likely due to the elevated updater changes landed during that
> time. So far I was not able to figure that out correctly due to issues in
> down tracking.
> 
> Continuing the discussion from earlier comments on this bug I wonder if we
> could instruct Marionette server to send the current pid to the client.
> Therefore the nsIXULRuntime interface could be used:
> 
> https://dxr.mozilla.org/mozilla-central/source/xpcom/system/nsIXULRuntime.
> idl#83
> 
> David, what do you think?

We can return it in the `sessionCapabilities` object that is returned when we have a session started and when we call `#getSessionCapabilties`.

https://dxr.mozilla.org/mozilla-central/source/testing/marionette/driver.js#150
Flags: needinfo?(dburns)
I found out that in automation.py we make use of a MOZ_PROCESS_LOG to track the pids of the different processes of Firefox:

https://dxr.mozilla.org/mozilla-central/source/build/automation.py.in#560

With that in place for Marionette we could also use a check for zombies to be able to kill the one remaining process of Firefox:

https://dxr.mozilla.org/mozilla-central/source/build/automation.py.in#505

I will check if that approach would make sense and is easily to port.
Chris, would it be wise to use any code in Marionette directly to handle that and even maybe through MOZ_PROCESS_LOG? Or maybe shall we do it more general in mozrunner? Looks like several test harnesses use that environment variable but not Marionette related test harnesses so far. Thanks.
Flags: needinfo?(ted) → needinfo?(cmanchester)
Thanks Ted. As it looks like this feature is only present for Linux and Windows right now. The following file is missing the handling of MOZ_PROCESS_LOG:

https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_mac.mm

Do you know why we don't have that logging for OS X? If there is no reason I might try to get my feet wet on that and try myself.
Flags: needinfo?(ted)
So the feature is indeed not implemented yet on OS X. I will check if I can get this into this file:
https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_mac.mm
Flags: needinfo?(ted)
Ok, I'm getting this moved over to mozbase. I will figure out how best we can handle that on the mozrunner or even mozprocess level.

The process logging for OS X I already got working locally. I will file a dependent bug for it.
Component: Marionette → Mozbase
Summary: Marionette is not able to kill the process if a restart escapes the process group → mozrunner / mozprocess should be able to kill all application processes (including forks into new process groups)
Did we verify that the process id for the updated Firefox ends up in that log? It's not clear to me that would be the case.
Flags: needinfo?(cmanchester)
It should, but I cannot tell right now because we would need a 1 day old nightly build on Mac with the patch on bug 950401 included. It means I will know more on Wednesday this week. But I'm fairly sure it will be all listed.
Maja tested the process logging of Firefox on OS X for the latest two nightly builds and that is what the log contains:

==> process 13422 launched child process 13423
==> process 13475 launched child process 13477

We clearly see two distinct processes (13422, 13475) listed whereby the latter is the one of the new process group after the update. Interestingly process logging doesn't work when our tests are getting invoked by mach, so that I will have to do some further investigation first before I can work on the following proposal:

1. Enable process logging when starting the build.
2. When killing the process we try the known process pid (process group) first
3. In case of a failure (Operation permitted) we check for an existing process log file
4. If it exists we retrieve all known parent processes, and try to kill each of them.

We should only re-raise the exception if we weren't able to find any killable process.
So as it looks like using the process logging capabilities (via the MOZ_PROCESS_LOG env variable) across platforms only seem to work if child processes are getting created. In our case for the firefox-ui-update tests we do only show about:blank and don't load anything. In such a case no child process is getting launched and as result nothing logged to the file.

Benjamin and Bill, I wonder if it would be helpful and possible if we could also log any new parent process. It would help to cope with restarts of Firefox which end up in a new process group and we totally lose tracking of them. What do you think?
Flags: needinfo?(wmccloskey)
Flags: needinfo?(benjamin)
This is a long bug and I don't understand most of it. But typically the thing that is responsible for logging that a process was launched is the thing that launched it. So if your test harness is launching Firefox multiple times, it should be recording that fact somewhere.
Flags: needinfo?(benjamin)
Benjamin, I'm sorry that I missed to add the reference to bug 1276220 comment 10 and following comments. The problem here is that not the harness restarts Firefox but that Firefox restarts itself into a new process group after applying the update. That change came in with the patch on bug 394984, and only applies to OS X right now.

I spent quiet a good amount of time on process handling but I have no idea how to get it working again in mozprocess beside hijacking the process log. But for that we would also have to add logging of the parent process during the startup of Firefox. Is that something you would think it's ok, and if yes where would this code have to be located? Maybe nsAppStartup.cpp?
Flags: needinfo?(benjamin)
Redirecting to spohl who implemented bug 394984.
Flags: needinfo?(benjamin) → needinfo?(spohl.mozilla.bugs)
If you want to log the parent pid, it might be easier to do it to a separate log file. Otherwise you'll probably break some test harness code.
Flags: needinfo?(wmccloskey)
Deferring to Bill here.
Flags: needinfo?(spohl.mozilla.bugs)
(In reply to Bill McCloskey (:billm) from comment #22)
> If you want to log the parent pid, it might be easier to do it to a separate
> log file. Otherwise you'll probably break some test harness code.

Ok, we can do that. Any proposal for the environment variable name? Is `MOZ_PARENT_PROCESS_LOG` ok? Otherwise I could also check our test harness code and implement fixes to cope with parent pids in a line only.

Where would I actually have to add the parent process logging? Is the file nsAppStartup.cpp the correct place?
Flags: needinfo?(wmccloskey)
(In reply to Henrik Skupin (:whimboo) from comment #24)
> (In reply to Bill McCloskey (:billm) from comment #22)
> > If you want to log the parent pid, it might be easier to do it to a separate
> > log file. Otherwise you'll probably break some test harness code.
> 
> Ok, we can do that. Any proposal for the environment variable name? Is
> `MOZ_PARENT_PROCESS_LOG` ok?

Yes, that sounds fine.

> Otherwise I could also check our test harness
> code and implement fixes to cope with parent pids in a line only.

I'm not sure what you mean.

> Where would I actually have to add the parent process logging? Is the file
> nsAppStartup.cpp the correct place?

There are many places where this could go. nsAppStartup.cpp seems fine to me.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #25)
> > Otherwise I could also check our test harness
> > code and implement fixes to cope with parent pids in a line only.
> 
> I'm not sure what you mean.

I mean the test harnesses which check that log file have an regex to extract the child process information. When we now add new lines with a parent only the regexes might still work and nothing is broken, or I fix those regexes and we can keep a single process log.
As discussed on bug 1282570 we want to do all that in two steps. First we need the feature to detect and kill any detached process which has been spawned into a new process group. This could be done if mozprocess knows about the new pid. In the current case the affected harness is Marionette, and it can let mozprocess know about it. So the logging mechanism is not really needed. But it's still something I would like to do as a follow-up bug.

The options we have here are the following ones as also listed on the other bug:

1. The process id Marionette client gets could be used to update `self.proc.pid` in ProcessHandler. It would mean that whatever method we call, there will be a running and no detached (zombie) process, and operations will succeed. But there is also a risk of breaking other code which for now should not happen given that Marionette is the only framework to actually support restarts of Firefox. Beside that there is also the question how to handle the pid and process group id given the setting of `ignore_children`.

2. Another option would be to move the new process from a foreign process group into the one we know since the process has been started. It would mean our call to killpg() would still succeed at the end when we kill left-open processes. With that method any method performed for the process should afaik also succeed.

3. We do not update the pid at all but store it while marking the old pid as detached. That way we can ensure to kill any left-open processes at the end in `kill()`. Side-effect here is that any method of the process instance will work on the detached (zombie) process, and doesn't reflect the reality.

From those three we decided to take option 3 which should not cause any side effects and will allow us to fix the current update test bustage.
Assignee: nobody → hskupin
Blocks: 1282570
Status: NEW → ASSIGNED
Summary: mozrunner / mozprocess should be able to kill all application processes (including forks into new process groups) → Allow mozprocess to detect and kill detached child processes
Ted, if you want tests I can get them added in a follow-up bug once the update tests are working again.
https://reviewboard.mozilla.org/r/61546/#review59128

I like this approach, it's not *too* terrible. I'm a little iffy on the API but I could be convinced to land it as is (or near as is). I have a few "issues" that are more questions I'd like answered below.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py:697
(Diff revision 1)
> +    def getpgid(self, pid=None):
> +        """Get the process group id for the given process id.
> +
> +        If pid is None the currently known process id will be used.
> +        """
> +        try:
> +            return os.getpgid(pid) if pid else self.proc.pgid
> +        except OSError as e:
> +            # No such process
> +            if e.errno == 3:
> +                return None
> +            else:
> +                raise

Is this method necessary? Seems like we can just call os.getpgid below from check_detached.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py:848
(Diff revision 1)
> +    def check_for_detached(self, new_pid):
> +        """Check if the current process has been detached and mark it appropriately.
> +
> +        In case of application restarts the child process can spawn itself into a new process group.
> +        From now on the process can no longer be tracked by mozprocess anymore and has to be
> +        marked as detached. If the consumer of mozprocess still knows the new process id it could
> +        check for the detached state.
> +
> +        new_pid is the new process id of the child process.
> +        """

This API seems very specific to your use case. What if instead we had a "detach" and "attach" method that basically just appends/pops the specified pid to a "managed_pids" list. Then "kill" would terminate every pid in that list.

Then again, maybe that API would be too easy to abuse.

::: testing/mozbase/mozprocess/mozprocess/processhandler.py:865
(Diff revision 1)
> +
> +        if isPosix:
> +            new_pgid = self.getpgid(new_pid)
> +            if new_pgid != self.proc.pgid:
> +                self.proc.detached = True
> +                self.proc._detached_pid = new_pid

Are you doing this because changing self.pid directly doesn't work with subprocess?
Comment on attachment 8766784 [details]
Bug 1176758 - Allow mozprocess to detect and kill detached child processes.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61546/diff/1-2/
Attachment #8766784 - Flags: review?(ted) → review?(ahalberstadt)
Attachment #8766785 - Flags: review?(ted)
Comment on attachment 8766785 [details]
Bug 1176758 - Release mozprocess 0.23 and mozrunner 6.12.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61548/diff/1-2/
https://reviewboard.mozilla.org/r/61546/#review59128

I answered them all. Please tell me what you think.

> Is this method necessary? Seems like we can just call os.getpgid below from check_detached.

We don't really need it. I added it to make the handling of the pgid easier. `os.getpgid()` could raise an exception and that is what we will have to take care of. If we remove this method we should wrap as best both calls to `os.getpgid()` in a try/except block.

> This API seems very specific to your use case. What if instead we had a "detach" and "attach" method that basically just appends/pops the specified pid to a "managed_pids" list. Then "kill" would terminate every pid in that list.
> 
> Then again, maybe that API would be too easy to abuse.

When the process spawns itself into a new process group all the old pids are mood. So my thinking is that we should not keep them all around because its unnecessary. Only the original pid would be important to have given that this process lingers around as zombie.

> Are you doing this because changing self.pid directly doesn't work with subprocess?

We can change `self.pid` but I didn't want to loose the id of the original process, given the reason as decribed above. Also the current behavior doesn't interfere with the current API and should not cause regressions.
https://reviewboard.mozilla.org/r/61546/#review59128

> We don't really need it. I added it to make the handling of the pgid easier. `os.getpgid()` could raise an exception and that is what we will have to take care of. If we remove this method we should wrap as best both calls to `os.getpgid()` in a try/except block.

Yeah, self.getpgid is only called once in your patch, so please just inline the try/except in check_detached.

> When the process spawns itself into a new process group all the old pids are mood. So my thinking is that we should not keep them all around because its unnecessary. Only the original pid would be important to have given that this process lingers around as zombie.

So what I was thinking that the consumer could do was:

    proc.detach(old_pid)
    prod.attach(new_pid)

This would be pretty general purposes, but also likely more than we need. I'd be happy to land what you have for now and consider an approach like I suggested in the future if there's a need.
Comment on attachment 8766784 [details]
Bug 1176758 - Allow mozprocess to detect and kill detached child processes.

https://reviewboard.mozilla.org/r/61546/#review59148
Attachment #8766784 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8766784 [details]
Bug 1176758 - Allow mozprocess to detect and kill detached child processes.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61546/diff/2-3/
Comment on attachment 8766785 [details]
Bug 1176758 - Release mozprocess 0.23 and mozrunner 6.12.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61548/diff/2-3/
We cannot land this patch as is given that I can see build bustage on Windows:

> 11:01:38     INFO -  AttributeError: 'module' object has no attribute 'getpgid'
>
> 11:01:38 INFO - self.proc.pgid = os.getpgid(self.proc.pid) 

Now when checking the code in `start()` I see that I missed to check for Posix, similar to the check_for_detached method.
Comment on attachment 8766784 [details]
Bug 1176758 - Allow mozprocess to detect and kill detached child processes.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61546/diff/3-4/
Comment on attachment 8766785 [details]
Bug 1176758 - Release mozprocess 0.23 and mozrunner 6.12.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/61548/diff/3-4/
The last try server push looks fine: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67db1a16834f&selectedJob=23385803

Andrew, mind having a quick look over it again before I push? I did some minor changes e.g. removing the extra and unnecessary `self.detached` flag. Thanks.
Flags: needinfo?(ahalberstadt)
Flags: needinfo?(ahalberstadt)
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f3d52541b503
Allow mozprocess to detect and kill detached child processes. r=ahal
https://hg.mozilla.org/integration/autoland/rev/de1d0a2322db
Release mozprocess 0.23 and mozrunner 6.12. r=ahal
I filed a follow-up bug 1284864 to track the remaining work for the log file.
No longer depends on: 1282148
Released mozprocess 0.23 and mozrunner 6.12:

creating build/bdist.linux-x86_64/wheel/mozprocess-0.23.dist-info/WHEEL
Uploading distributions to https://pypi.python.org/pypi
Uploading mozprocess-0.23-py2-none-any.whl
Uploading mozprocess-0.23.tar.gz

creating build/bdist.linux-x86_64/wheel/mozrunner-6.12.dist-info/WHEEL
Uploading distributions to https://pypi.python.org/pypi
Uploading mozrunner-6.12-py2-none-any.whl
Uploading mozrunner-6.12.tar.gz

If all works fine for todays Nightly builds I will request a backport to mozilla-aurora.
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
test-only patch works great. So we also need it for mozilla-aurora to fix the bustage. Thanks.
Whiteboard: [checkin-needed-aurora]
Whiteboard: [checkin-needed-aurora]
You need to log in before you can comment on or make changes to this bug.