Confusing "Failed to stop browser process" log message when trying to kill the already exited browser process

RESOLVED FIXED in Firefox 61

Status

defect
P1
normal
RESOLVED FIXED
Last year
Last year

People

(Reporter: whimboo, Assigned: ato)

Tracking

unspecified
mozilla61
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox61 fixed)

Details

Attachments

(2 attachments)

(In reply to Henrik Skupin (:whimboo) from bug 1434618 comment #20)
> I see some strange behavior here. All is based on the following log:
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=169826750&repo=try
> (log viewer)
> https://taskcluster-artifacts.net/W8In7kJvTES01pc-37mDGA/0/public/logs/
> live_backing.log (full log)
> 
> 
> 1) Geckodriver reports after only 5s that it failed to stop the browser
> process. How can this be given that we should wait 70s for that to be
> happening. Andreas, do you have an idea? Here the excerpt from the log:
> 
> > [task 2018-03-23T09:02:46.253Z] 09:02:46     INFO - PID 8530 | 1521795766245	Marionette	TRACE	0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
> [..]
> > [task 2018-03-23T09:02:51.563Z] 09:02:51     INFO - PID 8530 | 1521795771555	geckodriver::marionette	ERROR	Failed to stop browser process

(In reply to Andreas Tolfsen ‹:ato› from comment #21)
> I think std::process::Child::kill() errors if the process isn’t
> running:
> 
>     https://doc.rust-lang.org/std/process/struct.Child.html#method.kill
> 
> mozrunner unconditionally calls this after
> mozrunner::RunnerProcess::running() is false in the ::wait() loop:
> 
>    
> https://searchfox.org/mozilla-central/rev/
> 003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/mozbase/rust/mozrunner/src/
> runner.rs#147

(In reply to Andreas Tolfsen ‹:ato› from comment #22)
> Although killing twice an ended process does not:
> https://play.rust-lang.org/
> ?gist=85ae7d8f983307245872fb328b5685b4&version=stable
We should really get rid of this confusing message which originates here:
https://searchfox.org/mozilla-central/rev/003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/geckodriver/src/marionette.rs#604

It's really interesting why `kill()` reports an error in that case. I would like to know what kind of error this actually is.
Ok, so it can be replicated with the following code:

https://play.rust-lang.org/?gist=3ec0f0a6a8278ad0b218814d6b98f6c3&version=stable

The problem is going to happen when the process exits itself, and we try to kill it afterward.
The failure as thrown here is actually:

> Failed to stop process: Error { repr: Custom(Custom { kind: InvalidInput, error: StringError("invalid argument: can\'t kill an exited process") }) }

It means we should really only call `kill()` if we have to. And I remember that I was unhappy with that change when I reviewed the code.
Just for reference, the related code was added by bug 1443853. Andreas, can you provide a fix for that?
Blocks: 1443853
Flags: needinfo?(ato)
Summary: Confusing "Failed to stop browser process" log messages → Confusing "Failed to stop browser process" log message when trying to kill the already exited browser process
Incidentally my original patch to bug 1443853 had a fix for this
race condition, but I decided to adopt jgraham’s proposal for a
simpler approach.

In any case I will provide a patch for this.
Assignee: nobody → ato
Status: NEW → ASSIGNED
Flags: needinfo?(ato)
Priority: -- → P1
I’m inclined to want to make a 0.20.1 release with this regression
fix.  Inaccurately returning "Failed to stop process" when the
process has already successfully exited is pretty annoying and is
already confusing users.

There have been two functional changes to geckodriver since the
0.20.0 release, which is/will be this patch and
https://bugzilla.mozilla.org/show_bug.cgi?id=1444068.  I think this
meets the requirements for a point release according to semantic
versioning, but do you agree in principle that we should do this,
whimboo?
Flags: needinfo?(hskupin)
The changes are minimal. So I'm happy with a point release. I will try to review today but not if I can make it. At latest tomorrow morning you have it.
Flags: needinfo?(hskupin)
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237672

I don't love this from a readability point of view, but it seems like it should work.
Attachment #8963217 - Flags: review+
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237672

So I think the only non-racy solution here is to use try_wait.  The
problem with "Child::running()", then later "if Child::running" is
that its inherently racy.

Maybe I can add some comments that will improve the readability of
this.
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237818

I've tested it locally with all variations except not being able to kill the browser and it works pretty fine. I only want to see a bit more logging here, which I would need for investigating problems. See my inline comments.

::: testing/mozbase/rust/mozrunner/src/runner.rs:144
(Diff revision 2)
>      fn wait(&mut self, timeout: time::Duration) -> io::Result<process::ExitStatus> {
> -        let now = time::Instant::now();
> -        while self.running() {
> -            if now.elapsed() >= timeout {
> -                break;
> +        let start = time::Instant::now();
> +        loop {
> +            match self.try_wait() {
> +                // child has already exited, reap its exit code
> +                Ok(Some(status)) => return Ok(status),

For investigation of failures I would like to see the exit code logged at least as debug level.

::: testing/mozbase/rust/mozrunner/src/runner.rs:147
(Diff revision 2)
> -            if now.elapsed() >= timeout {
> -                break;
> +            match self.try_wait() {
> +                // child has already exited, reap its exit code
> +                Ok(Some(status)) => return Ok(status),
> +
> +                // child still running and timeout elapsed, kill it
> +                Ok(None) if start.elapsed() >= timeout => return self.kill(),

When we have to kill the process lets log this as info level. This information will be pretty helpful.
Attachment #8963217 - Flags: review?(hskupin) → review+
Blocks: 1449877
No longer blocks: 1441204
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237818

> For investigation of failures I would like to see the exit code logged at least as debug level.

Good idea, I’ve also missed this.
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237870

::: testing/geckodriver/src/marionette.rs:604
(Diff revision 3)
>  
>          if let Some(ref mut runner) = self.browser {
>              // TODO(https://bugzil.la/1443922):
>              // Use toolkit.asyncshutdown.crash_timout pref
> -            if runner.wait(time::Duration::from_secs(70)).is_err() {
> -                error!("Failed to stop browser process");
> +            match runner.wait(time::Duration::from_secs(70)) {
> +                Ok(x) => debug!("Browser process stopped with exit status {}", x),

Don't you want to use info level here, similar to the command we execute and which also gets logged as info level?
Comment on attachment 8963217 [details]
Bug 1448900 - Avoid killing exited process.

https://reviewboard.mozilla.org/r/232098/#review237870

> Don't you want to use info level here, similar to the command we execute and which also gets logged as info level?

I don’t think I would like that.  If anything, we should reduce the
level of the other one.

This isn’t information I think it is generally useful to display
to the user given that everything works.
Pushed by atolfsen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3d25026d07ef
Avoid killing exited process. r=jgraham,whimboo
https://hg.mozilla.org/integration/autoland/rev/56f8adbe41b7
Release mozrunner 0.6.1. r=whimboo
https://hg.mozilla.org/mozilla-central/rev/3d25026d07ef
https://hg.mozilla.org/mozilla-central/rev/56f8adbe41b7
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
It looks like the new message contains `exit status` and `exit code`. Can we remove one of them?

https://treeherder.mozilla.org/logviewer.html#?job_id=171227846&repo=try&lineNumber=41187

> [task 2018-03-30T17:29:38.196Z] 17:29:38     INFO - PID 1961 | 1522430972688	geckodriver::marionette	DEBUG	Browser process stopped with exit status exit code: 0
Flags: needinfo?(ato)
The "exit code: 0" part of that comes from the Display trait
implementation on ExitStatus, so we should remove the "with exit
status" part of the log message, so it comes something like "Browser
process stopped: exit code: 0".

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1451838.
Flags: needinfo?(ato)
You need to log in before you can comment on or make changes to this bug.