Closed Bug 1448900 Opened 7 years ago Closed 7 years ago

Confusing "Failed to stop browser process" log message when trying to kill the already exited browser process

Categories

(Testing :: geckodriver, defect, P1)

defect

Tracking

(firefox61 fixed)

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: whimboo, Assigned: ato)

References

Details

Attachments

(2 files)

(In reply to Henrik Skupin (:whimboo) from bug 1434618 comment #20) > I see some strange behavior here. All is based on the following log: > > https://treeherder.mozilla.org/logviewer.html#?job_id=169826750&repo=try > (log viewer) > https://taskcluster-artifacts.net/W8In7kJvTES01pc-37mDGA/0/public/logs/ > live_backing.log (full log) > > > 1) Geckodriver reports after only 5s that it failed to stop the browser > process. How can this be given that we should wait 70s for that to be > happening. Andreas, do you have an idea? Here the excerpt from the log: > > > [task 2018-03-23T09:02:46.253Z] 09:02:46 INFO - PID 8530 | 1521795766245 Marionette TRACE 0 -> [0,3,"quit",{"flags":["eForceQuit"]}] > [..] > > [task 2018-03-23T09:02:51.563Z] 09:02:51 INFO - PID 8530 | 1521795771555 geckodriver::marionette ERROR Failed to stop browser process (In reply to Andreas Tolfsen ‹:ato› from comment #21) > I think std::process::Child::kill() errors if the process isn’t > running: > > https://doc.rust-lang.org/std/process/struct.Child.html#method.kill > > mozrunner unconditionally calls this after > mozrunner::RunnerProcess::running() is false in the ::wait() loop: > > > https://searchfox.org/mozilla-central/rev/ > 003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/mozbase/rust/mozrunner/src/ > runner.rs#147 (In reply to Andreas Tolfsen ‹:ato› from comment #22) > Although killing twice an ended process does not: > https://play.rust-lang.org/ > ?gist=85ae7d8f983307245872fb328b5685b4&version=stable
We should really get rid of this confusing message which originates here: https://searchfox.org/mozilla-central/rev/003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/geckodriver/src/marionette.rs#604 It's really interesting why `kill()` reports an error in that case. I would like to know what kind of error this actually is.
Ok, so it can be replicated with the following code: https://play.rust-lang.org/?gist=3ec0f0a6a8278ad0b218814d6b98f6c3&version=stable The problem is going to happen when the process exits itself, and we try to kill it afterward.
The failure as thrown here is actually: > Failed to stop process: Error { repr: Custom(Custom { kind: InvalidInput, error: StringError("invalid argument: can\'t kill an exited process") }) } It means we should really only call `kill()` if we have to. And I remember that I was unhappy with that change when I reviewed the code.
Just for reference, the related code was added by bug 1443853. Andreas, can you provide a fix for that?
Blocks: 1443853
Flags: needinfo?(ato)
Summary: Confusing "Failed to stop browser process" log messages → Confusing "Failed to stop browser process" log message when trying to kill the already exited browser process
Incidentally my original patch to bug 1443853 had a fix for this race condition, but I decided to adopt jgraham’s proposal for a simpler approach. In any case I will provide a patch for this.
Assignee: nobody → ato
Status: NEW → ASSIGNED
Flags: needinfo?(ato)
Priority: -- → P1
I’m inclined to want to make a 0.20.1 release with this regression fix. Inaccurately returning "Failed to stop process" when the process has already successfully exited is pretty annoying and is already confusing users. There have been two functional changes to geckodriver since the 0.20.0 release, which is/will be this patch and https://bugzilla.mozilla.org/show_bug.cgi?id=1444068. I think this meets the requirements for a point release according to semantic versioning, but do you agree in principle that we should do this, whimboo?
Flags: needinfo?(hskupin)
The changes are minimal. So I'm happy with a point release. I will try to review today but not if I can make it. At latest tomorrow morning you have it.
Flags: needinfo?(hskupin)
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237672 I don't love this from a readability point of view, but it seems like it should work.
Attachment #8963217 - Flags: review+
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237672 So I think the only non-racy solution here is to use try_wait. The problem with "Child::running()", then later "if Child::running" is that its inherently racy. Maybe I can add some comments that will improve the readability of this.
Attachment #8963218 - Flags: review?(hskupin) → review+
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237818 I've tested it locally with all variations except not being able to kill the browser and it works pretty fine. I only want to see a bit more logging here, which I would need for investigating problems. See my inline comments. ::: testing/mozbase/rust/mozrunner/src/runner.rs:144 (Diff revision 2) > fn wait(&mut self, timeout: time::Duration) -> io::Result<process::ExitStatus> { > - let now = time::Instant::now(); > - while self.running() { > - if now.elapsed() >= timeout { > - break; > + let start = time::Instant::now(); > + loop { > + match self.try_wait() { > + // child has already exited, reap its exit code > + Ok(Some(status)) => return Ok(status), For investigation of failures I would like to see the exit code logged at least as debug level. ::: testing/mozbase/rust/mozrunner/src/runner.rs:147 (Diff revision 2) > - if now.elapsed() >= timeout { > - break; > + match self.try_wait() { > + // child has already exited, reap its exit code > + Ok(Some(status)) => return Ok(status), > + > + // child still running and timeout elapsed, kill it > + Ok(None) if start.elapsed() >= timeout => return self.kill(), When we have to kill the process lets log this as info level. This information will be pretty helpful.
Attachment #8963217 - Flags: review?(hskupin) → review+
Blocks: 1449877
No longer blocks: 1441204
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237818 > For investigation of failures I would like to see the exit code logged at least as debug level. Good idea, I’ve also missed this.
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237870 ::: testing/geckodriver/src/marionette.rs:604 (Diff revision 3) > > if let Some(ref mut runner) = self.browser { > // TODO(https://bugzil.la/1443922): > // Use toolkit.asyncshutdown.crash_timout pref > - if runner.wait(time::Duration::from_secs(70)).is_err() { > - error!("Failed to stop browser process"); > + match runner.wait(time::Duration::from_secs(70)) { > + Ok(x) => debug!("Browser process stopped with exit status {}", x), Don't you want to use info level here, similar to the command we execute and which also gets logged as info level?
Comment on attachment 8963217 [details] Bug 1448900 - Avoid killing exited process. https://reviewboard.mozilla.org/r/232098/#review237870 > Don't you want to use info level here, similar to the command we execute and which also gets logged as info level? I don’t think I would like that. If anything, we should reduce the level of the other one. This isn’t information I think it is generally useful to display to the user given that everything works.
Pushed by atolfsen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3d25026d07ef Avoid killing exited process. r=jgraham,whimboo https://hg.mozilla.org/integration/autoland/rev/56f8adbe41b7 Release mozrunner 0.6.1. r=whimboo
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
It looks like the new message contains `exit status` and `exit code`. Can we remove one of them? https://treeherder.mozilla.org/logviewer.html#?job_id=171227846&repo=try&lineNumber=41187 > [task 2018-03-30T17:29:38.196Z] 17:29:38 INFO - PID 1961 | 1522430972688 geckodriver::marionette DEBUG Browser process stopped with exit status exit code: 0
Flags: needinfo?(ato)
The "exit code: 0" part of that comes from the Display trait implementation on ExitStatus, so we should remove the "with exit status" part of the log message, so it comes something like "Browser process stopped: exit code: 0". Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1451838.
Flags: needinfo?(ato)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: