1403923 - By default delete_session has to perform a safe shutdown of Firefox

Assignee

Description

•

7 years ago

Attached file MOZ_CRASH for a forced kill on MacOS — Details

Right now geckodriver is forcing the exit of Firefox by sending the kill signal:

https://searchfox.org/mozilla-central/rev/f54c1723befe6bcc7229f005217d5c681128fcad/testing/geckodriver/src/marionette.rs#582-587

https://github.com/jgraham/rust_mozrunner/blob/master/src/runner.rs#L145

This is unfortunate given the following reasons:

* The shutdown can force an internal crash, which most people noticed with Firefox 55 on Windows. With more recent builds it's a MOZ_CRASH only for debug builds.

* Any component shutdown logic will not be performed. It means that we can not guarantee that eg sessionrestore is able to save the session, or places to store the latest changes to bookmarks

* The exit code of the child process is not 0 but indicates that something was broken

Lets really sanely shutdown Firefox. See bug 1141335 which covers the flip for Marionette, whereby with marionette_driver we have the `in_app` flag, which most of our tests are using nowadays.

It's not that urgent right now, given that we do not have the restart capability yet. But it would still be good to have it soonish.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 1

•

7 years ago

I was a bit wrong in comment 0. We indeed do a in_app shutdown, but we don't wait until the browser has been quit, before force kill the process:

1506631636378	geckodriver::marionette	TRACE	-> 39:[0,2,"quit",{"flags":["eForceQuit"]}]
1506631636395	Marionette	TRACE	0 -> [0,2,"quit",{"flags":["eForceQuit"]}]
1506631636396	Marionette	INFO	New connections will no longer be accepted
1506631636551	Marionette	TRACE	0 <- [1,2,null,{"cause":"shutdown"}]
1506631636559	geckodriver::marionette	TRACE	<- [1,2,null,{"cause":"shutdown"}]
1506631636559	webdriver::server	DEBUG	Deleting session
1506631636559	geckodriver::marionette	DEBUG	Stopping browser process
Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2543

We should clearly call proc.wait() and give it a bit of time before calling proc.stop().

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 2

•

7 years ago

So I don't think that we should do the call to `proc.wait()` in the DeleteSession command, but instead in `delete_session`. We have to keep in mind that every command which gets executed can potentially close the browser. And when the next command is sent the connection issues can appear.

Generally a shutdown of Firefox can take up to 65 seconds. Then the watchdog thread will kill any remaining and hanging processes/threads. With Marionette we have a 120s timeout right now, given that in some cases the watchdog thread fails to kill content processes (see bug 1395504), and as result it hangs forever. We just want to give it a little bit more time.

Blocks: 1401129

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 3

•

7 years ago

Given that mozrunner_rust doesn't have a `wait` method yet, we would have to get this implemented first.

Before I start I would like to hear some feedback.

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Blocks: 1401109

Andreas Tolfsen ❲:ato❳

Comment 4

•

7 years ago

I’m not sure what I’m supposed to give feedback on here.  There
seems to be talk about marionette_driver, geckodriver, and the
Marionette server.  Exactly what change are you proposing to make
and where?  Do you have a patch I can look at?

The root cause of runner.stop() causing a MOZ_CRASH assertion
(now on debug builds only?) is that Firefox doesn’t handle
signals properly.  In the event of SIGTERM, the process should
clean up relevant state and exit gracefully.  This is tracked by
https://bugzil.la/336193.

On the other hand, the reason we don’t ask Firefox to shut itself
down is that it causes us to loose leak reports.  jgraham has some
more information on this.

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 5

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #4)
> I’m not sure what I’m supposed to give feedback on here.  There
> seems to be talk about marionette_driver, geckodriver, and the
> Marionette server.  Exactly what change are you proposing to make
> and where?  Do you have a patch I can look at?

This is geckodriver bug and as such a bug in that product. The references to Marionette above explained what we do for Marionette client to handle it properly. 

> The root cause of runner.stop() causing a MOZ_CRASH assertion
> (now on debug builds only?) is that Firefox doesn’t handle
> signals properly.  In the event of SIGTERM, the process should
> clean up relevant state and exit gracefully.  This is tracked by
> https://bugzil.la/336193.

No, this is not the root cause. The problem is that we do not wait a reasonable amount of time for the shutdown of the application, but kill it immediately after receiving the response for the `quit` command. Hereby the browser doesn't have any time to properly shutdown itself. Also we are not using SIGTERM in geckodriver to trigger the shutdown.

> On the other hand, the reason we don’t ask Firefox to shut itself
> down is that it causes us to loose leak reports.  jgraham has some
> more information on this.

Huh, sorry but I don't understand this paragraph. We call `quit` which indeed is triggering an in-application shutdown:

> 1507172730424	geckodriver::marionette	TRACE	-> 37:[0,6,"quit",{"flags":["eForceQuit"]}]
> 1507172730435	Marionette	TRACE	0 -> [0,6,"quit",{"flags":["eForceQuit"]}]
> 1507172730438	Marionette	INFO	New connections will no longer be accepted
> 1507172730653	Marionette	TRACE	0 <- [1,6,null,{"cause":"shutdown"}]
> 1507172730699	geckodriver::marionette	TRACE	<- [1,6,null,{"cause":"shutdown"}]
> 1507172730699	webdriver::server	DEBUG	Deleting session
> 1507172730699	geckodriver::marionette	DEBUG	Stopping browser process

Also see https://github.com/mozilla/geckodriver/issues/979 which now perfectly makes sense to me.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 6

•

7 years ago

Also as the above mentioned geckodriver issue shows, we seem to leak a browser process. Not sure how long the old process will be alive, but running another test right after will break Marionette because it cannot create a server socket due to the port still in use.

Andreas Tolfsen ❲:ato❳

Comment 7

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #5)

> (In reply to Andreas Tolfsen ‹:ato› from comment #4)
> 
> > The root cause of runner.stop() causing a MOZ_CRASH assertion
> > (now on debug builds only?) is that Firefox doesn’t handle
> > signals properly.  In the event of SIGTERM, the process should
> > clean up relevant state and exit gracefully.  This is tracked by
> > https://bugzil.la/336193.
> 
> No, this is not the root cause.

It is evidently a problem that Firefox doesn’t gracefully shut
down when receiving SIGTERM: we loose leak reports, and the IPC
subsystem occasionally trips up.

> The problem is that we do not wait a reasonable amount of time for
> the shutdown of the application, but kill it immediately after
> receiving the response for the `quit` command. Hereby the browser
> doesn't have any time to properly shutdown itself.

This is a tertiary problem _caused_ by what I described.  It is
true that we should wait for the process shut down, but that
would be guaranteed by SIGTERM had Firefox respected that signal.
The meaning of SIGTERM is to request termination of a process.
Unlike SIGKILL, it can be caught by the process to perform
graceful deallocation of resources and safe cleanup of state when
appropriate.

Because Firefox lacks this safety guarantee we need work around the
problem.  Reviewing some geckodriver code this morning, I found that
the DeleteSession handler appears to send Marionette:Quit:

>             DeleteSession => {
>                 let mut body = BTreeMap::new();
>                 body.insert("flags".to_owned(), vec!["eForceQuit".to_json()].to_json());
>                 (Some("quit"), Some(Ok(body)))
>             },

Marionette:Quit causes Services.startup.quit to be called and for
Firefox to begin its shutdown route.  It’s unclear to me if the
eForceQuit bitmask causes certain cleanup steps to be skipped or
not.

Whereas the WebDriverHandler::delete_session helper function is
called when the session is ended implicitly after closing the last
window:

>     fn delete_session(&mut self, _: &Option<Session>) {
>         if let Ok(connection) = self.connection.lock() {
>             if let Some(ref conn) = *connection {
>                 conn.close();
>             }
>         }
>         if let Some(ref mut runner) = self.browser {
>             debug!("Stopping browser process");
>             if runner.stop().is_err() {
>                 error!("Failed to kill browser process");
>             };
>         }
>         self.connection = Mutex::new(None);
>         self.browser = None;
>     }

This effectively means we shut down the Firefox process in two
different ways.  When DELETE /session/{session id} is hit, we first
ask Firefox to quit then try to stop the process prematurely, and
when the last window implicitly ends the session, we simply kill the
process.

Iff we can trust Marionette:Quit to always provide a response whilst
the application is being shut down (this has been a problem in the
past, maybe it is fixed now) the right fix would appear to be to
make DeleteSession call delete_session, then modify this helper
function to send Marionette:Quit, then use std::process::Child::wait
instead of ::kill to allow time for the process to exit.

> Also we are not using SIGTERM in geckodriver to trigger the
> shutdown.

I checked the Rust standard library, and it is true that
std::process::Child#kill sends SIGKILL (equivalent to "kill -9
<PID>").  I would consider this a bug, but the fact is that at the
moment it doesn’t matter which signal we send because of the bug I
referenced.

> > On the other hand, the reason we don’t ask Firefox to shut
> > itself down is that it causes us to loose leak reports. jgraham
> > has some more information on this.
> 
> Huh, sorry but I don't understand this paragraph. We call `quit`
> which indeed is triggering an in-application shutdown:
> 
> > 1507172730424	geckodriver::marionette	TRACE	-> 37:[0,6,"quit",{"flags":["eForceQuit"]}]
> > 1507172730435	Marionette	TRACE	0 -> [0,6,"quit",{"flags":["eForceQuit"]}]
> > 1507172730438	Marionette	INFO	New connections will no longer be accepted
> > 1507172730653	Marionette	TRACE	0 <- [1,6,null,{"cause":"shutdown"}]
> > 1507172730699	geckodriver::marionette	TRACE	<- [1,6,null,{"cause":"shutdown"}]
> > 1507172730699	webdriver::server	DEBUG	Deleting session
> > 1507172730699	geckodriver::marionette	DEBUG	Stopping browser process

As I demonstrated, we do different things depending on the
situation. jgraham complained that for one of the two situations,
leak reports go missing.  But it is possible that the cause of this
is that we interrupt shutdown by prematurely killing the process
instead of allowing it time to exit.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 8

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #7)
> It is evidently a problem that Firefox doesn’t gracefully shut
> down when receiving SIGTERM: we loose leak reports, and the IPC
> subsystem occasionally trips up.

Right, but this is unrelated here.

> This is a tertiary problem _caused_ by what I described.  It is
> true that we should wait for the process shut down, but that
> would be guaranteed by SIGTERM had Firefox respected that signal.
> The meaning of SIGTERM is to request termination of a process.
> Unlike SIGKILL, it can be caught by the process to perform
> graceful deallocation of resources and safe cleanup of state when
> appropriate.

Again, we do not use SIGTERM but SIGKILL to force the process to close in mozrunner:

https://github.com/jgraham/rust_mozrunner/blob/master/src/runner.rs#L145
https://doc.rust-lang.org/std/process/struct.Child.html#method.kill

> Because Firefox lacks this safety guarantee we need work around the
> problem.  Reviewing some geckodriver code this morning, I found that
> the DeleteSession handler appears to send Marionette:Quit:
> 
> >             DeleteSession => {
> >                 let mut body = BTreeMap::new();
> >                 body.insert("flags".to_owned(), vec!["eForceQuit".to_json()].to_json());
> >                 (Some("quit"), Some(Ok(body)))
> >             },
> 
> Marionette:Quit causes Services.startup.quit to be called and for
> Firefox to begin its shutdown route.  It’s unclear to me if the
> eForceQuit bitmask causes certain cleanup steps to be skipped or
> not.

It doesn't make a difference locally to use eForceQuit or eAttemptQuit, but for stability reasons we should better use the latter. It will really ensure a clean shutdown with all components being able to clean-up.

> Whereas the WebDriverHandler::delete_session helper function is
> called when the session is ended implicitly after closing the last
> window:
> 
> >     fn delete_session(&mut self, _: &Option<Session>) {
> >         if let Ok(connection) = self.connection.lock() {
> >             if let Some(ref conn) = *connection {
> >                 conn.close();
> >             }
> >         }
> >         if let Some(ref mut runner) = self.browser {
> >             debug!("Stopping browser process");
> >             if runner.stop().is_err() {
> >                 error!("Failed to kill browser process");
> >             };
> >         }
> >         self.connection = Mutex::new(None);
> >         self.browser = None;
> >     }
> 
> This effectively means we shut down the Firefox process in two
> different ways.  When DELETE /session/{session id} is hit, we first
> ask Firefox to quit then try to stop the process prematurely, and
> when the last window implicitly ends the session, we simply kill the
> process.

That's exactly what I'm trying to explain since I filed this bug. Good that we are on an agreement now. :)

> Iff we can trust Marionette:Quit to always provide a response whilst
> the application is being shut down (this has been a problem in the
> past, maybe it is fixed now) the right fix would appear to be to
> make DeleteSession call delete_session, then modify this helper
> function to send Marionette:Quit, then use std::process::Child::wait
> instead of ::kill to allow time for the process to exit.

We would have to use both, ::wait and ::kill, because if a normal shutdown does not work we have to send a SIGKILL to the process. The question is how long ::wait should wait. As mentioned above in case of Marionette we wait 120s, which is about twice the time Firefox is allowed to quit (after 65s any hanging thread should be normally killed - which doesn't always work yet).

> I checked the Rust standard library, and it is true that
> std::process::Child#kill sends SIGKILL (equivalent to "kill -9
> <PID>").  I would consider this a bug, but the fact is that at the
> moment it doesn’t matter which signal we send because of the bug I
> referenced.

Well, ::kill would need an argument as best, which defaults to SIGKILL but also allows to set SIGTERM. But we don't need the latter because we have the capability to trigger a shutdown from inside the application, which is equivalent to a SIGTERM.

> As I demonstrated, we do different things depending on the
> situation. jgraham complained that for one of the two situations,
> leak reports go missing.  But it is possible that the cause of this
> is that we interrupt shutdown by prematurely killing the process
> instead of allowing it time to exit.

Leak reports will get missed in the case of SIGKILL'ing the process.

Andreas Tolfsen ❲:ato❳

Comment 9

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #8)

> (In reply to Andreas Tolfsen ‹:ato› from comment #7)
> 
> > It is evidently a problem that Firefox doesn’t gracefully shut
> > down when receiving SIGTERM: we loose leak reports, and the IPC
> > subsystem occasionally trips up.
> 
> Right, but this is unrelated here.

It is not at all unrelated.  I suggest you read up on how Unix
signalling works.

If Firefox caught SIGTERM and gracefully tore down relevant state,
there would not be a need for an RPC call to tell Marionette to shut
down Firefox.  The signal would block until leak- and crash reports
are generated, and until the IPC subsystem and subprocesses had time
to exit.

> > This is a tertiary problem _caused_ by what I described.  It
> > is true that we should wait for the process shut down, but
> > that would be guaranteed by SIGTERM had Firefox respected that
> > signal.  The meaning of SIGTERM is to request termination of a
> > process.  Unlike SIGKILL, it can be caught by the process to
> > perform graceful deallocation of resources and safe cleanup of
> > state when appropriate.
>
> Again, we do not use SIGTERM but SIGKILL to force the process to
> close in mozrunner:

Right, and I pointed this out in my reply.  Please don’t use
strawman arguments.

geckodriver currently uses SIGKILL, but the point I’m trying to
get across is that even if that was changed to SIGTERM, it would not
make a difference because the latter signal is not caught.

> It doesn't make a difference locally to use eForceQuit or
> eAttemptQuit, but for stability reasons we should better use the
> latter. It will really ensure a clean shutdown with all components
> being able to clean-up.

That’s good to know.  I suspect maybe the different between
eForceQuit and eAttemptQuit could be related to whether the “are
you sure you want to close multiple tabs?” dialogue is shown.  If
that is the case, continuing to use eForceQuit is the right thing to
do.

> The question is how long ::wait should wait. As mentioned above
> in case of Marionette we wait 120s, which is about twice the time
> Firefox is allowed to quit (after 65s any hanging thread should be
> normally killed - which doesn't always work yet).

120 seconds is probably way too much.  We could start with a
reasonably low value and then extend it if it proves to be a
problem.

> > I checked the Rust standard library, and it is true that
> > std::process::Child#kill sends SIGKILL (equivalent to "kill -9
> > <PID>").  I would consider this a bug, but the fact is that at
> > the moment it doesn’t matter which signal we send because of
> > the bug I referenced.
>
> Well, ::kill would need an argument as best, which defaults to
> SIGKILL but also allows to set SIGTERM. But we don't need the
> latter because we have the capability to trigger a shutdown from
> inside the application, which is equivalent to a SIGTERM.

… except Firefox does not respect SIGTERM.  If it did, all this
complexity would not be needed.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 10

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #9)
> It is not at all unrelated.  I suggest you read up on how Unix
> signalling works.
> 
> If Firefox caught SIGTERM and gracefully tore down relevant state,
> there would not be a need for an RPC call to tell Marionette to shut
> down Firefox.  The signal would block until leak- and crash reports
> are generated, and until the IPC subsystem and subprocesses had time
> to exit.

I know how signaling works and especially SIGTERM given that I fixed a couple issues in mozbase in the past. Thing is that we cannot use SIGTERM to close Firefox because we would loose the capability to restart the application. Sending a SIGTERM and starting right after is fundamentally different for Firefox. So lets get just stop on the SIGTERM signal now. Also because we cannot use it at all at the moment, and doesn't help us to fix this problem.

> > Again, we do not use SIGTERM but SIGKILL to force the process to
> > close in mozrunner:
> 
> Right, and I pointed this out in my reply.  Please don’t use
> strawman arguments.

Please note that in my original comment on this bug I said that we are sending the kill signal. It's not that I repeat YOUR comment here.

> geckodriver currently uses SIGKILL, but the point I’m trying to
> get across is that even if that was changed to SIGTERM, it would not
> make a difference because the latter signal is not caught.

There is no need to repeat the same argument over and over again. 

> That’s good to know.  I suspect maybe the different between
> eForceQuit and eAttemptQuit could be related to whether the “are
> you sure you want to close multiple tabs?” dialogue is shown.  If
> that is the case, continuing to use eForceQuit is the right thing to
> do.

I checked where `eForceQuit` is used, and it's a lot. So we should keep it, yes.

> > The question is how long ::wait should wait. As mentioned above
> > in case of Marionette we wait 120s, which is about twice the time
> > Firefox is allowed to quit (after 65s any hanging thread should be
> > normally killed - which doesn't always work yet).
> 
> 120 seconds is probably way too much.  We could start with a
> reasonably low value and then extend it if it proves to be a
> problem.

The minimum should be 70s given that by that time a hanging thread during shutdown should have been killed after 65s. We should not overrule the background monitor.

So my proposal is that when we receive the response of the `quit` command, we should immediately wait for the process to quit. Once that happened we return from this command, and allow the next command to execute.

In case Firefox doesn't shutdown after the given time we would also return, but leaving the process alone. Then in `delete_session` we can kill the process as we do now.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 11

•

7 years ago

So all available `wait` commands in `std::process::Child` do not have a timeout argument. It means we cannot just use one of those for speculative waiting. Instead the `FirefoxRunner::is_running` from mozrunner could be used to poll for the exit status. We would need a loop with the said timeout in geckodriver to be able to wait the given amount of time.

Priority: -- → P1

Andreas Tolfsen ❲:ato❳

Comment 12

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #10)

> Thing is that we cannot use SIGTERM to close Firefox because we
> would loose the capability to restart the application.

I didn’t suggest to use SIGTERM for restarting Firefox, but more
to the point there would be no restart requirements here.

> So my proposal is that when we receive the response of the `quit`
> command, we should immediately wait for the process to quit. Once
> that happened we return from this command, and allow the next
> command to execute.

Yes, given a _successul_ response.  It needs to take care not to
wait for the process to exit given an error response.

> In case Firefox doesn't shutdown after the given time we
> would also return, but leaving the process alone. Then in
> `delete_session` we can kill the process as we do now.

Bear in mind that delete_session is called directly when the
session is implicitly closed, so it might make sense to send the
Marionette:Quit command and wait for the process there, before
killing it.

In principle any command may cause the session end, but at the
moment closing the window is the only one we know where this is the
defined behaviour.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 13

•

7 years ago

(In reply to Andreas Tolfsen ‹:ato› from comment #12)
> > So my proposal is that when we receive the response of the `quit`
> > command, we should immediately wait for the process to quit. Once
> > that happened we return from this command, and allow the next
> > command to execute.
> 
> Yes, given a _successul_ response.  It needs to take care not to
> wait for the process to exit given an error response.

That's true. So maybe...

> Bear in mind that delete_session is called directly when the
> session is implicitly closed, so it might make sense to send the
> Marionette:Quit command and wait for the process there, before
> killing it.
> 
> In principle any command may cause the session end, but at the
> moment closing the window is the only one we know where this is the
> defined behaviour.

That seems to be right. So I will check how this can be implemented in `delete_session` by issuing a `quit` command implicitly, and then using `is_running` with the timeout to wait until the process is gone.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/186974/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/186974/

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 15

•

7 years ago

mozreview-review

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

::: testing/geckodriver/src/marionette.rs:587
(Diff revision 1)
> +                command: WebDriverCommand::DeleteSession
> +            };
> +            match self.handle_command(session, delete_session) {
> +                Err(_) => {},
> +                _ => {}
> +            }

I wonder if the above code might be better placed in `delete_session` inside of webdriver's server.js.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 16

•

7 years ago

Andreas, mind telling me if I'm on the right track?

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Whiteboard: [17/10]

Andreas Tolfsen ❲:ato❳

Comment 17

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

> I wonder if the above code might be better placed in `delete_session` inside of webdriver's server.js.

You probably want to handle the error of not being able to send a
command, for whatever reason here.

Andreas Tolfsen ❲:ato❳

Comment 18

•

7 years ago

mozreview-review

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

::: testing/geckodriver/src/marionette.rs:589
(Diff revision 1)
> +            if let Some(ref mut runner) = self.browser {
> +                while runner.is_running() {
> +                    info!("process still running");
> +                    sleep(Duration::from_millis(100));
> +                }
> +            }
> +        }
> +
> +        if let Ok(ref mut connection) = self.connection.lock() {
> +            if let Some(conn) = connection.as_mut() {
>                  conn.close();
>              }
>          }
> +
>          if let Some(ref mut runner) = self.browser {
> +            if runner.is_running() {
> -            debug!("Stopping browser process");
> +                debug!("Stopping browser process");
> -            if runner.stop().is_err() {
> +                if runner.stop().is_err() {
> -                error!("Failed to kill browser process");
> +                    error!("Failed to kill browser process");
> -            };
> +                };
> -        }
> +            }
> +        }

Don’t you need to use std::process::Child::wait here and kill the process after a given timeout?  If I read this code right, unconditionally waiting 100 ms doesn’t seem to be what we want here.

You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.

Andreas Tolfsen ❲:ato❳

Updated

•

7 years ago

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 19

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> Don’t you need to use std::process::Child::wait here and kill the process after a given timeout?  If I read this code right, unconditionally waiting 100 ms doesn’t seem to be what we want here.
> 
> You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.

As I said earlier on that bug the `wait` method doesn't accept a parameter, so we cannot customize it. By using `is_running()` we have an easy way to check if the process is still alive.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 20

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

> You probably want to handle the error of not being able to send a
> command, for whatever reason here.

The above is just a WIP as noted down in the description. It's far from being ready. I just wanted to know if the kind of strategy is correct. Given your reply I assume so, and I will continue. Thanks.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 21

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> As I said earlier on that bug the `wait` method doesn't accept a parameter, so we cannot customize it. By using `is_running()` we have an easy way to check if the process is still alive.

> You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.

The wait will be called only if the process is still alive. What's actually missing above is a break out for the maximum timeout duration.

Andreas Tolfsen ❲:ato❳

Comment 22

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> > You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.
> 
> The wait will be called only if the process is still alive. What's actually missing above is a break out for the maximum timeout duration.

I think I may have mis-communicated my message above.  What I
would concretely propose is to spin up a thread with a timer that
returns when the process exits or kills the process after N seconds,
and have the main thread use Child::wait.  This will ensure that
geckodriver never hangs on shutdown.

With the patch above as-is, geckodriver will enter an infinite
loop if the firefox process doesn’t exit from as a result of
Marionette:Quit.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 23

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> I think I may have mis-communicated my message above.  What I
> would concretely propose is to spin up a thread with a timer that
> returns when the process exits or kills the process after N seconds,
> and have the main thread use Child::wait.  This will ensure that
> geckodriver never hangs on shutdown.
> 
> With the patch above as-is, geckodriver will enter an infinite
> loop if the firefox process doesn’t exit from as a result of
> Marionette:Quit.

No we will not hang. Lets wait for the next update. Using threads is overcomplicated here.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 24

•

7 years ago

Sorry for the delay. I will continue working on this bug now.

Whiteboard: [17/10] → [17/12]

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/2-3/

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Attachment #8915761 - Flags: review?(james)

James Graham [:jgraham]

Comment 27

•

7 years ago

mozreview-review

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review212792

::: testing/geckodriver/src/marionette.rs:60
(Diff revision 3)
>  use capabilities::{FirefoxCapabilities, FirefoxOptions};
>  use prefs;
>  
>  const DEFAULT_HOST: &'static str = "localhost";
>  
> +const TIMEOUT_BROWSER_SHUTDOWN: u64 = 130 * 1000;

Explain where this comes from.

::: testing/geckodriver/src/marionette.rs:586
(Diff revision 3)
> +        if let Some(ref s) = *session {
> +            let delete_session = WebDriverMessage {
> +                session_id: Some(s.id.clone()),
> +                command: WebDriverCommand::DeleteSession
> +            };
> +            match self.handle_command(session, delete_session) {

Isn't this just 
```
match self.handle_command(session, delete_session) {
    _ => {}
}
```

which I would prefer to write

```
let _ = self.handle_command(session, delete_session);
```

Attachment #8915761 - Flags: review?(james) → review+

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 28

•

7 years ago

mozreview-review-reply

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review212792

> Isn't this just 
> ```
> match self.handle_command(session, delete_session) {
>     _ => {}
> }
> ```
> 
> which I would prefer to write
> 
> ```
> let _ = self.handle_command(session, delete_session);
> ```

That make sense. Thanks.

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/3-4/

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Attachment #8915761 - Flags: review?(ato)

Pulsebot

Comment 30

•

7 years ago

Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/34faef444ae3
Safely shutdown Firefox from in delete_session. r=jgraham

Cosmin Sabou [:CosminS]

Comment 31

•

7 years ago

Backed out for WD failures on Linux debug and x64 debug at /webdriver/tests/execute_async_script/user_prompts.py

https://hg.mozilla.org/integration/autoland/rev/dffc508ff053ba2a7ac93aa078f5d5448c2380fe

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=34faef444ae338b25ef7eb33ab15e60e9e52e8b8&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&selectedJob=151809340

https://treeherder.mozilla.org/logviewer.html#?job_id=151809340&repo=autoland&lineNumber=6326

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 32

•

7 years ago

The problem here is that test_handle_prompt_twice is causing a shutdown hang of Firefox, and as such the tests time out. Sadly this cannot be seen in the logs because of the maximum duration of all tests set for a test module. James, any idea how we could compensate that? We really should just timeout if a single test method hangs, but also we shouldn't kill the browser because it would hide the real MOZ_CRASH output.

Running the single test alone locally I can reproduce the hang. I get the following output:

 1:17.36 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "Hit MOZ_CRASH(Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken) at /builds/worker/workspace/build/src/dom/workers/RuntimeService.cpp:2191"
 1:17.38 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560312	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.48 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560413	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.58 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560516	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.68 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560621	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2534"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#01: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6e1034]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#02: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6b1701]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#03: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6aea1f]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#04: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x696065]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#05: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x693ac5]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#06: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6a0bcd]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#07: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x69aafa]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#08: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x393b]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#09: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3887]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "** Unknown exception behavior: -2147483647"

So this shutdown hang is related to DOM workers. I will have to investigate and file a bug about it.

Interestingly the forced MOZ_CRASH is happening after 65s and not 120s as used in my patch. I will have to clarify which value is the correct one.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 33

•

7 years ago

James please have a look at the beginning of the last comment. Thanks.

Flags: needinfo?(james)

James Graham [:jgraham]

Comment 34

•

7 years ago

I don't really understand the comment. If we just expect this test to timeout now, we can set it to expected: TIMEOUT in the ini file. If it's causing more than a single test to time out, we need to ensure the testrunner process is killed as expected.

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Depends on: 1425559

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 35

•

7 years ago

So with my changes we actually noticed a serious issue during shutdown of Firefox. See bug 1425559 for details. It means the failures we see here are real, and we might have to wait with re-landing the patch until bug 1425559 got fixed.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 36

•

7 years ago

I had a look at those tests. None of the alert tests make sure that all alerts are closed afterwards. Instead we have a finalizer for the end of a session which is `_dismiss_user_prompts`. Given that there are multiple user prompts open during shutdown of Firefox it means that this method does not close remaining alerts! I will check it the next days.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 37

•

7 years ago

For any test method which is using the `new_session` function, none of the registered finalizers are getting executed! That is also the reason why the alerts are not getting closed correctly, and is causing this hang. I will file a new bug for that.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Depends on: 1425947

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 38

•

7 years ago

We are not going to add finalizers for `new_session`. As such we will have to mark the test as timeout expected for now.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 39

•

7 years ago

So I tried to disable the test by adding the following lines:

  [user_prompts.py::test_handle_prompt_twice]
    disabled: https://bugzilla.mozilla.org/show_bug.cgi?id=1425559

But that doesn't work, and the test is still getting run when executing wp-tests for the execute_script folder. James is there something broken with determining the disabled state? The same can also be seen for other tests and manifests which already exists in tree like ` testing/web-platform/tests/webdriver/tests/navigation/current_url.py::test_get_current_url_after_modified_location`

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Depends on: 1427990

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 40

•

7 years ago

I filed bug 1427990 to cover the case for subtests.

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 41

•

7 years ago

So using TIMEOUT as expected status for the subtest doesn't work. Instead I would have to set this for the test module itself:

> [user_prompts.py]
>   expected: TIMEOUT

I find this awkward given that only a single subtest will cause a timeout. Why has this expected state to be set for everything?

Flags: needinfo?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 42

•

7 years ago

Maybe this is related because the timeout actually happens during shutdown of Firefox, which is past `test_end`. So `test_handle_prompt_twice` reports `FAIL`, but the final TIMEOUT is accounted for the test module? So I could add TIMEOUT as expected result for user_prompts.py, but then the subtest has to be kept as last test in that file.

Would that be appropriate?

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 43

•

7 years ago

Andreas helped me with that question. We are going to mark the module as expected timeout.

Flags: needinfo?(james)

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/4-5/

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/5-6/

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 46

•

7 years ago

As just found out on bug 1425559 the hang always happens when Telemetry is turned on. Setting `datareporting.policy.dataSubmissionEnabled` to false at least fixes the manual case. Strangely this pref is already set by Marionette as recommended preference, and it is clearly `false` after creating the new custom session.

What makes it interesting is that the shutdown hang is gone for the test when a navigation takes place to a different URL than `about:blank`. And I can indeed reproduce it manually when starting firefox with `browser.startup.page=0` as we do with Marionette. Basically it looks like a DOM worker problem.

I would suggest that we wait with pushing this patch, and I will give Olli a bit of time to have a look at this. I really want to avoid the addition of the TIMEOUT flag.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 47

•

7 years ago

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

James, I would like to ask you for an additional review due to some changes have been made, and I want to hear if that looks fine.

Please note that I was wrong with 120s for the background thread monitor. It's actually 65s only as logs of try builds also verify.

Attachment #8915761 - Flags: review?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 48

•

7 years ago

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Reverting for now given the unexpected test failures currently occurring.

Attachment #8915761 - Flags: review?(james)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 49

•

7 years ago

Those test failures actually happened because the affected tests start/stop Firefox a lot, and due to the safe shutdown it will take a bit longer. As result those exceed the default timeout of 30s.

We agreed to extend the timeout for those affected tests.

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/6-7/

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 51

•

7 years ago

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Latest try build as pushed via mozreview passes now.

Attachment #8915761 - Flags: review?(james)

James Graham [:jgraham]

Comment 52

•

7 years ago

mozreview-review

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review217092

James Graham [:jgraham]

Updated

•

7 years ago

Attachment #8915761 - Flags: review?(james) → review+

Pulsebot

Comment 53

•

7 years ago

Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/025d4690bfbe
Safely shutdown Firefox from in delete_session. r=jgraham

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 54

•

7 years ago

Increasing the timeout didn't work with the META tag in the test file. So a request for backout has been made.

Andreas mentioned that I should have regenerated the manifest. I did this locally now, and it shows me the changes in testing/web-platform/meta/MANIFEST.json. I will push this tomorrow after the backout, and the trees got reopened.

Comment hidden (mozreview-request)

Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/186974/diff/7-8/

Cristian Brindusan [:cbrindusan]

Comment 56

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/025d4690bfbe

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox59: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla59

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 57

•

7 years ago

Attached patch follow-up patch — Details — Splinter Review

The requested backout did not happen, but the code got merged to m-c. Maybe it is easier now to simply land this fix.

Attachment #8941340 - Flags: review+

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Blocks: 1393051

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Blocks: 1408848

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Blocks: 1397480

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Blocks: 1414141

Pulsebot

Comment 58

•

7 years ago

Pushed by rgurzau@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/ecb9935ce223
"By default delete_session has to perform a safe shutdown of Firefox" [r=hskupin] a=aryx

Raul Gurzau (:RaulG)

Comment 59

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ecb9935ce223

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 60

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/218cfb44a3f4

status-firefox58: --- → fixed

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

7 years ago

Whiteboard: [17/12] → [18/01]

Cosmin Sabou [:CosminS]

Updated

•

7 years ago

Blocks: 1431058

Treeherder Bug Filer

Updated

•

7 years ago

Blocks: 1431081

Liz Henry (:lizzard) (relman/hg->git project)

Comment 62

•

7 years ago

Henrik, we are trying out uplift on the patches in bug 1425559, so it sounds like you may need to change a setting on a web platform test in this bug. Is that still the case?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 63

•

7 years ago

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #62)
> Henrik, we are trying out uplift on the patches in bug 1425559, so it sounds
> like you may need to change a setting on a web platform test in this bug. Is
> that still the case?

No, this patch already got uplifted and is even life for Firefox 58. Whatever you uplift now to 59 or 58 should apply cleanly. There is nothing to do here.

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

1 failures in 660 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1403923&startday=2018-05-07&endday=2018-05-13&tree=trunk

MOZ_CRASH for a forced kill on MacOS 7 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 1.13 KB, text/plain		Details
Bug 1403923 - Safely shutdown Firefox from in delete_session. 7 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 59 bytes, text/x-review-board-request	jgraham : review+	Details
follow-up patch 7 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 1.72 KB, patch	whimboo : review+	Details \| Diff \| Splinter Review