By default delete_session has to perform a safe shutdown of Firefox

RESOLVED FIXED in Firefox 58

Status

enhancement
P1
major
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

(Depends on 1 bug)

57 Branch
mozilla59
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox58 fixed, firefox59 fixed)

Details

(Whiteboard: [18/01])

Attachments

(3 attachments)

Right now geckodriver is forcing the exit of Firefox by sending the kill signal:

https://searchfox.org/mozilla-central/rev/f54c1723befe6bcc7229f005217d5c681128fcad/testing/geckodriver/src/marionette.rs#582-587

https://github.com/jgraham/rust_mozrunner/blob/master/src/runner.rs#L145

This is unfortunate given the following reasons:

* The shutdown can force an internal crash, which most people noticed with Firefox 55 on Windows. With more recent builds it's a MOZ_CRASH only for debug builds.

* Any component shutdown logic will not be performed. It means that we can not guarantee that eg sessionrestore is able to save the session, or places to store the latest changes to bookmarks

* The exit code of the child process is not 0 but indicates that something was broken

Lets really sanely shutdown Firefox. See bug 1141335 which covers the flip for Marionette, whereby with marionette_driver we have the `in_app` flag, which most of our tests are using nowadays.

It's not that urgent right now, given that we do not have the restart capability yet. But it would still be good to have it soonish.
I was a bit wrong in comment 0. We indeed do a in_app shutdown, but we don't wait until the browser has been quit, before force kill the process:

1506631636378	geckodriver::marionette	TRACE	-> 39:[0,2,"quit",{"flags":["eForceQuit"]}]
1506631636395	Marionette	TRACE	0 -> [0,2,"quit",{"flags":["eForceQuit"]}]
1506631636396	Marionette	INFO	New connections will no longer be accepted
1506631636551	Marionette	TRACE	0 <- [1,2,null,{"cause":"shutdown"}]
1506631636559	geckodriver::marionette	TRACE	<- [1,2,null,{"cause":"shutdown"}]
1506631636559	webdriver::server	DEBUG	Deleting session
1506631636559	geckodriver::marionette	DEBUG	Stopping browser process
Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2543

We should clearly call proc.wait() and give it a bit of time before calling proc.stop().
So I don't think that we should do the call to `proc.wait()` in the DeleteSession command, but instead in `delete_session`. We have to keep in mind that every command which gets executed can potentially close the browser. And when the next command is sent the connection issues can appear.

Generally a shutdown of Firefox can take up to 65 seconds. Then the watchdog thread will kill any remaining and hanging processes/threads. With Marionette we have a 120s timeout right now, given that in some cases the watchdog thread fails to kill content processes (see bug 1395504), and as result it hangs forever. We just want to give it a little bit more time.
Blocks: 1401129
Given that mozrunner_rust doesn't have a `wait` method yet, we would have to get this implemented first.

Before I start I would like to hear some feedback.
Flags: needinfo?(ato)
I’m not sure what I’m supposed to give feedback on here.  There
seems to be talk about marionette_driver, geckodriver, and the
Marionette server.  Exactly what change are you proposing to make
and where?  Do you have a patch I can look at?

The root cause of runner.stop() causing a MOZ_CRASH assertion
(now on debug builds only?) is that Firefox doesn’t handle
signals properly.  In the event of SIGTERM, the process should
clean up relevant state and exit gracefully.  This is tracked by
https://bugzil.la/336193.

On the other hand, the reason we don’t ask Firefox to shut itself
down is that it causes us to loose leak reports.  jgraham has some
more information on this.
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #4)
> I’m not sure what I’m supposed to give feedback on here.  There
> seems to be talk about marionette_driver, geckodriver, and the
> Marionette server.  Exactly what change are you proposing to make
> and where?  Do you have a patch I can look at?

This is geckodriver bug and as such a bug in that product. The references to Marionette above explained what we do for Marionette client to handle it properly. 

> The root cause of runner.stop() causing a MOZ_CRASH assertion
> (now on debug builds only?) is that Firefox doesn’t handle
> signals properly.  In the event of SIGTERM, the process should
> clean up relevant state and exit gracefully.  This is tracked by
> https://bugzil.la/336193.

No, this is not the root cause. The problem is that we do not wait a reasonable amount of time for the shutdown of the application, but kill it immediately after receiving the response for the `quit` command. Hereby the browser doesn't have any time to properly shutdown itself. Also we are not using SIGTERM in geckodriver to trigger the shutdown.

> On the other hand, the reason we don’t ask Firefox to shut itself
> down is that it causes us to loose leak reports.  jgraham has some
> more information on this.

Huh, sorry but I don't understand this paragraph. We call `quit` which indeed is triggering an in-application shutdown:

> 1507172730424	geckodriver::marionette	TRACE	-> 37:[0,6,"quit",{"flags":["eForceQuit"]}]
> 1507172730435	Marionette	TRACE	0 -> [0,6,"quit",{"flags":["eForceQuit"]}]
> 1507172730438	Marionette	INFO	New connections will no longer be accepted
> 1507172730653	Marionette	TRACE	0 <- [1,6,null,{"cause":"shutdown"}]
> 1507172730699	geckodriver::marionette	TRACE	<- [1,6,null,{"cause":"shutdown"}]
> 1507172730699	webdriver::server	DEBUG	Deleting session
> 1507172730699	geckodriver::marionette	DEBUG	Stopping browser process

Also see https://github.com/mozilla/geckodriver/issues/979 which now perfectly makes sense to me.
Also as the above mentioned geckodriver issue shows, we seem to leak a browser process. Not sure how long the old process will be alive, but running another test right after will break Marionette because it cannot create a server socket due to the port still in use.
(In reply to Henrik Skupin (:whimboo) from comment #5)

> (In reply to Andreas Tolfsen ‹:ato› from comment #4)
> 
> > The root cause of runner.stop() causing a MOZ_CRASH assertion
> > (now on debug builds only?) is that Firefox doesn’t handle
> > signals properly.  In the event of SIGTERM, the process should
> > clean up relevant state and exit gracefully.  This is tracked by
> > https://bugzil.la/336193.
> 
> No, this is not the root cause.

It is evidently a problem that Firefox doesn’t gracefully shut
down when receiving SIGTERM: we loose leak reports, and the IPC
subsystem occasionally trips up.

> The problem is that we do not wait a reasonable amount of time for
> the shutdown of the application, but kill it immediately after
> receiving the response for the `quit` command. Hereby the browser
> doesn't have any time to properly shutdown itself.

This is a tertiary problem _caused_ by what I described.  It is
true that we should wait for the process shut down, but that
would be guaranteed by SIGTERM had Firefox respected that signal.
The meaning of SIGTERM is to request termination of a process.
Unlike SIGKILL, it can be caught by the process to perform
graceful deallocation of resources and safe cleanup of state when
appropriate.

Because Firefox lacks this safety guarantee we need work around the
problem.  Reviewing some geckodriver code this morning, I found that
the DeleteSession handler appears to send Marionette:Quit:

>             DeleteSession => {
>                 let mut body = BTreeMap::new();
>                 body.insert("flags".to_owned(), vec!["eForceQuit".to_json()].to_json());
>                 (Some("quit"), Some(Ok(body)))
>             },

Marionette:Quit causes Services.startup.quit to be called and for
Firefox to begin its shutdown route.  It’s unclear to me if the
eForceQuit bitmask causes certain cleanup steps to be skipped or
not.

Whereas the WebDriverHandler::delete_session helper function is
called when the session is ended implicitly after closing the last
window:

>     fn delete_session(&mut self, _: &Option<Session>) {
>         if let Ok(connection) = self.connection.lock() {
>             if let Some(ref conn) = *connection {
>                 conn.close();
>             }
>         }
>         if let Some(ref mut runner) = self.browser {
>             debug!("Stopping browser process");
>             if runner.stop().is_err() {
>                 error!("Failed to kill browser process");
>             };
>         }
>         self.connection = Mutex::new(None);
>         self.browser = None;
>     }

This effectively means we shut down the Firefox process in two
different ways.  When DELETE /session/{session id} is hit, we first
ask Firefox to quit then try to stop the process prematurely, and
when the last window implicitly ends the session, we simply kill the
process.

Iff we can trust Marionette:Quit to always provide a response whilst
the application is being shut down (this has been a problem in the
past, maybe it is fixed now) the right fix would appear to be to
make DeleteSession call delete_session, then modify this helper
function to send Marionette:Quit, then use std::process::Child::wait
instead of ::kill to allow time for the process to exit.

> Also we are not using SIGTERM in geckodriver to trigger the
> shutdown.

I checked the Rust standard library, and it is true that
std::process::Child#kill sends SIGKILL (equivalent to "kill -9
<PID>").  I would consider this a bug, but the fact is that at the
moment it doesn’t matter which signal we send because of the bug I
referenced.

> > On the other hand, the reason we don’t ask Firefox to shut
> > itself down is that it causes us to loose leak reports. jgraham
> > has some more information on this.
> 
> Huh, sorry but I don't understand this paragraph. We call `quit`
> which indeed is triggering an in-application shutdown:
> 
> > 1507172730424	geckodriver::marionette	TRACE	-> 37:[0,6,"quit",{"flags":["eForceQuit"]}]
> > 1507172730435	Marionette	TRACE	0 -> [0,6,"quit",{"flags":["eForceQuit"]}]
> > 1507172730438	Marionette	INFO	New connections will no longer be accepted
> > 1507172730653	Marionette	TRACE	0 <- [1,6,null,{"cause":"shutdown"}]
> > 1507172730699	geckodriver::marionette	TRACE	<- [1,6,null,{"cause":"shutdown"}]
> > 1507172730699	webdriver::server	DEBUG	Deleting session
> > 1507172730699	geckodriver::marionette	DEBUG	Stopping browser process

As I demonstrated, we do different things depending on the
situation. jgraham complained that for one of the two situations,
leak reports go missing.  But it is possible that the cause of this
is that we interrupt shutdown by prematurely killing the process
instead of allowing it time to exit.
(In reply to Andreas Tolfsen ‹:ato› from comment #7)
> It is evidently a problem that Firefox doesn’t gracefully shut
> down when receiving SIGTERM: we loose leak reports, and the IPC
> subsystem occasionally trips up.

Right, but this is unrelated here.

> This is a tertiary problem _caused_ by what I described.  It is
> true that we should wait for the process shut down, but that
> would be guaranteed by SIGTERM had Firefox respected that signal.
> The meaning of SIGTERM is to request termination of a process.
> Unlike SIGKILL, it can be caught by the process to perform
> graceful deallocation of resources and safe cleanup of state when
> appropriate.

Again, we do not use SIGTERM but SIGKILL to force the process to close in mozrunner:

https://github.com/jgraham/rust_mozrunner/blob/master/src/runner.rs#L145
https://doc.rust-lang.org/std/process/struct.Child.html#method.kill

> Because Firefox lacks this safety guarantee we need work around the
> problem.  Reviewing some geckodriver code this morning, I found that
> the DeleteSession handler appears to send Marionette:Quit:
> 
> >             DeleteSession => {
> >                 let mut body = BTreeMap::new();
> >                 body.insert("flags".to_owned(), vec!["eForceQuit".to_json()].to_json());
> >                 (Some("quit"), Some(Ok(body)))
> >             },
> 
> Marionette:Quit causes Services.startup.quit to be called and for
> Firefox to begin its shutdown route.  It’s unclear to me if the
> eForceQuit bitmask causes certain cleanup steps to be skipped or
> not.

It doesn't make a difference locally to use eForceQuit or eAttemptQuit, but for stability reasons we should better use the latter. It will really ensure a clean shutdown with all components being able to clean-up.

> Whereas the WebDriverHandler::delete_session helper function is
> called when the session is ended implicitly after closing the last
> window:
> 
> >     fn delete_session(&mut self, _: &Option<Session>) {
> >         if let Ok(connection) = self.connection.lock() {
> >             if let Some(ref conn) = *connection {
> >                 conn.close();
> >             }
> >         }
> >         if let Some(ref mut runner) = self.browser {
> >             debug!("Stopping browser process");
> >             if runner.stop().is_err() {
> >                 error!("Failed to kill browser process");
> >             };
> >         }
> >         self.connection = Mutex::new(None);
> >         self.browser = None;
> >     }
> 
> This effectively means we shut down the Firefox process in two
> different ways.  When DELETE /session/{session id} is hit, we first
> ask Firefox to quit then try to stop the process prematurely, and
> when the last window implicitly ends the session, we simply kill the
> process.

That's exactly what I'm trying to explain since I filed this bug. Good that we are on an agreement now. :)

> Iff we can trust Marionette:Quit to always provide a response whilst
> the application is being shut down (this has been a problem in the
> past, maybe it is fixed now) the right fix would appear to be to
> make DeleteSession call delete_session, then modify this helper
> function to send Marionette:Quit, then use std::process::Child::wait
> instead of ::kill to allow time for the process to exit.

We would have to use both, ::wait and ::kill, because if a normal shutdown does not work we have to send a SIGKILL to the process. The question is how long ::wait should wait. As mentioned above in case of Marionette we wait 120s, which is about twice the time Firefox is allowed to quit (after 65s any hanging thread should be normally killed - which doesn't always work yet).

> I checked the Rust standard library, and it is true that
> std::process::Child#kill sends SIGKILL (equivalent to "kill -9
> <PID>").  I would consider this a bug, but the fact is that at the
> moment it doesn’t matter which signal we send because of the bug I
> referenced.

Well, ::kill would need an argument as best, which defaults to SIGKILL but also allows to set SIGTERM. But we don't need the latter because we have the capability to trigger a shutdown from inside the application, which is equivalent to a SIGTERM.

> As I demonstrated, we do different things depending on the
> situation. jgraham complained that for one of the two situations,
> leak reports go missing.  But it is possible that the cause of this
> is that we interrupt shutdown by prematurely killing the process
> instead of allowing it time to exit.

Leak reports will get missed in the case of SIGKILL'ing the process.
(In reply to Henrik Skupin (:whimboo) from comment #8)

> (In reply to Andreas Tolfsen ‹:ato› from comment #7)
> 
> > It is evidently a problem that Firefox doesn’t gracefully shut
> > down when receiving SIGTERM: we loose leak reports, and the IPC
> > subsystem occasionally trips up.
> 
> Right, but this is unrelated here.

It is not at all unrelated.  I suggest you read up on how Unix
signalling works.

If Firefox caught SIGTERM and gracefully tore down relevant state,
there would not be a need for an RPC call to tell Marionette to shut
down Firefox.  The signal would block until leak- and crash reports
are generated, and until the IPC subsystem and subprocesses had time
to exit.

> > This is a tertiary problem _caused_ by what I described.  It
> > is true that we should wait for the process shut down, but
> > that would be guaranteed by SIGTERM had Firefox respected that
> > signal.  The meaning of SIGTERM is to request termination of a
> > process.  Unlike SIGKILL, it can be caught by the process to
> > perform graceful deallocation of resources and safe cleanup of
> > state when appropriate.
>
> Again, we do not use SIGTERM but SIGKILL to force the process to
> close in mozrunner:

Right, and I pointed this out in my reply.  Please don’t use
strawman arguments.

geckodriver currently uses SIGKILL, but the point I’m trying to
get across is that even if that was changed to SIGTERM, it would not
make a difference because the latter signal is not caught.

> It doesn't make a difference locally to use eForceQuit or
> eAttemptQuit, but for stability reasons we should better use the
> latter. It will really ensure a clean shutdown with all components
> being able to clean-up.

That’s good to know.  I suspect maybe the different between
eForceQuit and eAttemptQuit could be related to whether the “are
you sure you want to close multiple tabs?” dialogue is shown.  If
that is the case, continuing to use eForceQuit is the right thing to
do.

> The question is how long ::wait should wait. As mentioned above
> in case of Marionette we wait 120s, which is about twice the time
> Firefox is allowed to quit (after 65s any hanging thread should be
> normally killed - which doesn't always work yet).

120 seconds is probably way too much.  We could start with a
reasonably low value and then extend it if it proves to be a
problem.

> > I checked the Rust standard library, and it is true that
> > std::process::Child#kill sends SIGKILL (equivalent to "kill -9
> > <PID>").  I would consider this a bug, but the fact is that at
> > the moment it doesn’t matter which signal we send because of
> > the bug I referenced.
>
> Well, ::kill would need an argument as best, which defaults to
> SIGKILL but also allows to set SIGTERM. But we don't need the
> latter because we have the capability to trigger a shutdown from
> inside the application, which is equivalent to a SIGTERM.

… except Firefox does not respect SIGTERM.  If it did, all this
complexity would not be needed.
(In reply to Andreas Tolfsen ‹:ato› from comment #9)
> It is not at all unrelated.  I suggest you read up on how Unix
> signalling works.
> 
> If Firefox caught SIGTERM and gracefully tore down relevant state,
> there would not be a need for an RPC call to tell Marionette to shut
> down Firefox.  The signal would block until leak- and crash reports
> are generated, and until the IPC subsystem and subprocesses had time
> to exit.

I know how signaling works and especially SIGTERM given that I fixed a couple issues in mozbase in the past. Thing is that we cannot use SIGTERM to close Firefox because we would loose the capability to restart the application. Sending a SIGTERM and starting right after is fundamentally different for Firefox. So lets get just stop on the SIGTERM signal now. Also because we cannot use it at all at the moment, and doesn't help us to fix this problem.

> > Again, we do not use SIGTERM but SIGKILL to force the process to
> > close in mozrunner:
> 
> Right, and I pointed this out in my reply.  Please don’t use
> strawman arguments.

Please note that in my original comment on this bug I said that we are sending the kill signal. It's not that I repeat YOUR comment here.

> geckodriver currently uses SIGKILL, but the point I’m trying to
> get across is that even if that was changed to SIGTERM, it would not
> make a difference because the latter signal is not caught.

There is no need to repeat the same argument over and over again. 

> That’s good to know.  I suspect maybe the different between
> eForceQuit and eAttemptQuit could be related to whether the “are
> you sure you want to close multiple tabs?” dialogue is shown.  If
> that is the case, continuing to use eForceQuit is the right thing to
> do.

I checked where `eForceQuit` is used, and it's a lot. So we should keep it, yes.

> > The question is how long ::wait should wait. As mentioned above
> > in case of Marionette we wait 120s, which is about twice the time
> > Firefox is allowed to quit (after 65s any hanging thread should be
> > normally killed - which doesn't always work yet).
> 
> 120 seconds is probably way too much.  We could start with a
> reasonably low value and then extend it if it proves to be a
> problem.

The minimum should be 70s given that by that time a hanging thread during shutdown should have been killed after 65s. We should not overrule the background monitor.

So my proposal is that when we receive the response of the `quit` command, we should immediately wait for the process to quit. Once that happened we return from this command, and allow the next command to execute.

In case Firefox doesn't shutdown after the given time we would also return, but leaving the process alone. Then in `delete_session` we can kill the process as we do now.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
So all available `wait` commands in `std::process::Child` do not have a timeout argument. It means we cannot just use one of those for speculative waiting. Instead the `FirefoxRunner::is_running` from mozrunner could be used to poll for the exit status. We would need a loop with the said timeout in geckodriver to be able to wait the given amount of time.
Priority: -- → P1
(In reply to Henrik Skupin (:whimboo) from comment #10)

> Thing is that we cannot use SIGTERM to close Firefox because we
> would loose the capability to restart the application.

I didn’t suggest to use SIGTERM for restarting Firefox, but more
to the point there would be no restart requirements here.

> So my proposal is that when we receive the response of the `quit`
> command, we should immediately wait for the process to quit. Once
> that happened we return from this command, and allow the next
> command to execute.

Yes, given a _successul_ response.  It needs to take care not to
wait for the process to exit given an error response.

> In case Firefox doesn't shutdown after the given time we
> would also return, but leaving the process alone. Then in
> `delete_session` we can kill the process as we do now.

Bear in mind that delete_session is called directly when the
session is implicitly closed, so it might make sense to send the
Marionette:Quit command and wait for the process there, before
killing it.

In principle any command may cause the session end, but at the
moment closing the window is the only one we know where this is the
defined behaviour.
(In reply to Andreas Tolfsen ‹:ato› from comment #12)
> > So my proposal is that when we receive the response of the `quit`
> > command, we should immediately wait for the process to quit. Once
> > that happened we return from this command, and allow the next
> > command to execute.
> 
> Yes, given a _successul_ response.  It needs to take care not to
> wait for the process to exit given an error response.

That's true. So maybe...

> Bear in mind that delete_session is called directly when the
> session is implicitly closed, so it might make sense to send the
> Marionette:Quit command and wait for the process there, before
> killing it.
> 
> In principle any command may cause the session end, but at the
> moment closing the window is the only one we know where this is the
> defined behaviour.

That seems to be right. So I will check how this can be implemented in `delete_session` by issuing a `quit` command implicitly, and then using `is_running` with the timeout to wait until the process is gone.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

::: testing/geckodriver/src/marionette.rs:587
(Diff revision 1)
> +                command: WebDriverCommand::DeleteSession
> +            };
> +            match self.handle_command(session, delete_session) {
> +                Err(_) => {},
> +                _ => {}
> +            }

I wonder if the above code might be better placed in `delete_session` inside of webdriver's server.js.
Andreas, mind telling me if I'm on the right track?
Flags: needinfo?(ato)
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

> I wonder if the above code might be better placed in `delete_session` inside of webdriver's server.js.

You probably want to handle the error of not being able to send a
command, for whatever reason here.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

::: testing/geckodriver/src/marionette.rs:589
(Diff revision 1)
> +            if let Some(ref mut runner) = self.browser {
> +                while runner.is_running() {
> +                    info!("process still running");
> +                    sleep(Duration::from_millis(100));
> +                }
> +            }
> +        }
> +
> +        if let Ok(ref mut connection) = self.connection.lock() {
> +            if let Some(conn) = connection.as_mut() {
>                  conn.close();
>              }
>          }
> +
>          if let Some(ref mut runner) = self.browser {
> +            if runner.is_running() {
> -            debug!("Stopping browser process");
> +                debug!("Stopping browser process");
> -            if runner.stop().is_err() {
> +                if runner.stop().is_err() {
> -                error!("Failed to kill browser process");
> +                    error!("Failed to kill browser process");
> -            };
> +                };
> -        }
> +            }
> +        }

Don’t you need to use std::process::Child::wait here and kill the process after a given timeout?  If I read this code right, unconditionally waiting 100 ms doesn’t seem to be what we want here.

You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.
Flags: needinfo?(ato)
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> Don’t you need to use std::process::Child::wait here and kill the process after a given timeout?  If I read this code right, unconditionally waiting 100 ms doesn’t seem to be what we want here.
> 
> You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.

As I said earlier on that bug the `wait` method doesn't accept a parameter, so we cannot customize it. By using `is_running()` we have an easy way to check if the process is still alive.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192014

> You probably want to handle the error of not being able to send a
> command, for whatever reason here.

The above is just a WIP as noted down in the description. It's far from being ready. I just wanted to know if the kind of strategy is correct. Given your reply I assume so, and I will continue. Thanks.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> As I said earlier on that bug the `wait` method doesn't accept a parameter, so we cannot customize it. By using `is_running()` we have an easy way to check if the process is still alive.

> You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.

The wait will be called only if the process is still alive. What's actually missing above is a break out for the maximum timeout duration.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> > You want this function to only kill the process if it fails to exit after some time; not to always have to wait 100 ms before killing it.
> 
> The wait will be called only if the process is still alive. What's actually missing above is a break out for the maximum timeout duration.

I think I may have mis-communicated my message above.  What I
would concretely propose is to spin up a thread with a timer that
returns when the process exits or kills the process after N seconds,
and have the main thread use Child::wait.  This will ensure that
geckodriver never hangs on shutdown.

With the patch above as-is, geckodriver will enter an infinite
loop if the firefox process doesn’t exit from as a result of
Marionette:Quit.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review192738

> I think I may have mis-communicated my message above.  What I
> would concretely propose is to spin up a thread with a timer that
> returns when the process exits or kills the process after N seconds,
> and have the main thread use Child::wait.  This will ensure that
> geckodriver never hangs on shutdown.
> 
> With the patch above as-is, geckodriver will enter an infinite
> loop if the firefox process doesn’t exit from as a result of
> Marionette:Quit.

No we will not hang. Lets wait for the next update. Using threads is overcomplicated here.
Sorry for the delay. I will continue working on this bug now.
Whiteboard: [17/10] → [17/12]
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review212792

::: testing/geckodriver/src/marionette.rs:60
(Diff revision 3)
>  use capabilities::{FirefoxCapabilities, FirefoxOptions};
>  use prefs;
>  
>  const DEFAULT_HOST: &'static str = "localhost";
>  
> +const TIMEOUT_BROWSER_SHUTDOWN: u64 = 130 * 1000;

Explain where this comes from.

::: testing/geckodriver/src/marionette.rs:586
(Diff revision 3)
> +        if let Some(ref s) = *session {
> +            let delete_session = WebDriverMessage {
> +                session_id: Some(s.id.clone()),
> +                command: WebDriverCommand::DeleteSession
> +            };
> +            match self.handle_command(session, delete_session) {

Isn't this just 
```
match self.handle_command(session, delete_session) {
    _ => {}
}
```

which I would prefer to write

```
let _ = self.handle_command(session, delete_session);
```
Attachment #8915761 - Flags: review?(james) → review+
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review212792

> Isn't this just 
> ```
> match self.handle_command(session, delete_session) {
>     _ => {}
> }
> ```
> 
> which I would prefer to write
> 
> ```
> let _ = self.handle_command(session, delete_session);
> ```

That make sense. Thanks.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/34faef444ae3
Safely shutdown Firefox from in delete_session. r=jgraham
The problem here is that test_handle_prompt_twice is causing a shutdown hang of Firefox, and as such the tests time out. Sadly this cannot be seen in the logs because of the maximum duration of all tests set for a test module. James, any idea how we could compensate that? We really should just timeout if a single test method hangs, but also we shouldn't kill the browser because it would hide the real MOZ_CRASH output.

Running the single test alone locally I can reproduce the hang. I get the following output:

 1:17.36 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "Hit MOZ_CRASH(Workers Hanging - A:3|S:0|Q:0-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken-BC:0|WorkerHolderToken) at /builds/worker/workspace/build/src/dom/workers/RuntimeService.cpp:2191"
 1:17.38 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560312	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.48 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560413	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.58 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560516	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.68 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "1513366560621	geckodriver::marionette	DEBUG	Waiting for the browser process to shutdown"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2534"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#01: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6e1034]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#02: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6b1701]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#03: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6aea1f]"
 1:17.69 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#04: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x696065]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#05: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x693ac5]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#06: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x6a0bcd]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#07: nsXPTCStubBase::Stub249()[/Volumes/data/code/gecko/obj/debug/dist/NightlyDebug.app/Contents/MacOS/XUL +0x69aafa]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#08: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x393b]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "#09: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3887]"
 1:17.70 PROCESS_OUTPUT: Thread-TestrunnerManager-1 (pid:12515) "** Unknown exception behavior: -2147483647"

So this shutdown hang is related to DOM workers. I will have to investigate and file a bug about it.

Interestingly the forced MOZ_CRASH is happening after 65s and not 120s as used in my patch. I will have to clarify which value is the correct one.
Flags: needinfo?(hskupin)
James please have a look at the beginning of the last comment. Thanks.
Flags: needinfo?(james)
I don't really understand the comment. If we just expect this test to timeout now, we can set it to expected: TIMEOUT in the ini file. If it's causing more than a single test to time out, we need to ensure the testrunner process is killed as expected.
Flags: needinfo?(james)
So with my changes we actually noticed a serious issue during shutdown of Firefox. See bug 1425559 for details. It means the failures we see here are real, and we might have to wait with re-landing the patch until bug 1425559 got fixed.
I had a look at those tests. None of the alert tests make sure that all alerts are closed afterwards. Instead we have a finalizer for the end of a session which is `_dismiss_user_prompts`. Given that there are multiple user prompts open during shutdown of Firefox it means that this method does not close remaining alerts! I will check it the next days.
For any test method which is using the `new_session` function, none of the registered finalizers are getting executed! That is also the reason why the alerts are not getting closed correctly, and is causing this hang. I will file a new bug for that.
We are not going to add finalizers for `new_session`. As such we will have to mark the test as timeout expected for now.
So I tried to disable the test by adding the following lines:

  [user_prompts.py::test_handle_prompt_twice]
    disabled: https://bugzilla.mozilla.org/show_bug.cgi?id=1425559

But that doesn't work, and the test is still getting run when executing wp-tests for the execute_script folder. James is there something broken with determining the disabled state? The same can also be seen for other tests and manifests which already exists in tree like ` testing/web-platform/tests/webdriver/tests/navigation/current_url.py::test_get_current_url_after_modified_location`
Flags: needinfo?(james)
I filed bug 1427990 to cover the case for subtests.
Flags: needinfo?(james)
So using TIMEOUT as expected status for the subtest doesn't work. Instead I would have to set this for the test module itself:

> [user_prompts.py]
>   expected: TIMEOUT

I find this awkward given that only a single subtest will cause a timeout. Why has this expected state to be set for everything?
Flags: needinfo?(james)
Maybe this is related because the timeout actually happens during shutdown of Firefox, which is past `test_end`. So `test_handle_prompt_twice` reports `FAIL`, but the final TIMEOUT is accounted for the test module? So I could add TIMEOUT as expected result for user_prompts.py, but then the subtest has to be kept as last test in that file.

Would that be appropriate?
Andreas helped me with that question. We are going to mark the module as expected timeout.
Flags: needinfo?(james)
As just found out on bug 1425559 the hang always happens when Telemetry is turned on. Setting `datareporting.policy.dataSubmissionEnabled` to false at least fixes the manual case. Strangely this pref is already set by Marionette as recommended preference, and it is clearly `false` after creating the new custom session.

What makes it interesting is that the shutdown hang is gone for the test when a navigation takes place to a different URL than `about:blank`. And I can indeed reproduce it manually when starting firefox with `browser.startup.page=0` as we do with Marionette. Basically it looks like a DOM worker problem.

I would suggest that we wait with pushing this patch, and I will give Olli a bit of time to have a look at this. I really want to avoid the addition of the TIMEOUT flag.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

James, I would like to ask you for an additional review due to some changes have been made, and I want to hear if that looks fine.

Please note that I was wrong with 120s for the background thread monitor. It's actually 65s only as logs of try builds also verify.
Attachment #8915761 - Flags: review?(james)
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Reverting for now given the unexpected test failures currently occurring.
Attachment #8915761 - Flags: review?(james)
Those test failures actually happened because the affected tests start/stop Firefox a lot, and due to the safe shutdown it will take a bit longer. As result those exceed the default timeout of 30s.

We agreed to extend the timeout for those affected tests.
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

Latest try build as pushed via mozreview passes now.
Attachment #8915761 - Flags: review?(james)
Comment on attachment 8915761 [details]
Bug 1403923 - Safely shutdown Firefox from in delete_session.

https://reviewboard.mozilla.org/r/186974/#review217092
Attachment #8915761 - Flags: review?(james) → review+
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/025d4690bfbe
Safely shutdown Firefox from in delete_session. r=jgraham
Increasing the timeout didn't work with the META tag in the test file. So a request for backout has been made.

Andreas mentioned that I should have regenerated the manifest. I did this locally now, and it shows me the changes in testing/web-platform/meta/MANIFEST.json. I will push this tomorrow after the backout, and the trees got reopened.
https://hg.mozilla.org/mozilla-central/rev/025d4690bfbe
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
The requested backout did not happen, but the code got merged to m-c. Maybe it is easier now to simply land this fix.
Attachment #8941340 - Flags: review+
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pushed by rgurzau@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/ecb9935ce223
"By default delete_session has to perform a safe shutdown of Firefox" [r=hskupin] a=aryx
https://hg.mozilla.org/mozilla-central/rev/ecb9935ce223
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Whiteboard: [17/12] → [18/01]
Blocks: 1431058
Blocks: 1431081
Henrik, we are trying out uplift on the patches in bug 1425559, so it sounds like you may need to change a setting on a web platform test in this bug. Is that still the case?
Flags: needinfo?(hskupin)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #62)
> Henrik, we are trying out uplift on the patches in bug 1425559, so it sounds
> like you may need to change a setting on a web platform test in this bug. Is
> that still the case?

No, this patch already got uplifted and is even life for Firefox 58. Whatever you uplift now to 59 or 58 should apply cleanly. There is nothing to do here.
Flags: needinfo?(hskupin)
You need to log in before you can comment on or make changes to this bug.