Closed Bug 1257476 Opened 4 years ago Closed 3 years ago

Marionette doesn't kill Firefox in case of YSOD (DTD entities not found) issues

Categories

(Testing :: Marionette, defect, critical)

defect
Not set
critical

Tracking

(firefox49 wontfix, firefox50 fixed)

RESOLVED FIXED
mozilla50
Tracking Status
firefox49 --- wontfix
firefox50 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [blocked by bug 1287723])

Attachments

(2 files)

Nearly all locales of Firefox 46.0b1 were busted on Windows due to issues with the l10n migration script. For details see bug 1255811.

The problem we faced here is that due to the yellow screen of dead window, Marionette hangs forever and does not kill Firefox. So that caused hanging instances on several Windows machines in mozmill-ci, and all following jobs on those machines are failing. A manual cleanup is absolutely necessary then by connection via VNC and killing all active Firefox processes.

This is a critical issue given that we run a lot of locales for updates and if this situation comes up all 4 slave machines we have per platform version will be quickly busted. It will cause delays in the QA signoffs for release processes of Firefox.

For the Firefox 46.0b1 release it only affected Windows, but this might be a cross-platform issue with Marionette. 

In case you want to test this on Windows one of the listed failing locales can be used:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=fb3494d06dfb&filter-tier=1&filter-tier=2&filter-tier=3&filter-searchStr=firefox%20ui

Builds can be downloaded from here:

http://archive.mozilla.org/pub/firefox/candidates/46.0b1-candidates/build8/win32/
from a quick scan of the code it looks like we need a try: except: around checking we can connect to the browser. if we can't we need to cleanup the process.

Looks like https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#602 is where we need to do it.
I digged into that problem by manipulating a Firefox installation which causes such a YSOD window. The proposed location from above is not the correct code which needs a fix. It's the following one:

https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#676

In the above case the Marionette server gets started because its a XPCOM component. If Firefox comes up broken or not doesn't matter here. When we are trying to send a message for a new session the call to recv() will abort with a timeout, which will raise an IOError. Right now we blindly assume that the application is shutting down itself and we only call `runner.wait()` to wait for it to happen. But that does not happen here! The application is still up and running, and because we raise the IOError again - which doesn't get handled again - we exit Marionette without stopping the application.

The solution here is that we call `runner.stop()` in the case when the application is still running after `runner.wait()`. We really cannot leave the process running when we started it on our own.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
While further testing this patch I noticed that the level when we force stop the binary now will leave out other exceptions, which should also cause a force stop of the binary. It means that the handling of the IOError needs to be higher up in the stack.
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

https://reviewboard.mozilla.org/r/62262/#review59056

LGTM.

::: testing/marionette/client/marionette_driver/marionette.py:684
(Diff revision 1)
> -                # for it to shut down.
> +                # for it to shut down. In the case when it doesn't happen,
> +                # force its shut down.
>                  returncode = self.instance.runner.wait(timeout=self.DEFAULT_STARTUP_TIMEOUT)
> -                raise IOError("process died with returncode %s" % returncode)
> +                if returncode is None:
> +                    self.instance.runner.stop()
> +                    raise IOError("Process killed because the connection was lost.")

Nit: Drop .

::: testing/marionette/client/marionette_driver/marionette.py:686
(Diff revision 1)
>                  returncode = self.instance.runner.wait(timeout=self.DEFAULT_STARTUP_TIMEOUT)
> -                raise IOError("process died with returncode %s" % returncode)
> +                if returncode is None:
> +                    self.instance.runner.stop()
> +                    raise IOError("Process killed because the connection was lost.")
> +
> +                raise IOError("Process died with returncode: %s." % returncode)

Nit: Drop .
Attachment #8767886 - Flags: review+
Before I want to bring a patch up for review I would like to see bug 1282570 fixed first.
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/1-2/
Attachment #8767886 - Flags: review+
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/2-3/
Depends on: 1287723
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/3-4/
The last updated patch should be complete. I will only have to separate out changes into different commits for an easier review.

David, do you want to have tests for that change? If yes, it might be a bit harder to do given that I would have to mock a socket server which would need to be used to emulate a broken Firefox binary.
Flags: needinfo?(dburns)
Under some circumstances Marionette currently fails to stop the application in case of socket issues. To
ensure that the application always gets closed - in the case when Marionette started it - the check for crashes
decorator gets updated to do a full process check.

Review commit: https://reviewboard.mozilla.org/r/66462/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/66462/
Attachment #8767886 - Attachment description: Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself. → Bug 1257476 - Ensure Marionette error classes use correct inheritance.
Attachment #8773729 - Flags: review?(dburns)
Attachment #8767886 - Flags: review?(dburns)
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/4-5/
Attachment #8767886 - Flags: review?(dburns)
Attachment #8773729 - Flags: review?(dburns)
This is very strange and I do not understand it at all. The same failure for test_screenshots.py re-appeared not only on try bug also on my local box. It was perfectly working yesterday. Looks like something else is still affecting us here.
(In reply to Henrik Skupin (:whimboo) from comment #11)
> The last updated patch should be complete. I will only have to separate out
> changes into different commits for an easier review.
> 
> David, do you want to have tests for that change? If yes, it might be a bit
> harder to do given that I would have to mock a socket server which would
> need to be used to emulate a broken Firefox binary.

I am happy to not have tests for this. Hopefully we wont break anything in this area in the future.
Flags: needinfo?(dburns)
Attachment #8767886 - Flags: review?(dburns)
Attachment #8773729 - Flags: review?(dburns)
Comment on attachment 8773729 [details]
Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself.

https://reviewboard.mozilla.org/r/66462/#review63824
Attachment #8773729 - Flags: review?(dburns) → review+
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

https://reviewboard.mozilla.org/r/62262/#review63826
Attachment #8767886 - Flags: review?(dburns) → review+
I will land these commits once the blocking patch on bug 1287723 got reviewed and landed.
Whiteboard: [blocked by bug 1287723]
Comment on attachment 8767886 [details]
Bug 1257476 - Ensure Marionette error classes use correct inheritance.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/5-6/
Comment on attachment 8773729 [details]
Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/66462/diff/1-2/
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/928a0bdcea4c
Ensure Marionette error classes use correct inheritance. r=ato,automatedtester
https://hg.mozilla.org/integration/autoland/rev/aadff7c91cd6
Marionette has to force close the process if it doesn't shut down itself. r=automatedtester
https://hg.mozilla.org/mozilla-central/rev/928a0bdcea4c
https://hg.mozilla.org/mozilla-central/rev/aadff7c91cd6
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
I may would like to backport this change if possible to reduce the breakage for possibly broken localized versions of Firefox. I will revise once this code is on mozilla-aurora and works fine for localized builds.
https://reviewboard.mozilla.org/r/66462/#review64788

::: testing/marionette/client/marionette_driver/marionette.py:757
(Diff revision 2)
> +
> +        """
> +        if self.instance:
> +            exc, val, tb = sys.exc_info()
> +
> +            returncode = self.instance.runner.returncode

Actually this line caused a regression in reporting the state of the application. I should not have changed it from `self.instance.runner.wait()` to just `returncode`.

I will fix that broken behavior as part of bug 1202392.
https://reviewboard.mozilla.org/r/66462/#review65038

::: testing/marionette/client/marionette_driver/decorators.py:48
(Diff revision 2)
>          except (MarionetteException, socket.error, IOError) as e:
>              exc, val, tb = sys.exc_info()
>              if not isinstance(e, MarionetteException) or type(e) is MarionetteException:
>                  if not always:
> -                    check()
> +                    check_for_crash()
> +            if not isinstance(e, MarionetteException) or type(e) is TimeoutException:

Allowing the TimeoutException here has been caused another regression. It's intermittent and always happens when an operation in Marionette server times out. Instead of TimeoutException the IOError class should have been checked.

A fix for this regression I will add to my patch on bug 1202392.
Given that we have two regressions here I think we should just let this patch ride the trains. It will even be on aurora by early next week. So I hope we can get the follow-up patches landed by then.
You need to log in before you can comment on or make changes to this bug.