Closed
Bug 1257476
Opened 8 years ago
Closed 8 years ago
Marionette doesn't kill Firefox in case of YSOD (DTD entities not found) issues
Categories
(Remote Protocol :: Marionette, defect)
Remote Protocol
Marionette
Tracking
(firefox49 wontfix, firefox50 fixed)
RESOLVED
FIXED
mozilla50
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
(Whiteboard: [blocked by bug 1287723])
Attachments
(2 files)
Nearly all locales of Firefox 46.0b1 were busted on Windows due to issues with the l10n migration script. For details see bug 1255811. The problem we faced here is that due to the yellow screen of dead window, Marionette hangs forever and does not kill Firefox. So that caused hanging instances on several Windows machines in mozmill-ci, and all following jobs on those machines are failing. A manual cleanup is absolutely necessary then by connection via VNC and killing all active Firefox processes. This is a critical issue given that we run a lot of locales for updates and if this situation comes up all 4 slave machines we have per platform version will be quickly busted. It will cause delays in the QA signoffs for release processes of Firefox. For the Firefox 46.0b1 release it only affected Windows, but this might be a cross-platform issue with Marionette. In case you want to test this on Windows one of the listed failing locales can be used: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=fb3494d06dfb&filter-tier=1&filter-tier=2&filter-tier=3&filter-searchStr=firefox%20ui Builds can be downloaded from here: http://archive.mozilla.org/pub/firefox/candidates/46.0b1-candidates/build8/win32/
Comment 1•8 years ago
|
||
from a quick scan of the code it looks like we need a try: except: around checking we can connect to the browser. if we can't we need to cleanup the process. Looks like https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#602 is where we need to do it.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 3•8 years ago
|
||
I digged into that problem by manipulating a Firefox installation which causes such a YSOD window. The proposed location from above is not the correct code which needs a fix. It's the following one: https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#676 In the above case the Marionette server gets started because its a XPCOM component. If Firefox comes up broken or not doesn't matter here. When we are trying to send a message for a new session the call to recv() will abort with a timeout, which will raise an IOError. Right now we blindly assume that the application is shutting down itself and we only call `runner.wait()` to wait for it to happen. But that does not happen here! The application is still up and running, and because we raise the IOError again - which doesn't get handled again - we exit Marionette without stopping the application. The solution here is that we call `runner.stop()` in the case when the application is still running after `runner.wait()`. We really cannot leave the process running when we started it on our own.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/62262/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/62262/
Assignee | ||
Comment 5•8 years ago
|
||
While further testing this patch I noticed that the level when we force stop the binary now will leave out other exceptions, which should also cause a force stop of the binary. It means that the handling of the IOError needs to be higher up in the stack.
Comment 6•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. https://reviewboard.mozilla.org/r/62262/#review59056 LGTM. ::: testing/marionette/client/marionette_driver/marionette.py:684 (Diff revision 1) > - # for it to shut down. > + # for it to shut down. In the case when it doesn't happen, > + # force its shut down. > returncode = self.instance.runner.wait(timeout=self.DEFAULT_STARTUP_TIMEOUT) > - raise IOError("process died with returncode %s" % returncode) > + if returncode is None: > + self.instance.runner.stop() > + raise IOError("Process killed because the connection was lost.") Nit: Drop . ::: testing/marionette/client/marionette_driver/marionette.py:686 (Diff revision 1) > returncode = self.instance.runner.wait(timeout=self.DEFAULT_STARTUP_TIMEOUT) > - raise IOError("process died with returncode %s" % returncode) > + if returncode is None: > + self.instance.runner.stop() > + raise IOError("Process killed because the connection was lost.") > + > + raise IOError("Process died with returncode: %s." % returncode) Nit: Drop .
Attachment #8767886 -
Flags: review+
Assignee | ||
Comment 7•8 years ago
|
||
Before I want to bring a patch up for review I would like to see bug 1282570 fixed first.
Assignee | ||
Comment 8•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/1-2/
Attachment #8767886 -
Flags: review+
Assignee | ||
Comment 9•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/2-3/
Assignee | ||
Comment 10•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/3-4/
Assignee | ||
Comment 11•8 years ago
|
||
The last updated patch should be complete. I will only have to separate out changes into different commits for an easier review. David, do you want to have tests for that change? If yes, it might be a bit harder to do given that I would have to mock a socket server which would need to be used to emulate a broken Firefox binary.
Flags: needinfo?(dburns)
Assignee | ||
Comment 12•8 years ago
|
||
Under some circumstances Marionette currently fails to stop the application in case of socket issues. To ensure that the application always gets closed - in the case when Marionette started it - the check for crashes decorator gets updated to do a full process check. Review commit: https://reviewboard.mozilla.org/r/66462/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/66462/
Attachment #8767886 -
Attachment description: Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself. → Bug 1257476 - Ensure Marionette error classes use correct inheritance.
Attachment #8773729 -
Flags: review?(dburns)
Attachment #8767886 -
Flags: review?(dburns)
Assignee | ||
Comment 13•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/4-5/
Assignee | ||
Updated•8 years ago
|
Attachment #8767886 -
Flags: review?(dburns)
Attachment #8773729 -
Flags: review?(dburns)
Assignee | ||
Comment 14•8 years ago
|
||
This is very strange and I do not understand it at all. The same failure for test_screenshots.py re-appeared not only on try bug also on my local box. It was perfectly working yesterday. Looks like something else is still affecting us here.
Comment 15•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #11) > The last updated patch should be complete. I will only have to separate out > changes into different commits for an easier review. > > David, do you want to have tests for that change? If yes, it might be a bit > harder to do given that I would have to mock a socket server which would > need to be used to emulate a broken Firefox binary. I am happy to not have tests for this. Hopefully we wont break anything in this area in the future.
Flags: needinfo?(dburns)
Assignee | ||
Updated•8 years ago
|
Attachment #8767886 -
Flags: review?(dburns)
Attachment #8773729 -
Flags: review?(dburns)
Comment 16•8 years ago
|
||
Comment on attachment 8773729 [details] Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself. https://reviewboard.mozilla.org/r/66462/#review63824
Attachment #8773729 -
Flags: review?(dburns) → review+
Comment 17•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. https://reviewboard.mozilla.org/r/62262/#review63826
Attachment #8767886 -
Flags: review?(dburns) → review+
Assignee | ||
Comment 18•8 years ago
|
||
I will land these commits once the blocking patch on bug 1287723 got reviewed and landed.
Whiteboard: [blocked by bug 1287723]
Assignee | ||
Comment 19•8 years ago
|
||
Comment on attachment 8767886 [details] Bug 1257476 - Ensure Marionette error classes use correct inheritance. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/62262/diff/5-6/
Assignee | ||
Comment 20•8 years ago
|
||
Comment on attachment 8773729 [details] Bug 1257476 - Marionette has to force close the process if it doesn't shut down itself. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/66462/diff/1-2/
Comment 21•8 years ago
|
||
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/928a0bdcea4c Ensure Marionette error classes use correct inheritance. r=ato,automatedtester https://hg.mozilla.org/integration/autoland/rev/aadff7c91cd6 Marionette has to force close the process if it doesn't shut down itself. r=automatedtester
Comment 22•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/928a0bdcea4c https://hg.mozilla.org/mozilla-central/rev/aadff7c91cd6
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Assignee | ||
Comment 23•8 years ago
|
||
I may would like to backport this change if possible to reduce the breakage for possibly broken localized versions of Firefox. I will revise once this code is on mozilla-aurora and works fine for localized builds.
status-firefox49:
--- → ?
Assignee | ||
Comment 24•8 years ago
|
||
https://reviewboard.mozilla.org/r/66462/#review64788 ::: testing/marionette/client/marionette_driver/marionette.py:757 (Diff revision 2) > + > + """ > + if self.instance: > + exc, val, tb = sys.exc_info() > + > + returncode = self.instance.runner.returncode Actually this line caused a regression in reporting the state of the application. I should not have changed it from `self.instance.runner.wait()` to just `returncode`. I will fix that broken behavior as part of bug 1202392.
Assignee | ||
Comment 25•8 years ago
|
||
https://reviewboard.mozilla.org/r/66462/#review65038 ::: testing/marionette/client/marionette_driver/decorators.py:48 (Diff revision 2) > except (MarionetteException, socket.error, IOError) as e: > exc, val, tb = sys.exc_info() > if not isinstance(e, MarionetteException) or type(e) is MarionetteException: > if not always: > - check() > + check_for_crash() > + if not isinstance(e, MarionetteException) or type(e) is TimeoutException: Allowing the TimeoutException here has been caused another regression. It's intermittent and always happens when an operation in Marionette server times out. Instead of TimeoutException the IOError class should have been checked. A fix for this regression I will add to my patch on bug 1202392.
Assignee | ||
Comment 26•8 years ago
|
||
Given that we have two regressions here I think we should just let this patch ride the trains. It will even be on aurora by early next week. So I hope we can get the follow-up patches landed by then.
Updated•1 year ago
|
Product: Testing → Remote Protocol
You need to log in
before you can comment on or make changes to this bug.
Description
•