Closed Bug 1166033 Opened 9 years ago Closed 9 years ago

Marionette doesn't recognize when the application dies, waits for the socket timeout, and doesn't start application again for following tests

Categories

(Remote Protocol :: Marionette, defect)

defect
Not set
major

Tracking

(firefox41 fixed)

VERIFIED FIXED
mozilla41
Tracking Status
firefox41 --- fixed

People

(Reporter: whimboo, Assigned: jgriffin)

References

Details

Attachments

(1 file, 1 obsolete file)

While doing some tests with our Mozmill CI and getting the Firefox UI tests running, I noticed that Marionette has a somewhat strange behavior if you interact with the application during the testrun. E.g closing the application (which can also happen by a crash) seems to cause Marionette to wait for the global timeout. This is not necessary because mozrunner should tell us that the application has been closed. With a default global socket timeout of about 6min (or so?) we can safe a lot of time when running tests, and some unexpected behavior happens with the application.
Version: 38 Branch → unspecified
Problem here is that if the application gets closed somehow, Marionette doesn't handle it correctly at all. It even doesn't start a new instance of the application before continuing with the next tests!

Here an example when the application gets closed in the middle of a run:

[..]

  File "/mozilla/code/firefox/nightly/testing/marionette/transport/marionette_transport/transport.py", line 99, in send
    raise IOError("%s: %s" % (str(e), self.connection_lost_msg))

IOError: [Errno 32] Broken pipe: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

Now it takes a while until Marionette continuous with the next test. But then no new instance of Firefox gets started and I get:

[..]

    self.connect()
  File "/mozilla/code/firefox/nightly/testing/marionette/transport/marionette_transport/transport.py", line 69, in connect
    self.sock.connect((self.addr, self.port))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

We should handle the application status more carefully and should not only rely on the socket connection and its timeout.

Jonathan, you mentioned that you might be able to work on this issue. Is it still true?
Severity: normal → major
Flags: needinfo?(jgriffin)
Summary: If application closes Marionette should not wait for global timeout → Marionette doesn't recognize when the application dies, waits for the socket timeout, and doesn't start application again for following tests
Yes, I'll take this one.
Assignee: nobody → jgriffin
Status: NEW → ASSIGNED
Flags: needinfo?(jgriffin)
Do you have a test handy that illustrates this problem?
Flags: needinfo?(hskupin)
Note that handling this in case of a crash is different than handling the normal shutdown case; I don't expect to tackle the former, at least this week.
Is there a way that we can recreate the issue? 

(In reply to Henrik Skupin (:whimboo) from comment #1)
> Problem here is that if the application gets closed somehow, Marionette
> doesn't handle it correctly at all. It even doesn't start a new instance of
> the application before continuing with the next tests!
> 
> Here an example when the application gets closed in the middle of a run:
> 
> [..]
> 
>   File
> "/mozilla/code/firefox/nightly/testing/marionette/transport/
> marionette_transport/transport.py", line 99, in send
>     raise IOError("%s: %s" % (str(e), self.connection_lost_msg))
> 
> IOError: [Errno 32] Broken pipe: Connection to Marionette server is lost.
> Check gecko.log (desktop firefox) or logcat (b2g) for errors.
> 

Is this due to a crash or?


> Now it takes a while until Marionette continuous with the next test. But
> then no new instance of Firefox gets started and I get:
> 
> [..]
> 
>     self.connect()
>   File
> "/mozilla/code/firefox/nightly/testing/marionette/transport/
> marionette_transport/transport.py", line 69, in connect
>     self.sock.connect((self.addr, self.port))
>   File "/usr/lib/python2.7/socket.py", line 228, in meth
>     return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> 
> We should handle the application status more carefully and should not only
> rely on the socket connection and its timeout.

Does a new firefox appear between test files?
(In reply to David Burns :automatedtester from comment #5)
> Is there a way that we can recreate the issue? 

Just run the firefox ui tests like "firefox-ui-tests --binary %path%". At some point simply close the application via the x button.

> > IOError: [Errno 32] Broken pipe: Connection to Marionette server is lost.
> > Check gecko.log (desktop firefox) or logcat (b2g) for errors.
> > 
> 
> Is this due to a crash or?

Nope, just closing the application.

> > We should handle the application status more carefully and should not only
> > rely on the socket connection and its timeout.
> 
> Does a new firefox appear between test files?

As mentioned in the beginning of that comment there is no new instance of the application coming up. Marionette still thinks the original instance is open.
Flags: needinfo?(hskupin)
If this bug primarily concerns behavior that occurs when manually closing Firefox during a test run (as opposed to closing it programmatically) I would downgrade its importance.  None of our frameworks would handle this situation gracefully, and although I can imagine that it might be useful when debugging things locally, it doesn't seem to warrant a 'major' designation.

I'll still look at it, however.
I feel that it can happen in any situatino the application gets closed unexpectedly. What I could do is to convert one of our Mozmill tests which triggers a crash of the application. It could help to let us understand the problem better.
Attached patch WIP (obsolete) — Splinter Review
WIP.  This reworks the socket code to throw an exception quickly after the process dies, rather than having to wait for a socket.timeout.  We'll need to catch this in the runner in order to restart the process, and probably needs some other things as well.
CC'ing Armen here given that I have seen it for update tests too, and what I read on bug 1148546 comment 113 he had some problem with the application not shutting down. Not sure how long he waited but it could be related to this bug.
Attachment #8613152 - Attachment is obsolete: true
I have a patch which works well locally; going to run through try before requesting review.
if this looks ok, I'll schedule a larger run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc748a4392b8
It looks lovely but sadly I cannot test because the tryserver run did not create linux64 binaries. Please let me know once you created those and I will test with our Firefox UI Tests.
Bigger try run, which will include linux64 builds:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0da1af167fb8
Comment on attachment 8614357 [details]
MozReview Request: Bug 1166033 - Notice when the process has died, and restart it, r=dburns

Bug 1166033 - Notice when the process has died, and restart it
Attachment #8614357 - Flags: review?(dburns)
I tested this build by closing Firefox manually while tests were running and we still do not start a new instance of the application. At least not under Linux 64bit. For the next test I see the following exception:

Traceback (most recent call last):
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_client-0.12-py2.7.egg/marionette/marionette_test.py", line 277, in run
    self.setUp()
  File "/mozilla/code/firefox-ui-tests/firefox_puppeteer/tests/test_places.py", line 13, in setUp
    FirefoxTestCase.setUp(self)
  File "/mozilla/code/firefox-ui-tests/firefox_ui_harness/testcases/base.py", line 42, in setUp
    MarionetteTestCase.setUp(self, *args, **kwargs)
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_client-0.12-py2.7.egg/marionette/marionette_test.py", line 641, in setUp
    CommonTestCase.setUp(self)
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_client-0.12-py2.7.egg/marionette/marionette_test.py", line 410, in setUp
    self.marionette.start_session()
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 951, in start_session
    self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id)
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/home/henrik/.virtualenvs/fxtests/local/lib/python2.7/site-packages/marionette_driver-0.7-py2.7.egg/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)
  File "/mozilla/code/firefox/nightly/testing/marionette/transport/marionette_transport/transport.py", line 87, in send
    self.connect()
  File "/mozilla/code/firefox/nightly/testing/marionette/transport/marionette_transport/transport.py", line 69, in connect
    self.sock.connect((self.addr, self.port))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

Looks like the socket which was in use before has not been freed up.
I will look at the review after Jonathan has had a look.
Hey Henrik, looking at that stack trace, it doesn't look like my patch was applied; it looks like it's still using the previous version.
This change isn't in the build at all, but in the marionette-transport and marionette-driver packages, so you'd need to download the tests.zip from that build and use those packages.
Oh darn. Sorry for that, you are totally right. So I applied your patch in my recent inbound checkout and didn't notice any difference. Both packages are used for real:

>>> import marionette_driver
>>> marionette_driver.__file__
'/mozilla/code/firefox/inbound/testing/marionette/driver/marionette_driver/__init__.pyc'
>>> import marionette_transport
>>> marionette_transport.__file__
'/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/__init__.pyc'

Here the exceptions:

  File "/mozilla/code/firefox/inbound/testing/marionette/driver/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)

  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", line 99, in send
    raise IOError("%s: %s" % (str(e), self.connection_lost_msg))

IOError: [Errno 32] Broken pipe: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.


[..]
  File "/mozilla/code/firefox/inbound/testing/marionette/driver/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)
  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", line 103, in send
    response = self.receive()
  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", line 60, in receive
    raise IOError(self.connection_lost_msg)
IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

[..]
  File "/mozilla/code/firefox/inbound/testing/marionette/driver/marionette_driver/marionette.py", line 687, in _send_message
    response = self.client.send(message)
  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", line 87, in send
    self.connect()
  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", line 69, in connect
    self.sock.connect((self.addr, self.port))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Hey Henrik,

This is somehow still using the old version.  You can see at https://reviewboard.mozilla.org/r/9921/diff/1/#0, using this line from the stack trace:

line 87, in send
    self.connect()
  File "/mozilla/code/firefox/inbound/testing/marionette/transport/marionette_transport/transport.py", 

That code is at that line in the original code, but not with my patch; with my patch that line happens to be:

self.sock = None
Ok, not sure what really went wrong here but now I created a new venv and the only thing I can say is: AWESOME improvements!! That's exactly the behavior the harness should have when the application under test dies. Great work at least from the functional standpoint.

David, nothing should block a code-review anymore.
Comment on attachment 8614357 [details]
MozReview Request: Bug 1166033 - Notice when the process has died, and restart it, r=dburns

https://reviewboard.mozilla.org/r/9921/#review9097

Ship It!
Attachment #8614357 - Flags: review?(dburns) → review+
big try run to make sure this doesn't mess anything up:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=613f713e6ea6
Backed out for OSX Mn failures in test_profile_management.py:

https://treeherder.mozilla.org/logviewer.html#?job_id=10611879&repo=mozilla-inbound
What's happening is that when Firefox restarts itself, the new instance of Firefox has a new pid, and the mozrunner instance we've used to initially launch Firefox basically loses track of the new instance.

I'm not sure why this only happens on OSX..I haven't looked at what happens on the other platforms.  I'm also not sure how to compensate for this without being overly fragile.  The best thing I can think of is to disable the behavior that this patch implements when a test calls self.marionette.restart(in_app=True).
Comment on attachment 8614357 [details]
MozReview Request: Bug 1166033 - Notice when the process has died, and restart it, r=dburns

Bug 1166033 - Notice when the process has died, and restart it, r=dburns
Attachment #8614357 - Attachment description: MozReview Request: Bug 1166033 - Notice when the process has died, and restart it → MozReview Request: Bug 1166033 - Notice when the process has died, and restart it, r=dburns
Attachment #8614357 - Flags: review+ → review?(dburns)
(In reply to Jonathan Griffin (:jgriffin) from comment #32)
> Comment on attachment 8614357 [details]
> MozReview Request: Bug 1166033 - Notice when the process has died, and
> restart it, r=dburns
> 
> Bug 1166033 - Notice when the process has died, and restart it, r=dburns

This version adds a 'detached' attribute to self.instance when we call restart(in_app=True), at which point we stop trying to tell whether the process is alive during socket transactions, since it won't be...a new process will, but we won't have a mozrunner or mozprocess instance which is attached to it.  Try run is all green.
Comment on attachment 8614357 [details]
MozReview Request: Bug 1166033 - Notice when the process has died, and restart it, r=dburns

https://reviewboard.mozilla.org/r/9921/#review9795

Ship It!
Attachment #8614357 - Flags: review?(dburns) → review+
https://hg.mozilla.org/mozilla-central/rev/4c3af36331e0
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Blocks: 1176408
This works indeed fine with the new releases of driver and transport. But I still see a hang when the application gets closed after it has been restarted. We may still have some wrong process references. I will file a follow-up bug for it.
Status: RESOLVED → VERIFIED
Blocks: 1176758
Blocks: 1182681
Depends on: 1373635
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: