Closed Bug 1190791 Opened 4 years ago Closed 4 years ago

Again failures in various tests in self.marionette.start_session() : IOError: Connection to Marionette server is lost

Categories

(Firefox OS Graveyard :: Gaia::UI Tests, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: martijn.martijn, Unassigned)

References

Details

(Keywords: regression)

Attachments

(1 file)

I think I see similar failures again like we saw in bug 1172343 :(

http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/207/HTML_Report/
Traceback (most recent call last):
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_client-0.16-py2.7.egg/marionette/marionette_test.py", line 277, in run
self.setUp()
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/tests/python/gaia-ui-tests/gaiatest/tests/functional/system/test_privileged_app_video_capture_prompt.py", line 13, in setUp
GaiaTestCase.setUp(self)
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 862, in setUp
self.device.start_b2g()
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 663, in start_b2g
self.marionette.start_session()
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.9-py2.7.egg/marionette_driver/marionette.py", line 1015, in start_session
self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id)
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.9-py2.7.egg/marionette_driver/decorators.py", line 36, in _
return func(*args, **kwargs)
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.9-py2.7.egg/marionette_driver/marionette.py", line 691, in _send_message
response = self.client.send(message)
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.5-py2.7.egg/marionette_transport/transport.py", line 101, in send
self.connect()
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.5-py2.7.egg/marionette_transport/transport.py", line 89, in connect
hello = self.receive()
File "/var/lib/jenkins/jobs/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.5-py2.7.egg/marionette_transport/transport.py", line 73, in receive
    raise IOError(self.connection_lost_msg)
IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

I also saw it in smoke 2 of bitbar and I guess this also happens in other places.
Oliver, would you perhaps willing to find out when this regressed again (you can look at bitbar also when it regressed, I think)?
Flags: needinfo?(onelson)
Again, withe patch in bug 1172343, comment 28 and running that test, I get this failure in 2/3 repeats.
On mozilla-central Jenkins jobs, I almost never see this occur on smoke runs. I'm curious if it's because the tests don't run enough for this to occur. I see this most commonly in the non-smoke runs, and from what I can discern it appears it started reproing again on August 1st:

* http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1/382/
* http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.mozilla-central.nightly.ui.functional.non-smoke.1.bitbar/204/

This error really hurts automation testing because the tests that fail to this take 15 minutes before they close out. It always appears to be the last reported in the HTML report, assuming that means they were the last run by the marionette test client. Is it possible the test is taking too long and the client is losing it's port from adb?

Could we modify the timeout on tests so they never take more than 5 minutes? It would at least reduce some of the time overhead created by this failure.
Flags: needinfo?(onelson) → needinfo?(martijn.martijn)
(In reply to Oliver Nelson [:oliverthor] from comment #3)
> Could we modify the timeout on tests so they never take more than 5 minutes?
> It would at least reduce some of the time overhead created by this failure.

I don't understand what you mean. We should never get these kinds of failures in the first place. They are very disruptive.
Flags: needinfo?(martijn.martijn)
With the test in the pull request I can reproduce it after it starts with the 2nd run in there. The first test takes something like  272413ms.
When this issue occurs, I see only this message appearing, repeatedly: 
V/WLAN_PSA(  215): NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028
It's passing on:
Build ID               20150731150205
Gaia Revision          2ca27bbdd84526c6a3b198d9cf10f2caff1dadde
Gaia Date              2015-07-31 08:23:31
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/afa67b6957bb
Gecko Version          42.0a1
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150727.063909
Firmware Date          Mon Jul 27 06:39:20 EDT 2015
Bootloader             L1TC000118D0

It fails on:
Build ID               20150801030207
Gaia Revision          2ca27bbdd84526c6a3b198d9cf10f2caff1dadde
Gaia Date              2015-07-31 08:23:31
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/aeb85029c3b3
Gecko Version          42.0a1
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150727.063909
Firmware Date          Mon Jul 27 06:39:20 EDT 2015
Bootloader             L1TC000118D0
There was no change in Gaia between those builds, so there is only a Gecko changelog to look for.

It looks like this was caused by bug 1180596.

We had similar issues before, for which we filed bug 1172343. That corresponded to the regression range and fix range for Presentation WebAPI and disabling it.
It also makes me wonder if bug 1171827 would be there again, potentially.

Gary, can you take a look at this?
Blocks: 1180596
Flags: needinfo?(xeonchen)
Fabrice just disabled device discovery again in bug 1196884, so this should be fixed tomorrow.
Status: NEW → RESOLVED
Closed: 4 years ago
Depends on: 1196884
Resolution: --- → FIXED
Johan, do you think it would be useful to have a test that I've attached to this bug checked in as a regression test?
Flags: needinfo?(xeonchen) → needinfo?(jlorenzo)
Yes, this test is very valuable to us. I'd put it in the sanity suite. What do you guys think?
Flags: needinfo?(martijn.martijn)
Flags: needinfo?(jlorenzo)
Flags: needinfo?(jdorlus)
In the sanity suite probably makes the most sense. It doesn't really belong in the unit test suite, but I don't know in functional area it should belong.
Flags: needinfo?(martijn.martijn)
Depends on: 1198264
I just filed bug 1198264 to keep track of adding this to the test suite.
Yes, I agree that it should go in the sanity suite.
Flags: needinfo?(jdorlus)
I can still reproduce this in the latest Flame build, using the automated test from bug 1198264. Reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is still very likely caused by bug 1180596.
Flags: needinfo?(xeonchen)
Hmm, perhaps this is a different issue, this is much easier to reproduce in all kinds of testing in the latest Flame build.
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(xeonchen)
Resolution: --- → FIXED
(In reply to Martijn Wargers [:mwargers] (QA) from comment #18)
> Hmm, perhaps this is a different issue, this is much easier to reproduce in
> all kinds of testing in the latest Flame build.

I filed bug 1198950 for this.
However, while testing out the automated test in bug 1198264 is still causing this. But at this point, I'll wait on testing this until bug 1198950 is fixed.
Depends on: 1198950
You need to log in before you can comment on or make changes to this bug.