1198950 - [Marionette] All Jenkins jobs are failing almost all of the times - "Connection to Marionette server is lost"

Reporter

Description

•

9 years ago

At least since 2408-2015, we get Marionette connection lost issues again. This is not bug 1190791, which is supposedly fixed and would only happen somewhere at the end of a test run. This is happening almost with every test (including the brick test). A typical error message: http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick/137/testReport/%28root%29/test_brick_verification_py%20TestBrickVerification_test_verify_phone_not_bricked/test_brick_verification_py_TestBrickVerification_test_verify_phone_not_bricked/ Traceback (most recent call last): File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_client-0.18-py2.7.egg/marionette/marionette_test.py", line 277, in run self.setUp() File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 867, in setUp self.device.start_b2g() File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 667, in start_b2g self.marionette.start_session() File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/marionette.py", line 1025, in start_session self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id) File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/decorators.py", line 36, in _ return func(*args, **kwargs) File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/marionette.py", line 694, in _send_message response = self.client.send(message) File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 97, in send self.connect() File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 85, in connect hello = self.receive() File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 61, in receive raise IOError(self.connection_lost_msg) IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.

Martijn Wargers (dead)

Reporter

Comment 1

•

9 years ago

No-Jun mentioned on irce that these error messages might have something to do with it: 16:17:50 Most recent errors/exceptions are: 16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1 16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1 16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1 16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1 16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1 16:17:50 16:17:50 The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure. Bug 1035040, comment 7 mentions these ACDB errors.

No-Jun Park [:njpark]

Comment 2

•

9 years ago

Attached file logcat.txt — Details

Logcat from the failed jenkins run is attached. Seems similar to Bug 1035040

No-Jun Park [:njpark]

Comment 3

•

9 years ago

Hi Paul, since you have the experience with Bug 1035040, was wondering whether you can help out identify the issue. it seems quite similar to what you saw before.

Flags: needinfo?(pyang)

Martijn Wargers (dead)

Reporter

Comment 4

•

9 years ago

Alastor, perhaps you can help out here? Do you know what these ACDB errors are? And are they related to our test failures?

Flags: needinfo?(alwu)

No-Jun Park [:njpark]

Comment 5

•

9 years ago

ni?ing gwagner too since some errors shown in the logcat *could* be sfe related. Feel free to clear if I'm mistaken.

Flags: needinfo?(anygregor)

Gregor Wagner [:gwagner]

Comment 6

•

9 years ago

Maybe related to bug 1198172?

Flags: needinfo?(anygregor)

Martijn Wargers (dead)

Reporter

Updated

•

9 years ago

Keywords: regression, regressionwindow-wanted

Martijn Wargers (dead)

Reporter

Comment 7

•

9 years ago

(In reply to Gregor Wagner [:gwagner] from comment #6) > Maybe related to bug 1198172? I have no idea if bug 1198172 is related, the bug is not clear to me at all. Supposedly that bug is fixed and currently all Jenkins jobs are still failing.

Martijn Wargers (dead)

Reporter

Updated

•

9 years ago

Blocks: 1190791

Alastor Wu [:alwu]

Comment 8

•

9 years ago

Hi, Martin, Sorry I don't know the ACDB errors, and I think that is not related with our test case failures. Transfer NI to the graphic team people, maybe they would have some clues. --- Hi, Jeremy, Do you know who in your team might know about the ACDB error [1]? Thanks! [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1035040#c7

Flags: needinfo?(alwu) → needinfo?(jeremychen)

Paul Yang [:pyang] (away)

Comment 9

•

9 years ago

If seeing "Marionette server lost " but device was still alive after tests finished, indicating device was too busy to keep the connection. gathering cpu & memory info may help IMO. For logcat, it looks like some services didn't initial correctly so there're ipc errors. keeping retrying will be very painful if it was related to hardware, e.g., audio or RIL. bug 1035040 showed more exception about acdb module but this bug has GeckoDump: Crash reporter : Can't fetch app.reportCrashes. Exception: [Exception... "Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIPrefBranch.getBoolPref]" nsresult: "0x8000ffff (NS_ERROR_UNEXPECTED)" location: "JS frame :: chrome://b2g/content/shell.js :: shell_reportCrash :: line 150" data: no] not quite sure they are relative.

Paul Yang [:pyang] (away)

Updated

•

9 years ago

Flags: needinfo?(pyang)

Jeremy Chen (chenpighead@gmail.com)

Comment 10

•

9 years ago

(In reply to Alastor Wu [:alwu] from comment #8) > Hi, Martin, > Sorry I don't know the ACDB errors, and I think that is not related with our > test case failures. > Transfer NI to the graphic team people, maybe they would have some clues. > > --- > Hi, Jeremy, > Do you know who in your team might know about the ACDB error [1]? > Thanks! > > [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1035040#c7 I've never hit this error before. After a quick round offline discussion, seems none of us know this error.

Flags: needinfo?(jeremychen)

Martijn Wargers (dead)

Reporter

Comment 11

•

9 years ago

Looks like this has been magically fixed! More and more things seem to get greenish at http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/ It makes me wonder what caused and fixed it. Perhaps related to the mail that Andreas Tolfsen sent about "Protocol changes and deployment"? Related to bug 1153822 and bug 1168396.

Flags: needinfo?(ato)

Martijn Wargers (dead)

Reporter

Comment 12

•

9 years ago

Hmm, perhaps this is not fixed yet, there are still failures like this, in the bitbar runs: http://jenkins1.qa.scl3.mozilla.com/view/UI/job/flame-kk-319.mozilla-central.nightly.ui.brick/ http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick.bitbar/ http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/job/flame-kk-319.mozilla-central.nightly.ui.brick.bitbar/

Martijn Wargers (dead)

Reporter

Comment 13

•

9 years ago

Ok, it seems like the situation is now similar as what I see locally. Occasionally, this is now failing on my device as it is in the labs.

Paul Yang [:pyang] (away)

Comment 14

•

9 years ago

Unfortunately I saw them in our lab today, and all devices failed because of this. Possibly a device issue such like power saving but still unknown.

Paul Yang [:pyang] (away)

Updated

•

9 years ago

Blocks: MTBF-Marionette, mtbf-2015q3

Paul Yang [:pyang] (away)

Comment 15

•

9 years ago

The issue we have was about first boot took too much time. Add marionette.wait_for_port() before start_session() simply fix this problem.

Martijn Wargers (dead)

Reporter

Comment 16

•

9 years ago

Paul, what do you mean? There is already wait_for_port() before start_session, no? http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#666

Flags: needinfo?(pyang)

Paul Yang [:pyang] (away)

Comment 17

•

9 years ago

The flash script we use didn't have this protection and pop same issue. It reminds me perhaps the reason is first time boot took longer time and exceed timeout.

Flags: needinfo?(pyang)

Paul Yang [:pyang] (away)

Comment 18

•

9 years ago

I guess marionette itself didn't protect start_session while initializing. for example, http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#582 http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#673

Paul Yang [:pyang] (away)

Comment 19

•

9 years ago

start_session has 60 seconds timeout and sock client connection timeout has been set by 360 seconds, so timeout may not be too short here. Keep trying to reproduce this bug locally. Meanwhile, probably I'll change to retry test runner if we caught a IOError.

Martijn Wargers (dead)

Reporter

Comment 20

•

9 years ago

David, do you perhaps have an idea on what's going on here?

Flags: needinfo?(dburns)

Martijn Wargers (dead)

Reporter

Comment 21

•

9 years ago

(In reply to Paul Yang [:pyang] from comment #18) > I guess marionette itself didn't protect start_session while initializing. > for example, > http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/ > marionette/runner/base.py#582 > http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/ > marionette/runner/base.py#673 The error happens here: http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#667 That's not in base.py.

Andreas Tolfsen ❲:ato❳

Comment 22

•

9 years ago

(In reply to Martijn Wargers [:mwargers] (QA) from comment #11) > Looks like this has been magically fixed! More and more things seem to get > greenish at http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/ > > It makes me wonder what caused and fixed it. Perhaps related to the mail > that Andreas Tolfsen sent about "Protocol changes and deployment"? Related > to bug 1153822 and bug 1168396. I haven’t pushed any server-side protocol changes to Gecko yet. If you’re using the Marionette Python client from in-tree or you’ve updated to the latest release (0.19) there is of course a chance of bugs related to emulator/device Gaia tests because that isn’t tested on try, but in theory the changes I made as part of bug 1168396 should all be backwards-compatible. File bugs if you find that they aren’t.

Flags: needinfo?(ato)

David Burns :automatedtester

Comment 23

•

9 years ago

Nothing is standing out here. When it disconnects can you connect via the remote debugger into the device? are there any other network issues?

Flags: needinfo?(dburns)

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 24

•

9 years ago

It might be related to bug 1199863; I think we're running out of system space again.

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 25

•

9 years ago

A little more clarification: we don't do OTA on the flame; at the same time, my concern is that our reflashing gecko/gaia on top might be causing issues? Or are we doing full flashing again?

Martijn Wargers (dead)

Reporter

Comment 26

•

9 years ago

(In reply to David Burns :automatedtester from comment #23) > Nothing is standing out here. When it disconnects can you connect via the > remote debugger into the device? are there any other network issues? The thing is, it doesn't seem to be a problem anymore (or at least, not much). I don't know of cases anymore, lately, that shows this issue. There was an automation report 3 days ago that showed one failure that might have been happening due to this problem.

No-Jun Park [:njpark]

Comment 27

•

9 years ago

I was able to repro this issue locally on thursday's build (Sep. 3rd) - I ran a test with a repeat value of 20, and around 13th run, it started to show this issue consistently until I rebooted the phone. Will see whether this reproduces on today's.

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 28

•

9 years ago

If this happens locally at any given time, can we start collecting data on the device please? If possible things like df, b2g-info, b2g-ps, dmesg and /system/xbin/procrank would help. note: Looks like dumpstate might incorporate most of these.

Martijn Wargers (dead)

Reporter

Comment 29

•

9 years ago

I just got this perhaps and this is with a black screen. adb devices would still show the device. (test3)mwargers:B2G-flash-tool mwargers$ adb shell b2g-ps APPLICATION SEC USER PID PPID VSIZE RSS WCHAN PC NAME

Paul Yang [:pyang] (away)

Comment 30

•

9 years ago

Looks like no b2g process was found. b2g should automatically restart so it looks pretty weird. check dmesg might help.

Martijn Wargers (dead)

Reporter

Comment 31

•

9 years ago

It suddenly seems to be happening a lot again on Bitbar, not in the Mountain View lab. In the Mountain View lab, all the Brick jobs are green for more than a week. On Bitbar, the various Brick jobs have a varying state of stability. This one, more than half of the time red since 16 sept: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick.bitbar/ This one even less stable: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-inbound.tinderbox.ui.brick.bitbar/ This one only 2 jobs since 16 sept, and those are green: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.nightly.ui.brick.bitbar/ All of the failure in Bitbar have this again: 03:18:16 09-17 03:18:13.849 E/ACDB-LOADER( 1680): Error: ACDB ANC returned = -1 03:18:16 The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure. And the marionnette connection issue happens. It makes me wonder why this happens only on Bitbar

No-Jun Park [:njpark]

Comment 32

•

9 years ago

Attached file Another logcat from failed run.txt — Details

Martijn Wargers (dead)

Reporter

Comment 33

•

9 years ago

Bugs also with "Error: ACDB ANC returned = -1" in it and broken Flame devices: bug 1050282 and bug 1152283. Could it be that the base image on the devices on Bitbar are getting out of sync with the gaia+gecko layer or something, sometimes? When are the devices on Bitbar full-flashed?

No-Jun Park [:njpark]

Comment 34

•

9 years ago

It's reported that https://bugzilla.mozilla.org/show_bug.cgi?id=1197183 came back on the same day as well. Perhaps they're linked? Also, running a different (older) build did not cause this bug

Martijn Wargers (dead)

Reporter

Updated

•

9 years ago

Severity: normal → critical

Martijn Wargers (dead)

Reporter

Comment 35

•

9 years ago

And the failures on the jobs in Bitbar have mysteriously disappeared again now.

Martijn Wargers (dead)

Reporter

Comment 36

•

9 years ago

Not all failures have disappeard. I guess QAnalyst might be able to use Bitbar again for the reports, but I still see a lot of failurs for the Brick test on Bitbar: http://jenkins1.qa.scl3.mozilla.com/view/Brick/ This one is red/failing almost all of the time: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick.bitbar/ This one is also red/failing almost all of the time: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-inbound.tinderbox.ui.brick.bitbar/ This one is half red, half green, approximately: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.nightly.ui.brick.bitbar/ This one is completely green: http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.tinderbox.ui.brick.bitbar/

Bobby Chien

Updated

•

9 years ago

blocking-b2g: --- → 2.5+

Priority: -- → P1

Bobby Chien

Updated

•

9 years ago

Summary: All Jenkins jobs are failing almost all of the times - Connection to Marionette server is lost → [Marionette] All Jenkins jobs are failing almost all of the times - "Connection to Marionette server is lost"

Jayme Mercado [:JMercado] (Inactive 3/4/2016)

Comment 37

•

9 years ago

Martijn this isn't really something we can get a window on right? If so can we remove the tag?

Flags: needinfo?(martijn.martijn)

Martijn Wargers (dead)

Reporter

Comment 38

•

9 years ago

You're right, I don't think a regression-window is useful here anymore.

Flags: needinfo?(martijn.martijn)

Keywords: regressionwindow-wanted

Martijn Wargers (dead)

Reporter

Comment 39

•

9 years ago

I heard this was also sometimes an issue in the Taipei lab.

Bobby Chien

Comment 40

•

9 years ago

We need to get this to be fixed in 2.5. David and Andreas, could you please show us some direction to identify issue? do you need log or information for the issue? Many thanks.

Flags: needinfo?(dburns)

Flags: needinfo?(ato)

David Burns :automatedtester

Comment 41

•

9 years ago

WHen the issue happens are you able to connect to the device still via the remote debugging or via telnet?

Flags: needinfo?(dburns)

Andreas Tolfsen ❲:ato❳

Comment 42

•

9 years ago

I don’t have anything to contribute here.

Flags: needinfo?(ato)

Julien Wajsberg [:julienw]

Comment 43

•

9 years ago

[Blocking Requested - why for this release]: I'm not sure this should block the release.

blocking-b2g: 2.5+ → 2.5?

Mahendra Potharaju [:mahe]

Comment 44

•

9 years ago

[Tracking Requested - why for this release]:Clearing the nom for 2.5. Needs to be looked at in Quality Program

blocking-b2g: 2.5? → ---

tracking-b2g: --- → backlog

Paul Yang [:pyang] (away)

Updated

•

9 years ago

Blocks: mtbf-2015q4
No longer blocks: mtbf-2015q3

Paul Yang [:pyang] (away)

Comment 45

•

9 years ago

We haven't seen this issue for a while. leave it for another two weeks and see it will appear again.

Martijn Wargers (dead)

Reporter

Comment 46

•

9 years ago

I just got a crash like this, for one test in a test run: https://pastebin.mozilla.org/8851208

Bobby Chien

Updated

•

9 years ago

tracking-b2g: backlog → +

Priority: P1 → P2

Julien Wajsberg [:julienw]

Comment 47

•

9 years ago

Pastebin is gone :( next time is better to attach a file to the bug, I guess.

Martijn Wargers (dead)

Reporter

Comment 48

•

8 years ago

Marking WONTFIX, sorry for the bug spam. If somebody still wants to work on this, please file a new bug for it.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

logcat.txt 9 years ago No-Jun Park [:njpark] 399.04 KB, text/plain		Details
Another logcat from failed run.txt 9 years ago No-Jun Park [:njpark] 2.27 MB, text/plain		Details