Closed Bug 1198950 Opened 6 years ago Closed 6 years ago

[Marionette] All Jenkins jobs are failing almost all of the times - "Connection to Marionette server is lost"

Categories

(Firefox OS Graveyard :: Gaia::UI Tests, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

(tracking-b2g:+)

RESOLVED WONTFIX
tracking-b2g +

People

(Reporter: martijn.martijn, Unassigned)

References

Details

(Keywords: regression)

Attachments

(2 files)

At least since 2408-2015, we get Marionette connection lost issues again. This is not bug 1190791, which is supposedly fixed and would only happen somewhere at the end of a test run.
This is happening almost with every test (including the brick test).

A typical error message:
http://jenkins1.qa.scl3.mozilla.com/view/Mozilla%20Lab/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick/137/testReport/%28root%29/test_brick_verification_py%20TestBrickVerification_test_verify_phone_not_bricked/test_brick_verification_py_TestBrickVerification_test_verify_phone_not_bricked/
Traceback (most recent call last):
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_client-0.18-py2.7.egg/marionette/marionette_test.py", line 277, in run
    self.setUp()
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 867, in setUp
    self.device.start_b2g()
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 667, in start_b2g
    self.marionette.start_session()
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/marionette.py", line 1025, in start_session
    self.session = self._send_message('newSession', 'value', capabilities=desired_capabilities, sessionId=session_id)
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/decorators.py", line 36, in _
    return func(*args, **kwargs)
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_driver-0.12-py2.7.egg/marionette_driver/marionette.py", line 694, in _send_message
    response = self.client.send(message)
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 97, in send
    self.connect()
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 85, in connect
    hello = self.receive()
  File "/var/lib/jenkins/jobs/flame-kk-319.b2g-inbound.tinderbox.ui.brick/workspace/.env/lib/python2.7/site-packages/marionette_transport-0.6-py2.7.egg/marionette_transport/transport.py", line 61, in receive
    raise IOError(self.connection_lost_msg)

IOError: Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors.
No-Jun mentioned on irce that these error messages might have something to do with it:
16:17:50 Most recent errors/exceptions are:
16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1
16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1
16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1
16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1
16:17:50 08-24 16:17:48.387 E/ACDB-LOADER( 1769): Error: ACDB ANC returned = -1
16:17:50 
16:17:50 The B2G process has restarted after crashing during  the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

Bug 1035040, comment 7 mentions these ACDB errors.
Attached file logcat.txt
Logcat from the failed jenkins run is attached.  Seems similar to Bug 1035040
Hi Paul, since you have the experience with Bug 1035040, was wondering whether you can help out identify the issue.  it seems quite similar to what you saw before.
Flags: needinfo?(pyang)
Alastor, perhaps you can help out here? Do you know what these ACDB errors are? And are they related to our test failures?
Flags: needinfo?(alwu)
ni?ing gwagner too since some errors shown in the logcat *could* be sfe related.  Feel free to clear if I'm mistaken.
Flags: needinfo?(anygregor)
Maybe related to bug 1198172?
Flags: needinfo?(anygregor)
(In reply to Gregor Wagner [:gwagner] from comment #6)
> Maybe related to bug 1198172?

I have no idea if bug 1198172 is related, the bug is not clear to me at all.
Supposedly that bug is fixed and currently all Jenkins jobs are still failing.
Blocks: 1190791
Hi, Martin,
Sorry I don't know the ACDB errors, and I think that is not related with our test case failures.
Transfer NI to the graphic team people, maybe they would have some clues.

---
Hi, Jeremy,
Do you know who in your team might know about the ACDB error [1]?
Thanks!

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1035040#c7
Flags: needinfo?(alwu) → needinfo?(jeremychen)
If seeing "Marionette server lost " but device was still alive after tests finished, indicating device was too busy to keep the connection.  gathering cpu & memory info may help IMO.

For logcat, it looks like some services didn't initial correctly so there're ipc errors.  keeping retrying will be very painful if it was related to hardware, e.g., audio or RIL.

bug 1035040 showed more exception about acdb module but this bug has 
GeckoDump: Crash reporter : Can't fetch app.reportCrashes. Exception: [Exception... "Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIPrefBranch.getBoolPref]"  nsresult: "0x8000ffff (NS_ERROR_UNEXPECTED)"  location: "JS frame :: chrome://b2g/content/shell.js :: shell_reportCrash :: line 150"  data: no]

not quite sure they are relative.
Flags: needinfo?(pyang)
(In reply to Alastor Wu [:alwu] from comment #8)
> Hi, Martin,
> Sorry I don't know the ACDB errors, and I think that is not related with our
> test case failures.
> Transfer NI to the graphic team people, maybe they would have some clues.
> 
> ---
> Hi, Jeremy,
> Do you know who in your team might know about the ACDB error [1]?
> Thanks!
> 
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1035040#c7

I've never hit this error before. After a quick round offline discussion, seems none of us know this error.
Flags: needinfo?(jeremychen)
Looks like this has been magically fixed! More and more things seem to get greenish at http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/

It makes me wonder what caused and fixed it. Perhaps related to the mail that Andreas Tolfsen sent about "Protocol changes and deployment"? Related to bug 1153822 and bug 1168396.
Flags: needinfo?(ato)
Ok, it seems like the situation is now similar as what I see locally. Occasionally, this is now failing on my device as it is in the labs.
Unfortunately I saw them in our lab today, and all devices failed because of this.
Possibly a device issue such like power saving but still unknown.
The issue we have was about first boot took too much time.
Add marionette.wait_for_port() before start_session() simply fix this problem.
Paul, what do you mean? There is already wait_for_port() before start_session, no?
http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#666
Flags: needinfo?(pyang)
The flash script we use didn't have this protection and pop same issue. It reminds me perhaps the reason is first time boot took longer time and exceed timeout.
Flags: needinfo?(pyang)
start_session has 60 seconds timeout and sock client connection timeout has been set by 360 seconds, so timeout may not be too short here.
Keep trying to reproduce this bug locally.  Meanwhile, probably I'll change to retry test runner if we caught a IOError.
David, do you perhaps have an idea on what's going on here?
Flags: needinfo?(dburns)
(In reply to Paul Yang [:pyang] from comment #18)
> I guess marionette itself didn't protect start_session while initializing.
> for example,
> http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/
> marionette/runner/base.py#582
> http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/
> marionette/runner/base.py#673

The error happens here:
http://mxr.mozilla.org/gaia/source/tests/python/gaia-ui-tests/gaiatest/gaia_test.py#667
That's not in base.py.
(In reply to Martijn Wargers [:mwargers] (QA) from comment #11)
> Looks like this has been magically fixed! More and more things seem to get
> greenish at http://jenkins1.qa.scl3.mozilla.com/view/Bitbar/
> 
> It makes me wonder what caused and fixed it. Perhaps related to the mail
> that Andreas Tolfsen sent about "Protocol changes and deployment"? Related
> to bug 1153822 and bug 1168396.

I haven’t pushed any server-side protocol changes to Gecko yet.  If you’re using the Marionette Python client from in-tree or you’ve updated to the latest release (0.19) there is of course a chance of bugs related to emulator/device Gaia tests because that isn’t tested on try, but in theory the changes I made as part of bug 1168396 should all be backwards-compatible.  File bugs if you find that they aren’t.
Flags: needinfo?(ato)
Nothing is standing out here. When it disconnects can you connect via the remote debugger into the device? are there any other network issues?
Flags: needinfo?(dburns)
It might be related to bug 1199863; I think we're running out of system space again.
A little more clarification: we don't do OTA on the flame; at the same time, my concern is that our reflashing gecko/gaia on top might be causing issues?  Or are we doing full flashing again?
(In reply to David Burns :automatedtester from comment #23)
> Nothing is standing out here. When it disconnects can you connect via the
> remote debugger into the device? are there any other network issues?

The thing is, it doesn't seem to be a problem anymore (or at least, not much). I don't know of cases anymore, lately, that shows this issue.
There was an automation report 3 days ago that showed one failure that might have been happening due to this problem.
I was able to repro this issue locally on thursday's build (Sep. 3rd) - I ran a test with a repeat value of 20, and around 13th run, it started to show this issue consistently until I rebooted the phone.  Will see whether this reproduces on today's.
If this happens locally at any given time, can we start collecting data on the device please?
If possible things like df, b2g-info, b2g-ps, dmesg and /system/xbin/procrank would help.

note: Looks like dumpstate might incorporate most of these.
I just got this perhaps and this is with a black screen.
adb devices would still show the device.
(test3)mwargers:B2G-flash-tool mwargers$ adb shell b2g-ps
APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
Looks like no b2g process was found.  b2g should automatically restart so it looks pretty weird.
check dmesg might help.
It suddenly seems to be happening a lot again on Bitbar, not in the Mountain View lab.
In the Mountain View lab, all the Brick jobs are green for more than a week.

On Bitbar, the various Brick jobs have a varying state of stability.

This one, more than half of the time red since 16 sept:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick.bitbar/

This one even less stable:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-inbound.tinderbox.ui.brick.bitbar/

This one only 2 jobs since 16 sept, and those are green:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.nightly.ui.brick.bitbar/

All of the failure in Bitbar have this again:
03:18:16 09-17 03:18:13.849 E/ACDB-LOADER( 1680): Error: ACDB ANC returned = -1
03:18:16 The B2G process has restarted after crashing during  the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.
And the marionnette connection issue happens.

It makes me wonder why this happens only on Bitbar
Bugs also with "Error: ACDB ANC returned = -1" in it and broken Flame devices:
bug 1050282 and bug 1152283.

Could it be that the base image on the devices on Bitbar are getting out of sync with the gaia+gecko layer or something, sometimes?
When are the devices on Bitbar full-flashed?
It's reported that https://bugzilla.mozilla.org/show_bug.cgi?id=1197183 came back on the same day as well.  Perhaps they're linked?

Also, running a different (older) build did not cause this bug
Severity: normal → critical
And the failures on the jobs in Bitbar have mysteriously disappeared again now.
Not all failures have disappeard.
I guess QAnalyst might be able to use Bitbar again for the reports, but I still see a lot of failurs for the Brick test on Bitbar: http://jenkins1.qa.scl3.mozilla.com/view/Brick/

This one is red/failing almost all of the time:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.b2g-inbound.tinderbox.ui.brick.bitbar/
This one is also red/failing almost all of the time:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-inbound.tinderbox.ui.brick.bitbar/
This one is half red, half green, approximately:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.nightly.ui.brick.bitbar/
This one is completely green:
http://jenkins1.qa.scl3.mozilla.com/view/Brick/job/flame-kk-319.mozilla-central.tinderbox.ui.brick.bitbar/
blocking-b2g: --- → 2.5+
Priority: -- → P1
Summary: All Jenkins jobs are failing almost all of the times - Connection to Marionette server is lost → [Marionette] All Jenkins jobs are failing almost all of the times - "Connection to Marionette server is lost"
Martijn this isn't really something we can get a window on right?  If so can we remove the tag?
Flags: needinfo?(martijn.martijn)
You're right, I don't think a regression-window is useful here anymore.
Flags: needinfo?(martijn.martijn)
I heard this was also sometimes an issue in the Taipei lab.
We need to get this to be fixed in 2.5. David and Andreas, could you please show us some direction to identify issue? do you need log or information for the issue? Many thanks.
Flags: needinfo?(dburns)
Flags: needinfo?(ato)
WHen the issue happens are you able to connect to the device still via the remote debugging or via telnet?
Flags: needinfo?(dburns)
I don’t have anything to contribute here.
Flags: needinfo?(ato)
[Blocking Requested - why for this release]:
I'm not sure this should block the release.
blocking-b2g: 2.5+ → 2.5?
[Tracking Requested - why for this release]:Clearing the nom for 2.5. Needs to be looked at in Quality Program
blocking-b2g: 2.5? → ---
Blocks: mtbf-2015q4
No longer blocks: mtbf-2015q3
We haven't seen this issue for a while.  leave it for another two weeks and see it will appear again.
I just got a crash like this, for one test in a test run: https://pastebin.mozilla.org/8851208
Priority: P1 → P2
Pastebin is gone :( next time is better to attach a file to the bug, I guess.
Marking WONTFIX, sorry for the bug spam. If somebody still wants to work on this, please file a new bug for it.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.