Closed Bug 763527 Opened 12 years ago Closed 11 years ago

Investigate failure of mochitest chunks on B2G to start

Categories

(Testing :: Mochitest, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jgriffin, Unassigned)

References

Details

We now have mochitest-plain running on B2G CI (bug 759887).  However, we're seeing an issue in which sometimes one of the 8 chunks will fail to start, and the test job will then be killed due to a timeout.

For the chunks that failed to start, we're successfully copying the test profile to the emulator and restarting B2G.  I'm thinking the failure may occur because the request from Marionette to navigate to the mochitest URL is coming too quickly, somehow, before gecko is able to act on the navigation request.
It looks like chunk 8 is often being interrupted by a 90-minute timeout I had set up on the mochitest-plain job.  I've just increased this to 120-minutes to see if this resolves.

The VM that is running both the builds and mochitests doesn't really have the capacity needed to do so.  We're going to have to expand our VM capacity to handle mochitests and reftests; I'll handle this in a separate bug.
I caught this problem occurring on the VM.  When I used adb to look at the file system of the running emulator, I saw:

- a marionette.log in /data/b2g/mozilla/{profile}
- no marionette.log in /data/local/tests/profile

From this I infer that we're either not successfully restarting B2G (so that it starts up with the test profile), or the automation is getting stuck before ever getting to that point.
One more data point:  on the server, I see these processes:

jenkins   8449  8183  0 21:12 ?        00:00:00 /data/jenkins/jobs/mochitest-plain/workspace/b2g-distro/out/host/linux-x86/bin/adb logcat
jenkins   8450  8183  0 21:12 ?        00:00:00 [adb] <defunct>

The <defunct> adb makes me think adb has crashed; I've seen this on my own machine occasionally.
I've changed the way mochitests are run in the CI to try and debug what's going on.  The last two failures produced this output:

INFO | runtests.py | Received unexpected exception while running application
Traceback (most recent call last):
  File "/data/jenkins/workspace/mochitest/objdir-gecko/_tests/testing/mochitest/runtests.py", line 677, in runTests
    timeout = timeout)
  File "/data/jenkins/workspace/mochitest/objdir-gecko/_tests/testing/mochitest/automation.py", line 900, in runApp
    stderr = subprocess.STDOUT)
  File "/data/jenkins/workspace/mochitest/objdir-gecko/_tests/testing/mochitest/b2gautomation.py", line 205, in Process
    session = self.marionette.start_session()
  File "/data/jenkins/workspace/mochitest/venv/src/marionette/marionette/marionette.py", line 218, in start_session
    self.session = self._send_message('newSession', 'value')
  File "/data/jenkins/workspace/mochitest/venv/src/marionette/marionette/marionette.py", line 140, in _send_message
    raise TimeoutException(message='socket.timeout', status=ErrorCodes.TIMEOUT, stacktrace=None)
TimeoutException: socket.timeout
WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected!

I think this may be another case of bug 753273.  In any case, I'm going to add a sleep to try and resolve this.
http://hg.mozilla.org/mozilla-central/rev/515c5d751c5e - increase some sleeps to see if it resolves the chunk timeout problems
The above doesn't appear to have helped.  I'm going to have to manually run mochitests on the CI VM and try to catch it in the act.
Blocks: 778249
I'm going to dupe bug 778249 to bug 777714 rather than having two releng tracking bugs.
Blocks: b2g-testing-track
No longer blocks: 778249
Is this bug for pandas?
or b2g testing on VMs?
This is for old testing on Amazon AWS VM's and isn't relevant any longer.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.