Closed Bug 895966 Opened 6 years ago Closed 5 years ago

tbpl shows green for Android 4.0 rc2 job that failed with DMError

Categories

(Infrastructure & Operations :: CIDuty, task)

ARM
Android
task
Not set

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

https://tbpl.mozilla.org/?tree=Try&rev=580e1d322dbe has a failure at https://tbpl.mozilla.org/php/getParsedLog.php?id=25458214&tree=Try&full=1 (middle job in the last group of 3 consecutive greens) but it shows as passed.

15:53:51     INFO -  Traceback (most recent call last):
15:53:51     INFO -    File "/builds/panda-0783/test/build/tests/mochitest/runtestsremote.py", line 636, in main
15:53:51     INFO -      dm.removeDir("/mnt/sdcard/Robotium-Screenshots")
15:53:51     INFO -    File "/builds/panda-0783/test/build/tests/mochitest/devicemanagerSUT.py", line 422, in removeDir
15:53:51     INFO -      if self.dirExists(remoteDir):
15:53:51     INFO -    File "/builds/panda-0783/test/build/tests/mochitest/devicemanagerSUT.py", line 391, in dirExists
15:53:51     INFO -      ret = self._runCmds([{ 'cmd': 'isdir ' + remotePath }]).strip()
15:53:51     INFO -    File "/builds/panda-0783/test/build/tests/mochitest/devicemanagerSUT.py", line 152, in _runCmds
15:53:51     INFO -      self._sendCmds(cmdlist, outputfile, timeout, retryLimit=retryLimit)
15:53:51     INFO -    File "/builds/panda-0783/test/build/tests/mochitest/devicemanagerSUT.py", line 134, in _sendCmds
15:53:51     INFO -      raise err
15:53:51     INFO -  DMError: Automation Error: Timeout in command isdir /mnt/sdcard/Robotium-Screenshots
16:04:33     INFO -  Automation Error: Exception caught while running tests

I assume this is fall-out from bug 829211...but I am not sure.
https://tbpl.mozilla.org/php/getParsedLog.php?id=25439981&tree=Try - same tryrun, a run from somewhere in the middle of the panda restarts
Ok, there are a couple things wrong here.

First, the test harness seems to be exiting 1;
https://tbpl.mozilla.org/php/getParsedLog.php?id=25458214&tree=Try#error2

This block isn't setting tbpl_status to TBPL_WARNING (or even TBPL_FAILURE):
http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/scripts/b2g_panda.py#l141

I think the check for code == 10 is wrong; we need to figure out the exit codes for mochitest/runtestsremote.py and adjust accordingly.  If the exit codes match up to http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/mozharness/mozilla/buildbot.py#l40 , then we can set self.return_code directly.

Once we get self.return_code set properly, either directly setting |self.return_code = ___| or via self.buildbot_status()
http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/mozharness/mozilla/buildbot.py#l69
the test run should go orange or red as needed.


Second, the Automation Error lines in
https://tbpl.mozilla.org/php/getParsedLog.php?id=25458214&tree=Try&full=1#error2
are INFO, not ERROR.  This isn't a terrible bug, but could be remedied by passing an error_list to this run_command:
http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/scripts/b2g_panda.py#l141

error_lists look like this: an ordered list with substrings that match certain lines, and a level:
http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/mozharness/base/errors.py#l37
or with re.compile()d regexes: http://hg.mozilla.org/build/mozharness/file/e7e6e4dbcbe7/mozharness/base/errors.py#l73

So the error_list in this case might look like

    [{'substr': r'''Automation Error: ''', 'level': ERROR}]
Hardware: x86 → ARM
Product: mozilla.org → Release Engineering
See Also: → 917578
Aki: did we do anything to address this on our side? 

Has there really been no recurrence since 2013-08-01, or is this one of those things that's happens so often we don't report it any more?
Flags: needinfo?(aki)
(In reply to Chris Cooper [:coop] from comment #4)
> Has there really been no recurrence since 2013-08-01, or is this one of
> those things that's happens so often we don't report it any more?

I think I have seen similar problems recently, but failed to report (sorry). Also I think this is a case where we are unlikely to notice a problem since we often don't check logs for greeen jobs.
Blocks: 1048775
We no longer run robocop on 4.0. In general, it feels like we are doing better at "coloring" jobs appropriately...perhaps since the switch to treeherder? In any case, I don't see much remaining value in this bug.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.