Open Bug 1573269 Opened 5 months ago Updated 2 months ago

Make bitbar hosted tests retry for intermitent device issues.

Categories

(Testing :: Autophone, task)

task
Not set

Tracking

(Not tracked)

REOPENED

People

(Reporter: bc, Assigned: aerickson)

Details

Attachments

(1 file)

This try run shows a number of retries on emulators. Most of these are the result of ADBTimeoutErrors. This alleviates the need for sheriffs to triage oranges and manages to obtain the test results for pushes which we currently fail to get at bitbar since we treat ADBTimeoutError as a fatal orange.

We should at least attempt to retry when ADBTimeoutError occurs on android hardware... less work for sheriffs and more results for us.

def fatal attempts to model itself after the corresponding fatal method in tree but I have the retry code commented out at the moment.

Let's

  1. uncomment #TBPL_RETRY_EXIT_STATUS = 4
  2. delete TBPL_RETRY_EXIT_STATUS = 1
  3. change the sys.exit to just return TBPL_RETRY_EXIT_STATUS
  4. change all of the calls to fatal(...) to return fatal(...)
  5. change the return to just
    return rc

We've made the change to the bitbar-docker image (currently just in testing), but jobs aren't being rerun when the exit code is 4.

BC thinks it might be some interaction with https://searchfox.org/mozilla-central/source/testing/mozharness/scripts/android_hardware_unittest.py.

I was wrong. We haven't made an image with this change default yet.

Current default is based on https://github.com/bclary/mozilla-bitbar-docker/commit/053ac62ecacfac239c0d7280310ec3413efac823 per https://github.com/bclary/mozilla-bitbar-devicepool/pull/42.

The fix was landed in https://github.com/bclary/mozilla-bitbar-docker/pull/14.

We're still working to get an image deployed with the change.

An image with the change mentioned above is live. Some jobs are doing retries, but others aren't.

Debugging with bc.

:tomprince,

We're trying to get TBPL_RETRY working for android tasks that have adb connection issues. We exit 4 on that particular failure, but the job doesn't get retried.

It seems like we need to set onExitStatus in the payload. That's set by retry-exit-status in the kind.yml for the test?

Where would be the best spot to make the change to get the job below to retry?

example job exiting 4, but not retrying:

https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=268423110&tier=1%2C2%2C3&revision=30fb23d826a8d95e75860d7903934650c8d1326a

BC also notes:

The android emulator tests successfully set onExitStatus but it isn't
clear where it actually sets the retry-exit-status on the test which
causes onExitStatus to be set in the payload.

Flags: needinfo?(mozilla)

Looking at where retry-exit-status is used/defined, it is only supported (in taskgraph) by docker-worker. It gets set here and passed to the task transform here. Looking at generic-worker's schemas it appears that generic work also supports the same option, but that it isn't exposed to taskgraph. Doing so would require copying the docker-worker logic to generic-worker, and adding code the the mozharness_test transform to pass it through there as well.

Flags: needinfo?(mozilla)

Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5cef7916b094
make g-w tasks support retry-exit-status r=tomprince

Keywords: checkin-needed

There was some fallout from this due to our supersede detection in mozilla-bitbar-docker causing runaway bad workers that failed lots of jobs. https://github.com/bclary/mozilla-bitbar-docker/pull/28 was landed to handle the issues.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED

These aren't all being retried yet, :gbrown is working on finding the last few.

https://phabricator.services.mozilla.com/D51432

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Work continuing in Bug 1500266.

You need to log in before you can comment on or make changes to this bug.