Open Bug 1464219 Opened 7 years ago Updated 2 years ago

decision tasks should retry on intermittent hg / network issues

Categories

(Firefox Build System :: Task Configuration, task)

3 Branch
task

Tracking

(Not tracked)

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(1 file)

As of the last couple weeks, we're seeing a lot of decision task retriggers due to failed network calls. These are generally: - hg clone failures - json automationrelevance download failures - taskcluster ISE 500s I believe each of these should involve retries. Ideally we'd either catch and retry in-task; otherwise we can catch and exit with a specific exit code that results in an `intermittent-task` exception status, which will auto-rerun the task. Currently, sheriffs retrigger the decision task. This means that any and all tasks downstream will fail Chain of Trust verification. This is generally ok, because this type of failure is most commonly seen on Try, and at worst that means the developer won't get any windows xpcshell tests. However, we have seen these retriggers on autoland, inbound, and even mozilla-beta; at worst here, this would delay merges or even a chemspill.
Note: if the decision task fails, aiui we can't run a retrigger or rerun action task. I don't think non-admins have the ability to do anything but `taskcluster task rerun` via tc-cli to fix; admins may be able to retrigger a mozilla-taskcluster run against the revision. I may be wrong.
I think retrying on clone issues could be achieved via fixes to run-task. Greg, what do you think about adding custom exit statuses here [1] ? [1] https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task#446
Flags: needinfo?(gps)
I think we want `hg robustcheckout` to issue the custom exit code on network related errors: we can't just assume that every VCS failure can be retried. robustcheckout canonically lives in version-control-tools:hgext/robustcheckout.
Flags: needinfo?(gps)
From https://stackoverflow.com/a/16787722 . If we want to take this approach, we could replace some `error.Abort()` calls with `AbortWithExitCode()` with `exit_code` set. We may want to test catching an `AbortWithExitCode`. I'm not sure which Abort calls we want to mark specifically. We could potentially mark them in categories: client-side issues, like a sparse- vs non-sparse clone, and server-/network-side. I don't specifically see the missing-revision error here, or I would have tackled that first.
See Also: → 1473734
Version: Version 3 → 3 Branch
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: