Open
Bug 1464219
Opened 7 years ago
Updated 2 years ago
decision tasks should retry on intermittent hg / network issues
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(Not tracked)
NEW
People
(Reporter: mozilla, Unassigned)
References
Details
Attachments
(1 file)
3.06 KB,
patch
|
Details | Diff | Splinter Review |
As of the last couple weeks, we're seeing a lot of decision task retriggers due to failed network calls. These are generally:
- hg clone failures
- json automationrelevance download failures
- taskcluster ISE 500s
I believe each of these should involve retries. Ideally we'd either catch and retry in-task; otherwise we can catch and exit with a specific exit code that results in an `intermittent-task` exception status, which will auto-rerun the task.
Currently, sheriffs retrigger the decision task. This means that any and all tasks downstream will fail Chain of Trust verification. This is generally ok, because this type of failure is most commonly seen on Try, and at worst that means the developer won't get any windows xpcshell tests. However, we have seen these retriggers on autoland, inbound, and even mozilla-beta; at worst here, this would delay merges or even a chemspill.
Reporter | ||
Comment 1•7 years ago
|
||
Note: if the decision task fails, aiui we can't run a retrigger or rerun action task. I don't think non-admins have the ability to do anything but `taskcluster task rerun` via tc-cli to fix; admins may be able to retrigger a mozilla-taskcluster run against the revision. I may be wrong.
Reporter | ||
Comment 2•7 years ago
|
||
I think retrying on clone issues could be achieved via fixes to run-task. Greg, what do you think about adding custom exit statuses here [1] ?
[1] https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task#446
Flags: needinfo?(gps)
Comment 3•7 years ago
|
||
I think we want `hg robustcheckout` to issue the custom exit code on network related errors: we can't just assume that every VCS failure can be retried.
robustcheckout canonically lives in version-control-tools:hgext/robustcheckout.
Flags: needinfo?(gps)
Reporter | ||
Comment 4•7 years ago
|
||
From https://stackoverflow.com/a/16787722 . If we want to take this approach, we could replace some `error.Abort()` calls with `AbortWithExitCode()` with `exit_code` set. We may want to test catching an `AbortWithExitCode`.
I'm not sure which Abort calls we want to mark specifically. We could potentially mark them in categories: client-side issues, like a sparse- vs non-sparse clone, and server-/network-side. I don't specifically see the missing-revision error here, or I would have tackled that first.
Updated•6 years ago
|
Version: Version 3 → 3 Branch
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•