Open Bug 1464219 Opened 7 years ago Updated 2 years ago

decision tasks should retry on intermittent hg / network issues

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(1 file)

wip - AbortWithExitCode 7 years ago Aki Sasaki (not active) 3.06 KB, patch		Details \| Diff \| Splinter Review

Aki Sasaki (not active)

Reporter

Description

•

7 years ago

As of the last couple weeks, we're seeing a lot of decision task retriggers due to failed network calls. These are generally: - hg clone failures - json automationrelevance download failures - taskcluster ISE 500s I believe each of these should involve retries. Ideally we'd either catch and retry in-task; otherwise we can catch and exit with a specific exit code that results in an `intermittent-task` exception status, which will auto-rerun the task. Currently, sheriffs retrigger the decision task. This means that any and all tasks downstream will fail Chain of Trust verification. This is generally ok, because this type of failure is most commonly seen on Try, and at worst that means the developer won't get any windows xpcshell tests. However, we have seen these retriggers on autoland, inbound, and even mozilla-beta; at worst here, this would delay merges or even a chemspill.

Aki Sasaki (not active)

Reporter

Comment 1

•

7 years ago

Note: if the decision task fails, aiui we can't run a retrigger or rerun action task. I don't think non-admins have the ability to do anything but `taskcluster task rerun` via tc-cli to fix; admins may be able to retrigger a mozilla-taskcluster run against the revision. I may be wrong.

Aki Sasaki (not active)

Reporter

Comment 2

•

7 years ago

I think retrying on clone issues could be achieved via fixes to run-task. Greg, what do you think about adding custom exit statuses here [1] ? [1] https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task#446

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 3

•

7 years ago

I think we want `hg robustcheckout` to issue the custom exit code on network related errors: we can't just assume that every VCS failure can be retried. robustcheckout canonically lives in version-control-tools:hgext/robustcheckout.

Flags: needinfo?(gps)

Aki Sasaki (not active)

Reporter

Comment 4

•

7 years ago

Attached patch wip - AbortWithExitCode — Details — Splinter Review

From https://stackoverflow.com/a/16787722 . If we want to take this approach, we could replace some `error.Abort()` calls with `AbortWithExitCode()` with `exit_code` set. We may want to test catching an `AbortWithExitCode`. I'm not sure which Abort calls we want to mark specifically. We could potentially mark them in categories: client-side issues, like a sparse- vs non-sparse clone, and server-/network-side. I don't specifically see the missing-revision error here, or I would have tackled that first.

Geoff Brown [:gbrown]

Updated

•

7 years ago

Updated

•

6 years ago

Version: Version 3 → 3 Branch

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

decision tasks should retry on intermittent hg / network issues

Categories

(Firefox Build System :: Task Configuration, task)

Tracking

(Not tracked)

People

(Reporter: mozilla, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type