Closed Bug 1146197 Opened 9 years ago Closed 7 years ago

docker-worker / queue: Consider failure to pull image for a task due to networking reasons a situation to issue retries

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jlal, Unassigned)

References

Details

(Whiteboard: [docker-worker])

James Lal [:lightsofapollo]

Reporter

Description

•

9 years ago

Today quay.io did some scheduled maintenance this is not a big deal but it did cause tasks to fail without any retries. We should figure out how to retry this and message that clearly to sheriffs.

There is other work to be done in Q2 to get off quay.io but that is independent of hardening our retries.

James Lal [:lightsofapollo]

Reporter

Comment 1

•

9 years ago

Do we have an easy way to do this now?

Flags: needinfo?(jopsen)

James Lal [:lightsofapollo]

Reporter

Comment 2

•

9 years ago

http://docs.taskcluster.net/tools/task-inspector/#b9J4CuGmQo2LbYhRx0yMPw/0

James Lal [:lightsofapollo]

Reporter

Updated

•

9 years ago

Blocks: 1146188

Jonas Finnemann Jensen (:jonasfj)

Comment 3

•

9 years ago

If docker-worker is able to detect that this failed because of a network error, docker-worker should
retry the request with exponential back-off. Ie. docker-worker should pull with exponential backoff,
if docker doesn't already do this (I'm not sure what docker behaviour is).

If all retries are exhausted we have one of the following cases:
i) docker-worker is running on a host with network problems, or
ii) the resources (quay.io in this case) that we're trying to access is down.

How docker-worker distinguishes between (i) and (ii) is beyond the scope of this comment.

In case (i):
docker-worker should assume that the current instance is broken, hence, docker-worker should
reportException with 'worker-shutdown' and do one of the following things:
A) shutdown (bad because it can create a spawn/shutdown cycle),
B) wait to end-of-billing-cycle and shutdown without taking on new tasks, or
C) notify operator (garndt) then wait around for the operator to debug and shutdown the node.

In case (ii):
docker-worker should reportFailed and continue with what it's doing.

Note, that if docker-worker can validate that the remote resource referenced doesn't exists, in this
case that would mean validate that the image doesn't exists because of a 404 or something like that
docker-worker should reportException with malformed-payload.
See http://docs.taskcluster.net/queue/api-docs/#reportException

I'm not keen on offering a retry for this kind of error at queue-level.
If (ii) is the case and retries with exponential backoffs have failed, then clearly any retries of
the task will have the same result. Hence, reportFailed is appropriate.

---
If remote resources being temporarily unavailable for extended periods of time is a common case, and
we want to schedule a retry 10 minutes later we can implement a feature for this using reportException
with a special reason.
But I don't think that is a common case, remote resources should be reliable, or we should cache them
on S3 to avoid problems like this. Designing for tasks to be retried 10-20 minutes later is a bad idea.

@lightsofapollo,
I'm all for retrying things, we have to retry things at multiple level, for this reason it's
important that we never retry blindly. Doing so builds up cascading retries and the number of attempts
grows exponentially with every level of retries we build.
Note, the auto-reruns implemented at scheduler-level is a blind retry; we blow up our load if we add
blind retries at multiple levels.

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 4

•

9 years ago

tl;dr: "failure to pull image for a task due to networking reasons" is only
       "a situation to issue retries" if the worker instance is broken/compromised.
       In which case the worker instance should shutdown.

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Component: TaskCluster → Docker-Worker

Product: Testing → Taskcluster

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Whiteboard: [docker-worker]

Selena Deckelmann :selenamarie :selena

Updated

•

8 years ago

Component: Docker-Worker → Worker

Greg Arndt [:garndt]

Comment 5

•

7 years ago

We do have retries within the workers now and a lot of this problem has gone away when we started loading images from s3 artifacts rather than an external registry.  We also load some of the common sidecar images into the AMI at creation time, rather task run time.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

5 years ago

Component: Worker → Workers

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

docker-worker / queue: Consider failure to pull image for a task due to networking reasons a situation to issue retries

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: jlal, Unassigned)

References

Details

(Whiteboard: [docker-worker])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Updated

Updated

Comment 5

Updated