Today quay.io did some scheduled maintenance this is not a big deal but it did cause tasks to fail without any retries. We should figure out how to retry this and message that clearly to sheriffs. There is other work to be done in Q2 to get off quay.io but that is independent of hardening our retries.
Do we have an easy way to do this now?
If docker-worker is able to detect that this failed because of a network error, docker-worker should retry the request with exponential back-off. Ie. docker-worker should pull with exponential backoff, if docker doesn't already do this (I'm not sure what docker behaviour is). If all retries are exhausted we have one of the following cases: i) docker-worker is running on a host with network problems, or ii) the resources (quay.io in this case) that we're trying to access is down. How docker-worker distinguishes between (i) and (ii) is beyond the scope of this comment. In case (i): docker-worker should assume that the current instance is broken, hence, docker-worker should reportException with 'worker-shutdown' and do one of the following things: A) shutdown (bad because it can create a spawn/shutdown cycle), B) wait to end-of-billing-cycle and shutdown without taking on new tasks, or C) notify operator (garndt) then wait around for the operator to debug and shutdown the node. In case (ii): docker-worker should reportFailed and continue with what it's doing. Note, that if docker-worker can validate that the remote resource referenced doesn't exists, in this case that would mean validate that the image doesn't exists because of a 404 or something like that docker-worker should reportException with malformed-payload. See http://docs.taskcluster.net/queue/api-docs/#reportException I'm not keen on offering a retry for this kind of error at queue-level. If (ii) is the case and retries with exponential backoffs have failed, then clearly any retries of the task will have the same result. Hence, reportFailed is appropriate. --- If remote resources being temporarily unavailable for extended periods of time is a common case, and we want to schedule a retry 10 minutes later we can implement a feature for this using reportException with a special reason. But I don't think that is a common case, remote resources should be reliable, or we should cache them on S3 to avoid problems like this. Designing for tasks to be retried 10-20 minutes later is a bad idea. @lightsofapollo, I'm all for retrying things, we have to retry things at multiple level, for this reason it's important that we never retry blindly. Doing so builds up cascading retries and the number of attempts grows exponentially with every level of retries we build. Note, the auto-reruns implemented at scheduler-level is a blind retry; we blow up our load if we add blind retries at multiple levels.
tl;dr: "failure to pull image for a task due to networking reasons" is only "a situation to issue retries" if the worker instance is broken/compromised. In which case the worker instance should shutdown.
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Component: Docker-Worker → Worker
We do have retries within the workers now and a lot of this problem has gone away when we started loading images from s3 artifacts rather than an external registry. We also load some of the common sidecar images into the AMI at creation time, rather task run time.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.