docker.io is having service disruptions

RESOLVED FIXED in mozilla46

Status

Taskcluster
Integration
--
blocker
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: pmoore, Unassigned)

Tracking

unspecified
mozilla46

Details

(Reporter)

Description

2 years ago
This is preventing decision tasks from running, see e.g.:
https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Zu0Ygrq_T2GisdZs8xVQHA/3/public/logs/live_backing.log

https://treeherder.mozilla.org/#/jobs?repo=fx-team&filter-searchStr=gecko-decision%20opt

Locally, I get:

$ docker pull taskcluster/livelog:v3
Pulling repository taskcluster/livelog
FATA[0065] Get https://registry-1.docker.io/v1/repositories/taskcluster/livelog/tags: read tcp 54.152.161.54:443: i/o timeout 



This seems to be the root cause: 

From https://status.docker.com/:

Investigating issue with high loadDegraded Performance

Incident Status

Degraded Performance

Components

Docker Hub Web, Docker Registry Hub API, Docker Hub Oauth and Accounts API, Docker Registry API, Docker Registry Hub WEB, Docker Hub Automated Builds, Docker Docs

Locations

IAD3

August 10, 2015 8:27AM UTC
[Investigating] We are currently experience very high load on our registry servers and we are looking to see what is causing the problems.
(Reporter)

Comment 1

2 years ago
I'm not sure if we can serve the affected docker images from a different registry (such as quay.io). This might be the solution, until the service disruption is resolved on docker.io.
I closed the trees, please ping me on IRC when this clears up.
FWIW, looks like AWS is having issues, too: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=0e269a1f1beb&filter-searchStr=night
(Reporter)

Comment 4

2 years ago
This is confirmed by the latest post on http://status.docker.com/:

"The issues that we have been investigating has been identified as being related to an AWS S3 issue that is currently affecting AWS-East, we have contacted Amazon, and are looking at ways to mitigate the issue."
(Reporter)

Comment 5

2 years ago
I can pull from docker.io again locally now, hopefully things are clearing up...
(Reporter)

Comment 6

2 years ago
gecko decision tasks are working again:
https://tools.taskcluster.net/task-inspector/#0x0gWz8sRciT07M8HxaT7Q/0
Is there a bug somewhere for the fact that taskcluster is violating policy by depending on external services?
(Reporter)

Comment 8

2 years ago
(In reply to James Graham [:jgraham] from comment #7)
> Is there a bug somewhere for the fact that taskcluster is violating policy
> by depending on external services?

I'm not sure. Did you search for one? Please feel free to add one.

The root cause of the disruption was an aws s3 service disruption, which also impacted non-taskcluster services such as buildbot nightly builds.

Could you provide a link to the policy document? I believe much of mozilla's workflow relies on external services, such as github push requests, travis builds, heroku app deployment and storage in s3, so if it really is the case that we should not depend on external services, there is much more than just taskcluster to fix. Even our email goes through the gmail service. I also understood that we intentionally migrating away from managing our own data centres to deploying in the cloud, so I'm not sure how we can not depend on external services.

Comment 9

2 years ago
In regards to depending on a service like docker.io, there is a bug for a quarterly goal to investigate storing images as artifacts within taskcluster. [1].  

Unfortunately since this was an Amazon issue in us-east-1, not only would images as artifacts (as proposed in that bug) be disrupted, but also perhaps the creation of our EC2 instances which runs the tasks within that region.  While we do have a bug to move away from less robust solutions (like docker.io) to hosting images as artifacts within taskcluster, we still ultimately rely on Amazon.  We do not have a good story currently for not exclusively relying on Amazon for our workers but we do have the ability to support multiple regions so if this issue was persistent within a region we do have the option to only provision our workers in other regions.

Also, as a side note, relying on other docker registries also wouldn't have solved this problem because the other popular hosted docker registry was also affected by this outage. Although that still would violate the rule of not depending on external resources.


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1182490
[2] (as stated on status.aws.amazon.com) 3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.
(Reporter)

Comment 10

2 years ago
At the moment we are just waiting for some retriggered jobs to complete, and then the trees will be reopened.
(Reporter)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
> Could you provide a link to the policy document? 

Here you go :pmoore 

https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures

Specifically "Must not rely on resources from sites whose content we do not control/have no SLA"
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures says:

"Must not rely on resources from sites whose content we do not control/have no SLA"

If this problem was actually "amazon was down in a way that would affect our 'internal' infrastructure too", then it's quite possible that in this specific case we would have had a problem even if we were self-hosting. But there are any number of reasons that docker.io might go down that should not cause total loss of CI/release service at Mozilla.
> If this problem was actually "amazon was down in a way that would affect our
> 'internal' infrastructure too", then it's quite possible that in this
> specific case we would have had a problem even if we were self-hosting. But
> there are any number of reasons that docker.io might go down that should not
> cause total loss of CI/release service at Mozilla.

Absolutely agree.  This is why we are wanting to move images from docker.io to artifacts within taskcluster that could be used.  Unfortunately in this case it wouldn't have helped because of a larger Amazon outage, but either way, it would make taskcluster image handling more robust than relying on docker.io and also give us the ability of hosting images in multiple regions.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Reporter)

Comment 20

2 years ago
Moving closed bugs across to new Bugzilla product "TaskCluster".
Component: TaskCluster → Integration
Product: Testing → Taskcluster
Target Milestone: --- → mozilla46
You need to log in before you can comment on or make changes to this bug.