This is preventing decision tasks from running, see e.g.: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Zu0Ygrq_T2GisdZs8xVQHA/3/public/logs/live_backing.log https://treeherder.mozilla.org/#/jobs?repo=fx-team&filter-searchStr=gecko-decision%20opt Locally, I get: $ docker pull taskcluster/livelog:v3 Pulling repository taskcluster/livelog FATA Get https://registry-1.docker.io/v1/repositories/taskcluster/livelog/tags: read tcp 184.108.40.206:443: i/o timeout This seems to be the root cause: From https://status.docker.com/: Investigating issue with high loadDegraded Performance Incident Status Degraded Performance Components Docker Hub Web, Docker Registry Hub API, Docker Hub Oauth and Accounts API, Docker Registry API, Docker Registry Hub WEB, Docker Hub Automated Builds, Docker Docs Locations IAD3 August 10, 2015 8:27AM UTC [Investigating] We are currently experience very high load on our registry servers and we are looking to see what is causing the problems.
I'm not sure if we can serve the affected docker images from a different registry (such as quay.io). This might be the solution, until the service disruption is resolved on docker.io.
I closed the trees, please ping me on IRC when this clears up.
FWIW, looks like AWS is having issues, too: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=0e269a1f1beb&filter-searchStr=night
This is confirmed by the latest post on http://status.docker.com/: "The issues that we have been investigating has been identified as being related to an AWS S3 issue that is currently affecting AWS-East, we have contacted Amazon, and are looking at ways to mitigate the issue."
I can pull from docker.io again locally now, hopefully things are clearing up...
gecko decision tasks are working again: https://tools.taskcluster.net/task-inspector/#0x0gWz8sRciT07M8HxaT7Q/0
Is there a bug somewhere for the fact that taskcluster is violating policy by depending on external services?
(In reply to James Graham [:jgraham] from comment #7) > Is there a bug somewhere for the fact that taskcluster is violating policy > by depending on external services? I'm not sure. Did you search for one? Please feel free to add one. The root cause of the disruption was an aws s3 service disruption, which also impacted non-taskcluster services such as buildbot nightly builds. Could you provide a link to the policy document? I believe much of mozilla's workflow relies on external services, such as github push requests, travis builds, heroku app deployment and storage in s3, so if it really is the case that we should not depend on external services, there is much more than just taskcluster to fix. Even our email goes through the gmail service. I also understood that we intentionally migrating away from managing our own data centres to deploying in the cloud, so I'm not sure how we can not depend on external services.
In regards to depending on a service like docker.io, there is a bug for a quarterly goal to investigate storing images as artifacts within taskcluster. . Unfortunately since this was an Amazon issue in us-east-1, not only would images as artifacts (as proposed in that bug) be disrupted, but also perhaps the creation of our EC2 instances which runs the tasks within that region. While we do have a bug to move away from less robust solutions (like docker.io) to hosting images as artifacts within taskcluster, we still ultimately rely on Amazon. We do not have a good story currently for not exclusively relying on Amazon for our workers but we do have the ability to support multiple regions so if this issue was persistent within a region we do have the option to only provision our workers in other regions. Also, as a side note, relying on other docker registries also wouldn't have solved this problem because the other popular hosted docker registry was also affected by this outage. Although that still would violate the rule of not depending on external resources.  https://bugzilla.mozilla.org/show_bug.cgi?id=1182490  (as stated on status.aws.amazon.com) 3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.
At the moment we are just waiting for some retriggered jobs to complete, and then the trees will be reopened.
> Could you provide a link to the policy document? Here you go :pmoore https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures Specifically "Must not rely on resources from sites whose content we do not control/have no SLA"
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures says: "Must not rely on resources from sites whose content we do not control/have no SLA" If this problem was actually "amazon was down in a way that would affect our 'internal' infrastructure too", then it's quite possible that in this specific case we would have had a problem even if we were self-hosting. But there are any number of reasons that docker.io might go down that should not cause total loss of CI/release service at Mozilla.
> If this problem was actually "amazon was down in a way that would affect our > 'internal' infrastructure too", then it's quite possible that in this > specific case we would have had a problem even if we were self-hosting. But > there are any number of reasons that docker.io might go down that should not > cause total loss of CI/release service at Mozilla. Absolutely agree. This is why we are wanting to move images from docker.io to artifacts within taskcluster that could be used. Unfortunately in this case it wouldn't have helped because of a larger Amazon outage, but either way, it would make taskcluster image handling more robust than relying on docker.io and also give us the ability of hosting images in multiple regions.
Moving closed bugs across to new Bugzilla product "TaskCluster".