1192758 - docker.io is having service disruptions

Reporter

Description

•

10 years ago

This is preventing decision tasks from running, see e.g.: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Zu0Ygrq_T2GisdZs8xVQHA/3/public/logs/live_backing.log https://treeherder.mozilla.org/#/jobs?repo=fx-team&filter-searchStr=gecko-decision%20opt Locally, I get: $ docker pull taskcluster/livelog:v3 Pulling repository taskcluster/livelog FATA[0065] Get https://registry-1.docker.io/v1/repositories/taskcluster/livelog/tags: read tcp 54.152.161.54:443: i/o timeout This seems to be the root cause: From https://status.docker.com/: Investigating issue with high loadDegraded Performance Incident Status Degraded Performance Components Docker Hub Web, Docker Registry Hub API, Docker Hub Oauth and Accounts API, Docker Registry API, Docker Registry Hub WEB, Docker Hub Automated Builds, Docker Docs Locations IAD3 August 10, 2015 8:27AM UTC [Investigating] We are currently experience very high load on our registry servers and we are looking to see what is causing the problems.

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

10 years ago

I'm not sure if we can serve the affected docker images from a different registry (such as quay.io). This might be the solution, until the service disruption is resolved on docker.io.

:Ms2ger (he/him; ⌚ UTC+1/+2)

Comment 2

•

10 years ago

I closed the trees, please ping me on IRC when this clears up.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 3

•

10 years ago

FWIW, looks like AWS is having issues, too: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=0e269a1f1beb&filter-searchStr=night

Pete Moore [:pmoore][:pete]

Reporter

Comment 4

•

10 years ago

This is confirmed by the latest post on http://status.docker.com/: "The issues that we have been investigating has been identified as being related to an AWS S3 issue that is currently affecting AWS-East, we have contacted Amazon, and are looking at ways to mitigate the issue."

Pete Moore [:pmoore][:pete]

Reporter

Comment 5

•

10 years ago

I can pull from docker.io again locally now, hopefully things are clearing up...

Pete Moore [:pmoore][:pete]

Reporter

Comment 6

•

10 years ago

gecko decision tasks are working again: https://tools.taskcluster.net/task-inspector/#0x0gWz8sRciT07M8HxaT7Q/0

James Graham [:jgraham]

Comment 7

•

10 years ago

Is there a bug somewhere for the fact that taskcluster is violating policy by depending on external services?

Pete Moore [:pmoore][:pete]

Reporter

Comment 8

•

10 years ago

(In reply to James Graham [:jgraham] from comment #7) > Is there a bug somewhere for the fact that taskcluster is violating policy > by depending on external services? I'm not sure. Did you search for one? Please feel free to add one. The root cause of the disruption was an aws s3 service disruption, which also impacted non-taskcluster services such as buildbot nightly builds. Could you provide a link to the policy document? I believe much of mozilla's workflow relies on external services, such as github push requests, travis builds, heroku app deployment and storage in s3, so if it really is the case that we should not depend on external services, there is much more than just taskcluster to fix. Even our email goes through the gmail service. I also understood that we intentionally migrating away from managing our own data centres to deploying in the cloud, so I'm not sure how we can not depend on external services.

Greg Arndt [:garndt]

Comment 9

•

10 years ago

In regards to depending on a service like docker.io, there is a bug for a quarterly goal to investigate storing images as artifacts within taskcluster. [1]. Unfortunately since this was an Amazon issue in us-east-1, not only would images as artifacts (as proposed in that bug) be disrupted, but also perhaps the creation of our EC2 instances which runs the tasks within that region. While we do have a bug to move away from less robust solutions (like docker.io) to hosting images as artifacts within taskcluster, we still ultimately rely on Amazon. We do not have a good story currently for not exclusively relying on Amazon for our workers but we do have the ability to support multiple regions so if this issue was persistent within a region we do have the option to only provision our workers in other regions. Also, as a side note, relying on other docker registries also wouldn't have solved this problem because the other popular hosted docker registry was also affected by this outage. Although that still would violate the rule of not depending on external resources. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1182490 [2] (as stated on status.aws.amazon.com) 3:46 AM PDT Between 12:08 AM and 3:40 AM PDT, Amazon S3 experienced elevated error rates and latencies. We identified the root cause and pursued multiple paths to recovery. The error has been corrected and the service is operating normally.

Pete Moore [:pmoore][:pete]

Reporter

Comment 10

•

10 years ago

At the moment we are just waiting for some retriggered jobs to complete, and then the trees will be reopened.

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Greg Arndt [:garndt]

Comment 11

•

10 years ago

> Could you provide a link to the policy document? Here you go :pmoore https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures Specifically "Must not rely on resources from sites whose content we do not control/have no SLA"

James Graham [:jgraham]

Comment 12

•

10 years ago

https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures says: "Must not rely on resources from sites whose content we do not control/have no SLA" If this problem was actually "amazon was down in a way that would affect our 'internal' infrastructure too", then it's quite possible that in this specific case we would have had a problem even if we were self-hosting. But there are any number of reasons that docker.io might go down that should not cause total loss of CI/release service at Mozilla.

Greg Arndt [:garndt]

Comment 13

•

10 years ago

> If this problem was actually "amazon was down in a way that would affect our > 'internal' infrastructure too", then it's quite possible that in this > specific case we would have had a problem even if we were self-hosting. But > there are any number of reasons that docker.io might go down that should not > cause total loss of CI/release service at Mozilla. Absolutely agree. This is why we are wanting to move images from docker.io to artifacts within taskcluster that could be used. Unfortunately in this case it wouldn't have helped because of a larger Amazon outage, but either way, it would make taskcluster image handling more robust than relying on docker.io and also give us the ability of hosting images in multiple regions.

Comment hidden (Legacy TBPL/Treeherder Robot)

Pete Moore [:pmoore][:pete]

Reporter

Comment 20

•

9 years ago

Moving closed bugs across to new Bugzilla product "TaskCluster".

Component: TaskCluster → Integration

Product: Testing → Taskcluster

Target Milestone: --- → mozilla46

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Integration → Services

Bugzilla

docker.io is having service disruptions

Categories

(Taskcluster :: Services, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Updated