Closed Bug 1447585 Opened 7 years ago Closed 7 years ago

Migrate off Docker Cloud by May 21, 2018

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: bstack)

References

Details

"Cluster Management in Docker Cloud will be discontinued on May 21" That gives us 2 months to migrate everything off of their platform. The current plan is to terraform up some instances in our aws account and run the containers directly on them using coreOS which we can copy from what jonas did in [0]. We'll also need to figure out some dns updates and such. [0] https://github.com/taskcluster/taskcluster-infrastructure/blob/master/modules/taskcluster-worker-packet/machines.tf
Our initial plan here was to basically copy [0] for each of our services. I got that to work today but now I realize that we have a fatal flaw here. This setup does not allow us to update the version of the service running without booting an entirely new instance plus is not very HA etc. Options I see: 1) Come up with a little systemd service that restarts this service with newest version of container every so often. (I do not like this) 2. We already need to get used to running kubernetes for real and that would allow us to run these services HA etc. (better! just a bit more complex) 3. ??? (your idea here) I think a lot of the reason we didn't go with k8s in the first place is that we thought we would need to run multiple clusters to get things in different regions but really when I look at docker cloud nodes that we're running, I don't think we _need_ to run anywhere other than us-west-2 (which is a required place to run because of cloud-mirror). We could run one cluster in us-west-2 and put all of our docker-cloud services there. Your thoughts? Please feel free to ask for opinions from others in your timezones if you think they would be interested as well. I don't want to go full RFC on this just because we should get this done asap and move on. [0] https://github.com/taskcluster/taskcluster-infrastructure/blob/master/modules/taskcluster-worker-packet/machines.tf
Flags: needinfo?(jopsen)
Flags: needinfo?(dustin)
Another option is maybe aws container service or whatever it is called? Also any of the options in docker's migration guide [1]. Also NI from john since cloud mirror is the most important service here. [1] https://docs.docker.com/docker-cloud/migration/
Flags: needinfo?(jhford)
Only one of these four services is destined for actual redeployment, so I don't think we need to figure out how to build them using tc-installer. So, standing up a k8s cluster just for this seems pretty reasonable. Under the hood, that's basically what docker cloud was doing. If we don't plan to allow that cluster, or at least that configuration, to later morph into a "taskcluster instance", then I think we're free to make mistakes and misdesigns without causing our future selves pain. Even if that meant running a k8s cluster in every region, that wouldn't be awful (but it sounds like it doesn't).
Flags: needinfo?(dustin)
++, on doing a k8s cluster.
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #4) > ++, on doing a k8s cluster. -- for using k8s ;)
For Cloud-Mirror, I think the systemd service that updates + reboots every 24h would be fine. I'd like to do something where it's not just checking out the master branch, rather using something like a 'production' branch. The only key detail here is that the copier nodes *must* be in us-west-2.
Flags: needinfo?(jhford)
Blocks: 1454770
Blocks: 1454773
Ok, everything is up and running now! I've created some bugs for people to update docs and turn off old copies of services etc. The next step is to point dns at webhooktunnel/statsum. jonas, can you verify during your workday that these services are working as intended and then file bugs to update dns? Their static ips are available with `terraform output`.
Flags: needinfo?(jopsen)
@bstack, We need to move these services to use their own log destination. I think it's okay that all static services share a log destination, but they shouldn't share one that's also used by workers. Ideally, then we can set them in a group and we have a shot a finding them. I create a log destination for "static services", which add systems to "static systems" group. I'v updated the file in password store, but I couldn't terraform apply: > [taskcluster-infrastructure]$terraform plan > Acquiring state lock. This may take a few moments... > var.cloudmirror_aws_access_key_id > Enter a value: ^C > Interrupt received. > Please wait for Terraform to exit or data loss may occur. > Gracefully shutting down... > Releasing state lock. This may take a few moments... > > Error: Error asking for user input: Error asking for cloudmirror_aws_access_key_id: interrupted @bstack, if you re-deploy we should be able to find the log systems. I saw one of the systems might still be called "kubernetes-statsum", but I can see why it would be called that. It's not in terraform... so it was probably set manually, that's also fine. But at-least they should be group'ed automatically. --- I wish we could DNS in terraform, so we wouldn't have to file bugs this way. Rolling back is arguably going to be hard :)
Flags: needinfo?(jopsen)
@bstack, Okay, so apart from the logging thing, it appears we're also DNS-ignorant. We can't just CNAME things to static IPs. We need to: A) assign these some dummy domain name we control in terraform B) change domain records to some other types C) move the domain administration into terraform with route53 Let's chat about it.
I'm not DNS-ignorant :) What's the issue here? Comment 8 seems to be about logging, not DNS.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #8) > @bstack, > We need to move these services to use their own log destination. > I think it's okay that all static services share a log destination, but they > shouldn't share one that's also used by workers. > Ideally, then we can set them in a group and we have a shot a finding them. They already have a new log destination in that it was the destination that papertrail provided me when I asked for a new one. They also have systems for each one configured automatically by the configuration we have in terraform. webhooktunnel: https://papertrailapp.com/systems/webhooktunnel/events stateless dns: https://papertrailapp.com/systems/stateless_dns/events cloud mirror: https://papertrailapp.com/systems/cloudmirrorcopiers/events statsum: https://papertrailapp.com/systems/statsum/events > > I create a log destination for "static services", which add systems to > "static systems" group. > I'v updated the file in password store, but I couldn't terraform apply: > > > [taskcluster-infrastructure]$terraform plan > > Acquiring state lock. This may take a few moments... > > var.cloudmirror_aws_access_key_id > > Enter a value: ^C > > Interrupt received. > > Please wait for Terraform to exit or data loss may occur. > > Gracefully shutting down... > > Releasing state lock. This may take a few moments... > > > > Error: Error asking for user input: Error asking for cloudmirror_aws_access_key_id: interrupted Did you update you password-store? That value seems to be in there for me. > > @bstack, if you re-deploy we should be able to find the log systems. > > I saw one of the systems might still be called "kubernetes-statsum", but I > can see why it would be called that. It was called that because of me clicking buttons in papertrail when I was trying to get that to work in kubernetes. It has been fixed. > It's not in terraform... so it was probably set manually, that's also fine. > But at-least they should be group'ed automatically. > > --- > I wish we could DNS in terraform, so we wouldn't have to file bugs this way. > Rolling back is arguably going to be hard :) I agree it would be nice if we managed ourselves but yeah, these need to be A records. Is that going to be a problem for how things work currently? I assume the dns admin people can do that for us. Something along the lines of statsum IN A 54.71.54.57 might work.
Ok, I've turned off all services other than cloud-mirror since they are all out of dns now anyway. Things seem to be working a-ok. (other than the fact that docker-cloud is failing to stop one of the containers for stateless-dns) john: I leave the turning off of cloud-mirror in your hands so that you can watch the logs while things happen. Please assign back to me when the service in docker cloud is turned off so that I can wrap up whatever remaining things there are to do! logs for new cloud-mirror: https://papertrailapp.com/systems/cloudmirrorcopiers/events
Assignee: bstack → jhford
I'm satisfied with how things are working in cloud mirror at this point. Thanks very much for setting this up! One strange thing I noticed is that I cannot get logs from that link before late in the day on April 30. Do you know what's going on there? I looked at the instances and they do seem to be transfering a lot on their network interfaces, so I suspect they're still online.
Assignee: jhford → bstack
Papertrail only retains searchable logs for 3 days, so that seems normal, unless I've misunderstood.
Yeah, I believe that's what is happening. The logs are retained in s3 for a year as well if you want to get some historical data though!
Ok, I have terminated every instance that was in our docker-cloud cluster! I deem this bug fixed! Thanks for the help, everyone.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
ヾ( ´・_ゝ・`)ノ docker cloud
Oh, just to be clear -- you terminated the cloud-mirror instances, or the EC2 instances? We should terminate the EC2 instances, and the VPC they occupy, too.
Terminated the ec2 instances. I will go and terminate the vpc now as well. Good catch.
See Also: → 1458996
I have disabled the access creds for the aws user (that had ec2:* !) that was hooked up to docker cloud and created a followup bug (1447585) to delete it permanently. I have deleted "vpc-5fe1c93a | tutum-vpc" in eu-west-1. I can't seem to figure out what the vpc being used in us-west-2 was. The instances have been terminated long enough that I can't find them anymore. Is it possible they were just in the default vpc? Anything else you can think of?
That sounds good. My main concern was that we not continue to run the ec2 instances in perpetuity, but you had already covered that.
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.