Closed
Bug 1496210
Opened 6 years ago
Closed 6 years ago
[tracking] give ciduty access to terminate bad instances
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zfay, Assigned: bstack)
Details
Attachments
(1 file)
After a talk with Brian, we've agreed that it would be helpful for ciduty to be able to terminate bad instances[1] that are under the TC account in AWS. [1] Instances which have all their jobs completed as fail or exception
Assignee | ||
Comment 1•6 years ago
|
||
ajvb: does this seem like a reasonable permission to give? It seems fine to me. We can just add it to the assume-role stuff.
Flags: needinfo?(abahnken)
Comment 2•6 years ago
|
||
This seems fine to me. How will we be limiting it to "Instances which have all their jobs completed as fail or exception"?
Flags: needinfo?(abahnken)
Assignee | ||
Comment 3•6 years ago
|
||
I think permissions-wise we'll just give access to all instances. Maybe we can figure a way to do something with tags to limit it to just aws-provisioner instances but that might be a bit overly cautious?
Comment 4•6 years ago
|
||
I think it would be worth limiting it to workers using ec2 tags. Can use a condition (like in https://aws.amazon.com/premiumsupport/knowledge-center/iam-policy-tags-restrict/) where "Owner" == "ec2-provisioner" or similar?
Reporter | ||
Comment 5•6 years ago
|
||
I agree with Aj. The ec2 instances are the only ones we had to interact with thus far so what has been stated in Comment 4 makes sense.
Assignee | ||
Comment 6•6 years ago
|
||
Ok, I will work on setting up a policy. Also curious why we want this on top of what you can do through the provisioner explorer in tools site?
Reporter | ||
Comment 7•6 years ago
|
||
The main reason we wanted this is because when sheriffs tell ciduty they have bustages and after we investigate it turns out that a certain worker is failing every task, we wanted to have the ability to stop that instance so another, healthy one, takes its place.
Assignee | ||
Comment 8•6 years ago
|
||
https://github.com/taskcluster/taskcluster-infrastructure/pull/39
Assignee | ||
Comment 9•6 years ago
|
||
Ok, you all now have aws console/api access to do this. (thanks for verifying apop) We are now working on giving you scopes!
Assignee | ||
Comment 10•6 years ago
|
||
Looking into the ec2-manager api, the scope guarding this action is ec2-manager:manage-resources:<workerType>. This allows someone to terminate a single instance, terminate all instances or run a single instance of a workerType. I think this seems like something that is totally in ciduty's wheelhouse. Looking at apop's groups, I think maybe the best bet is to just give ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut the taskcluster team out of the loop for one more common action and move us towards a world where releng owns that whole side of the world. Thoughts, people I've NI'd?
Flags: needinfo?(jlund)
Flags: needinfo?(jhford)
Flags: needinfo?(abahnken)
Comment 11•6 years ago
|
||
(In reply to Brian Stack [:bstack] from comment #10) > Looking into the ec2-manager api, the scope guarding this action is > ec2-manager:manage-resources:<workerType>. > > This allows someone to terminate a single instance, terminate all instances > or run a single instance of a workerType. I think this seems like something > that is totally in ciduty's wheelhouse. > > Looking at apop's groups, I think maybe the best bet is to just give > ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut > the taskcluster team out of the loop for one more common action and move us > towards a world where releng owns that whole side of the world. Thoughts, > people I've NI'd? sounds good to me
Flags: needinfo?(jhford)
Comment 12•6 years ago
|
||
wfm. whenever we cut responsibility off from Taskcluster and hand off to ciduty, we should make sure we also go over what the responsibility entails: daily actionables, troubleshooting, etc. Docs where possible. @danut - should we discuss this responsibility more or do you think ciduty have a handle on this already?
Flags: needinfo?(jlund) → needinfo?(dlabici)
Comment 13•6 years ago
|
||
(In reply to Brian Stack [:bstack] from comment #10) > Looking into the ec2-manager api, the scope guarding this action is > ec2-manager:manage-resources:<workerType>. > > This allows someone to terminate a single instance, terminate all instances > or run a single instance of a workerType. I think this seems like something > that is totally in ciduty's wheelhouse. > > Looking at apop's groups, I think maybe the best bet is to just give > ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut > the taskcluster team out of the loop for one more common action and move us > towards a world where releng owns that whole side of the world. Thoughts, > people I've NI'd? Sounds good to me as well.
Flags: needinfo?(abahnken)
Assignee | ||
Comment 14•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Comment 15•6 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #12) > wfm. whenever we cut responsibility off from Taskcluster and hand off to > ciduty, we should make sure we also go over what the responsibility entails: > daily actionables, troubleshooting, etc. Docs where possible. > > @danut - should we discuss this responsibility more or do you think ciduty > have a handle on this already? I believe everyone in CiDuty knows how to terminate the instance(s) via console or api and when to do it. So we should be good on this part. But I do think it will be good to define some boundaries (of when do we take an action), will follow up via an email.
Flags: needinfo?(dlabici)
Assignee | ||
Comment 16•6 years ago
|
||
Ok, the patch has been landed! I think this should all work now but we can't test until there's an actual issue that requires terminating instances. If historical trends continue, that should happen sometime next week. I'm closing this for now, let's reopen if it doesn't work.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Component: Service Request → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•