Closed Bug 1496210 Opened 6 years ago Closed 6 years ago

[tracking] give ciduty access to terminate bad instances

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Assigned: bstack)

Details

Attachments

(1 file)

After a talk with Brian, we've agreed that it would be helpful for ciduty to be able to terminate bad instances[1] that are under the TC account in AWS. 

[1] Instances which have all their jobs completed as fail or exception
ajvb: does this seem like a reasonable permission to give? It seems fine to me. We can just add it to the assume-role stuff.
Flags: needinfo?(abahnken)
This seems fine to me. How will we be limiting it to "Instances which have all their jobs completed as fail or exception"?
Flags: needinfo?(abahnken)
I think permissions-wise we'll just give access to all instances. Maybe we can figure a way to do something with tags to limit it to just aws-provisioner instances but that might be a bit overly cautious?
I think it would be worth limiting it to workers using ec2 tags. Can use a condition (like in https://aws.amazon.com/premiumsupport/knowledge-center/iam-policy-tags-restrict/) where "Owner" == "ec2-provisioner" or similar?
I agree with Aj. The ec2 instances are the only ones we had to interact with thus far so what has been stated in Comment 4 makes sense.
Ok, I will work on setting up a policy. Also curious why we want this on top of what you can do through the provisioner explorer in tools site?
The main reason we wanted this is because when sheriffs tell ciduty they have bustages and after we investigate it turns out that a certain worker is failing every task, we wanted to have the ability to stop that instance so another, healthy one, takes its place.
Ok, you all now have aws console/api access to do this. (thanks for verifying apop)

We are now working on giving you scopes!
Looking into the ec2-manager api, the scope guarding this action is ec2-manager:manage-resources:<workerType>.

This allows someone to terminate a single instance, terminate all instances or run a single instance of a workerType. I think this seems like something that is totally in ciduty's wheelhouse. 

Looking at apop's groups, I think maybe the best bet is to just give ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut the taskcluster team out of the loop for one more common action and move us towards a world where releng owns that whole side of the world. Thoughts, people I've NI'd?
Flags: needinfo?(jlund)
Flags: needinfo?(jhford)
Flags: needinfo?(abahnken)
(In reply to Brian Stack [:bstack] from comment #10)
> Looking into the ec2-manager api, the scope guarding this action is
> ec2-manager:manage-resources:<workerType>.
> 
> This allows someone to terminate a single instance, terminate all instances
> or run a single instance of a workerType. I think this seems like something
> that is totally in ciduty's wheelhouse. 
> 
> Looking at apop's groups, I think maybe the best bet is to just give
> ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut
> the taskcluster team out of the loop for one more common action and move us
> towards a world where releng owns that whole side of the world. Thoughts,
> people I've NI'd?

sounds good to me
Flags: needinfo?(jhford)
wfm. whenever we cut responsibility off from Taskcluster and hand off to ciduty, we should make sure we also go over what the responsibility entails: daily actionables, troubleshooting, etc. Docs where possible.

@danut - should we discuss this responsibility more or do you think ciduty have a handle on this already?
Flags: needinfo?(jlund) → needinfo?(dlabici)
(In reply to Brian Stack [:bstack] from comment #10)
> Looking into the ec2-manager api, the scope guarding this action is
> ec2-manager:manage-resources:<workerType>.
> 
> This allows someone to terminate a single instance, terminate all instances
> or run a single instance of a workerType. I think this seems like something
> that is totally in ciduty's wheelhouse. 
> 
> Looking at apop's groups, I think maybe the best bet is to just give
> ec2-manager:manage-resources:gecko-* to mozilla-group:releng. This will cut
> the taskcluster team out of the loop for one more common action and move us
> towards a world where releng owns that whole side of the world. Thoughts,
> people I've NI'd?

Sounds good to me as well.
Flags: needinfo?(abahnken)
Assignee: nobody → bstack
Status: NEW → ASSIGNED
(In reply to Jordan Lund (:jlund) from comment #12)
> wfm. whenever we cut responsibility off from Taskcluster and hand off to
> ciduty, we should make sure we also go over what the responsibility entails:
> daily actionables, troubleshooting, etc. Docs where possible.
> 
> @danut - should we discuss this responsibility more or do you think ciduty
> have a handle on this already?

I believe everyone in CiDuty knows how to terminate the instance(s) via console or api and when to do it. So we should be good on this part. 
But I do think it will be good to define some boundaries (of when do we take an action), will follow up via an email.
Flags: needinfo?(dlabici)
Ok, the patch has been landed! I think this should all work now but we can't test until there's an actual issue that requires terminating instances. If historical trends continue, that should happen sometime next week.

I'm closing this for now, let's reopen if it doesn't work.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Service Request → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: