Closed Bug 965691 Opened 6 years ago Closed 2 years ago

Create a Comprehensive Slave Loan tool

Categories

(Release Engineering :: General, defect, P3)

x86_64
Windows 7
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Callek, Unassigned)

References

Details

(Keywords: meta, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2573] )

Attachments

(4 files, 2 obsolete files)

This is a tracking bug for the loaner "tool" -- a brief dump of thoughts below, will try and formulate a high-level better later

* web interface
** for devs to request machines and we would approve it
*** start/stop aws machines
*** puppet do the secret clean up and disable puppet itself
*** plus setup the password
* slaveapi
** inventory changes for AWS
** can IT export a way to 'kick' DNS instead of us waiting for it to update?
* bug filling
* nagging emails
* LDAP changes for machine and user
* automatically do re-imaging
** we can do this for Mac (you can't switch OS versions)
** also iX and HP, but maybe not headless
*what are we solving?
** look better to the rest of the company
** reduce human time by reducing steps
** reduce costs (by shutting down things faster)
Love it Callek!

Additional thoughts:

* Regarding approval - maybe we could work out what our own criteria is for approving a request (e.g. do we need to know a bug number? is a person only allowed to loan one machine at-a-time? etc) - and see how much of this can also be automated; maybe it will suffice to send a notification email that a person has loaned a machine, and keep a history database, so that there are no human delays in the process

* Regarding auto-reimaging - can this both imaging pre-loan-out and pre-return-back-to-pool - so that both reimaging steps are automatically taken care of. e.g. an api call or web interface that allows a user to say "i'm all done, you can have the machine back"

What I also like about your proposal is:
1) We have a history of loan requests, and real data on it that can be shared with other parties, to see how many machine loans there were, how long a loan request takes a machine out of the pool for, if we can associate the loan request back to the bug number that is being solved, we can see how many loan requests result in a resolved bug (i.e. the dev was able to fix the problem they sought to fix by borrowing the machine)
2) Less work for our team, cutting costs
3) Immediate turnaround time for loaning a machine (assuming we can automate approval or "report" on borrowing rather than "approve")
4) Safer regarding forgetting steps such as leaving keys on machines
5) More reliable - less human steps which can introduce errors
6) Works when we are on holiday, e.g. over Christmas/New Year!

Ideally the process would involve *no* human intervention from our team at all.
as various parts of this get fleshed out, we'll need some sec-team review/input. Roping them in early.
Duplicate of this bug: 919402
Coop, it looks to me like the other bug is not about a web-based self-serve tool that allows arbitrary machines to be loaned out, rather in the specific case or recurring crashes that a machine will automatically keep its test process alive, attach a debugger, and notify parties that this machine can be loaned.

I think the topics are closely related, but I don't think they are duplicates - I see this bug more as a means to have a humanly-interfacable web service that allows machines to be loaned out, and the workflow around this process to be automatically driven by input to the web interface. I think the other bug is more driven by test results.
Attached image slaveloan_v0 - architechture (infra) (obsolete) —
After discussion with ben on tuesday I created this last night. I had hoped to get feedback on if this matched our discussion over email but he got overly busy today.

So I'm putting this out there, so others can mull over it anyway.

I would like to keep this bug relatively clean for now, so please reach out to me directly if there is any Questions. (only exception is ben whom I already spoke with and can call out a mistake in this model compared to what we discussed)
Attachment #8371879 - Flags: feedback?(bhearsum)
Attachment #8371879 - Attachment is patch: false
Attachment #8371879 - Attachment mime type: text/plain → image/png
(In reply to Justin Wood (:Callek) from comment #5)
> Created attachment 8371879 [details]
> slaveloan_v0 - architechture (infra)
> 
> After discussion with ben on tuesday I created this last night. I had hoped
> to get feedback on if this matched our discussion over email but he got
> overly busy today.
> 
> So I'm putting this out there, so others can mull over it anyway.
> 
> I would like to keep this bug relatively clean for now, so please reach out
> to me directly if there is any Questions. (only exception is ben whom I
> already spoke with and can call out a mistake in this model compared to what
> we discussed)

I put this in e-mail, but I'll echo it here:
Mostly looks fine. Having the app talk directly with SlaveAPI at all is a security risk at this point though - there's no way to enable "read only" access or such. Do you know what you're anticipating needing that for, that won't already be in the db?
Attachment #8371879 - Flags: feedback?(bhearsum) → feedback+
since v0:

Added display of e-mail path. (where e-mails will come from)

Settled on RelengAPI vs a generic flask app

Settled on Celery Workers rather than mozharness via crontask

Settled on Slaveapi being the only (prominent) slave access method. (since it holds all the secrets)
Attachment #8371879 - Attachment is obsolete: true
Comment on attachment 8398519 [details]
slaveloan_v1 - Loaner Request Flow Chart

Not explicitly stated in this diagram, but for reference, the Celery tasks will not be long-lived, they will be short-lived, such that each piece is resilient to failure and not perpetually running.

Most of the work is happening in slaveapi the pieces that run in celery should also be resilient to failure and re-doable (by a new celery task run) if it does fail.

The celery task(s) will relaunch on a form of cron to pickup where we left off.
Attachment #8398612 - Attachment description: Loan Reclaim Flow → slaveloan_v1 - Loan Reclaim Flow Chart
Attachment #8398519 - Attachment description: slaveloan_v1 - Loaner Process Flow Chart → slaveloan_v1 - Loaner Request Flow Chart
Attachment #8398670 - Attachment is obsolete: true
Ok, so I'm very very bad at UX, I'll invite feedback from others in the field (webdev/etc) to get this going well.

That said, I have a VERY rudimentary example of what this could look like, even if not pretty at http://people.mozilla.org/~jwood/slaveloan/

The code will be implemented in such a way that almost anyone could help design a better UI anyway.
Thanks for adding me to the bug. So far it looks good and I appreciate the level of documentation you're putting together.
Looks great Callek, and I'd like to echo Adam's words - really appreciate the upfront architecture designs.

A thought about the stop/start workflows.

I see it makes sense to have a generic solution in place regardless of slave, but i'm also aware if we have a user interface with a web interface, that talks to the slave tool backend, which talks to a celery worked, that then proxies to slave api rest api service, that then talks to the slaves, we might require quite a lot of end-to-end infrastructure e.g. in the case of wanting to shutdown an aws machine.

If it is possible to automatically grant the user access to the aws console, with limited rights *just* to the user's loan machines, that might be a way to give them the power of the aws console at their fingertips, without the need for all the infrastructure in between that we'd need to support, and potential sources of breakage that could impact them. In the vain of keeping it as simple as possible, but not simpler (something like "occam's razor").

For other machine types, if the dev has access to the machine, I'm guessing they should be able to reboot it themselves, for example. I think it is only AWS ones where they would want an interface to suspend it - and i think access to aws console would provide this, if we can sort out automatic provisioning of grants.

Interested to know your thoughts. :)

Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #15)
> If it is possible to automatically grant the user access to the aws console,
> with limited rights *just* to the user's loan machines, that might be a way
> to give them the power of the aws console at their fingertips, without the
> need for all the infrastructure in between that we'd need to support, and
> potential sources of breakage that could impact them. In the vain of keeping
> it as simple as possible, but not simpler (something like "occam's razor").

I'm not really interested in expecting devs to figure out how to navigate AWS, also in navigating what perms can/should be set, especially with regard to sec considerations around them having direct AWS access (e.g. can they steal an AMI with ssh keys, can they get into other active AMIs, can they change the instance type into something REALLY expensive that we don't want to support, etc)
(In reply to Justin Wood (:Callek) from comment #16)

> I'm not really interested ....

ouch

> ... in expecting devs to figure out how to navigate AWS

I think a lot of devs already know how to navigate AWS, and actually it may be a smaller learning-curve to use AWS than learning how to use the new slave loan tool. Both are web interfaces - I'm not sure AWS is necessarily harder to use.

> also in navigating what perms can/should be set, especially with regard
> to sec considerations around them having direct AWS access (e.g. can they
> steal an AMI with ssh keys, can they get into other active AMIs, can they
> change the instance type into something REALLY expensive that we don't want
> to support, etc)

I think that is a one-time problem - we need to work out which permissions to assign them so they have access only to stop/start machines, and nothing else. This is quite reasonable, and should be entirely possible. That is not a reason to block, as far as i can see.

Let us not forget the benefits - a simpler, smaller tool is easier to maintain. It costs less in maintenance. It uses less infrastructure, and therefore reduces operational costs. The work has been done by AWS to provide role/grants systems so that users can be granted access precisely to what they need, exactly that - nothing more, nothing less; and AWS have created comprehensive APIs to create users and slaves exactly for this purpose.

It is always tempting to create large architectures, because they are interesting and "generic" - but we shouldn't forget this comes at a cost. If a simpler solution does everything necessary, we should justify the reason not to go with it terms of a business case. I currently don't see any benefits of engineering a large stack of intercommunicating systems in order to achieve what could be potentially achieved directly by the user with existing AWS console, and some grants.

Regarding the rest of the slave loan tool - I agree on the architectural suggestions you make, since they are compliant with occam's razor - the problem *is* that complex so that necessitates a complex architecture, as you nicely laid out and documented. I just don't see it for the part related to suspending hosts (which i think is only needed for aws loans - please tell me if I'm wrong).

I think this will be a fantastic tool, and really help serve the developers. It would be a shame though if we over-engineered parts that could be kept simple, IMO.
(In reply to Justin Wood (:Callek) from comment #13)
> Ok, so I'm very very bad at UX, I'll invite feedback from others in the
> field (webdev/etc) to get this going well.
> 
> That said, I have a VERY rudimentary example of what this could look like,
> even if not pretty at http://people.mozilla.org/~jwood/slaveloan/
> 
> The code will be implemented in such a way that almost anyone could help
> design a better UI anyway.

There are two parts to this:

1) Making sure we understand the workflows, and present the selections to devs in as straightforward a manner as possible; and, 
2) Making it look pretty.

It's way too early to worry about #2, but I do have some comments on #1 based on your mock-up:

* In the "My Loans" section, does history refer to the history of the slave being loaned, or the history of the request itself?
* In the "New Loan" section, for the first pass (and possibly always), I think we only want a list of existing slave types; I don't think there's value in starting the conversation about loaning a one-off slave type in this tool. We should certainly explain what each slave type is in an info box, and we could also have a bugzilla template link for requesting a novel slave type.

As a next step, I think we should mock-up the full workflow, including the admin side.

I've looked over the diagrams and they seem good, but I'll have a more thorough look over the weekend.

Thanks, Callek.
(In reply to Pete Moore [:pete][:pmoore] from comment #15) 
> If it is possible to automatically grant the user access to the aws console,
> with limited rights *just* to the user's loan machines, that might be a way
> to give them the power of the aws console at their fingertips, without the
> need for all the infrastructure in between that we'd need to support, and
> potential sources of breakage that could impact them. In the vain of keeping
> it as simple as possible, but not simpler (something like "occam's razor").

I don't think this is actually easier. We already have APIs to manipulate host state in AWS the way we need to, whereas we have no existing automation to integrate with AWS auth to give devs direct access.

> For other machine types, if the dev has access to the machine, I'm guessing
> they should be able to reboot it themselves, for example. I think it is only
> AWS ones where they would want an interface to suspend it - and i think
> access to aws console would provide this, if we can sort out automatic
> provisioning of grants.

Yes, and no. If the machine is reachable, they can login and reboot as per normal. If the machine is wedged, they would need access to the PDU or IPMI interface. This access doesn't come automatically with the loan. Again, slaveapi already lets us do this though.
(In reply to Chris Cooper [:coop] from comment #18)
> * In the "My Loans" section, does history refer to the history of the slave
> being loaned, or the history of the request itself?

History of the loan-actions, specifically "shut down", "loan requested", "started", "loan ready" etc.

> * In the "New Loan" section, for the first pass (and possibly always), I
> think we only want a list of existing slave types; I don't think there's
> value in starting the conversation about loaning a one-off slave type in
> this tool. We should certainly explain what each slave type is in an info
> box, and we could also have a bugzilla template link for requesting a novel
> slave type.

This was also for a sake of "I'm confused on what loan I need", and "Releng missed updating this for a supported Loan Type" - less for support of one-off slave types.

Of course I can see value in a way to track non-standard things with the UI, even if no "actions" are supported via it.

> I've looked over the diagrams and they seem good, but I'll have a more
> thorough look over the weekend.

Would love any followup thoughts you have there. (if any)
Depends on: 1007501
Depends on: 1019732
Depends on: 932396
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2573]
Depends on: 1154548
Component: Tools → General
Assignee: bugspam.Callek → nobody
Priority: -- → P3
Not expecting to work on this tool anymore.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.