Closed Bug 1019732 Opened 10 years ago Closed 9 years ago

implement 'aws_create_instance' action in slaveapi

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2699] )

Attachments

(3 files, 1 obsolete file)

like 'disable' (Bug 932396) the core motivation for this is for the benefit of slaveloan.

for the short term, the plan is to use slaveapi to call commands based on https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#AWS_machines against the aws-manager host rather than giving slaveapi the ability itself to create aws instances.

This action should be able to create both build and test machines against a given owner and bug#.
Depends on: 1020512
Component: Tools → RelOps: Puppet
Product: Release Engineering → Infrastructure & Operations
QA Contact: hwine → dustin
Summary: implement 'aws_create_instance' action in slaveapi → give slaveapi aws_manager powers
Component: RelOps: Puppet → Tools
Product: Infrastructure & Operations → Release Engineering
QA Contact: dustin → hwine
Summary: give slaveapi aws_manager powers → implement 'aws_create_instance' action in slaveapi
Depends on: 1030332
Depends on: 1031953
impls the ability to create instances via slaveapi.

this needs to be tested before r+
Attachment #8448127 - Flags: review?(bugspam.Callek)
Comment on attachment 8448127 [details] [diff] [review]
140630_bug_1019732_slaveapi_create_instances-puppet.patch

Review of attachment 8448127 [details] [diff] [review]:
-----------------------------------------------------------------

did you forget to git add before diffing? or is this the entirety of the patch?
this is the entirety that I think we need.

the other puppet patch I wrote: 'Bug 1020512 - give slaveapi aws_manager powers' equipped the slaveapi nodes with the ability to create instances

this puppet patch is to actually modify slaveapi code: 'slaveapi.ini', as that ini file is managed by puppet not by the slaveapi repo. I didn't realize I needed to modify it until after.
Comment on attachment 8448127 [details] [diff] [review]
140630_bug_1019732_slaveapi_create_instances-puppet.patch

Review of attachment 8448127 [details] [diff] [review]:
-----------------------------------------------------------------

Ahh totally misread this as the slaveapi patch itself not the puppet patch
Attachment #8448127 - Flags: review?(bugspam.Callek) → review+
See Also: → 1045114
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2691]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2691] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2699]
Comment on attachment 8448127 [details] [diff] [review]
140630_bug_1019732_slaveapi_create_instances-puppet.patch

thought this was landed a long time ago and is needed in 1045114

https://hg.mozilla.org/build/puppet/rev/71da1c895b3f
Attachment #8448127 - Flags: checked-in+
I sprinted this over the last couple days.

I think it's pretty much done. I finally just created my first successful instance from scratch. There were so many annoying edge cases trying to combine cloud-tools and slaveapi.

you can use this on the dev instance now (it's pinned to my puppet repo that supports it):

to create an instance, you need three things:
- instance_type (can be build or test)
- email
- bug #

there is also one optional arg:
- arch (can be 64 or 32 but will fail if you try to use 32 with instance_type=build)

example:
$ curl -d "instance_type=build&email=jlund5@foo.com&bug=1111111" slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/default/actions/aws_create_instance

note, this endpoint is a weird circumstance where the slave doesn't exist (have a hostname) so you can see I pass 'default' as the above and it will choose a host name based on the POST params passed. If you use anything other than default, it will take that literally as the hostname when it goes to create.

I'll update the docs once it makes it through review
Attachment #8448082 - Attachment is obsolete: true
Attachment #8581431 - Flags: review?(coop)
another puppet patch needed to support new way that cloud-tools makes everything a package
Attachment #8581432 - Flags: review?(coop)
fwiw I'm not a huge fan of .../default/actions...

You can always add an endpoint:

/actions/create_aws_instance instead of /slaves/.../actions/ without the need to use <slave> in the file:

web/__init__.py  and using its related code.

That said, I don't quite think this API quirk needs to block basic deployment, and has some quirky fallouts based on "rest of slaveapi" assumptions (e.g. actions always having a slave, and logging semantics).
> You can always add an endpoint:
> 
> /actions/create_aws_instance instead of /slaves/.../actions/ without the
> need to use <slave> in the file:
> 

I started this route but 1) I wanted to keep convention and 2) I wanted the ability for us to be able to explicitly name our hosts. e.g. if we are loaning 5 slaves out to the same person or persons, this would be useful I think. Happy to weigh all the pros and cons if you feel strongly
Attachment #8581432 - Flags: review?(coop) → review+
Comment on attachment 8581431 [details] [diff] [review]
150322_aws_create_instance-slaveapi-2.patch

Review of attachment 8581431 [details] [diff] [review]:
-----------------------------------------------------------------

Overall looks really good. No showstoppers, just some small nits/questions.

Do we need some functions for performing the secret cleanup? Is that something Callek is working on more generally, or is he expecting some AWS-specific functions here?

::: slaveapi/actions/aws_create_instance.py
@@ +21,5 @@
> +    :type instance_type: str
> +
> +    :rtype: tuple
> +    """
> +    # TODO allow for region us west as well

We don't currently allow this for by-hand loans, so no rush. :)

@@ +30,5 @@
> +    assert value_in_values(instance_type, ['build', 'test'])
> +    assert not (arch == 32 and instance_type == 'build')
> +
> +    if not arch:
> +        arch = 64

Is there a reason why you wouldn't set this as the default in the function params instead of None?

@@ +42,5 @@
> +            name = 'dev-linux64-ec2-%s' % nick
> +        fqdn = '%s.dev.releng.use1.mozilla.com' % name
> +        aws_config = 'dev-linux%s' % arch
> +        data = 'us-east-1.instance_data_dev.json'
> +    else:  # instance_type == 'test'

Do we let users borrow try slaves? Is that something we would ever want to do?

@@ +67,5 @@
> +        if return_code == SUCCESS:
> +            status_msgs.append("Success\nwaiting for DNS to propagate...")
> +            log.info("host: {0} - waiting for DNS to propagate".format(name))
> +            # TODO - rather than waiting 20 min, maybe we can poll
> +            # inventory.get_system(name) and if that yields a result, we can say

I do like idea. Maybe sleep for a minute, check the DNS, loop for as many iterations as required, capped at 30 or some such.

@@ +71,5 @@
> +            # inventory.get_system(name) and if that yields a result, we can say
> +            # it's propagated?
> +            time.sleep(20 * 60)
> +            status_msgs.append("Success\ncreating and assimilating aws instance...")
> +            return_code, instance = aws.create_aws_instance(fqdn, name, email, bug, aws_config,

This command is likely to take a while. Would be worth indicating that with a comment.

::: slaveapi/clients/aws.py
@@ +4,2 @@
>  import subprocess
> +import re

Is there a reason for the reordering here?

@@ +20,5 @@
> +        return False
> +
> +
> +def ip_is_free(address):
> +    # TODO - we should use inventory to validate an address being free or not

I think gethostbyaddr is fine since it doesn't rely on a particular backend.

::: slaveapi/clients/inventory.py
@@ +47,5 @@
> +        log.info(debug_msg)
> +        response = requests.post(
> +            url, headers=headers, data=json.dumps(payload, indent=2), auth=auth
> +        )
> +    except RequestException as e:

This is the generic exception that would also catch Timeout errors, correct? That's the one I tend to worry about most with our infra.

::: slaveapi/slave.py
@@ +45,4 @@
>      def load_slavealloc_info(self):
>          log.info("Getting slavealloc info")
>          info = slavealloc.get_slave(config["slavealloc_api_url"], name=self.name)
> +        if info:  # some slaves might not be in slavealloc. e.g., loans

Good point. Do we need an else clause to cover the cases where the slave is not in slavealloc?

::: slaveapi/util/__init__.py
@@ +4,4 @@
>  import traceback
>  
>  
> +def normalize_truthiness(target_value):

The need for this pains me so much, and I used to write perl.

::: slaveapi/web/slave.py
@@ +111,5 @@
> +        if not all(required_fields.values()):
> +            return missing_fields_response(required_fields)
> +
> +        if not value_in_values(required_fields['instance_type'],
> +                               ['build', 'test']):

Again, should we offer 'try' as an option here, or is it functionally the same as 'build'? 

This might be a user interface for Callek to solve to let people who request loans know build==try for loaner instances.

@@ +125,5 @@
> +                optional_fields == '32'):
> +            return make_response(
> +                jsonify(
> +                    {'error': 'mismatching instance_type and arch',
> +                     'msg': '32bit instances are not used for building'}

Again, Callek can probably help prevent this via the interface, but always important to validate.
Attachment #8581431 - Flags: review?(coop) → review+
Thanks for the detailed review. anywhere I add a will-fix will be part of a follow up patch


> Do we need some functions for performing the secret cleanup? Is that
> something Callek is working on more generally, or is he expecting some
> AWS-specific functions here?

This may be a separate slaveapi call or something built into slaveloan itself. Either way, not part of this endpoint.

> 
> @@ +30,5 @@
> > +    assert value_in_values(instance_type, ['build', 'test'])
> > +    assert not (arch == 32 and instance_type == 'build')
> > +
> > +    if not arch:
> > +        arch = 64
> 
> Is there a reason why you wouldn't set this as the default in the function
> params instead of None?

will-fix - none that I can think of.

> 
> @@ +42,5 @@
> > +            name = 'dev-linux64-ec2-%s' % nick
> > +        fqdn = '%s.dev.releng.use1.mozilla.com' % name
> > +        aws_config = 'dev-linux%s' % arch
> > +        data = 'us-east-1.instance_data_dev.json'
> > +    else:  # instance_type == 'test'
> 
> Do we let users borrow try slaves? Is that something we would ever want to
> do?

at this point, this is functionally the same as 'build'. I think, like supporting different regions or aws instance types, this can be part of follow up patches.

> 
> @@ +67,5 @@
> > +        if return_code == SUCCESS:
> > +            status_msgs.append("Success\nwaiting for DNS to propagate...")
> > +            log.info("host: {0} - waiting for DNS to propagate".format(name))
> > +            # TODO - rather than waiting 20 min, maybe we can poll
> > +            # inventory.get_system(name) and if that yields a result, we can say
> 
> I do like idea. Maybe sleep for a minute, check the DNS, loop for as many
> iterations as required, capped at 30 or some such.

will-fix - I'll try this out on the dev node in an optimization follow up patch

> 
> @@ +71,5 @@
> > +            # inventory.get_system(name) and if that yields a result, we can say
> > +            # it's propagated?
> > +            time.sleep(20 * 60)
> > +            status_msgs.append("Success\ncreating and assimilating aws instance...")
> > +            return_code, instance = aws.create_aws_instance(fqdn, name, email, bug, aws_config,
> 
> This command is likely to take a while. Would be worth indicating that with
> a comment.

currently slaveapi only supports the message 'Still in progress..' until the whole thing is completed when you make a GET call on the current POST request. However, I did add logging so we can debug. e.g. within aws.create_aws_instance() we should log:

log.info('creating instance: {0}'.format(fqdn))

I could add to that saying this could take 40min

> 
> ::: slaveapi/clients/aws.py
> @@ +4,2 @@
> >  import subprocess
> > +import re
> 
> Is there a reason for the reordering here?

I suspect it just due to back and forth hacking while implementing this. no race conditions or weird buildbot issues where we need to reload() existing imports :)

> 
> @@ +20,5 @@
> > +        return False
> > +
> > +
> > +def ip_is_free(address):
> > +    # TODO - we should use inventory to validate an address being free or not
> 
> I think gethostbyaddr is fine since it doesn't rely on a particular backend.

will-fix - I'll remove the TODO

> 
> ::: slaveapi/clients/inventory.py
> @@ +47,5 @@
> > +        log.info(debug_msg)
> > +        response = requests.post(
> > +            url, headers=headers, data=json.dumps(payload, indent=2), auth=auth
> > +        )
> > +    except RequestException as e:
> 
> This is the generic exception that would also catch Timeout errors, correct?
> That's the one I tend to worry about most with our infra.

this is the accompanying and recommended exception handler for requests module. It *should* catch all errors including timeouts.
> 
> ::: slaveapi/slave.py
> @@ +45,4 @@
> >      def load_slavealloc_info(self):
> >          log.info("Getting slavealloc info")
> >          info = slavealloc.get_slave(config["slavealloc_api_url"], name=self.name)
> > +        if info:  # some slaves might not be in slavealloc. e.g., loans
> 
> Good point. Do we need an else clause to cover the cases where the slave is
> not in slavealloc?

so afaict the logic within the `if info:` block just sets the slavealloc properties to the slave. Slave.__init__ assigns default values (None) for all of these, effectively implementing the behaviour that an `else:` would do so I don't think it's needed.
this was landed on the 1st of april and I forgot to close this: http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=d18f65e56f39b996155f00e41c9305515fb95556

There can be follow up work such as the above 'will-fix' comments but I think they belong in an optimizing follow up bug (filing now)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1154548
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: