Don't require puppet or DNS to launch new instances

RESOLVED FIXED

Status

defect
RESOLVED FIXED
5 years ago
Last year

People

(Reporter: catlee, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(11 attachments, 5 obsolete attachments)

2.99 KB, patch
catlee
: review+
Details | Diff | Splinter Review
1.43 KB, patch
catlee
: review+
Details | Diff | Splinter Review
3.38 KB, patch
catlee
: review+
Details | Diff | Splinter Review
3.64 KB, patch
catlee
: review+
Details | Diff | Splinter Review
1000 bytes, patch
Callek
: review+
Details | Diff | Splinter Review
1.35 KB, patch
dustin
: review+
Details | Diff | Splinter Review
93.22 KB, patch
Details | Diff | Splinter Review
1.61 KB, patch
dustin
: review+
Details | Diff | Splinter Review
1.61 KB, patch
dustin
: review+
Details | Diff | Splinter Review
4.04 KB, patch
dustin
: review+
Details | Diff | Splinter Review
4.88 KB, patch
dustin
: review+
Details | Diff | Splinter Review
Currently all our instances run puppet on boot, which requires valid forward and reverse DNS. This is sub-optimal for several reasons:

- Running puppet on boot means we're waiting longer before we can get real work done on the machine
- Adding new instances is painful since it takes 10-20 minutes for DNS changes to propagate
- Inventory and AWS get out-of-sync easily if instances are being added/deleted.
- We need to keep a pool of detached network interfaces to allocate to spot instances. This artificially limits how many spot instances we can have running at once, and unnecessarily complicates our code.

I'd like to re-vamp our AMI process and at the same time remove our dependency on puppet and DNS.

The process will look something like this:

- Create a base root snapshot for our target OS (e.g. Centos6.4)
- Create two boot snapshots, one for HVM and one for PV virtualization.
- Create a pair of base AMIs that have the boot snapshot as the "root" device, and the root snapshot as a second EBS volume. The boot volume mounts the root volume on boot.

(we have up to this part working)

- For our various end worker types (e.g. bld-linux64, try-linux64), create a reference instance from the base AMI.
- Run puppet on the reference instance so it gets all the required configuration, packages, etc. installed.
- Disable/remove puppet on the instance
- Create snapshot from the puppetized reference instance's root volume.
- Create new AMIs for HVM, PV using the new root snapshot and the existing boot snapshots for each virtualization type

At this point we should be left with AMIs that we can create directly for ondemand or spot and don't require puppet or DNS to function properly.
We also should figure out how to allocate/release hostnames used by buildslave to connect to masters. Relasing may be tricky for spot instances. Maybe we need a service to check used but dead hostnames.
Have you considered using Packer to build AMIs from templates?
http://www.packer.io/intro

You could also reuse existing puppet manifests:
http://www.packer.io/docs/provisioners/puppet-masterless.html
Assignee: nobody → rail
Depends on: 989814
If you don't use CentOS 6.2, please use CentOS 6.5, since that's what we'll be supporting on onsite hardware.  Preferably that would be created from the repos in puppet, so we don't have minor/release version differences between AWS and onsite.
(:thumbsup: for the idea by the way!)
Depends on: 1001714
publish all available AMIs somewhere accessible from everywhere!
Attachment #8418374 - Flags: review?(catlee)
Attachment #8418374 - Flags: review?(catlee) → review+
Enable publishing
Attachment #8418384 - Flags: review?(catlee)
For my reference -- the AMIs themselves are *not* public, just the https://s3.amazonaws.com/mozilla-releng-amis/amis.json file?
(In reply to Dustin J. Mitchell [:dustin] (PTO until ~5/20) from comment #8)
> For my reference -- the AMIs themselves are *not* public, just the
> https://s3.amazonaws.com/mozilla-releng-amis/amis.json file?

Correct, the AMIs may have some secrets.
Posted patch configs (obsolete) — Splinter Review
To make the testing part simpler I'd prefer to add some slaves not backed by network interfaces. Still need to add them to slavealloc.
Attachment #8418823 - Flags: review?(catlee)
Posted patch configs (obsolete) — Splinter Review
err, some garbage removed
Attachment #8418823 - Attachment is obsolete: true
Attachment #8418823 - Flags: review?(catlee)
Attachment #8418826 - Flags: review?(catlee)
Posted patch configsSplinter Review
Bah, padding!
Attachment #8418826 - Attachment is obsolete: true
Attachment #8418826 - Flags: review?(catlee)
Attachment #8418832 - Flags: review?(catlee)
Attachment #8418384 - Flags: review?(catlee) → review+
Attachment #8418832 - Flags: review?(catlee) → review+
(In reply to Rail Aliiev [:rail] from comment #14)
> Comment on attachment 8418384 [details] [diff] [review]
> puppet_aws_publish_amis.diff
> 
> remote:   https://hg.mozilla.org/build/puppet/rev/19b038a962c5
> remote:   https://hg.mozilla.org/build/puppet/rev/3e6c51f3e8ec

I had to adjust IAM policies for aws-manager user and added s3:* actions for the mozilla-releng-amis bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1399488079000",
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": [
        "arn:aws:s3:::mozilla-releng-amis/*"
      ]
    }
  ]
}
I added the "golden" DNS entries in both regions, just in case (we will be copying the AMIs across the regions):

invtool A create --ip 10.134.49.65 --fqdn try-linux64-ec2-golden.build.releng.use1.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.134.49.65 --target try-linux64-ec2-golden.build.releng.use1.mozilla.com  --private --description "Golden AMI"
invtool A create --ip 10.134.49.4 --fqdn tst-linux64-ec2-golden.build.releng.use1.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.134.49.4 --target tst-linux64-ec2-golden.build.releng.use1.mozilla.com  --private --description "Golden AMI"
invtool A create --ip 10.134.49.89 --fqdn tst-linux32-ec2-golden.build.releng.use1.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.134.49.89 --target tst-linux32-ec2-golden.build.releng.use1.mozilla.com  --private --description "Golden AMI"
invtool A create --ip 10.132.49.90 --fqdn try-linux64-ec2-golden.build.releng.usw2.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.132.49.90 --target try-linux64-ec2-golden.build.releng.usw2.mozilla.com  --private --description "Golden AMI"
invtool A create --ip 10.132.50.36 --fqdn tst-linux64-ec2-golden.build.releng.usw2.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.132.50.36 --target tst-linux64-ec2-golden.build.releng.usw2.mozilla.com  --private --description "Golden AMI"
invtool A create --ip 10.132.49.98 --fqdn tst-linux32-ec2-golden.build.releng.usw2.mozilla.com  --private  --description "Golden AMI"
invtool PTR create --ip 10.132.49.98 --target tst-linux32-ec2-golden.build.releng.usw2.mozilla.com  --private --description "Golden AMI"
Merged and deployed to production.
Depends on: 1007967
Depends on: 1011257
Depends on: 1008241
in the new world we won't be using network interface tags. This code works fine for the current setup as well.
Attachment #8427257 - Flags: review?(catlee)
Attachment #8427257 - Flags: review?(catlee) → review+
Posted patch no-puppet2-cloud-tools.diff (obsolete) — Splinter Review
This version is in semi-production now (running in parallel with some hacks to avoid collisions). Still need to address some inline todos and remove some of them when the code lands.
Attachment #8429331 - Flags: feedback?(catlee)
Depends on: 1016579
Depends on: 1017634
Posted patch no-puppet2-cloud-tools-1.diff (obsolete) — Splinter Review
Attachment #8429331 - Attachment is obsolete: true
Attachment #8429331 - Flags: feedback?(catlee)
Attachment #8432480 - Flags: feedback?(catlee)
Depends on: 1019013
Current spot capacity (evenly split across 2 regions):

tst-linux64: 200 old + 900 new
tst-linux32: 200 old + 700 new
bld-linux64: 200 old + 300 new
try-linux64: 200 old + 300 new
I deleted the following ranges of network interfaces to free up some IP space for new style instances:

tst-linux64-spot-600..999, leaving 300 network interfaces per region
tst-linux64-spot-600..799, leaving 300 network interfaces per region

If everything goes as expected, I'm going to shrink the range again today.
moved the following ranges:

bld-linux64-spot-001..099
bld-linux64-spot-300..399
try-linux64-spot-001..099
try-linux64-spot-300..399

At this point all bld and try instances are supposed to use the new system
Blocks: 1019869
no need to run this anymore
Attachment #8433755 - Flags: review?(dustin)
once it's deleted we can delete the code
Attachment #8433756 - Flags: review?(dustin)
Comment on attachment 8433755 [details] [diff] [review]
stop running instance2ami

r- due to http://mxr.mozilla.org/build/source/puppet/modules/aws_manager/manifests/cron.pp#61 existing, but change that to absent and test it does what we think with a --noop run and you can have a "I don't need to see this again" r+
Attachment #8433755 - Flags: review?(dustin) → review-
Comment on attachment 8433755 [details] [diff] [review]
stop running instance2ami

err ignore me
Attachment #8433755 - Flags: review- → review+
Attachment #8433756 - Flags: review?(dustin) → review+
Posted file use slavealloc for reportor (obsolete) —
Attachment #8434157 - Flags: review?(bhearsum)
Comment on attachment 8434157 [details] [review]
use slavealloc for reportor

Looks like catlee merged this already...
Attachment #8434157 - Attachment is obsolete: true
Attachment #8434157 - Flags: review?(bhearsum)
The current version. I plan to land this version today and stop the one running in parallel.
Attachment #8432480 - Attachment is obsolete: true
Attachment #8432480 - Flags: feedback?(catlee)
certs no more!
Attachment #8435001 - Flags: review?(dustin)
Comment on attachment 8435001 [details] [diff] [review]
watch_pending_puppet.diff

\o/
Attachment #8435001 - Flags: review?(dustin) → review+
Blocks: 1022368
Added a wiki page regarding generated AMIs: https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_spot_AMIs
Deletes old AMIs (leaving last 10) once a day
Attachment #8440788 - Flags: review?(dustin)
Attachment #8440788 - Flags: review?(dustin) → review+
The last piece!
Attachment #8441546 - Flags: review?(dustin)
Comment on attachment 8441546 [details] [diff] [review]
amis-puppet.diff

Will it be annoying to have to add new slave types here?
Attachment #8441546 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #42)
> Comment on attachment 8441546 [details] [diff] [review]
> amis-puppet.diff
> 
> Will it be annoying to have to add new slave types here?

Not a big deal now. We can refactor the code in the future to use a config file.
Posted patch fix pathsSplinter Review
err, forgot to adjust according to production cwd.
Attachment #8442034 - Flags: review?(dustin)
Comment on attachment 8442034 [details] [diff] [review]
fix paths

Not a change I really understand, but from a puppet perspective this is fine.
Attachment #8442034 - Flags: review?(dustin) → review+
All works fine here! \o/
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.