Closed Bug 1022368 Opened 8 years ago Closed 8 years ago

Kill some puppet masters in AWS

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] )

Attachments

(1 file, 1 obsolete file)

Since we reduced our load on puppet, I think we can kill 2 out of 4 running AWS puppet masters. Any reason why we shoulnd't do this?
Redundancy ? If we go down to one per region, and that machine goes down, do we try other regions ?
If one of the region masters goes down the slaves fall back to one of the next available puppet masters in all regions (including in-house).

We have this in our config:
http://hg.mozilla.org/build/puppet/file/27b30fedfb63/manifests/moco-config.pp#l20

The implementation is here: http://hg.mozilla.org/build/puppet/file/27b30fedfb63/modules/config/lib/puppet/parser/functions/sort_servers_by_group.rb

Example output of /etc/puppet/puppetmasters.txt (the file we use to iterate over puppet masters by puppet agents) from one of the slaves:

$ cat /etc/puppet/puppetmasters.txt 
releng-puppet2.srv.releng.use1.mozilla.com
releng-puppet1.srv.releng.use1.mozilla.com
releng-puppet2.srv.releng.usw2.mozilla.com
releng-puppet2.build.scl1.mozilla.com
releng-puppet1.srv.releng.scl3.mozilla.com
releng-puppet1.srv.releng.usw2.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com
At this point, we have 8, so if we lose one then the load on the others increases by about 14%.  

We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll only have four, which means that a failure will increase load by 33%.

So we should watch load carefull after turning these off, and if it's getting anywhere near capacity, add more capacity, either in more powerful instances or more instances.
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll
> only have four, which means that a failure will increase load by 33%.

... which was decreased last week by "a lot" :)

I don't have any puppet master load stats, but the fact that we moved 2-3K instances off of puppet, leaving rarely started on-demand instances the only users of puppet, makes me think that the 14% and 33% are way behind the decrease.
Yep, I think it's a good idea to turn them off -- we just need to be aware of the risk.

Load average seems to be the best metric for load.
Attached patch no-puppet2.diff (obsolete) — Splinter Review
Not shutting down the masters yet.
Attachment #8437061 - Flags: review?(dustin)
Attachment #8437061 - Flags: review?(dustin) → review+
It looks like the cron jobs are still running to try to rsync stuff from releng-puppet2.srv.releng.scl3.mozilla.com and it's filling up the releng-shared puppet-errors folder with hundreds of authentication failed messages.
I've disabled the jobs in releng-puppet2.srv.releng.{use1,usw2}:/etc/cron.d/{rsync-secrets,ssl_git_sync} to quench the spamming.
I completely forgot that we also use these masters for mock packages...

Probably it's worth to leave them alive but just change the instance type from m3.xlarge back to m3.large.
The masters are back, all use m3.large. No puppet errors so far.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
It's time to revisit this. We barely use puppet masters in AWS. The main consumers are buildbot-masters, slaves switched to AMI-based solution.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655]
Switched to the scl3 masters for golden images. https://hg.mozilla.org/build/cloud-tools/rev/31b64f52d7a6
Take 2. Once this landed I'm going to watch the logs to see what accesses them before I shut them down.
Attachment #8437061 - Attachment is obsolete: true
Attachment #8516869 - Flags: review?(dustin)
Comment on attachment 8516869 [details] [diff] [review]
killem-again.diff

Review of attachment 8516869 [details] [diff] [review]:
-----------------------------------------------------------------

Hm, I thought you were going to remove one from each region.

Can you do a quick grep to figure out the rate of catalog compilation on these hosts?  If it's a tiny fraction of that on the scl3 hosts, this is probably OK.
Attachment #8516869 - Flags: review?(dustin) → review+
I see about 350 catalog compilations per hour on an scl3 master.  I see about 20/hr on an AWS master.  So the total is 350*2 + 20*4 = 780/hr, and divided by two scl3 masters that's 390/hr.  Which seems unlikely to cause excessive pain.
When this does land, please file a relops bug for me to remove flows for those puppetmasters and update firewall-tests accordingly.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710]
Depends on: 1100560
Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently remove nagios checks, or is the ack sufficient?
(In reply to Pete Moore [:pete][:pmoore] from comment #20)
> Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently
> remove nagios checks, or is the ack sufficient?

I filed bug 1100560 to deal with that.
Depends on: 1101051
I shut the masters down this morning. I'll let them sit around for a bit before I terminate them.
A little too late, but for future reference, https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Remove_a_Puppetmaster is the guide for doing this.

bug 1101051 had some other missed work for shutting down these hosts.

I just changed all of the CNAMEs pointing to the now-gone hosts to point to scl3 hosts instead.
I also remove the IPs from the puppetagain-apt.pvt.build.mozilla.org A entry list.
Dustin, do you think that I can go ahead and kill the AWS masters now? Are the in-house masters OK?
They're looking fine.

However, given the fluid nature of near-term plans, I'd suggest leaving them in place.  Given hindsight, I think it might have been smarter not to shut these down, since eventually we'll be moving things *out* of scl3, not in.
I prefer to add them when we need them. The process is quite straight forward.
Done.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.