Closed
Bug 1022368
Opened 10 years ago
Closed 10 years ago
Kill some puppet masters in AWS
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: rail)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] )
Attachments
(1 file, 1 obsolete file)
8.67 KB,
patch
|
dustin
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
Since we reduced our load on puppet, I think we can kill 2 out of 4 running AWS puppet masters. Any reason why we shoulnd't do this?
Comment 1•10 years ago
|
||
Redundancy ? If we go down to one per region, and that machine goes down, do we try other regions ?
Assignee | ||
Comment 2•10 years ago
|
||
If one of the region masters goes down the slaves fall back to one of the next available puppet masters in all regions (including in-house).
We have this in our config:
http://hg.mozilla.org/build/puppet/file/27b30fedfb63/manifests/moco-config.pp#l20
The implementation is here: http://hg.mozilla.org/build/puppet/file/27b30fedfb63/modules/config/lib/puppet/parser/functions/sort_servers_by_group.rb
Example output of /etc/puppet/puppetmasters.txt (the file we use to iterate over puppet masters by puppet agents) from one of the slaves:
$ cat /etc/puppet/puppetmasters.txt
releng-puppet2.srv.releng.use1.mozilla.com
releng-puppet1.srv.releng.use1.mozilla.com
releng-puppet2.srv.releng.usw2.mozilla.com
releng-puppet2.build.scl1.mozilla.com
releng-puppet1.srv.releng.scl3.mozilla.com
releng-puppet1.srv.releng.usw2.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com
Comment 3•10 years ago
|
||
At this point, we have 8, so if we lose one then the load on the others increases by about 14%.
We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll only have four, which means that a failure will increase load by 33%.
So we should watch load carefull after turning these off, and if it's getting anywhere near capacity, add more capacity, either in more powerful instances or more instances.
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> We'll lose scl1 soon, too, so if we turn off two of the AWS masters we'll
> only have four, which means that a failure will increase load by 33%.
... which was decreased last week by "a lot" :)
I don't have any puppet master load stats, but the fact that we moved 2-3K instances off of puppet, leaving rarely started on-demand instances the only users of puppet, makes me think that the 14% and 33% are way behind the decrease.
Comment 5•10 years ago
|
||
Yep, I think it's a good idea to turn them off -- we just need to be aware of the risk.
Load average seems to be the best metric for load.
Assignee | ||
Comment 6•10 years ago
|
||
Not shutting down the masters yet.
Attachment #8437061 -
Flags: review?(dustin)
Updated•10 years ago
|
Attachment #8437061 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 7•10 years ago
|
||
Comment on attachment 8437061 [details] [diff] [review]
no-puppet2.diff
https://hg.mozilla.org/build/puppet/rev/9f2fb7db6535
Attachment #8437061 -
Flags: checked-in+
Comment 8•10 years ago
|
||
It looks like the cron jobs are still running to try to rsync stuff from releng-puppet2.srv.releng.scl3.mozilla.com and it's filling up the releng-shared puppet-errors folder with hundreds of authentication failed messages.
Comment 9•10 years ago
|
||
I've disabled the jobs in releng-puppet2.srv.releng.{use1,usw2}:/etc/cron.d/{rsync-secrets,ssl_git_sync} to quench the spamming.
Assignee | ||
Comment 10•10 years ago
|
||
I completely forgot that we also use these masters for mock packages...
Probably it's worth to leave them alive but just change the instance type from m3.xlarge back to m3.large.
Assignee | ||
Comment 11•10 years ago
|
||
Comment on attachment 8437061 [details] [diff] [review]
no-puppet2.diff
remote: https://hg.mozilla.org/build/puppet/rev/da69a61b039e
remote: https://hg.mozilla.org/build/puppet/rev/7d69f5e64c78
Attachment #8437061 -
Flags: checked-in+ → checked-in-
Assignee | ||
Comment 12•10 years ago
|
||
The masters are back, all use m3.large. No puppet errors so far.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 13•10 years ago
|
||
It's time to revisit this. We barely use puppet masters in AWS. The main consumers are buildbot-masters, slaves switched to AMI-based solution.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655]
Assignee | ||
Comment 14•10 years ago
|
||
Switched to the scl3 masters for golden images. https://hg.mozilla.org/build/cloud-tools/rev/31b64f52d7a6
Assignee | ||
Comment 15•10 years ago
|
||
Take 2. Once this landed I'm going to watch the logs to see what accesses them before I shut them down.
Attachment #8437061 -
Attachment is obsolete: true
Attachment #8516869 -
Flags: review?(dustin)
Comment 16•10 years ago
|
||
Comment on attachment 8516869 [details] [diff] [review]
killem-again.diff
Review of attachment 8516869 [details] [diff] [review]:
-----------------------------------------------------------------
Hm, I thought you were going to remove one from each region.
Can you do a quick grep to figure out the rate of catalog compilation on these hosts? If it's a tiny fraction of that on the scl3 hosts, this is probably OK.
Attachment #8516869 -
Flags: review?(dustin) → review+
Comment 17•10 years ago
|
||
I see about 350 catalog compilations per hour on an scl3 master. I see about 20/hr on an AWS master. So the total is 350*2 + 20*4 = 780/hr, and divided by two scl3 masters that's 390/hr. Which seems unlikely to cause excessive pain.
Comment 18•10 years ago
|
||
When this does land, please file a relops bug for me to remove flows for those puppetmasters and update firewall-tests accordingly.
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710] [kanban:engops:https://kanbanize.com/ctrl_board/6/655] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/710]
Assignee | ||
Comment 19•10 years ago
|
||
Comment on attachment 8516869 [details] [diff] [review]
killem-again.diff
remote: https://hg.mozilla.org/build/puppet/rev/3e7b3aedb9aa
remote: https://hg.mozilla.org/build/puppet/rev/841dc1c7197a
Attachment #8516869 -
Flags: checked-in+
Comment 20•10 years ago
|
||
Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently remove nagios checks, or is the ack sufficient?
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #20)
> Thanks for ack'ing alerts Rail. BTW do we need to do anything to permanently
> remove nagios checks, or is the ack sufficient?
I filed bug 1100560 to deal with that.
Assignee | ||
Comment 22•10 years ago
|
||
I shut the masters down this morning. I'll let them sit around for a bit before I terminate them.
Comment 23•10 years ago
|
||
A little too late, but for future reference, https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Remove_a_Puppetmaster is the guide for doing this.
bug 1101051 had some other missed work for shutting down these hosts.
I just changed all of the CNAMEs pointing to the now-gone hosts to point to scl3 hosts instead.
Assignee | ||
Comment 24•10 years ago
|
||
I also remove the IPs from the puppetagain-apt.pvt.build.mozilla.org A entry list.
Assignee | ||
Comment 25•10 years ago
|
||
Dustin, do you think that I can go ahead and kill the AWS masters now? Are the in-house masters OK?
Comment 26•10 years ago
|
||
They're looking fine.
However, given the fluid nature of near-term plans, I'd suggest leaving them in place. Given hindsight, I think it might have been smarter not to shut these down, since eventually we'll be moving things *out* of scl3, not in.
Assignee | ||
Comment 27•10 years ago
|
||
I prefer to add them when we need them. The process is quite straight forward.
Assignee | ||
Comment 28•10 years ago
|
||
Done.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•